-
Notifications
You must be signed in to change notification settings - Fork 932
Open
Description
Bug Description
When using AWS Step Functions with METAFLOW_SFN_COMPRESS_STATE_MACHINE=True, AWS Batch tasks fail with shell syntax errors due to incorrect quote escaping in the S3 command download script.
Environment
- Metaflow version: 2.19.9 (also present in master branch as of commit 2738793)
- Deployment: AWS Step Functions + AWS Batch
- Configuration:
METAFLOW_SFN_COMPRESS_STATE_MACHINE=True
Error
The batch job log shows the following:
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: 0: command not found
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.463533Z: command not found
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: task: command not found
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.463533117+00:00]Setting: command not found
2025-11-24T19:33:38.464Z Setting up task environment.
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: 0: command not found
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.830367Z: command not found
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: task: command not found
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.830367306+00:00]Downloading: command not found
2025-11-24T19:33:38.831Z Downloading code package...
2025-11-24T19:33:38.840Z File "<string>", line 1
2025-11-24T19:33:38.840Z import boto3, os; ep=os.getenv(\"METAFLOW_S3_ENDPOINT_URL\"); boto3.client(\"s3\", **({\"endpoint_url\":ep} if ep else {})).download_file(\"e2-twilio-ai-ml-s3-mflow-main-dev\", \"metaflow/e2TestS3OffloadFlow/data/58/588877b6636a4db90a81c9b8230b708fa54964ba\", \"job.tar\")
2025-11-24T19:33:38.840Z ^
2025-11-24T19:33:38.840Z SyntaxError: unexpected character after line continuation character
Reproduction
git clone https://github.com/Netflix/metaflow.git
cd metaflow
pip install -e ".[conda]"
echo "from metaflow import FlowSpec, step, batch, conda_base
class e2TestS3OffloadFlow(FlowSpec):
@batch(cpu=1, memory=2048)
@step
def start(self):
print('Hello from batch!')
self.next(self.end)
@step
def end(self):
print('Done')
if __name__ == '__main__':
e2TestS3OffloadFlow()" > sample_flow.py
METAFLOW_SFN_COMPRESS_STATE_MACHINE=True METAFLOW_PROFILE=dev python sample_flow.py step-functions create && METAFLOW_PROFILE=dev python sample_flow.py step-functions trigger
Proceed to verify the failure on AWS console.
Temporary fix
Just FYI, the following patch fixes the issue on my end.
diff --git a/metaflow/plugins/aws/batch/batch.py b/metaflow/plugins/aws/batch/batch.py
index 4f26eeb..cd4a262 100644
--- a/metaflow/plugins/aws/batch/batch.py
+++ b/metaflow/plugins/aws/batch/batch.py
@@ -122,7 +122,8 @@ class Batch(object):
# Get the command that was created
# Upload the command to S3 during deployment
try:
- command_bytes = cmd_str.encode("utf-8")
+ unescaped_cmd_str = cmd_str.replace('\\"', '"')
+ command_bytes = unescaped_cmd_str.encode("utf-8")
result_paths = self.flow_datastore.save_data([command_bytes], len_hint=1)
s3_path, _key = result_paths[0]
Metadata
Metadata
Assignees
Labels
No labels