Skip to content

AWS Batch S3 command offloading fails with quote escaping bug when METAFLOW_SFN_COMPRESS_STATE_MACHINE=True #2706

@AshkiX

Description

@AshkiX

Bug Description

When using AWS Step Functions with METAFLOW_SFN_COMPRESS_STATE_MACHINE=True, AWS Batch tasks fail with shell syntax errors due to incorrect quote escaping in the S3 command download script.

Environment

  • Metaflow version: 2.19.9 (also present in master branch as of commit 2738793)
  • Deployment: AWS Step Functions + AWS Batch
  • Configuration: METAFLOW_SFN_COMPRESS_STATE_MACHINE=True

Error

The batch job log shows the following:

2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: 0: command not found
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.463533Z: command not found
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: task: command not found
2025-11-24T19:33:38.464Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.463533117+00:00]Setting: command not found
2025-11-24T19:33:38.464Z Setting up task environment.
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: 0: command not found
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.830367Z: command not found
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: task: command not found
2025-11-24T19:33:38.831Z /tmp/step_command.sh: line 1: 2025-11-24T19:33:38.830367306+00:00]Downloading: command not found
2025-11-24T19:33:38.831Z Downloading code package...
2025-11-24T19:33:38.840Z File "<string>", line 1
2025-11-24T19:33:38.840Z import boto3, os; ep=os.getenv(\"METAFLOW_S3_ENDPOINT_URL\"); boto3.client(\"s3\", **({\"endpoint_url\":ep} if ep else {})).download_file(\"e2-twilio-ai-ml-s3-mflow-main-dev\", \"metaflow/e2TestS3OffloadFlow/data/58/588877b6636a4db90a81c9b8230b708fa54964ba\", \"job.tar\")
2025-11-24T19:33:38.840Z ^
2025-11-24T19:33:38.840Z SyntaxError: unexpected character after line continuation character

Reproduction

git clone https://github.com/Netflix/metaflow.git
cd metaflow
pip install -e ".[conda]"
echo "from metaflow import FlowSpec, step, batch, conda_base

class e2TestS3OffloadFlow(FlowSpec):

    @batch(cpu=1, memory=2048)
    @step
    def start(self):
        print('Hello from batch!')
        self.next(self.end)

    @step
    def end(self):
        print('Done')

if __name__ == '__main__':
    e2TestS3OffloadFlow()" > sample_flow.py

METAFLOW_SFN_COMPRESS_STATE_MACHINE=True METAFLOW_PROFILE=dev python sample_flow.py step-functions create && METAFLOW_PROFILE=dev python sample_flow.py step-functions trigger

Proceed to verify the failure on AWS console.

Temporary fix

Just FYI, the following patch fixes the issue on my end.

diff --git a/metaflow/plugins/aws/batch/batch.py b/metaflow/plugins/aws/batch/batch.py
index 4f26eeb..cd4a262 100644
--- a/metaflow/plugins/aws/batch/batch.py
+++ b/metaflow/plugins/aws/batch/batch.py
@@ -122,7 +122,8 @@ class Batch(object):
         # Get the command that was created
         # Upload the command to S3 during deployment
         try:
-            command_bytes = cmd_str.encode("utf-8")
+            unescaped_cmd_str = cmd_str.replace('\\"', '"')
+            command_bytes = unescaped_cmd_str.encode("utf-8")
             result_paths = self.flow_datastore.save_data([command_bytes], len_hint=1)
             s3_path, _key = result_paths[0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions