Skip to content

Conversation

@herin049
Copy link
Contributor

Adds support for running on AWS Lambda managed instances.

Lambda managed instances differ from standard lambda functions in several areas, but the differences most relevant for the OpenTelemetry collector layer are:

  • For Managed instance types the Extension API does not allow subscribing to the Invoke event type.
  • For Managed instances the telemetry API does not report platform.runtimeDone events
  • A Managed Lambda instance is never frozen (removes the need for the decouple processor)
  • Multiple Lambda processes can be created within a single execution environment (this is particularly relevant for ensuring that auto-instrumentation works with the bootstrap script)
  • Multiple Lambda function invocations may be simultaneously in process for a given execution environment

For more information see: https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html

In order to accommodate the changes above, the following changes have been made to the layer if the extension determines the initialization type is lambda-managed-instances using the AWS_LAMBDA_INITIALIZATION_TYPE environment variable:

  • The Extension no longer subscribes to the Invoke event type, and the extension no longer subscribes to the telemetry API to listen for the platform.runtimeDone event if the initialization type is lambda-managed-instances
  • The decouple processor is no longer added to any pipelines.
  • The FunctionInvoked() and FunctionFinished() lifecycle methods are no longer invoked for lifecycle listeners.
  • The wrapper script has been updated to instrument inside the wrapper script itself to ensure instrumentation is properly applied for newly created Python processes. (applies to all initialization types)
  • A few changes have been made to the telemetry API receiver to accomidate changes to the events reported by the telemetry API.

I have added relevant unit tests and have manually verified the implementation using the following repo: https://github.com/herin049/aws-lambda-managed which configures the layer to export signals to Grafana cloud.

@herin049 herin049 requested a review from a team as a code owner December 20, 2025 04:12
@serkan-ozal
Copy link
Contributor

Hi @herin049,

For the Python SDK related changes, as far as I understood from your explanations, AWS_LAMBDA_EXEC_WRAPPER is not taken care for the spawned Lambda processes (but main Lambda runtime process) in Lambda managed instances. Is that correct? Is there any official AWS documentation mentioning/explaining this behavior?

Because otherwise (if AWS_LAMBDA_EXEC_WRAPPER could still be used for the spawned Lambda processes) opentelemetry-instrument CLI would instrument spawned Lambda processes.

@herin049
Copy link
Contributor Author

herin049 commented Dec 23, 2025

Hi @serkan-ozal, thanks for the review.

Yes, your understanding is correct. To reiterate, here is what I assume happens internally for AWS Lambda managed instances:

  1. During initialization of the EC2 VM, Lambda first executes the AWS_LAMBDA_EXEC_WRAPPER as usual.
  2. After this initialization phase completes, Lambda spawns N child Python processes and imports the handler module directly in each child process by calling importlib.import_module. This is not an issue for most wrapper scripts since typically they involve just setting environment variables/manipulating the file system in some manner, so these side effects will be visible to each process. However, for the auto instrumentation library, this is an issue because the new processes no longer have the patching applied from the auto instrumentation libraries because they are fresh interpreter processes.

I can't find any documentation on this except with the docs stating that managed lambda instances can serve many requests concurrently, and the only way to do this while achieving true parallelism with Python is to use multiple processes.

Regardless, this change will ALWAYS be safe to make because even for standard lambda instances, the behavior will be identical. That is, running opentelemetry-instrument python main.py is nearly identical to running python main.py and calling auto_instrumentation.initialize() at the top of the file. These changes also make this PR: #2069 irrelevant. Essentially, even if the assumptions I am making are incorrect, regular lambda functions will still be instrumented properly, which is why this is the most straightforward/safest approach to take.

@serkan-ozal
Copy link
Contributor

@herin049

Yes, I know that using a wrapper handler which delegates to the user handler is a very common approach. But while asking to verify this behavioral change with the AWS_LAMBDA_EXEC_WRAPPER env var, I was mostly thinking about the other runtimes. Except for the Ruby runtime, auto instrumentation for the other runtimes works without wrapper (NODE_OPTIONS for NodeJS and JAVA_TOOL_OPTIONS for Java agent).

However, even though AWS_LAMBDA_EXEC_WRAPPER env var is not taken care for spawned processes, this should not be an issue for the NodeJS and Java runtimes as they are only set setting some env vars to configure/activate OTEL instrumentation and these env vars should be inherited by the spawned worker/child processes, unless main process filters env vars to pass them (that is the point we need to check). And one more point on this, loading user handler in the wrapper handler is a little bit complex in NodeJS as we need to take care of some more cases (paths, CJS vs ESM, etc ...). Another point here is we may also need to be sure that OTEL SDK is not instrumenting the main process itself. Because, otherwise, depending on the implementation of the main process, there might be some spans reported by OTEL SDK which are not related to the user code.

In addition to the points above, for Python, instead of wrapper handler, one other approach would be using sitecustomize.py which will be available under PYTHONPATH. Basically, you should be able to initialize the OpenTelemetry SDK automatically at Python startup by placing your OTEL initialization code in a sitecustomize.py file and ensuring that its directory is included in PYTHONPATH. Python will import sitecustomize on interpreter startup, allowing OTEL to be configured before any application code runs.

@herin049
Copy link
Contributor Author

Thanks @serkan-ozal

I see where you are coming from now. To limit the scope of these changes I've just focused on the collector level changes and adding support for Python for now. If you'd like, I can create an issue for verifying/making changes for all of the supported lambda runtime for Lambda managed instances. I am not as familiar with how auto instrumentation works for the other runtimes, but I can certainly look into this more and make any required changes based on my findings in order to support managed instances. In either case, the collector changes I have made in this PR will not change even if there are substantial changes to the auto instrumentation logic in some runtimes.

With regard to your concern with instrumenting the original parent process, I don't think this is a concern. I have not observed any irregular spans being reported even when enabling all of the auto instrumentation libraries.

From what I have found so far, it seems like the auto instrumentation wrapper command is not working properly for the worker Python processes because somehow sitecustomize.py is not being loaded properly for the worker processes (the auto instrumentation wrapper command already adds a custom sitecustomize.py file to the PYTHON_PATH environment variable). There are two possible reasons for this that I can think of: the updated PYTHON_PATH environment variable is not being propagated to the other Python processes correctly or sitecustomize.py is not being loaded for the new Python processes. I think the latter scenario is more likely because I know that the environment variables set in the wrapper script are indeed propagated to the worker processes correctly since otherwise, the modifications to the _HANDLER and ORIG_HANDLER environment variables would not be properly set, leading to auto instrumentation not being applied at all (but from my tests I do see telemetry being properly reported in all cases). So, we can try to switch to an approach that involves us creating our own sitecustomize.py file and having the wrapper script add it to the PYTHON_PATH, but I am guessing that we will run into the same issues as with the wrapper command.

The reason I made the changes to the wrapper script the way I did is because: they are relatively minimal and the changes are backwards compatible, ensuring that the behavior is the same as the previous wrapper script.

@serkan-ozal
Copy link
Contributor

@herin049 I think it is better to limit the scope of this PR to only changes related to the collector. For the SDK related changes, first, I would like to understand the behavior for the main process and spawned processes (I will be looking into it too) when AWS Lambda managed instance is used.

So, my take is please create issue(s) for the SDK related changes and remove the Python related changes from this PR.
@tylerbenson @wpessers @pragmaticivan @maxday WDYT?

@herin049
Copy link
Contributor Author

@herin049 I think it is better to limit the scope of this PR to only changes related to the collector. For the SDK related changes, first, I would like to understand the behavior for the main process and spawned processes (I will be looking into it too) when AWS Lambda managed instance is used.

So, my take is please create issue(s) for the SDK related changes and remove the Python related changes from this PR. @tylerbenson @wpessers @pragmaticivan @maxday WDYT?

Sounds good to me @serkan-ozal. I have reverted the Python SDK related changes in this PR, I can work on a follow-up PR to update all of the SDKs where necessary to support managed instance types, and do some additional research myself.

@wpessers
Copy link
Contributor

So, my take is please create issue(s) for the SDK related changes and remove the Python related changes from this PR. @tylerbenson @wpessers @pragmaticivan @maxday WDYT?

Yes I agree!

@herin049 herin049 force-pushed the feat/managed-instances branch from 2b22ef9 to 114e673 Compare December 27, 2025 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants