-
Notifications
You must be signed in to change notification settings - Fork 576
fix: always return PhysicalResourceID for CFn CustomResources #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e shouldn't change it
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #613 +/- ##
==========================================
+ Coverage 73.59% 73.75% +0.15%
==========================================
Files 35 35
Lines 1337 1345 +8
==========================================
+ Hits 984 992 +8
Misses 277 277
Partials 76 76 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Agree. For updates, I think defaulting to the input physical resource id (as CDK does) is correct, and fixes a bug. I do wonder if changing the default for As I review this though, I realize that defaulting to the
Porting of the CDK logic makes sense, and looks correct to me. What's the breaking behavior you think could happen here? The scenario I could think of is a delete that always returns physical resource id |
I'll try to make some time to test CloudFormation's behavior here, and form a firmer opinion :) |
| // A previous physical resource id exists unless this is a create request. | ||
| fallbackPhysicalResourceID := event.PhysicalResourceID | ||
| if event.RequestType == RequestCreate { | ||
| // If this is a create request, the fallback should be the request ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking falling back to the LogStreamName for RequestCreate is more backwards compatible.
However, switching to RequestId may be better for correctness with multiple custom resources in the same stack. There may well be another bug lurking here that needs fixing. But that could also land as a separate PR.
| r.Status = StatusFailed | ||
| r.Reason = err.Error() | ||
| log.Printf("sending status failed: %s", r.Reason) | ||
| } else if event.RequestType == RequestDelete && event.PhysicalResourceID != r.PhysicalResourceID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to test a scenario I'm worried about, will come back to this. Even though the CDK throws a validation error, this might be legal/ignored by CloudFormation, in which case this validation could break existing users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is legal/ignored by CloudFromation. That's why I called out this change as breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, it definitely is legal/ignored because for all the CustomResources that were created and then deleted with this code, the deletion request would've had old PhysicalResourceId as "lambda-log-stream" and the new PhysicalResourceId would be "".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though personally, I think it very unlikely for a customer workload to break because of this. They'd have to intentially try to break things.
About Create/UpdateI'm making a clarification to my earlier statements to ease this conversation.
Specifically, when the lambda instance changes according to Lambda's docs
We can do logstream name in create. But it'd mostly be perpetuating tech debt. I don't think using
I agree it seems wrong. I'm not sure if it causes issues, though, because it felt like the It doesn't affect this PR, but I can POC if something breaks if I hardcode the PhysicalResourceId and create two resources. About Delete
It's an obscure case: if the customer returns a different Before:
After:
I POC'd that. CFn didn't fail. It succeeded the delete even when a different There was one time where the stack got stuck but I believe that was my error because I failed to replicate it multiple times afterwards while paying more attention to what I was doing. But I may have done something wrong because testing this in general is very confusing. References:
No, this shouldn't fail because the
I'll share more testing details in the description. |
|
I recreated a repo with the test setup. You should be able to just clone and test with it: https://github.com/Youssef-Beltagy/test-go-cdk-PR-613 As a reminder (because I always forget), to trigger a CFn custom resource update, you need to update the input in the CDK/CFn. Updating the lambda handler doesn't trigger an CFn update. Best wishes. |
|
|
||
| if r.PhysicalResourceID == "" { | ||
| r.PhysicalResourceID = fallbackPhysicalResourceID | ||
| log.Printf("PhysicalResourceID not set. Using fallback PhysicalResourceID: %s\n", r.PhysicalResourceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remembered #237, which resulted in a revert of a similar change.
As I understand from re-reading that, if a create failure status response contains a physical resource id, CloudFormation will issue a followup delete request. Which makes sense, however, some existing setups might not handle a delete with an unrecognized physical resource id. And such a request today would not be issued, as CloudFormation instead swallows the creation failure with "invalid physical resource id".
Which sucks, because it means resolving the related #107 means re-introducing these (unexpected for some) deletes on create errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand from re-reading that, if a create failure status response contains a physical resource id, CloudFormation will issue a followup delete request.
As far as I could see, CFn always issues a delete request if create failed regardless of whether the PhysicalResourceID is returned or not. I can double check if you wish.
The problem with #237 was that, for DELETE request, a new PhysicalResourceID was returned. This is definitely wrong because #108 was implemented incorrectly. Do you see how it always defaults to the lambda logstream and doesn't try to use the old PhysicalResourceID. That was the problem.
if r.PhysicalResourceID == "" {
log.Println("PhysicalResourceID must exist, copying Log Stream name")
r.PhysicalResourceID = lambdacontext.LogStreamName
}
Actually, this may tie back to the discussion on whether Delete request should validate the returned PhysicalResourceId.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised #237 saw delete fail because of different Physical Resource ID. I wonder how/why they saw it. I didn't. Maybe CFn behavior changed in all this time, maybe it had to do with their logic (which I agree we should still support).
Actually, what they described is similar to the one time I saw my stack get stuck and then failed to replicate it. It may be that I just need to think more about the methodology for testing delete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption about about a delete not being issued is wrong. Or at least, it's wrong today. Perhaps there was a behavior change since 2019 🤷♂️
When I try to validate this
package main
import (
"context"
"errors"
"github.com/aws/aws-lambda-go/cfn"
"github.com/aws/aws-lambda-go/lambda"
)
func main() {
lambda.Start(cfn.LambdaWrap(func(ctx context.Context, e cfn.Event) (string, map[string]any, error) {
if e.RequestType == cfn.RequestCreate {
return "", nil, errors.New("fail") // in v1.51.1, "" won't fallback to a sane default, a delete will not be issued
}
return e.PhysicalResourceID, nil, errors.New("in v1.51.1, this should never occur")
}))
}
That error infact, does occur, a delete is issued, and we can see that in the stack events, CloudFormation came up with it's own PhysicalResourceId to pass in a delete when we failed to provide a valid one. Notice below, after the create fails for our invalid physical resource id, CloudFormation starts plumbing something else. "PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
...
{
...
"LogicalResourceId": "Resouce",
"PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
"ResourceType": "AWS::CloudFormation::CustomResource",
"Timestamp": "2026-01-10T00:21:36.084000+00:00",
"ResourceStatus": "DELETE_FAILED",
"ResourceStatusReason": "Received response status [FAILED] from custom resource. Message returned: in v1.51.1, this should never occur (RequestId: b6be9d6b-6ed3-47c1-b115-74d16dcab99c)",
},
...
{
...
"LogicalResourceId": "Resouce",
"PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
"ResourceType": "AWS::CloudFormation::CustomResource",
"Timestamp": "2026-01-10T00:21:34.300000+00:00",
"ResourceStatus": "DELETE_IN_PROGRESS",
...
},
...
{
...
"LogicalResourceId": "Resouce",
"PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
"ResourceType": "AWS::CloudFormation::CustomResource",
"Timestamp": "2026-01-10T00:21:31.490000+00:00",
"ResourceStatus": "CREATE_FAILED",
"ResourceStatusReason": "Invalid PhysicalResourceId",
...
},
{
...
"LogicalResourceId": "Resouce",
"PhysicalResourceId": "",
"ResourceType": "AWS::CloudFormation::CustomResource",
"Timestamp": "2026-01-10T00:21:28.684000+00:00",
"ResourceStatus": "CREATE_IN_PROGRESS",
...
},
...
I'm surprised #237 saw delete fail because of different Physical Resource ID. I wonder how/why they saw it. I didn't. Maybe CFn behavior changed in all this time, maybe it had to do with their logic (which I agree we should still support).
I suspect whatever issue occurred for this customer, (assuming the revert resolved this issue), could have began re-occuring at some point, but with that funny logical id looking thing rather than the funny log stream id that got reverted.
Actually, what they described is similar to the one time I saw my stack get stuck and then failed to replicate it. It may be that I just need to think more about the methodology for testing delete.
This feels familiar to me too, for an app I had where the API call was timing out.
I'll comment on #107 with this context too. #108 could probably be re-applied (either directly, or as a part of this PR's changes) given this observation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious if that was the behavior at the time of #237,
After reading more, I suspect #237 was a user-error. They were facing the issue all the way back in 2016 (they referenced this: remind101/empire/918). This was before #108 was introduced. CFn behavior was the same back then.
As you noted, according to remind101/empire/918, it seems that even if we don't return a PhysicalResourceID upon creation failure, CFn generates one anyway when it sends the delete. At least if we are the ones who create that dummy PhysicalResourceID, the actual error gets surfaced instead of "Invalid PhysicalResourceID".
|
So on the delete validation part of this PR, here's a trivial app with custom resource function that would break.
This app does an update-with-delete on every deploy. Editing the but not failing or rolling back, which is good I guess. Final stack state is destroying the stack however results in the classic stuck state |
This is the "breaking change" I tried to call out (also quoted below).
I did the same thing with timestamps in the lambda response before opening the PR. I'm aware of this behavior after my change. The Delete behavior change is controversial. I only did it to maintain alignment with typescript CDK. So I was trying to understand why typescript CDK does what it does. I was wondering if CFn goes into a recursive delete cycle, fails in a problematic way, or something similar before my changes. That's what I failed to replicate/prove: your Always changing
|
Current
lambdaWrapWithClienthandlesPhysicalResourceIDincorrectly and is inconsistent with documentation.The current behavior doesn't set a fallback
PhysicalResourceIDwhen an error is returned by thelambdaFunction. It also doesn't reuse the oldPhysicalResourceIDwhen an update succeeds. This causes multiple problems:PhysicalResourceIDcan be returned. CFn perceives this as a resource replacement and triggers a deletion request for the "old"PhysicalResourceID. This behavior is wrong and obscure.PhysicalResourceIDas the input. This is the only breaking change in this PR. I tested returning a differentPhysicalResourceIDbut, honestly, I didn't see any issues (maybe point 2 happened and I didn't notice). I still made the change to maintain alignment with documentation and the typescript code.References:
Tests
Before:
Create Failure:

Update Failure:

When update succeeds, there often is an undesired followup delete.
After:
Create Failure:

Update Failure:

Tests
Edit: Please use this repository to replicate/test https://github.com/Youssef-Beltagy/test-go-cdk-PR-613
I used this setup along with go workspaces.
CDK:
Golang lambda: