Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions cfn/wrap.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ import (
"net/http"

"github.com/aws/aws-lambda-go/events"
"github.com/aws/aws-lambda-go/lambdacontext"
)

// CustomResourceLambdaFunction is a standard form Lambda for a Custom Resource.
Expand All @@ -29,11 +28,19 @@ func lambdaWrapWithClient(lambdaFunction CustomResourceFunction, client httpClie
fn = func(ctx context.Context, event Event) (reason string, err error) {
r := NewResponse(&event)

// A previous physical resource id exists unless this is a create request.
fallbackPhysicalResourceID := event.PhysicalResourceID
if event.RequestType == RequestCreate {
// If this is a create request, the fallback should be the request ID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking falling back to the LogStreamName for RequestCreate is more backwards compatible.

However, switching to RequestId may be better for correctness with multiple custom resources in the same stack. There may well be another bug lurking here that needs fixing. But that could also land as a separate PR.

fallbackPhysicalResourceID = event.RequestID
}

funcDidPanic := true
defer func() {
if funcDidPanic {
r.Status = StatusFailed
r.Reason = "Function panicked, see log stream for details"
r.PhysicalResourceID = fallbackPhysicalResourceID
// FIXME: something should be done if an error is returned here
_ = r.sendWith(client)
}
Expand All @@ -42,17 +49,17 @@ func lambdaWrapWithClient(lambdaFunction CustomResourceFunction, client httpClie
r.PhysicalResourceID, r.Data, err = lambdaFunction(ctx, event)
funcDidPanic = false

if r.PhysicalResourceID == "" {
r.PhysicalResourceID = fallbackPhysicalResourceID
log.Printf("PhysicalResourceID not set. Using fallback PhysicalResourceID: %s\n", r.PhysicalResourceID)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remembered #237, which resulted in a revert of a similar change.

As I understand from re-reading that, if a create failure status response contains a physical resource id, CloudFormation will issue a followup delete request. Which makes sense, however, some existing setups might not handle a delete with an unrecognized physical resource id. And such a request today would not be issued, as CloudFormation instead swallows the creation failure with "invalid physical resource id".

Which sucks, because it means resolving the related #107 means re-introducing these (unexpected for some) deletes on create errors.

Copy link
Contributor Author

@Youssef-Beltagy Youssef-Beltagy Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from re-reading that, if a create failure status response contains a physical resource id, CloudFormation will issue a followup delete request.

As far as I could see, CFn always issues a delete request if create failed regardless of whether the PhysicalResourceID is returned or not. I can double check if you wish.


The problem with #237 was that, for DELETE request, a new PhysicalResourceID was returned. This is definitely wrong because #108 was implemented incorrectly. Do you see how it always defaults to the lambda logstream and doesn't try to use the old PhysicalResourceID. That was the problem.

if r.PhysicalResourceID == "" {
	log.Println("PhysicalResourceID must exist, copying Log Stream name")
	r.PhysicalResourceID = lambdacontext.LogStreamName
}

Actually, this may tie back to the discussion on whether Delete request should validate the returned PhysicalResourceId.

Copy link
Contributor Author

@Youssef-Beltagy Youssef-Beltagy Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised #237 saw delete fail because of different Physical Resource ID. I wonder how/why they saw it. I didn't. Maybe CFn behavior changed in all this time, maybe it had to do with their logic (which I agree we should still support).

Actually, what they described is similar to the one time I saw my stack get stuck and then failed to replicate it. It may be that I just need to think more about the methodology for testing delete.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption about about a delete not being issued is wrong. Or at least, it's wrong today. Perhaps there was a behavior change since 2019 🤷‍♂️

When I try to validate this

package main

import (
        "context"
        "errors"

        "github.com/aws/aws-lambda-go/cfn"
        "github.com/aws/aws-lambda-go/lambda"
)

func main() {
        lambda.Start(cfn.LambdaWrap(func(ctx context.Context, e cfn.Event) (string, map[string]any, error) {
                if e.RequestType == cfn.RequestCreate {
                        return "", nil, errors.New("fail") // in v1.51.1, "" won't fallback to a sane default, a delete will not be issued
                }
                return e.PhysicalResourceID, nil, errors.New("in v1.51.1, this should never occur")
        }))
}

That error infact, does occur, a delete is issued, and we can see that in the stack events, CloudFormation came up with it's own PhysicalResourceId to pass in a delete when we failed to provide a valid one. Notice below, after the create fails for our invalid physical resource id, CloudFormation starts plumbing something else. "PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",

...
        {
            ...
            "LogicalResourceId": "Resouce",
            "PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
            "ResourceType": "AWS::CloudFormation::CustomResource",
            "Timestamp": "2026-01-10T00:21:36.084000+00:00",
            "ResourceStatus": "DELETE_FAILED",
            "ResourceStatusReason": "Received response status [FAILED] from custom resource. Message returned: in v1.51.1, this should never occur (RequestId: b6be9d6b-6ed3-47c1-b115-74d16dcab99c)",
        },
        ...
        {
            ...
            "LogicalResourceId": "Resouce",
            "PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
            "ResourceType": "AWS::CloudFormation::CustomResource",
            "Timestamp": "2026-01-10T00:21:34.300000+00:00",
            "ResourceStatus": "DELETE_IN_PROGRESS",
            ...
        },
        ...
        {
            ...
            "LogicalResourceId": "Resouce",
            "PhysicalResourceId": "Test3-Resouce-1FUJL4XBGXTTC",
            "ResourceType": "AWS::CloudFormation::CustomResource",
            "Timestamp": "2026-01-10T00:21:31.490000+00:00",
            "ResourceStatus": "CREATE_FAILED",
            "ResourceStatusReason": "Invalid PhysicalResourceId",
            ...
        },
        {
            ...
            "LogicalResourceId": "Resouce",
            "PhysicalResourceId": "",
            "ResourceType": "AWS::CloudFormation::CustomResource",
            "Timestamp": "2026-01-10T00:21:28.684000+00:00",
            "ResourceStatus": "CREATE_IN_PROGRESS",
            ...
        },
...

I'm surprised #237 saw delete fail because of different Physical Resource ID. I wonder how/why they saw it. I didn't. Maybe CFn behavior changed in all this time, maybe it had to do with their logic (which I agree we should still support).

I suspect whatever issue occurred for this customer, (assuming the revert resolved this issue), could have began re-occuring at some point, but with that funny logical id looking thing rather than the funny log stream id that got reverted.

Actually, what they described is similar to the one time I saw my stack get stuck and then failed to replicate it. It may be that I just need to think more about the methodology for testing delete.

This feels familiar to me too, for an app I had where the API call was timing out.


I'll comment on #107 with this context too. #108 could probably be re-applied (either directly, or as a part of this PR's changes) given this observation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious if that was the behavior at the time of #237,

After reading more, I suspect #237 was a user-error. They were facing the issue all the way back in 2016 (they referenced this: remind101/empire/918). This was before #108 was introduced. CFn behavior was the same back then.

As you noted, according to remind101/empire/918, it seems that even if we don't return a PhysicalResourceID upon creation failure, CFn generates one anyway when it sends the delete. At least if we are the ones who create that dummy PhysicalResourceID, the actual error gets surfaced instead of "Invalid PhysicalResourceID".

}

if err != nil {
r.Status = StatusFailed
r.Reason = err.Error()
log.Printf("sending status failed: %s", r.Reason)
log.Printf("sending status failed: %s\n", r.Reason)
} else {
r.Status = StatusSuccess

if r.PhysicalResourceID == "" {
log.Println("PhysicalResourceID must exist on creation, copying Log Stream name")
r.PhysicalResourceID = lambdacontext.LogStreamName
}
}

err = r.sendWith(client)
Expand Down
114 changes: 86 additions & 28 deletions cfn/wrap_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,48 +7,103 @@ import (
"context"
"encoding/json"
"errors"
"fmt"
"io/ioutil" //nolint: staticcheck
"net/http"
"testing"

"github.com/aws/aws-lambda-go/lambdacontext"
"github.com/stretchr/testify/assert"
)

var testEvent = &Event{
RequestType: RequestCreate,
RequestID: "unique id for this create request",
ResponseURL: "http://pre-signed-S3-url-for-response",
LogicalResourceID: "MyTestResource",
StackID: "arn:aws:cloudformation:us-west-2:EXAMPLE/stack-name/guid",
RequestType: RequestUpdate,
RequestID: "unique id for this create request",
ResponseURL: "http://pre-signed-S3-url-for-response",
LogicalResourceID: "MyTestResource",
PhysicalResourceID: "prevPhysicalResourceID",
StackID: "arn:aws:cloudformation:us-west-2:EXAMPLE/stack-name/guid",
}

func TestCopyLambdaLogStream(t *testing.T) {
lgs := lambdacontext.LogStreamName
lambdacontext.LogStreamName = "DUMMYLOGSTREAMNAME"
func TestLambdaPhysicalResourceId(t *testing.T) {

tests := []struct {
// Input to the lambda
inputRequestType RequestType

// Output from the lambda
returnErr error
returnPhysicalResourceID string

// The PhysicalResourceID to test
expectedPhysicalResourceID string
}{
// For Create with no returned PhysicalResourceID
{RequestCreate, nil, "", testEvent.RequestID}, // Use RequestID as default for PhysicalResourceID
{RequestCreate, fmt.Errorf("dummy error"), "", testEvent.RequestID},

// For Create with PhysicalResourceID
{RequestCreate, nil, "newPhysicalResourceID", "newPhysicalResourceID"},
{RequestCreate, fmt.Errorf("dummy error"), "newPhysicalResourceID", "newPhysicalResourceID"},

// For Update with no returned PhysicalResourceID
{RequestUpdate, nil, "", "prevPhysicalResourceID"},
{RequestUpdate, fmt.Errorf("dummy error"), "", "prevPhysicalResourceID"},

// For Update with returned PhysicalResourceID
{RequestUpdate, nil, "newPhysicalResourceID", "newPhysicalResourceID"},
{RequestUpdate, fmt.Errorf("dummy error"), "newPhysicalResourceID", "newPhysicalResourceID"},

// For Delete with no returned PhysicalResourceID
{RequestDelete, nil, "", "prevPhysicalResourceID"},
{RequestDelete, fmt.Errorf("dummy error"), "", "prevPhysicalResourceID"},

// For Delete with returned PhysicalResourceID = old PhysicalResourceID
{RequestDelete, nil, "prevPhysicalResourceID", "prevPhysicalResourceID"},
{RequestDelete, fmt.Errorf("dummy error"), "prevPhysicalResourceID", "prevPhysicalResourceID"},

// For Delete with returned PhysicalResourceID != old PhysicalResourceID
// Technically, lambda handlers shouldn't return a different physical resource id upon deletion.
// Typescript CDK implementation for handlers would return an error here.
// But CFn ignores the returned PhysicalResourceID for deletion so it doesn't matter.
{RequestDelete, nil, "newPhysicalResourceID", "newPhysicalResourceID"},
{RequestDelete, fmt.Errorf("dummy error"), "newPhysicalResourceID", "newPhysicalResourceID"},
}
for _, test := range tests {

client := &mockClient{
DoFunc: func(req *http.Request) (*http.Response, error) {
response := extractResponseBody(t, req)
curTestEvent := *testEvent
curTestEvent.RequestType = test.inputRequestType

assert.Equal(t, StatusSuccess, response.Status)
assert.Equal(t, testEvent.LogicalResourceID, response.LogicalResourceID)
assert.Equal(t, "DUMMYLOGSTREAMNAME", response.PhysicalResourceID)
if curTestEvent.RequestType == RequestCreate {
curTestEvent.PhysicalResourceID = ""
}

return &http.Response{
StatusCode: http.StatusOK,
Body: nopCloser{bytes.NewBufferString("")},
}, nil
},
}
client := &mockClient{
DoFunc: func(req *http.Request) (*http.Response, error) {
response := extractResponseBody(t, req)

fn := func(ctx context.Context, event Event) (physicalResourceID string, data map[string]interface{}, err error) {
return
}
if test.returnErr == nil {
assert.Equal(t, StatusSuccess, response.Status)
} else {
assert.Equal(t, StatusFailed, response.Status)
}

_, err := lambdaWrapWithClient(fn, client)(context.TODO(), *testEvent)
assert.NoError(t, err)
lambdacontext.LogStreamName = lgs
assert.Equal(t, curTestEvent.LogicalResourceID, response.LogicalResourceID)
assert.Equal(t, test.expectedPhysicalResourceID, response.PhysicalResourceID)

return &http.Response{
StatusCode: http.StatusOK,
Body: nopCloser{bytes.NewBufferString("")},
}, nil
},
}

fn := func(ctx context.Context, event Event) (physicalResourceID string, data map[string]interface{}, err error) {
return test.returnPhysicalResourceID, nil, test.returnErr
}

_, err := lambdaWrapWithClient(fn, client)(context.TODO(), curTestEvent)
assert.NoError(t, err)
}
}

func TestPanicSendsFailure(t *testing.T) {
Expand All @@ -58,6 +113,9 @@ func TestPanicSendsFailure(t *testing.T) {
DoFunc: func(req *http.Request) (*http.Response, error) {
response := extractResponseBody(t, req)
assert.Equal(t, StatusFailed, response.Status)

// Even in a panic, a dummy PhysicalResourceID should be returned to ensure the error is surfaced correctly
assert.Equal(t, testEvent.PhysicalResourceID, response.PhysicalResourceID)
didSendStatus = response.Status == StatusFailed

return &http.Response{
Expand Down Expand Up @@ -111,7 +169,7 @@ func TestWrappedError(t *testing.T) {
response := extractResponseBody(t, req)

assert.Equal(t, StatusFailed, response.Status)
assert.Empty(t, response.PhysicalResourceID)
assert.Equal(t, testEvent.PhysicalResourceID, response.PhysicalResourceID)
assert.Equal(t, "failed to create resource", response.Reason)

return &http.Response{
Expand Down
Loading