Skip to content

[Bug] When using Async IO Engine pending ops cause resume to freeze #5554

@dobrac

Description

@dobrac

Describe the bug

When the Async IO Engine is used for the rootfs filesystem and there is a lot of io happening during pause and snapshot creation, there might be some pending operations completions (write/read completion) between the pause and snapshot FC (pending ops). When the vm is resumed later on, the kernel freezes - it's waiting for IO which never finishes.

This issue doesn't seem to happen when using the Sync IO Engine.

To Reproduce

Here is a test case that reproduces the issue most of the time with added debug messages for the pending ops: https://github.com/e2b-dev/firecracker/pull/6/files#diff-d960bea365831acfb0eb3b1b548e6d22293710c9ed558f5dfbf68e016457870dR595

Example error output:

Starting iteration 1/100 - Testing for non-zero async I/O drain
================================================================================
Free space on sandbox start: 18G
DRAIN: pending_ops=17
Restoring from snapshot...
(frozen, nothing happens after)

Expected behavior

The resume will succeed even when there are pending ops during the FC pause/snapshot.

Environment

  • Firecracker version: 1.13.1
  • Host and guest kernel versions: Guest kernels provided by the test suite, Host: Linux codespaces-2de7d3 6.8.0-1030-azure VCPU Support #35~22.04.1-Ubuntu SMP Mon May 26 18:08:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
  • Rootfs used: Provided by the tests suite (extended size to fit the operations)
  • Architecture: x86_64
  • Any other relevant software versions: -

Additional context

How has this bug affected you? The resume is occasionally failing.

What are you trying to achieve? Resume a VM that has been paused previously.

Do you have any idea of what the solution might be? Not yet. My guess would be that the completions are not properly acknowledged for the guest OS.

Checks

  • Have you searched the Firecracker Issues database for similar problems?
  • Have you read the existing relevant Firecracker documentation?
  • Are you certain the bug being reported is a Firecracker issue? Not fully sure. It might be related to the io_uring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Status: WIPIndicates that an issue is currently being worked on or triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions