[feat] added fal-flashpack support #12999

devanshi00 · 2026-01-19T10:42:45Z

What does this PR do?

This PR adds FlashPack support to Diffusers, enabling significantly faster model loading and improved inference throughput through an optimized on-disk weight format.

What is FlashPack?

FlashPack is a weight-packing format that serializes model parameters into a contiguous, GPU-friendly layout.
It reduces:

Disk I/O overhead during model loading
CPU overhead from fragmented tensor deserialization
Runtime memory indirection during inference

As a result, FlashPack provides:

Faster pipeline initialization
Faster end-to-end inference
Zero changes to sampling logic or model architecture

FlashPack is particularly beneficial for large transformer-based diffusion models, including video and multi-modal pipelines.

Architecture & Design

FlashPack is implemented as a storage-level optimization, not a runtime execution hook.

Design principles

Non-intrusive: No changes to model forward passes
Opt-in: Enabled explicitly via use_flashpack=True
One-time conversion: Packed weights are generated once and reused
Graceful fallback: Automatically falls back to standard weights if FlashPack artifacts are missing

Workflow

Load a pipeline using standard Diffusers APIs
Convert weights using save_pretrained(..., use_flashpack=True)
Reload the pipeline with use_flashpack=True
FlashPack weights are used automatically if present

Note : To push a FlashPack model to the Hugging Face Hub, users must first save with use_flashpack=True and then manually upload the resulting directory, as push_to_hub=True currently only supports standard PyTorch/safetensors weights.

Benchmark Results

Model Load Time Comparison

Model	Size (GB)	Standard Load (s)	FlashPack Load (s)	Speedup
Wan2.1 1.3B DiT	2.64	1.181	0.109	10.81×
FLUX.1 [dev] 12B DiT	22.17	3.738	0.092	40.64×
Wan2.1 14B DiT	26.61	4.105	0.125	32.84×
Qwen-Image-Edit 20B DiT	38.05	5.988	0.166	36.09×

Effective Weight Loading Throughput & FlashPack Conversion Cost

Model	Standard Load (GB/s)	FlashPack Load (GB/s)	FlashPack Conversion Time (s)
Wan2.1 1.3B DiT	2.43	27.99	8.21
FLUX.1 [dev] 12B DiT	5.94	241.07	77.44
Wan2.1 14B DiT	6.52	213.27	81.85
Qwen-Image-Edit 20B DiT	6.36	229.78	129.59

Note: All timings use direct GPU loading (device_map="cuda" or equivalent) to ensure a fair comparison between standard loading and FlashPack.

Benchmark Setup

Hardware: NVIDIA A100
Precision: bfloat16
Framework: Diffusers with FlashPack enabled
Measurement: End-to-end model load time (disk → GPU)
Reproducibility: Same environment and configuration across runs

Usage

One-time FlashPack conversion

import torch
from diffusers import WanPipeline

pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

# One-time conversion to FlashPack format
pipe.save_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    use_flashpack=True,
)

This produces the FlashPack-packed weights alongside the regular weights in the repo.

Loading with FlashPack

pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16,
    use_flashpack=True,  # Enable FlashPack loading
)
pipe.to("cuda")

video = pipe(
    prompt="A beautiful sunset over a calm ocean",
    width=832,
    height=480,
    num_inference_steps=16,
).videos[0]

FlashPack load/inference falls back gracefully to standard weights if FlashPack output does not exist.

Inference-Time Results (FlashPack)

Model	Standard Inference (s)	FlashPack Inference (s)
Wan2.1-T2V-1.3B-Diffusers	65.527	64.326

FlashPack optimizes model loading, not transformer compute. Inference speedups are expected to be minimal.

FIxes #(issue)
#12550

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[] Did you write any new necessary tests?

Who can review?

@sayakpaul @yiyixuxu @DN6

sayakpaul · 2026-01-20T07:13:51Z

Thanks for the comprehensive PR. To benchmark model loading, are we reporting just the pipeline loading time or just the denoiser?

Could we benchmark just the denoiser loading? Also, during the standard load, let's specify device_map="cuda" that should provide faster loading.

sayakpaul

Thanks for starting this. I left some questions.

src/diffusers/models/modeling_utils.py

devanshi00 · 2026-01-20T23:27:05Z

Thanks for the comprehensive PR. To benchmark model loading, are we reporting just the pipeline loading time or just the denoiser?

Could we benchmark just the denoiser loading? Also, during the standard load, let's specify device_map="cuda" that should provide faster loading.

For benchmarking model loading, I used the official benchmarking script from the orignal flashpack repository, which measures only the diffusion transformer (DiT / denoiser) loading time, not the full pipeline. The results reported in the table above therefore correspond exclusively to denoiser loading for the respective Diffusers models.

Here is the script

import csv
import gc
import os
import shutil
import tempfile
import time
import torch

from huggingface_hub import snapshot_download
from diffusers.models import AutoModel as DiffusersAutoModel
def test_model(
    repo_id: str,
    subfolder: str | None = None,
    accelerate_device: str | torch.device = "cuda",
    flashpack_device: str | torch.device = "cuda",
    dtype: torch.dtype | None = None,
    use_transformers: bool = False,
) -> tuple[float, float, int]:
    """
    Test a model from a repository.
    """
    repo_dir = snapshot_download(
        repo_id, allow_patterns=None if subfolder is None else [f"{subfolder}/*"]
    )
    model_dir = repo_dir if subfolder is None else os.path.join(repo_dir, subfolder)
    saved_flashpack_path = os.path.join(model_dir, "model.flashpack")
    saved_flashpack_config_path = os.path.join(model_dir, "flashpack_config.json")
    
    with tempfile.TemporaryDirectory() as tmpdir:
        # Make a new model directory with the model in it so it isn't cached
        temp_model_dir = os.path.join(tmpdir, "model")
        flashpack_dir = os.path.join(tmpdir, "flashpack")
        os.makedirs(flashpack_dir, exist_ok=True)
        print("Copying model to temporary directory")
        shutil.copytree(model_dir, temp_model_dir)
        # Load from the temporary model directory
        print("Loading model from temporary directory using from_pretrained")
        start_time = time.time()
        model = DiffusersAutoModel.from_pretrained(
            temp_model_dir,
            torch_dtype=dtype,
            device_map={"": accelerate_device},
        )

        end_time = time.time()
        accelerate_time = end_time - start_time
        print(f"Time taken with from_pretrained: {accelerate_time} seconds")

        if os.path.exists(saved_flashpack_path) and os.path.exists(
            saved_flashpack_config_path
        ):
            print("Copying flashpack to temporary directory")
            shutil.copy(
                saved_flashpack_path, os.path.join(flashpack_dir, "model.flashpack")
            )
            shutil.copy(
                saved_flashpack_config_path, os.path.join(flashpack_dir, "config.json")
            )
        else:
            print("Packing model to flashpack")
            pack_start_time = time.time()
            model.save_pretrained(
                flashpack_dir, target_dtype=dtype, use_flashpack=True
            )
            pack_end_time = time.time()
            print(
                f"Time taken for flashpack packing: {pack_end_time - pack_start_time} seconds"
            )
            # Copy back to the original model directory
            shutil.copy(
                os.path.join(flashpack_dir, "model.flashpack"), saved_flashpack_path
            )
            shutil.copy(
                os.path.join(flashpack_dir, "config.json"), saved_flashpack_config_path
            )

        del model
        sync_and_flush()

        print("Loading model from flashpack directory using from_pretrained_flashpack")
        flashpack_start_time = time.time()

        flashpack_model = DiffusersAutoModel.from_pretrained(
                flashpack_dir, device=flashpack_device, target_dtype=dtype,use_flashpack=True
            )

        flashpack_end_time = time.time()
        flashpack_time = flashpack_end_time - flashpack_start_time
        print(f"Time taken with flashpack loading: {flashpack_time} seconds")

        total_numel = 0
        for param in flashpack_model.parameters():
            total_numel += param.numel()

        total_bytes = total_numel * dtype.itemsize

        del flashpack_model
        sync_and_flush()

        return accelerate_time, flashpack_time, total_bytes


def test_wan_small_transformer() -> tuple[float, float, int]:
    return test_model(
        repo_id="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
        subfolder="transformer",
        accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
        flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
        dtype=torch.bfloat16,
    )


def test_wan_large_transformer() -> tuple[float, float, int]:
    return test_model(
        repo_id="Wan-AI/Wan2.1-T2V-14B-Diffusers",
        subfolder="transformer",
        accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
        flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
        dtype=torch.bfloat16,
    )


def test_flux_transformer() -> tuple[float, float, int]:
    return test_model(
        repo_id="black-forest-labs/FLUX.1-dev",
        subfolder="transformer",
        accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
        flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
        dtype=torch.bfloat16,
    )


def test_qwen_transformer() -> tuple[float, float, int]:
    return test_model(
        repo_id="Qwen/Qwen-Image-Edit",
        subfolder="transformer",
        accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
        flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
        dtype=torch.bfloat16,
        use_transformers=True,
    )


def print_test_result(
    model_name: str,
    accelerate_time: float,
    flashpack_time: float,
    total_bytes: int,
) -> None:
    print(f"{model_name}: Accelerate time: {accelerate_time} seconds")
    print(f"{model_name}: Flashpack time: {flashpack_time} seconds")
    accelerate_gbps = (total_bytes / 1000**3) / accelerate_time
    flashpack_gbps = (total_bytes / 1000**3) / flashpack_time
    print(f"{model_name}: Accelerate GB/s: {accelerate_gbps} GB/s")
    print(f"{model_name}: Flashpack GB/s: {flashpack_gbps} GB/s")


def sync_and_flush() -> None:
    torch.cuda.empty_cache()
    gc.collect()
    os.system("sync")
    if os.geteuid() == 0:
        os.system("echo 3 | tee /proc/sys/vm/drop_caches")


if __name__ == "__main__":
    with open("benchmark_results_finall.csv", "a", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(
            ["model", "accelerate_time", "flashpack_time", "total_overhead_time", "total_bytes"]
        )

        for i in range(10):
            for test_model_name, test_func in [
                ("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", test_wan_small_transformer),
                ("Wan-AI/Wan2.1-T2V-14B-Diffusers", test_wan_large_transformer),
                ("black-forest-labs/FLUX.1-dev", test_flux_transformer),
                ("Qwen/Qwen-Image-Edit", test_qwen_transformer),
            ]:
                accelerate_time, flashpack_time, total_bytes = test_func()
                # ---- main benchmark CSV (unchanged, written every run) ----
                writer.writerow(
                    [
                        test_model_name,
                        accelerate_time,
                        flashpack_time,
                        total_bytes,
                    ]
                )

                print_test_result(
                    test_model_name,
                    accelerate_time,
                    flashpack_time,
                    total_bytes,
                )

Regarding FlashPack overhead, the total overhead time is the sum of:
FlashPack conversion time (one-time cost), and
FlashPack load time (per load)
These values are approximately as follows, based on the measurements reported earlier in the PR:

Model	Total FlashPack Overhead Time (s)
Wan2.1 1.3B DiT	8.319
FLUX.1 [dev] 12B DiT	77.532
Wan2.1 14B DiT	81.975
Qwen-Image-Edit 20B DiT	129.756

Note: All timings use direct GPU loading (device_map="cuda" or equivalent) to ensure a fair comparison between standard loading and FlashPack.

sayakpaul · 2026-01-21T02:58:24Z

Thanks, so it should be `device_map="cuda".

These values are approximately as follows, based on the measurements reported earlier in the PR:

The table doesn't make it clear the time with from_pretrained() when device_map="cuda" is used.

sayakpaul

Left some further comments.

src/diffusers/models/modeling_utils.py

sayakpaul · 2026-01-21T03:03:32Z

src/diffusers/models/modeling_utils.py

+                except EnvironmentError:
+                    resolved_model_file = None
+                with no_init_weights():
+                    model = cls.from_config(config, **unused_kwargs)


Why do we need to initialize the model here? It is already done here:

diffusers/src/diffusers/models/modeling_utils.py

Line 1278 in ec37629

model = cls.from_config(config, **unused_kwargs)

IMO, the flashpack weight name resolving related code can be around the following block:

diffusers/src/diffusers/models/modeling_utils.py

Lines 1242 to 1256 in ec37629

if resolved_model_file is None and not is_sharded:

resolved_model_file = _get_model_file(

pretrained_model_name_or_path,

weights_name=_add_variant(WEIGHTS_NAME, variant),

cache_dir=cache_dir,

force_download=force_download,

proxies=proxies,

local_files_only=local_files_only,

token=token,

revision=revision,

subfolder=subfolder,

user_agent=user_agent,

commit_hash=commit_hash,

dduf_entries=dduf_entries,

)

And then the rest of the code can follow.

I’ve changed the load_pretrained() function and removed the redundant initialization. Happy to clarify or discuss further if you have any questions.

sayakpaul · 2026-01-21T03:05:40Z

Also, WDYT of running a conversion on the go from bin / safetensors -> flashpack when a user requests to load in flashpack for a non-flashpack checkpoint?

devanshi00 · 2026-01-21T04:02:39Z

Thanks, so it should be `device_map="cuda".

These values are approximately as follows, based on the measurements reported earlier in the PR:

The table doesn't make it clear the time with from_pretrained() when device_map="cuda" is used.

I hope it is clear now.

devanshi00 · 2026-01-21T04:12:09Z

Also, WDYT of running a conversion on the go from bin / safetensors -> flashpack when a user requests to load in flashpack for a non-flashpack checkpoint?

I don’t think automatic conversion should be enabled by default. FlashPack conversion can take a long time for large models, so triggering it automatically when use_flashpack=True would result in unexpected delays in first run. It may also silently create files on disk or modify the local cache, which can be confusing for users. Keeping conversion explicit avoids these surprises and makes behavior easier to understand.
The current behavior should remain the default, with no automatic conversion. If needed, this can be offered as an explicit opt-in (for example, flashpack_auto_convert=True) or via a separate helper utility to generate FlashPack files ahead of time. This keeps loading predictable while still giving advanced users more control.

sayakpaul · 2026-01-21T04:25:46Z

I don’t think automatic conversion should be enabled by default. FlashPack conversion can take a long time for large models, so triggering it automatically when use_flashpack=True would result in unexpected delays in first run. It may also silently create files on disk or modify the local cache, which can be confusing for users. Keeping conversion explicit avoids these surprises and makes behavior easier to understand.
The current behavior should remain the default, with no automatic conversion. If needed, this can be offered as an explicit opt-in (for example, flashpack_auto_convert=True) or via a separate helper utility to generate FlashPack files ahead of time. This keeps loading predictable while still giving advanced users more control.

This makes sense. We can make all of this explicitly documented. Thanks for the context!

I hope it is clear now.

It is not. I would have expected to see a table with the following columns:

Model checkpoint id
Flashpack loading time
from_pretrained(device_map="cuda") timing

LMK if this is still unclear.

devanshi00 · 2026-01-21T04:51:16Z

It is not. I would have expected to see a table with the following columns:

Model checkpoint id

Flashpack loading time

from_pretrained(device_map="cuda") timing

LMK if this is still unclear.

So according to you, the benchmarking script is okay ? I use

 model = DiffusersAutoModel.from_pretrained(
            temp_model_dir,
            torch_dtype=dtype,
            device_map={"": accelerate_device},
        )

For the standard loading time . I just need to change the csv header right?

This makes sense. We can make all of this explicitly documented. Thanks for the context!

So do we need an explicit opt-in (for example, flashpack_auto_convert=True) or not?

sayakpaul · 2026-01-21T05:21:05Z

So do we need an explicit opt-in (for example, flashpack_auto_convert=True) or not?

I think we can skip that for now and advise that users run the conversion with use_flashpack=True during save_pretrained().

For the standard loading time . I just need to change the csv header right?

I only see two columns in the table provided in #12999 (comment). It has the flashpack timing but not the from_pretrained() with device_map timing unless I am missing out on something obvious.

devanshi00 · 2026-01-21T06:05:12Z

I only see two columns in the table provided in #12999 (comment). It has the flashpack timing but not the from_pretrained() with device_map timing unless I am missing out on something obvious.

Sorry for the confusion. But I m refering to the tables in the first comment where Standard load is equivalent to from_pretrained() with device_map cuda. I had uploaded those results with the same benchmarking script I mentioned afterwards.
#12999 (comment)

added fal-flashpack support

ec54190

devanshi00 mentioned this pull request Jan 19, 2026

Fal Flashpack #12550

Open

devanshi00 changed the title ~~added fal-flashpack support~~ [feat] added fal-flashpack support Jan 19, 2026

sayakpaul reviewed Jan 20, 2026

View reviewed changes

sayakpaul requested a review from DN6 January 20, 2026 07:20

review comments resolved

e5bb10c

devanshi00 force-pushed the fal-flashpack branch from 68949d0 to e5bb10c Compare January 20, 2026 22:54

sayakpaul reviewed Jan 21, 2026

View reviewed changes

“devanshi00” added 2 commits January 21, 2026 12:27

redundant model initialisation removed

8cc38a7

redundant model initialisation removed final

3bc3fdb

devanshi00 force-pushed the fal-flashpack branch from 321542b to 3bc3fdb Compare January 21, 2026 07:02

	if resolved_model_file is None and not is_sharded:
	resolved_model_file = _get_model_file(
	pretrained_model_name_or_path,
	weights_name=_add_variant(WEIGHTS_NAME, variant),
	cache_dir=cache_dir,
	force_download=force_download,
	proxies=proxies,
	local_files_only=local_files_only,
	token=token,
	revision=revision,
	subfolder=subfolder,
	user_agent=user_agent,
	commit_hash=commit_hash,
	dduf_entries=dduf_entries,
	)

[feat] added fal-flashpack support #12999

Are you sure you want to change the base?

[feat] added fal-flashpack support #12999

Conversation

devanshi00 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

What is FlashPack?

Architecture & Design

Design principles

Workflow

Benchmark Results

Model Load Time Comparison

Effective Weight Loading Throughput & FlashPack Conversion Cost

Benchmark Setup

Usage

One-time FlashPack conversion

Loading with FlashPack

Inference-Time Results (FlashPack)

Before submitting

Who can review?

Uh oh!

sayakpaul commented Jan 20, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devanshi00 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jan 21, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

devanshi00 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Jan 21, 2026

Uh oh!

devanshi00 commented Jan 21, 2026

Uh oh!

devanshi00 commented Jan 21, 2026

Uh oh!

sayakpaul commented Jan 21, 2026

Uh oh!

devanshi00 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devanshi00 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devanshi00 commented Jan 19, 2026 •

edited

Loading

devanshi00 commented Jan 20, 2026 •

edited

Loading

devanshi00 commented Jan 21, 2026 •

edited

Loading

sayakpaul commented Jan 21, 2026 •

edited

Loading