-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[feat] added fal-flashpack support #12999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for the comprehensive PR. To benchmark model loading, are we reporting just the pipeline loading time or just the denoiser? Could we benchmark just the denoiser loading? Also, during the standard load, let's specify |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this. I left some questions.
68949d0 to
e5bb10c
Compare
For benchmarking model loading, I used the official benchmarking script from the orignal flashpack repository, which measures only the diffusion transformer (DiT / denoiser) loading time, not the full pipeline. The results reported in the table above therefore correspond exclusively to denoiser loading for the respective Diffusers models. Here is the script import csv
import gc
import os
import shutil
import tempfile
import time
import torch
from huggingface_hub import snapshot_download
from diffusers.models import AutoModel as DiffusersAutoModel
def test_model(
repo_id: str,
subfolder: str | None = None,
accelerate_device: str | torch.device = "cuda",
flashpack_device: str | torch.device = "cuda",
dtype: torch.dtype | None = None,
use_transformers: bool = False,
) -> tuple[float, float, int]:
"""
Test a model from a repository.
"""
repo_dir = snapshot_download(
repo_id, allow_patterns=None if subfolder is None else [f"{subfolder}/*"]
)
model_dir = repo_dir if subfolder is None else os.path.join(repo_dir, subfolder)
saved_flashpack_path = os.path.join(model_dir, "model.flashpack")
saved_flashpack_config_path = os.path.join(model_dir, "flashpack_config.json")
with tempfile.TemporaryDirectory() as tmpdir:
# Make a new model directory with the model in it so it isn't cached
temp_model_dir = os.path.join(tmpdir, "model")
flashpack_dir = os.path.join(tmpdir, "flashpack")
os.makedirs(flashpack_dir, exist_ok=True)
print("Copying model to temporary directory")
shutil.copytree(model_dir, temp_model_dir)
# Load from the temporary model directory
print("Loading model from temporary directory using from_pretrained")
start_time = time.time()
model = DiffusersAutoModel.from_pretrained(
temp_model_dir,
torch_dtype=dtype,
device_map={"": accelerate_device},
)
end_time = time.time()
accelerate_time = end_time - start_time
print(f"Time taken with from_pretrained: {accelerate_time} seconds")
if os.path.exists(saved_flashpack_path) and os.path.exists(
saved_flashpack_config_path
):
print("Copying flashpack to temporary directory")
shutil.copy(
saved_flashpack_path, os.path.join(flashpack_dir, "model.flashpack")
)
shutil.copy(
saved_flashpack_config_path, os.path.join(flashpack_dir, "config.json")
)
else:
print("Packing model to flashpack")
pack_start_time = time.time()
model.save_pretrained(
flashpack_dir, target_dtype=dtype, use_flashpack=True
)
pack_end_time = time.time()
print(
f"Time taken for flashpack packing: {pack_end_time - pack_start_time} seconds"
)
# Copy back to the original model directory
shutil.copy(
os.path.join(flashpack_dir, "model.flashpack"), saved_flashpack_path
)
shutil.copy(
os.path.join(flashpack_dir, "config.json"), saved_flashpack_config_path
)
del model
sync_and_flush()
print("Loading model from flashpack directory using from_pretrained_flashpack")
flashpack_start_time = time.time()
flashpack_model = DiffusersAutoModel.from_pretrained(
flashpack_dir, device=flashpack_device, target_dtype=dtype,use_flashpack=True
)
flashpack_end_time = time.time()
flashpack_time = flashpack_end_time - flashpack_start_time
print(f"Time taken with flashpack loading: {flashpack_time} seconds")
total_numel = 0
for param in flashpack_model.parameters():
total_numel += param.numel()
total_bytes = total_numel * dtype.itemsize
del flashpack_model
sync_and_flush()
return accelerate_time, flashpack_time, total_bytes
def test_wan_small_transformer() -> tuple[float, float, int]:
return test_model(
repo_id="Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
subfolder="transformer",
accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
dtype=torch.bfloat16,
)
def test_wan_large_transformer() -> tuple[float, float, int]:
return test_model(
repo_id="Wan-AI/Wan2.1-T2V-14B-Diffusers",
subfolder="transformer",
accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
dtype=torch.bfloat16,
)
def test_flux_transformer() -> tuple[float, float, int]:
return test_model(
repo_id="black-forest-labs/FLUX.1-dev",
subfolder="transformer",
accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
dtype=torch.bfloat16,
)
def test_qwen_transformer() -> tuple[float, float, int]:
return test_model(
repo_id="Qwen/Qwen-Image-Edit",
subfolder="transformer",
accelerate_device="cuda:0" if torch.cuda.is_available() else "cpu",
flashpack_device="cuda:1" if torch.cuda.is_available() else "cpu",
dtype=torch.bfloat16,
use_transformers=True,
)
def print_test_result(
model_name: str,
accelerate_time: float,
flashpack_time: float,
total_bytes: int,
) -> None:
print(f"{model_name}: Accelerate time: {accelerate_time} seconds")
print(f"{model_name}: Flashpack time: {flashpack_time} seconds")
accelerate_gbps = (total_bytes / 1000**3) / accelerate_time
flashpack_gbps = (total_bytes / 1000**3) / flashpack_time
print(f"{model_name}: Accelerate GB/s: {accelerate_gbps} GB/s")
print(f"{model_name}: Flashpack GB/s: {flashpack_gbps} GB/s")
def sync_and_flush() -> None:
torch.cuda.empty_cache()
gc.collect()
os.system("sync")
if os.geteuid() == 0:
os.system("echo 3 | tee /proc/sys/vm/drop_caches")
if __name__ == "__main__":
with open("benchmark_results_finall.csv", "a", newline="") as f:
writer = csv.writer(f)
writer.writerow(
["model", "accelerate_time", "flashpack_time", "total_overhead_time", "total_bytes"]
)
for i in range(10):
for test_model_name, test_func in [
("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", test_wan_small_transformer),
("Wan-AI/Wan2.1-T2V-14B-Diffusers", test_wan_large_transformer),
("black-forest-labs/FLUX.1-dev", test_flux_transformer),
("Qwen/Qwen-Image-Edit", test_qwen_transformer),
]:
accelerate_time, flashpack_time, total_bytes = test_func()
# ---- main benchmark CSV (unchanged, written every run) ----
writer.writerow(
[
test_model_name,
accelerate_time,
flashpack_time,
total_bytes,
]
)
print_test_result(
test_model_name,
accelerate_time,
flashpack_time,
total_bytes,
)Regarding FlashPack overhead, the total overhead time is the sum of:
Note: All timings use direct GPU loading (device_map="cuda" or equivalent) to ensure a fair comparison between standard loading and FlashPack. |
|
Thanks, so it should be `device_map="cuda".
The table doesn't make it clear the time with |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some further comments.
| except EnvironmentError: | ||
| resolved_model_file = None | ||
| with no_init_weights(): | ||
| model = cls.from_config(config, **unused_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to initialize the model here? It is already done here:
diffusers/src/diffusers/models/modeling_utils.py
Line 1278 in ec37629
| model = cls.from_config(config, **unused_kwargs) |
IMO, the flashpack weight name resolving related code can be around the following block:
diffusers/src/diffusers/models/modeling_utils.py
Lines 1242 to 1256 in ec37629
| if resolved_model_file is None and not is_sharded: | |
| resolved_model_file = _get_model_file( | |
| pretrained_model_name_or_path, | |
| weights_name=_add_variant(WEIGHTS_NAME, variant), | |
| cache_dir=cache_dir, | |
| force_download=force_download, | |
| proxies=proxies, | |
| local_files_only=local_files_only, | |
| token=token, | |
| revision=revision, | |
| subfolder=subfolder, | |
| user_agent=user_agent, | |
| commit_hash=commit_hash, | |
| dduf_entries=dduf_entries, | |
| ) |
And then the rest of the code can follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve changed the load_pretrained() function and removed the redundant initialization. Happy to clarify or discuss further if you have any questions.
|
Also, WDYT of running a conversion on the go from bin / safetensors -> flashpack when a user requests to load in flashpack for a non-flashpack checkpoint? |
I hope it is clear now. |
I don’t think automatic conversion should be enabled by default. FlashPack conversion can take a long time for large models, so triggering it automatically when |
This makes sense. We can make all of this explicitly documented. Thanks for the context!
It is not. I would have expected to see a table with the following columns:
LMK if this is still unclear. |
So according to you, the benchmarking script is okay ? I use model = DiffusersAutoModel.from_pretrained(
temp_model_dir,
torch_dtype=dtype,
device_map={"": accelerate_device},
)For the standard loading time . I just need to change the csv header right?
So do we need an explicit opt-in (for example, |
I think we can skip that for now and advise that users run the conversion with
I only see two columns in the table provided in #12999 (comment). It has the flashpack timing but not the |
Sorry for the confusion. But I m refering to the tables in the first comment where Standard load is equivalent to from_pretrained() with device_map cuda. I had uploaded those results with the same benchmarking script I mentioned afterwards. |
321542b to
3bc3fdb
Compare
What does this PR do?
This PR adds FlashPack support to Diffusers, enabling significantly faster model loading and improved inference throughput through an optimized on-disk weight format.
What is FlashPack?
FlashPack is a weight-packing format that serializes model parameters into a contiguous, GPU-friendly layout.
It reduces:
As a result, FlashPack provides:
FlashPack is particularly beneficial for large transformer-based diffusion models, including video and multi-modal pipelines.
Architecture & Design
FlashPack is implemented as a storage-level optimization, not a runtime execution hook.
Design principles
use_flashpack=TrueWorkflow
save_pretrained(..., use_flashpack=True)use_flashpack=TrueNote : To push a FlashPack model to the Hugging Face Hub, users must first save with
use_flashpack=Trueand then manually upload the resulting directory, aspush_to_hub=Truecurrently only supports standard PyTorch/safetensors weights.Benchmark Results
Model Load Time Comparison
Effective Weight Loading Throughput & FlashPack Conversion Cost
Note: All timings use direct GPU loading (device_map="cuda" or equivalent) to ensure a fair comparison between standard loading and FlashPack.
Benchmark Setup
bfloat16Usage
One-time FlashPack conversion
This produces the FlashPack-packed weights alongside the regular weights in the repo.
Loading with FlashPack
FlashPack load/inference falls back gracefully to standard weights if FlashPack output does not exist.
Inference-Time Results (FlashPack)
FlashPack optimizes model loading, not transformer compute. Inference speedups are expected to be minimal.
FIxes #(issue)
#12550
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@sayakpaul @yiyixuxu @DN6