[Qwen] avoid creating attention masks when there is no padding #12987

kashif · 2026-01-16T14:25:25Z

What does this PR do?

This pull request sets all ones masks to None.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2026-01-16T14:36:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul

Thanks!

Could you update with the following things?

Shed light into what caused the speed regression
Add a test with masks in the compilation tests here
Do a before and after comparison in the outputs with the PR

kashif · 2026-01-16T15:18:21Z

will do!

src/diffusers/pipelines/qwenimage/pipeline_qwenimage_img2img.py

sayakpaul · 2026-01-19T03:54:16Z

tests/models/transformers/test_models_transformer_qwenimage.py

+        model = self.model_class(**init_dict).to(torch_device)
+        model.eval()
+
+        compiled_model = torch.compile(model, mode="default", fullgraph=False)


Some notes:

Usually, it should be model.compile() as it doesn't wrap the underlying model into a dynamo wrapper. This way, we don't have to add any extra code to handle it.

Why is fullgraph=False here?

sayakpaul · 2026-01-19T03:57:08Z

tests/models/transformers/test_models_transformer_qwenimage.py

+        with torch.no_grad():
+            output_no_mask = compiled_model(**inputs_no_mask)


Does it lead to graph breaks? If not, then we should add additional contexts:

diffusers/tests/models/test_modeling_common.py

Line 2145 in 3996788

torch._dynamo.config.patch(error_on_recompile=True),

dxqb · 2026-01-19T21:01:50Z

does this PR lack a merge, or is the amount of code changes intentional and really part of this PR only? (+318 - 170)
if intentional, could you explain what it does and why all these changes are necessary?

avoiding masks if they're not necessary has been one line before
attention_mask = attention_mask if not torch.all(text_attention_mask) else None
(on a boolean mask)

kashif · 2026-01-19T21:09:20Z

@dxqb i intially mis-calculated and was adding support for the pipeline to be compiled, i will revert and simplify

dxqb · 2026-01-19T21:16:33Z

@dxqb i intially mis-calculated and was adding support for the pipeline to be compiled, i will revert and simplify

thanks!
regarding compile,

Regional compilation is usually as efficient as full compilation (even if it was possible). Regional compilation compiles the transformer blocks, but not the entire pipeline.
whenever you want to branch depending on GPU data, that's a graph break for torch.compile. Either inefficient, or fails depending on fullgraph. Or, in less abstract terms:
you want to set your attention mask to None if the entire attention mask is True. But the attention mask lives on GPU. Checking whether all values of a tensor on GPU are True requires a transfer back to CPU - that's always a graph break and cannot be compiled (efficiently).

Therefore, I'd suggest to

check for an all-True mask in the pipeline
pass a None-Mask to the transformer block in that case - but don't check any GPU data in the transformer block, so the transformer block can be compiled.

dxqb · 2026-01-19T21:31:34Z

Thanks!

Could you update with the following things?
* Shed light into what caused the speed regression

Here is a benchmark of the impact of using a mask unnecessarily (second graph): #12870 (comment)
torch SDPA falls back to a non-flash algorithm if a mask is used.

kashif · 2026-01-19T21:36:04Z

yes i also benchmarked the mask and got:

kashif · 2026-01-19T21:58:30Z

thanks @dxqb please check now

avoid creating attention masks when there is no padding

661febb

kashif mentioned this pull request Jan 16, 2026

Fix QwenImage txt_seq_lens handling #12702

Merged

6 tasks

make fix-copies

5507b5e

kashif requested a review from sayakpaul January 16, 2026 15:10

sayakpaul reviewed Jan 16, 2026

View reviewed changes

yiyixuxu reviewed Jan 17, 2026

View reviewed changes

src/diffusers/pipelines/qwenimage/pipeline_qwenimage_img2img.py Outdated Show resolved Hide resolved

kashif added 4 commits January 17, 2026 12:09

Merge branch 'main' into fix-reg

cd85aae

torch compile tests

4839fcf

set all ones mask to none

23150e4

fix positional encoding from becoming > 4096

47f6585

sayakpaul reviewed Jan 19, 2026

View reviewed changes

kashif added 2 commits January 19, 2026 08:49

fix from review

da6e128

slice freqs_cis to match the input sequence length

283df92

dxqb mentioned this pull request Jan 19, 2026

Flux2 Klein support Nerogar/OneTrainer#1261

Draft

7 tasks

Merge branch 'main' into fix-reg

19d9d09

keep only attenton masking change

3a0fd2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Qwen] avoid creating attention masks when there is no padding #12987

[Qwen] avoid creating attention masks when there is no padding #12987

kashif commented Jan 16, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jan 16, 2026

Uh oh!

sayakpaul left a comment

Uh oh!

kashif commented Jan 16, 2026

Uh oh!

Uh oh!

sayakpaul Jan 19, 2026

Uh oh!

sayakpaul Jan 19, 2026

Uh oh!

dxqb commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

dxqb commented Jan 19, 2026 •

edited

Loading

Uh oh!

dxqb commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		with torch.no_grad():
		output_no_mask = compiled_model(**inputs_no_mask)

[Qwen] avoid creating attention masks when there is no padding #12987

Are you sure you want to change the base?

[Qwen] avoid creating attention masks when there is no padding #12987

Conversation

kashif commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jan 16, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

kashif commented Jan 16, 2026

Uh oh!

Uh oh!

sayakpaul Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

sayakpaul Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

dxqb commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

dxqb commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dxqb commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

kashif commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kashif commented Jan 16, 2026 •

edited

Loading

dxqb commented Jan 19, 2026 •

edited

Loading