Skip to content

Sage Attention sm90 causes confetti/noisy output #12783

@dylanprins

Description

@dylanprins

Describe the bug

When using
_sage_qk_int8_pv_fp8_cuda_sm90 as the attention backend on WAN2.2 I2V I notice that the output is broken:

Image

It works fine with _flash_3_hub and _sage_qk_int8_pv_fp16_cuda

Does it matter whether we use the original fp16 model? Do we need the fp8 model or should this just work out of the box?

Can somebody explain the type of artifacts we have? Does it point to a specific issue?

Reproduction

import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda"
vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", vae=vae, torch_dtype=dtype)
pipe.transformer.set_attention_backend("_sage_qk_int8_pv_fp8_cuda_sm90")

pipe.to(device)

height = 720
width = 1280

prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=4.0,
    guidance_scale_2=3.0,
    num_inference_steps=40,
).frames[0]
export_to_video(output, "t2v_out.mp4", fps=16)

Logs

System Info

diffusers = 0.36.0.dev0
python = 3.11
cuda = 12.8
nvidia driver = 570.195.03

Using H100 80GB HBM3

Who can help?

@DN6 @yiyixuxu

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions