question on stable_audio_transformer.py #12950

JohnHerry · 2025-11-14T09:26:01Z

JohnHerry
Nov 14, 2025

Execuse me, I am leaning the code of class StableAudioDiTModel , I do not know what is the argument global_states_input_dim used to? It seems that it is a must component that should be packed before the hidden_states sequence. and its default dim seems larger then the transformer inner_dim. What is that componenet means? If it is used to take in additional conditions, that seems can be done in the encoder outside. and compared with the concatenate, I think it may be better to repeat condition embedding to the sequence length and concat on hidden_dim.

And what is the sample_size: int = 1024, parameter used in the model creation? it seems not used during forward call

The func doc of class StableAudioDiTModel:forward, it said encoder_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_len)`, *optional*):. why the shape of encoder_attention_mask is batch_size X sequence_len instead of batch_size X encoder_sequence_len to be identical with the shape of the input encoder_hidden_states

and why thee return value of this forward is the direct (hidden_states,) but not (hidden_states * attention_mask, )?

about the class StableAudioDiTModel forward, what is the shape of parameters rotary_embedding and timestep?

why the global_embedding is concated before the hidden_states? I think hidden_states is what we want to generated during DiT pipeline. while encoder_hidden_states is the condition signal, so global_embedding should be used to en-rich the encoder_hidden_states. and the action of concate the global_embedding before the input hidden_states sequence will change the input seq_length, according to [1], the concatenation should be done in the feature_dim direction, is it?
It seems using normal LayerNorm layer instead of adaLN layer?

JohnHerry · 2025-11-25T08:53:39Z

JohnHerry
Nov 25, 2025
Author

Experiments on mel-spectrogram generation is far from satisfaction with the StableAudioDiTModel.
the training loss and varification_loss are too high then expected. the DiT module here produce better results.

0 replies

2026-01-09T15:05:42Z

github-actions[bot]
bot Jan 9, 2026

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question on stable_audio_transformer.py #12950

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

question on stable_audio_transformer.py #12950

Uh oh!

Uh oh!

JohnHerry Nov 14, 2025

Replies: 2 comments

Uh oh!

JohnHerry Nov 25, 2025 Author

Uh oh!

github-actions[bot] bot Jan 9, 2026

JohnHerry
Nov 14, 2025

JohnHerry
Nov 25, 2025
Author

github-actions[bot]
bot Jan 9, 2026