Replies: 2 comments
-
|
Experiments on mel-spectrogram generation is far from satisfaction with the |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Execuse me, I am leaning the code of
class StableAudioDiTModel, I do not know what is the argumentglobal_states_input_dimused to? It seems that it is a must component that should be packed before the hidden_states sequence. and its default dim seems larger then the transformer inner_dim. What is that componenet means? If it is used to take in additional conditions, that seems can be done in the encoder outside. and compared with the concatenate, I think it may be better to repeat condition embedding to the sequence length and concat on hidden_dim.And what is the
sample_size: int = 1024,parameter used in the model creation? it seems not used duringforwardcallThe func doc of
class StableAudioDiTModel:forward, it saidencoder_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_len)`, *optional*):. why the shape of encoder_attention_mask is batch_size X sequence_len instead of batch_size X encoder_sequence_len to be identical with the shape of the inputencoder_hidden_statesand why thee return value of this
forwardis the direct(hidden_states,)but not(hidden_states * attention_mask, )?about the
class StableAudioDiTModel forward, what is the shape of parametersrotary_embeddingandtimestep?why the global_embedding is concated before the hidden_states? I think hidden_states is what we want to generated during DiT pipeline. while encoder_hidden_states is the condition signal, so global_embedding should be used to en-rich the encoder_hidden_states. and the action of concate the global_embedding before the input hidden_states sequence will change the input seq_length, according to [1], the concatenation should be done in the feature_dim direction, is it?
It seems using normal LayerNorm layer instead of adaLN layer?
Beta Was this translation helpful? Give feedback.
All reactions