Skip to content

add-qwen3-omni-thinker#1202

Open
SangChengC wants to merge 7 commits intomainfrom
add-qwen3-omni-moe
Open

add-qwen3-omni-thinker#1202
SangChengC wants to merge 7 commits intomainfrom
add-qwen3-omni-moe

Conversation

@SangChengC
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the LightLLM framework by integrating the Qwen3 Omni Thinker model. The changes encompass the addition of core model definitions, specialized components for processing both audio and visual data, and updates to the tokenizer and inference pipeline. These modifications ensure that the new multimodal model can be seamlessly loaded, configured, and utilized for various tasks requiring understanding across different data types.

Highlights

  • New Model Support: Introduced comprehensive support for the Qwen3 Omni Thinker multimodal model, enabling its integration and functionality within the LightLLM framework.
  • Multimodal Capabilities: Added dedicated audio and visual processing components, including Qwen3OmniMoeAudioEncoder and Qwen3OmniMoeVisionTransformerPretrainedModel, to handle diverse input modalities.
  • Custom Tokenization and Inference: Implemented QWen3OmniTokenizer for specialized multimodal tokenization and Qwen3OmniMOEInferStateInfo for managing the model's unique inference state.
  • Configuration System Updates: Modified configuration utilities and server-side model initialization logic to correctly parse and load model parameters from a new thinker_config structure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lightllm/models/init.py
    • Added import for Qwen3OmniMOETpPartModel to register the new model.
  • lightllm/models/qwen2/layer_weights/pre_and_post_layer_weight.py
    • Modified Qwen2PreAndPostLayerWeight to initialize lm_head_weight_ using LMHeadWeight, referencing 'thinker.lm_head.weight'.
  • lightllm/models/qwen2_vl/infer_struct.py
    • Added use_image_h parameter to the get_mrope_position function call.
  • lightllm/models/qwen2_vl/triton_kernel/get_mrope_position_ids.py
    • Added use_image_h parameter to get_mrope_position_triton and included conditional logic to modify b_image_thwd.
  • lightllm/models/qwen3_omni_moe_thinker/audio_process.py
    • Added WhisperFeatureExtractor class for audio feature extraction, enabling audio modality processing.
  • lightllm/models/qwen3_omni_moe_thinker/infer_struct.py
    • Added Qwen3OmniMOEInferStateInfo class, inheriting from Qwen3VLInferStateInfo and setting use_image_h to False for specific inference handling.
  • lightllm/models/qwen3_omni_moe_thinker/layer_infer/transformer_layer_infer.py
    • Added Qwen3OmniMOETransformerLayerInfer class, inheriting from Qwen3VLMOETransformerLayerInfer and initializing mrope_section for multimodal Rotary Position Embeddings.
  • lightllm/models/qwen3_omni_moe_thinker/layer_weights/meta_weights/code2wav_causal_conv_net.py
    • Added Qwen3OmniMoeCausalConvNetWeight class for handling causal convolutional network weights.
  • lightllm/models/qwen3_omni_moe_thinker/layer_weights/meta_weights/code2wav_causal_trans_conv_net.py
    • Added Qwen3OmniMoeCode2wavCausalTransConvNetWeight class for handling causal transposed convolutional network weights.
  • lightllm/models/qwen3_omni_moe_thinker/layer_weights/meta_weights/code2wav_conv_ne_xt.py
    • Added Qwen3OmniMoeConvNeXtBlockWeight class for handling ConvNeXt block weights.
  • lightllm/models/qwen3_omni_moe_thinker/layer_weights/meta_weights/talker_resize_mlp_weight.py
    • Added Qwen3OmniMoeTalkerResizeMLPWeight class for handling MLP weights.
  • lightllm/models/qwen3_omni_moe_thinker/layer_weights/pre_and_post_layer_weight.py
    • Added Qwen3OmniMOEThinkerPreAndPostLayerWeight class, including a rename_weight_keys function to adjust weight prefixes during loading.
  • lightllm/models/qwen3_omni_moe_thinker/layer_weights/transformers_layer_weight.py
    • Added Qwen3OmniMOEThinkerTransformerLayerWeight class for transformer layer weights.
  • lightllm/models/qwen3_omni_moe_thinker/model.py
    • Added QWen3OmniTokenizer and Qwen3OmniMOETpPartModel classes, registering the new model and defining its multimodal tokenization logic, including audio token length calculation.
  • lightllm/models/qwen3_omni_moe_thinker/qwen3_omni_audio.py
    • Added Qwen3OmniMoeAudioEncoder and related components (Qwen3OmniMoeAudioEncoderLayer, Qwen3OmniMoeAudioAttention, SinusoidsPositionEmbedding) for detailed audio processing.
  • lightllm/models/qwen3_omni_moe_thinker/qwen3_omni_visual.py
    • Added Qwen3OmniMoeVisionTransformerPretrainedModel and related components (Qwen3OmniMoeVisionMLP, Qwen3OmniMoeVisionPatchEmbed, Qwen3OmniMoeVisionPatchMerger, Qwen3OmniMoeVisionBlock) for detailed visual processing.
  • lightllm/server/audioserver/model_infer/model_rpc.py
    • Modified exposed_init_model to support Qwen3OmniMoeAudioEncoder and adjusted config loading to check for thinker_config.
  • lightllm/server/tokenizer.py
    • Modified get_tokenizer to include QWen3OmniTokenizer when a thinker_config is present in the model configuration.
  • lightllm/server/visualserver/model_infer/model_rpc.py
    • Modified exposed_init_model to support Qwen3OmniMoeVisionTransformerPretrainedModel and adjusted config loading to check for thinker_config.
  • lightllm/utils/config_utils.py
    • Modified _get_config_llm_keyvalue, get_eos_token_ids, and get_vocab_size to correctly retrieve configuration values from the thinker_config section.
    • Hardcoded the eos_token_id to [151645] (with a comment indicating potential future refinement).
Activity
  • The pull request introduces a new multimodal model, Qwen3 Omni Thinker, by adding numerous new files and modifying existing ones to integrate its audio and visual processing capabilities, tokenizer, and inference logic.
  • Configuration utilities and server-side model initialization have been updated to recognize and load this new model architecture, ensuring proper setup for multimodal inference.
  • Specific changes include new classes for audio feature extraction, visual patch embedding and merging, and specialized transformer layers, all tailored for the Qwen3 Omni Thinker model.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the qwen3-omni-thinker model. The changes include new model files, modifications to existing server and utility components to handle the new model's configuration and architecture. While the overall direction is correct, I've identified several critical issues, including hardcoded values that can break other models, incorrect return types causing runtime errors, and buggy logic in utility functions. There are also some medium-severity issues like the use of eval and leftover debug print statements. Addressing these points will significantly improve the robustness and maintainability of the code.

Comment on lines +108 to +180
) -> Tuple[torch.Tensor, torch.Tensor]:

is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
if is_batched_numpy and len(raw_speech.shape) > 2:
raise ValueError(f"Only mono-channel audio is supported for input to {self}")
is_batched = is_batched_numpy or (
isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
)

if is_batched:
raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech]
elif not is_batched and not isinstance(raw_speech, np.ndarray):
raw_speech = np.asarray(raw_speech, dtype=np.float32)
elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
raw_speech = raw_speech.astype(np.float32)

# always return batch
if not is_batched:
raw_speech = [np.asarray([raw_speech]).T]

batched_speech = BatchFeature({"input_features": raw_speech})

# convert into correct format for padding

padded_inputs = self.pad(
batched_speech,
padding=padding,
max_length=max_length if max_length else self.n_samples,
truncation=truncation,
pad_to_multiple_of=pad_to_multiple_of,
return_attention_mask=return_attention_mask or do_normalize,
)

# zero-mean and unit-variance normalization
if do_normalize:
padded_inputs["input_features"] = self.zero_mean_unit_var_norm(
padded_inputs["input_features"],
attention_mask=padded_inputs["attention_mask"],
padding_value=self.padding_value,
)
padded_inputs["input_features"] = np.stack(padded_inputs["input_features"], axis=0)

# make sure list is in array format
input_features = padded_inputs.get("input_features").transpose(2, 0, 1)

input_features = self._torch_extract_fbank_features(input_features[0], device)

if isinstance(input_features[0], list):
padded_inputs["input_features"] = [np.asarray(feature, dtype=np.float32) for feature in input_features]

else:
padded_inputs["input_features"] = input_features

if return_attention_mask:
# rescale from sample (48000) to feature (3000)
rescaled_attention_mask = padded_inputs["attention_mask"][:, :: self.hop_length]

# The STFT computation produces L//hop_length + 1 frames,
# but we skip the last frame (see `_torch_extract_fbank_features`).
# This means we need to trim the rescaled attention mask to match
# the actual number of frames (L//hop_length) when the input length
# is not perfectly divisible by the hop length.
if padded_inputs["attention_mask"].shape[1] % self.hop_length != 0:
rescaled_attention_mask = rescaled_attention_mask[:, :-1]
padded_inputs["attention_mask"] = rescaled_attention_mask

if return_token_timestamps is not None:
padded_inputs["num_frames"] = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech]

if return_tensors is not None:
padded_inputs = padded_inputs.convert_to_tensors(return_tensors)

return padded_inputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _preprocess method is type-hinted to return a Tuple[torch.Tensor, torch.Tensor], but it currently returns a single BatchFeature object. This will cause a ValueError: too many values to unpack at the call site in qwen3_omni_audio.py, which expects two return values. The method should be updated to return the input features and their corresponding lengths as a tuple to match the type hint and the caller's expectation.

        if return_tensors is not None:
            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)

        lengths = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech]
        if return_tensors == "pt":
            lengths = torch.tensor(lengths, device=padded_inputs["input_features"].device)
        elif return_tensors == "np":
            lengths = np.array(lengths)

        if return_token_timestamps is not None:
            padded_inputs["num_frames"] = lengths

        return padded_inputs["input_features"], lengths


def get_eos_token_ids(model_path: str) -> Optional[List[int]]:
eos_token_id = _get_config_llm_keyvalue(model_path=model_path, key_name=["eos_token_id"])
return [151645] # 后面看看怎么改? 直接改config.json?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Hardcoding the return value [151645] for get_eos_token_ids is a critical issue. This will break any model that relies on this function to get its correct EOS token ID(s) from the configuration file. The hardcoded line should be removed to restore the original, correct logic of reading from the config.

Comment on lines 8 to 17
hidden_size = network_config["hidden_size"]
vocab_size = network_config["vocab_size"]
tie_word_embeddings = network_config.get("tie_word_embeddings", False)
self.lm_head_weight_ = LMHeadWeight(
dim=hidden_size,
vocab_size=vocab_size,
weight_name="thinker.lm_head.weight",
data_type=self.data_type_,
embedding_weight=self.wte_weight_ if tie_word_embeddings else None,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The lm_head_weight_ is being overridden with a hardcoded weight name "thinker.lm_head.weight". This change is in the Qwen2PreAndPostLayerWeight class, which may be used by other qwen2 models. Hardcoding a model-specific weight name here can break other models that don't use this weight naming scheme. It's better to make this configurable or handle it within the specific model's weight class (Qwen3OmniMOEThinkerPreAndPostLayerWeight).

Comment on lines +28 to +29
if config_json.get("thinker_config") is not None:
value = config_json.get("thinker_config", {}).get("text_config").get(key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to handle thinker_config is flawed. It unconditionally overwrites the value variable if thinker_config exists, even if a value was already found from other keys. This will lead to incorrect behavior. This check should be part of the fallback chain, only executing if the value hasn't been found in other locations.

mel_len = chunk_len // 160
dilation = 1
L_in = mel_len
for (padding, kernel_size, stride) in eval("[(1,3,1)] + [(1,3,2)] "):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using eval() on a string, even if it's a literal, is generally considered unsafe and can be a performance bottleneck. It's better to use the literal value directly.

Suggested change
for (padding, kernel_size, stride) in eval("[(1,3,1)] + [(1,3,2)] "):
for (padding, kernel_size, stride) in [(1, 3, 1), (1, 3, 2)]:

all_config = json.load(json_file)
self.config = all_config["thinker_config"]["text_config"]
# rename keys
print(f"self.config is {self.config}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement seems to be for debugging purposes. It should be removed from the production code.

raise ValueError(f"cannot read audio which type is {type(item)}!")

# padding to min audio len
MIN_AUDIO_LEN = 480
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 480 is used multiple times in this method. It's better to define it as a module-level constant to improve readability and maintainability. For example: MIN_AUDIO_LEN = 480 at the top of the file.

deepstack_feature_lists.append(deepstack_feature)

hidden_states = self.merger(hidden_states)
print(f"hidden_states is {hidden_states}, deepstack is {deepstack_feature_lists}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This print statement appears to be for debugging. It should be removed before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants