Conversation
Summary of ChangesHello @SangChengC, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the LightLLM framework by integrating the Qwen3 Omni Thinker model. The changes encompass the addition of core model definitions, specialized components for processing both audio and visual data, and updates to the tokenizer and inference pipeline. These modifications ensure that the new multimodal model can be seamlessly loaded, configured, and utilized for various tasks requiring understanding across different data types. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for the qwen3-omni-thinker model. The changes include new model files, modifications to existing server and utility components to handle the new model's configuration and architecture. While the overall direction is correct, I've identified several critical issues, including hardcoded values that can break other models, incorrect return types causing runtime errors, and buggy logic in utility functions. There are also some medium-severity issues like the use of eval and leftover debug print statements. Addressing these points will significantly improve the robustness and maintainability of the code.
| ) -> Tuple[torch.Tensor, torch.Tensor]: | ||
|
|
||
| is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1 | ||
| if is_batched_numpy and len(raw_speech.shape) > 2: | ||
| raise ValueError(f"Only mono-channel audio is supported for input to {self}") | ||
| is_batched = is_batched_numpy or ( | ||
| isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list))) | ||
| ) | ||
|
|
||
| if is_batched: | ||
| raw_speech = [np.asarray([speech], dtype=np.float32).T for speech in raw_speech] | ||
| elif not is_batched and not isinstance(raw_speech, np.ndarray): | ||
| raw_speech = np.asarray(raw_speech, dtype=np.float32) | ||
| elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64): | ||
| raw_speech = raw_speech.astype(np.float32) | ||
|
|
||
| # always return batch | ||
| if not is_batched: | ||
| raw_speech = [np.asarray([raw_speech]).T] | ||
|
|
||
| batched_speech = BatchFeature({"input_features": raw_speech}) | ||
|
|
||
| # convert into correct format for padding | ||
|
|
||
| padded_inputs = self.pad( | ||
| batched_speech, | ||
| padding=padding, | ||
| max_length=max_length if max_length else self.n_samples, | ||
| truncation=truncation, | ||
| pad_to_multiple_of=pad_to_multiple_of, | ||
| return_attention_mask=return_attention_mask or do_normalize, | ||
| ) | ||
|
|
||
| # zero-mean and unit-variance normalization | ||
| if do_normalize: | ||
| padded_inputs["input_features"] = self.zero_mean_unit_var_norm( | ||
| padded_inputs["input_features"], | ||
| attention_mask=padded_inputs["attention_mask"], | ||
| padding_value=self.padding_value, | ||
| ) | ||
| padded_inputs["input_features"] = np.stack(padded_inputs["input_features"], axis=0) | ||
|
|
||
| # make sure list is in array format | ||
| input_features = padded_inputs.get("input_features").transpose(2, 0, 1) | ||
|
|
||
| input_features = self._torch_extract_fbank_features(input_features[0], device) | ||
|
|
||
| if isinstance(input_features[0], list): | ||
| padded_inputs["input_features"] = [np.asarray(feature, dtype=np.float32) for feature in input_features] | ||
|
|
||
| else: | ||
| padded_inputs["input_features"] = input_features | ||
|
|
||
| if return_attention_mask: | ||
| # rescale from sample (48000) to feature (3000) | ||
| rescaled_attention_mask = padded_inputs["attention_mask"][:, :: self.hop_length] | ||
|
|
||
| # The STFT computation produces L//hop_length + 1 frames, | ||
| # but we skip the last frame (see `_torch_extract_fbank_features`). | ||
| # This means we need to trim the rescaled attention mask to match | ||
| # the actual number of frames (L//hop_length) when the input length | ||
| # is not perfectly divisible by the hop length. | ||
| if padded_inputs["attention_mask"].shape[1] % self.hop_length != 0: | ||
| rescaled_attention_mask = rescaled_attention_mask[:, :-1] | ||
| padded_inputs["attention_mask"] = rescaled_attention_mask | ||
|
|
||
| if return_token_timestamps is not None: | ||
| padded_inputs["num_frames"] = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech] | ||
|
|
||
| if return_tensors is not None: | ||
| padded_inputs = padded_inputs.convert_to_tensors(return_tensors) | ||
|
|
||
| return padded_inputs |
There was a problem hiding this comment.
The _preprocess method is type-hinted to return a Tuple[torch.Tensor, torch.Tensor], but it currently returns a single BatchFeature object. This will cause a ValueError: too many values to unpack at the call site in qwen3_omni_audio.py, which expects two return values. The method should be updated to return the input features and their corresponding lengths as a tuple to match the type hint and the caller's expectation.
if return_tensors is not None:
padded_inputs = padded_inputs.convert_to_tensors(return_tensors)
lengths = [len(raw_speech_i) // self.hop_length for raw_speech_i in raw_speech]
if return_tensors == "pt":
lengths = torch.tensor(lengths, device=padded_inputs["input_features"].device)
elif return_tensors == "np":
lengths = np.array(lengths)
if return_token_timestamps is not None:
padded_inputs["num_frames"] = lengths
return padded_inputs["input_features"], lengths
lightllm/utils/config_utils.py
Outdated
|
|
||
| def get_eos_token_ids(model_path: str) -> Optional[List[int]]: | ||
| eos_token_id = _get_config_llm_keyvalue(model_path=model_path, key_name=["eos_token_id"]) | ||
| return [151645] # 后面看看怎么改? 直接改config.json? |
There was a problem hiding this comment.
Hardcoding the return value [151645] for get_eos_token_ids is a critical issue. This will break any model that relies on this function to get its correct EOS token ID(s) from the configuration file. The hardcoded line should be removed to restore the original, correct logic of reading from the config.
| hidden_size = network_config["hidden_size"] | ||
| vocab_size = network_config["vocab_size"] | ||
| tie_word_embeddings = network_config.get("tie_word_embeddings", False) | ||
| self.lm_head_weight_ = LMHeadWeight( | ||
| dim=hidden_size, | ||
| vocab_size=vocab_size, | ||
| weight_name="thinker.lm_head.weight", | ||
| data_type=self.data_type_, | ||
| embedding_weight=self.wte_weight_ if tie_word_embeddings else None, | ||
| ) |
There was a problem hiding this comment.
The lm_head_weight_ is being overridden with a hardcoded weight name "thinker.lm_head.weight". This change is in the Qwen2PreAndPostLayerWeight class, which may be used by other qwen2 models. Hardcoding a model-specific weight name here can break other models that don't use this weight naming scheme. It's better to make this configurable or handle it within the specific model's weight class (Qwen3OmniMOEThinkerPreAndPostLayerWeight).
| if config_json.get("thinker_config") is not None: | ||
| value = config_json.get("thinker_config", {}).get("text_config").get(key) |
There was a problem hiding this comment.
The logic to handle thinker_config is flawed. It unconditionally overwrites the value variable if thinker_config exists, even if a value was already found from other keys. This will lead to incorrect behavior. This check should be part of the fallback chain, only executing if the value hasn't been found in other locations.
| mel_len = chunk_len // 160 | ||
| dilation = 1 | ||
| L_in = mel_len | ||
| for (padding, kernel_size, stride) in eval("[(1,3,1)] + [(1,3,2)] "): |
There was a problem hiding this comment.
Using eval() on a string, even if it's a literal, is generally considered unsafe and can be a performance bottleneck. It's better to use the literal value directly.
| for (padding, kernel_size, stride) in eval("[(1,3,1)] + [(1,3,2)] "): | |
| for (padding, kernel_size, stride) in [(1, 3, 1), (1, 3, 2)]: |
| all_config = json.load(json_file) | ||
| self.config = all_config["thinker_config"]["text_config"] | ||
| # rename keys | ||
| print(f"self.config is {self.config}") |
| raise ValueError(f"cannot read audio which type is {type(item)}!") | ||
|
|
||
| # padding to min audio len | ||
| MIN_AUDIO_LEN = 480 |
| deepstack_feature_lists.append(deepstack_feature) | ||
|
|
||
| hidden_states = self.merger(hidden_states) | ||
| print(f"hidden_states is {hidden_states}, deepstack is {deepstack_feature_lists}") |
No description provided.