Why LLama Architecture use LLAMA_ROPE_TYPE_NORM? #18127

kimjoohyungsd · 2025-12-17T07:22:29Z

kimjoohyungsd
Dec 17, 2025

I've found that huggingface implements Neox Style RoPE for llama based models as quoted below

def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)


def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    """Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`torch.Tensor`): The query tensor.
        k (`torch.Tensor`): The key tensor.
        cos (`torch.Tensor`): The cosine part of the rotary embedding.
        sin (`torch.Tensor`): The sine part of the rotary embedding.
        position_ids (`torch.Tensor`, *optional*):
            Deprecated and unused.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
    Returns:
        `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    cos = cos.unsqueeze(unsqueeze_dim) # [Batch,1,Length, dim]
    sin = sin.unsqueeze(unsqueeze_dim) # [Batch,1,Length, dim]
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

However I've found out that llama based models are assigned to LLAMA_ROPE_TYPE_NORM

llama_rope_type llama_model_rope_type(const llama_model * model) {
    switch (model->arch) {
        // these models do not use RoPE
        case LLM_ARCH_GPT2:
        case LLM_ARCH_GPTJ:
        case LLM_ARCH_MPT:
        case LLM_ARCH_REFACT:
        case LLM_ARCH_BLOOM:
        case LLM_ARCH_MAMBA:
        case LLM_ARCH_JINA_BERT_V2:
        case LLM_ARCH_T5:
        case LLM_ARCH_T5ENCODER:
        case LLM_ARCH_JAIS:
        case LLM_ARCH_RWKV6:
        case LLM_ARCH_RWKV6QWEN2:
        case LLM_ARCH_RWKV7:
        case LLM_ARCH_ARWKV7:
        case LLM_ARCH_WAVTOKENIZER_DEC:
            return LLAMA_ROPE_TYPE_NONE;

        // use what we call a normal RoPE, operating on pairs of consecutive head values
        case LLM_ARCH_LLAMA:
        case LLM_ARCH_LLAMA4:
        case LLM_ARCH_DECI:
        case LLM_ARCH_BAICHUAN:
        case LLM_ARCH_STARCODER:
        case LLM_ARCH_INTERNLM2:
        case LLM_ARCH_MINICPM:
        case LLM_ARCH_XVERSE:
        case LLM_ARCH_COMMAND_R:
        case LLM_ARCH_COHERE2:
        case LLM_ARCH_OLMO:
        case LLM_ARCH_ARCTIC:
        case LLM_ARCH_DEEPSEEK:
        case LLM_ARCH_DEEPSEEK2:
        case LLM_ARCH_PLM:
        case LLM_ARCH_CHATGLM:
        case LLM_ARCH_GLM4:
        case LLM_ARCH_GRANITE:
        case LLM_ARCH_GRANITE_MOE:
        case LLM_ARCH_CHAMELEON:
        case LLM_ARCH_BAILINGMOE:
        case LLM_ARCH_NEO_BERT:
        case LLM_ARCH_ARCEE:
            return LLAMA_ROPE_TYPE_NORM;

And I found out manually switching RoPE type to ROPE_TYPE_NEOX undermines performance severly, when I run PPL evaluation code in llama.cpp . Could anyone answer the underlying reason for applying different ROPE_type than pytorch?

Answered by CISC

Dec 17, 2025

Because Q/K gets permuted on conversion:

llama.cpp/convert_hf_to_gguf.py

Lines 2506 to 2509 in 3d86c6c

     if name.endswith(("q_proj.weight", "q_proj.bias")):  
   data_torch = LlamaModel.permute(data_torch, n_head, n_head)  
   if name.endswith(("k_proj.weight", "k_proj.bias")):  
   data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)  

 

View full answer

CISC · 2025-12-17T07:27:39Z

CISC
Dec 17, 2025
Collaborator

Because Q/K gets permuted on conversion:

llama.cpp/convert_hf_to_gguf.py

Lines 2506 to 2509 in 3d86c6c

    
           if name.endswith(("q_proj.weight", "q_proj.bias")): 
        
               data_torch = LlamaModel.permute(data_torch, n_head, n_head) 
        
           if name.endswith(("k_proj.weight", "k_proj.bias")): 
        
               data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)

1 reply

kimjoohyungsd Dec 17, 2025
Author

Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why LLama Architecture use LLAMA_ROPE_TYPE_NORM? #18127

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

	if name.endswith(("q_proj.weight", "q_proj.bias")):
	data_torch = LlamaModel.permute(data_torch, n_head, n_head)
	if name.endswith(("k_proj.weight", "k_proj.bias")):
	data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)

Why LLama Architecture use LLAMA_ROPE_TYPE_NORM? #18127

Uh oh!

Uh oh!

kimjoohyungsd Dec 17, 2025

Replies: 1 comment · 1 reply

Uh oh!

CISC Dec 17, 2025 Collaborator

Uh oh!

kimjoohyungsd Dec 17, 2025 Author

kimjoohyungsd
Dec 17, 2025

Replies: 1 comment 1 reply

CISC
Dec 17, 2025
Collaborator

kimjoohyungsd Dec 17, 2025
Author