Quantize BF16 weights to TL1

Hello!

I'm working on a TL1 implementation for the BitNet. I'm using the weights in the .safetensors here: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16. 

I'm quantizing the weights, following your paper, using the absmean() method. Here is a snippet of my code. 

std::vector<int8_t> bitnet_158_quantize(const std::vector<float>& weight_array, float32_t * weight_scale, int M, int K) {
    const float32_t epsilon = 1e-7f;
    int size = weight_array.size();
    
    float sum_abs = 0.0f;
    for(int m = 0; m < M; m++) {
        for(int k = 0; k < K; k++) {
            sum_abs += std::fabs(weight_array[m * K + k]) ;
        }
    }
    float32_t gamma = sum_abs / (M * K);
    //gamma = 4.365f;
    weight_scale[0] = gamma;
    
    std::vector<int8_t> quantized_w(size);    
    for(int m = 0; m < M; m++) {
        for(int k = 0; k < K; k++) {
            float32_t gamma = weight_scale[0];
            int idx = m * K + k;
            float normalized = weight_array[idx] / (gamma + epsilon);
            float rounded = std::round(normalized);
            // Clip to [-1, 1] range
            int8_t clipped = static_cast<int8_t>(
                std::max(-1.0f, std::min(1.0f, rounded))
            );
            quantized_w[idx] = clipped;
        }
    }
    
    return quantized_w;
}

The problem I'm having is that the Ternary weights produce results with a huge difference compared to the original .safetensors. If I do a Cosine Similarity between my MatMul with the ternary weights and the reference weights, it falls below 0.7, which I don't believe is preserving the signals and result in math explosion in the deeper layers. 

Things I have checked:
1. Yes I'm using absmean() method as described in your paper. 
2. Yes I'm using a global weight scale and later applied as a multiplier in my kernel for LUT kernel. 
3. My LUT kernel is producing correct results if tested with weights initialized with uniform randomized {-1, 0, 1}, and a global weight scale of 1.0, so I know the other aspect of my algo is solid. 

Please let me know how should I quantize this BF16 model for TL1? Does it need a block/tile weight scaler? Do I need to customize the ggml code to add some magic sauce somewhere? 

thanks!
David


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantize BF16 weights to TL1 #383

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantize BF16 weights to TL1 #383

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions