Skip to content

Quantize BF16 weights to TL1 #383

@davyuan

Description

@davyuan

Hello!

I'm working on a TL1 implementation for the BitNet. I'm using the weights in the .safetensors here: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-bf16.

I'm quantizing the weights, following your paper, using the absmean() method. Here is a snippet of my code.

std::vector<int8_t> bitnet_158_quantize(const std::vector& weight_array, float32_t * weight_scale, int M, int K) {
const float32_t epsilon = 1e-7f;
int size = weight_array.size();

float sum_abs = 0.0f;
for(int m = 0; m < M; m++) {
    for(int k = 0; k < K; k++) {
        sum_abs += std::fabs(weight_array[m * K + k]) ;
    }
}
float32_t gamma = sum_abs / (M * K);
//gamma = 4.365f;
weight_scale[0] = gamma;

std::vector<int8_t> quantized_w(size);    
for(int m = 0; m < M; m++) {
    for(int k = 0; k < K; k++) {
        float32_t gamma = weight_scale[0];
        int idx = m * K + k;
        float normalized = weight_array[idx] / (gamma + epsilon);
        float rounded = std::round(normalized);
        // Clip to [-1, 1] range
        int8_t clipped = static_cast<int8_t>(
            std::max(-1.0f, std::min(1.0f, rounded))
        );
        quantized_w[idx] = clipped;
    }
}

return quantized_w;

}

The problem I'm having is that the Ternary weights produce results with a huge difference compared to the original .safetensors. If I do a Cosine Similarity between my MatMul with the ternary weights and the reference weights, it falls below 0.7, which I don't believe is preserving the signals and result in math explosion in the deeper layers.

Things I have checked:

  1. Yes I'm using absmean() method as described in your paper.
  2. Yes I'm using a global weight scale and later applied as a multiplier in my kernel for LUT kernel.
  3. My LUT kernel is producing correct results if tested with weights initialized with uniform randomized {-1, 0, 1}, and a global weight scale of 1.0, so I know the other aspect of my algo is solid.

Please let me know how should I quantize this BF16 model for TL1? Does it need a block/tile weight scaler? Do I need to customize the ggml code to add some magic sauce somewhere?

thanks!
David

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions