-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Describe the bug
16-bit models have some operations performed at FP32 (e.g., LayerNorm). This is expected and the subsequent (non-quantized) linear layer will cast the activations back to 16 bits during the matmul with 16-bit weights.
However, QLinear uses the datatype of the inputs to set the datatype of the clip values.
This results in all QLinear layers that follow a LayerNorm to use FP32 as datatype for clip values. The matmul is correctly performed at 16 bits, only the datatype of the clips (and related scale and zero_point) is inconsistent.
On the other hand, subsequent layers receive the output of QLinear and will use 16 bits for the clips... up until the next LayerNorm.
Expected behavior
Datatype for clips should be consistent: always FP32, or always FP16, or always match the datatype of the model being quantized.
Proposed solution
We could take as reference the datatype from the weights in the linear layer that is being processed, and use that in the quantizers to set the clips datatype.