Skip to content

inconsistent datatypes using QLinear #142

@andrea-fasoli

Description

@andrea-fasoli

Describe the bug

16-bit models have some operations performed at FP32 (e.g., LayerNorm). This is expected and the subsequent (non-quantized) linear layer will cast the activations back to 16 bits during the matmul with 16-bit weights.

However, QLinear uses the datatype of the inputs to set the datatype of the clip values.
This results in all QLinear layers that follow a LayerNorm to use FP32 as datatype for clip values. The matmul is correctly performed at 16 bits, only the datatype of the clips (and related scale and zero_point) is inconsistent.
On the other hand, subsequent layers receive the output of QLinear and will use 16 bits for the clips... up until the next LayerNorm.

Expected behavior

Datatype for clips should be consistent: always FP32, or always FP16, or always match the datatype of the model being quantized.

Proposed solution

We could take as reference the datatype from the weights in the linear layer that is being processed, and use that in the quantizers to set the clips datatype.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions