inconsistent datatypes using QLinear

## Describe the bug

16-bit models have some operations performed at FP32 (e.g., LayerNorm). This is expected and the subsequent (non-quantized) linear layer will cast the activations back to 16 bits during the matmul with 16-bit weights.

However, QLinear uses the datatype of the inputs to set the datatype of the clip values.
This results in all QLinear layers that follow a LayerNorm to use FP32 as datatype for clip values. The matmul is correctly performed at 16 bits, only the datatype of the clips (and related scale and zero_point) is inconsistent.
On the other hand, subsequent layers receive the output of QLinear and will use 16 bits for the clips... up until the next LayerNorm.

## Expected behavior

Datatype for clips should be consistent: always FP32, or always FP16, or always match the datatype of the model being quantized.

## Proposed solution

We could take as reference the datatype from the weights in the linear layer that is being processed, and use that in the quantizers to set the clips datatype.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent datatypes using QLinear #142

Describe the bug

Expected behavior

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inconsistent datatypes using QLinear #142

Description

Describe the bug

Expected behavior

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions