Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640
Conversation
|
Unfortunately BF16 only has widening MADD instructions. So the same changes can not be made for BF16. |
|
These are for VLEN = 256 only currently |
|
It now works for VLEN = 128. |
…X faster on BananaPi.
|
Main loop now uses LMUL = 2 |
|
Even faster!!! |
|
Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster. |
|
Convert BF16 values once (and vectorized) - 3-4% faster. |
|
Latest BF16 version is 48% faster than the current version. And is 3% faster than the FP32 . It uses 1.5X more memory unfortunately. It would be possible to make this version to work on the BananaPi - which doesn't support BF16 vector instructions. The conversions could be done with uint32 conversion and shift. And fixing the remaining BF16 vectors to not uses vector BF16 MADDs. |
|
BF16 GEMM now works for pre-RVA23 systems like BananaPi |
Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.
Accumulation differences are about 4 epsilons - compared to the widening (previous) version. But the performance it up to 2.7X faster - Note: BananaPi shows only 1.85X faster.