Skip to content

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640

Open
ChipKerchner wants to merge 15 commits intoOpenMathLib:developfrom
ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM
Open

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640
ChipKerchner wants to merge 15 commits intoOpenMathLib:developfrom
ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM

Conversation

@ChipKerchner
Copy link
Contributor

@ChipKerchner ChipKerchner commented Feb 10, 2026

Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.

Testing LLVM FP16 LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         24948910

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         18968190

Accumulation differences are about 4 epsilons - compared to the widening (previous) version. But the performance it up to 2.7X faster - Note: BananaPi shows only 1.85X faster.

@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 10, 2026

Unfortunately BF16 only has widening MADD instructions. So the same changes can not be made for BF16.

@ChipKerchner
Copy link
Contributor Author

These are for VLEN = 256 only currently

@ChipKerchner
Copy link
Contributor Author

It now works for VLEN = 128.

@ChipKerchner ChipKerchner marked this pull request as draft February 10, 2026 22:04
@ChipKerchner
Copy link
Contributor Author

Main loop now uses LMUL = 2

@ChipKerchner ChipKerchner marked this pull request as ready for review February 11, 2026 00:38
@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 11, 2026

Even faster!!!

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         13400067

@ChipKerchner
Copy link
Contributor Author

Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.

@ChipKerchner
Copy link
Contributor Author

Convert BF16 values once (and vectorized) - 3-4% faster.

@ChipKerchner ChipKerchner marked this pull request as draft February 13, 2026 15:20
@ChipKerchner
Copy link
Contributor Author

ChipKerchner commented Feb 13, 2026

Latest BF16 version is 48% faster than the current version. And is 3% faster than the FP32 .

It uses 1.5X more memory unfortunately.

It would be possible to make this version to work on the BananaPi - which doesn't support BF16 vector instructions. The conversions could be done with uint32 conversion and shift. And fixing the remaining BF16 vectors to not uses vector BF16 MADDs.

@ChipKerchner ChipKerchner marked this pull request as ready for review February 13, 2026 17:56
@ChipKerchner ChipKerchner changed the title Added ability to accumulate in FP16 for GEMM for RISC-V Added ability to accumulate in FP16 and one set of conversions for BF16 for GEMM in RISC-V Feb 13, 2026
@ChipKerchner
Copy link
Contributor Author

BF16 GEMM now works for pre-RVA23 systems like BananaPi

@ChipKerchner ChipKerchner changed the title Added ability to accumulate in FP16 and one set of conversions for BF16 for GEMM in RISC-V Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23) Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant