Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23) by ChipKerchner · Pull Request #5640 · OpenMathLib/OpenBLAS

ChipKerchner · 2026-02-10T17:37:23Z

Added ability to accumulate in FP16 for GEMM. Widens once at the end of loops.

Testing LLVM FP16 LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         24948910

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         18968190

Accumulation differences are about 4 epsilons - compared to the widening (previous) version. But the performance it up to 2.7X faster - Note: BananaPi shows only 1.85X faster.

…of loops.

ChipKerchner · 2026-02-10T17:40:11Z

Unfortunately BF16 only has widening MADD instructions. So the same changes can not be made for BF16.

ChipKerchner · 2026-02-10T17:40:56Z

These are for VLEN = 256 only currently

ChipKerchner · 2026-02-10T18:31:00Z

It now works for VLEN = 128.

…X faster on BananaPi.

ChipKerchner · 2026-02-11T00:38:22Z

Main loop now uses LMUL = 2

ChipKerchner · 2026-02-11T00:46:47Z

Even faster!!!

Testing LLVM FP16_N LMUL1 VLEN256 GEMM 1 0 0  512  512  512   1  2.0  1.0  1

Total time =         13400067

ChipKerchner · 2026-02-11T19:52:02Z

Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.

ChipKerchner · 2026-02-12T18:46:24Z

Convert BF16 values once (and vectorized) - 3-4% faster.

ChipKerchner · 2026-02-13T17:56:39Z

Latest BF16 version is 48% faster than the current version. And is 3% faster than the FP32 .

It uses 1.5X more memory unfortunately.

It would be possible to make this version to work on the BananaPi - which doesn't support BF16 vector instructions. The conversions could be done with uint32 conversion and shift. And fixing the remaining BF16 vectors to not uses vector BF16 MADDs.

…ersion during packing.

ChipKerchner · 2026-02-15T15:52:12Z

BF16 GEMM now works for pre-RVA23 systems like BananaPi

ChipKerchner added 4 commits January 30, 2026 17:36

Merge remote-tracking branch 'origin' into develop

cb4e4ce

Merge remote-tracking branch 'origin/develop' into develop

720654a

Merge remote-tracking branch 'origin/develop' into develop

7da983e

Added ability to accumulate in FP16 for GEMM. Widens once at the end …

b5f2a50

…of loops.

128-bit versions.

aa1cebd

Forget to add defintion.

74d9fe2

ChipKerchner marked this pull request as draft February 10, 2026 22:04

Fixed MADD to use float16 values. Use LMUL = 2 in main loop. Now 1.85…

e3cb067

…X faster on BananaPi.

ChipKerchner marked this pull request as ready for review February 11, 2026 00:38

Convert inputs from BF16 to FP32 and use FP32 vector madds. 18% faster.

3356043

Convert BF16 values once (and vectorized).

4121a22

One small change.

9701a80

ChipKerchner marked this pull request as draft February 13, 2026 15:20

Conversion from BF16 to FP32 only once.

0acb60a

ChipKerchner marked this pull request as ready for review February 13, 2026 17:56

ChipKerchner changed the title ~~Added ability to accumulate in FP16 for GEMM for RISC-V~~ Added ability to accumulate in FP16 and one set of conversions for BF16 for GEMM in RISC-V Feb 13, 2026

ChipKerchner added 4 commits February 13, 2026 18:14

Only convert B if M is greater or equal to 4.

1cc377e

Add flag for not converting A & B - will be used in future to do conv…

7a1d234

…ersion during packing.

Add dummy memsets - just in case.

1d6aa0d

Add pre-RVA23 to BF16 GEMM.

efe63e7

ChipKerchner changed the title ~~Added ability to accumulate in FP16 and one set of conversions for BF16 for GEMM in RISC-V~~ Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23) Feb 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640

Added ability to accumulate in FP16. Convert BF16 to FP32. For FP16 and BF16 GEMM in RISC-V (BF16 now works for pre-RVA23)#5640
ChipKerchner wants to merge 15 commits intoOpenMathLib:developfrom
ChipKerchner:RVV_Narrow_Accumulate_FP16_GEMM

ChipKerchner commented Feb 10, 2026 •

edited

Loading

Uh oh!

ChipKerchner commented Feb 10, 2026 •

edited

Loading

Uh oh!

ChipKerchner commented Feb 10, 2026

Uh oh!

ChipKerchner commented Feb 10, 2026

Uh oh!

ChipKerchner commented Feb 11, 2026

Uh oh!

ChipKerchner commented Feb 11, 2026 •

edited

Loading

Uh oh!

ChipKerchner commented Feb 11, 2026

Uh oh!

ChipKerchner commented Feb 12, 2026

Uh oh!

ChipKerchner commented Feb 13, 2026 •

edited

Loading

Uh oh!

ChipKerchner commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChipKerchner commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChipKerchner commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChipKerchner commented Feb 10, 2026

Uh oh!

ChipKerchner commented Feb 10, 2026

Uh oh!

ChipKerchner commented Feb 11, 2026

Uh oh!

ChipKerchner commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChipKerchner commented Feb 11, 2026

Uh oh!

ChipKerchner commented Feb 12, 2026

Uh oh!

ChipKerchner commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChipKerchner commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChipKerchner commented Feb 10, 2026 •

edited

Loading

ChipKerchner commented Feb 10, 2026 •

edited

Loading

ChipKerchner commented Feb 11, 2026 •

edited

Loading

ChipKerchner commented Feb 13, 2026 •

edited

Loading