Is your feature request related to a problem? Please describe.
fms-mo's triton matmul kernels already included the emulation of accumulation bits. But lacking of documentations nor examples.
Describe the solution you'd like
A clear and concise example to demonstrate how to use these kernels.
Describe alternatives you've considered
NA
Additional context
Documents should include GPU accumulation bit demo as well.