-
Notifications
You must be signed in to change notification settings - Fork 14.2k
ggml-hexagon: gelu optimization #18151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
ggml-hexagon: gelu optimization #18151
Conversation
|
Obviously, I need to do more cleanup for this pr. But if you think this approach is good, I can probably apply the approach to all activation, unary(mul, add, sub).. and similar functions? I think this should bring some improvement for all models, especially for long context |
ggml/src/ggml-hexagon/htp/main.c
Outdated
| ctx->n_threads = n_hvx; | ||
| for (int i = 0; i < ctx->n_threads; i++) { | ||
| ctx->dma[i] = dma_queue_create(HTP_SPAD_SRC0_NROWS * 2); | ||
| ctx->dma[i] = dma_queue_create(ctx->vtcm_size); //NOTE: for now, whole vtcm size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: looks the capacity here is for the max dma queue depth, which in mum_mat case is the HTP_SPAD_SRC0_NROWS * 2 for each thread. Its not the vtcm size for each thread I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But with this pr, each thread could theoretically now fetch up to the whole vtcm size. Thus, I set it to the size of VTCM instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought theoretically its not size in bytes, but the count of row that gonna fetch to vtcm tho....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as I take another look, it is not even in size of byte or count of row. It seems to me that the maximum size of the DMA-linked list that it can support. Maybe I should set to like total_thread*(4*2) 4 stands for number of input/output and 2 for pingpong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a number of DMA descriptors, aka a number of DMA transactions that can be enqueued at a time.
The DMA engine is per thread i.e each HW/SW thread literally has its own DMA HW state.
So there is no need to include the number if threads in the sizing.
Also the size of the queue doesn't affect the performance it's just an array of preallocated descriptors.
So we can just set it to something reasonable like 64 (ie 64 outstanding transactions per thread), add an error message for full queue for the debug builds, and that should be good enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see f3e9f6a
|
Good timing. |
Thanks for confirming! Should I go ahead and apply the same optimization on all other functions, or wait until you have a thorough look at it and fix the code format/style before applying it to other functions? |
Following the discussion regarding the refactor idea here, I have re-evaluated the approach. Upon reviewing the current implementations for activation, unary, and binary functions, I observed that the existing code heavily relies on L2 prefetching while under-utilizing the VTCM and DMA.
While L2 prefetching offers some benefits, it is limited by two main factors:
This PR optimizes the code (using GELU as the initial implementation) by shifting the workload to heavily utilize the VTCM and DMA, thereby freeing up the L2 cache for L1 instruction and data cache
Optimization Strategy
Instead of relying on L2 prefetching, this implementation employs a DMA ping-pong buffering approach:
This allows for overlapping computation and memory transfer, resulting in significantly higher throughput.
Performance Benchmarks
The performance improvements are significant. Below is a comparison between the existing implementation and the new DMA/VTCM approach:
NOTE: I used the GELU as an example, but this approach can easily extend to other operations.
Unaligned Load Resolution:
This approach inherently solves the unaligned load issues encountered previously. Since data is fetched from DDR via DMA, the DMA engine ensures that data is stored into aligned addresses within the VTCM, even if the source data in DDR is unaligned.
@max-krasnyansky