Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This could run faster #17

Open
ryao opened this issue Nov 29, 2024 · 0 comments
Open

This could run faster #17

ryao opened this issue Nov 29, 2024 · 0 comments

Comments

@ryao
Copy link

ryao commented Nov 29, 2024

I have been experimenting with a local fork on my Ryzen 7 5800X where a single core can use all memory bandwidth (such that OpenMP doesn't make things faster). I found that things can run faster. Here are some numbers:

This code:

pp tok/s: ~1.0
tg tok/s: ~1.0

My fork's performance:

pp tok/s: 53.70
tg tok/s: 1.33

llama.cpp performance:

pp tok/s: 46.14
tg tok/s: 1.29

I probably should note that these are close to best case numbers, and there is run to run variance. My fork is very sensitive to other things sharing the CPU and performance will drop significantly if you do something else while it runs.

These performance numbers are from running the following test prompt:

env OMP_PLACES=cores ./run "llama3_8b_base.bin" -z "tokenizer.bin" -s 10000 -i "Once upon a time, there was a beautiful princess who fell in love with a prince. It was love at first sight, and they couldn’t stop thinking about each other.
The princess and the prince lived in different kingdoms, but they met at a grand ball. The prince was enchanted by the princess’s beauty, and they danced all night long. As the sun rose, they promised to meet again.
The princess and the prince were separated by their kingdoms, but their love was stronger than any obstacle. They wrote letters to each other and exchanged gifts. They were determined to be together, no matter what.
One day, the princess received news that the prince had been captured by a powerful enemy. She knew she had to save him, no matter the cost. She gathered her courage and set out on a journey to rescue her prince.
The princess faced many dangers on her journey, but she never gave up. She fought off dragons, navigated treacherous terrain, and outsmarted her enemies. Finally, she reached the castle where the prince was being held captive.
The princess confronted the enemy and demanded that the prince be released. The enemy tried to thwart her plans, but she was too strong and too determined. Eventually, the prince was freed, and the princess and prince were reunited.
They lived happily ever after, and their love was stronger than anything the world could throw at them. They knew that they would always be together, no matter what happened.
In the end, the princess and the prince’s love story reminds us that true love is worth fighting for. It shows us that with determination, courage, and love, anything is"

It was generated from "Once upon a time", and I basically included all but a few tokens to be able to test prompt processing performance. Interestingly, even though the random seeds should mean different things for llama3.c and llama.cpp, giving the same inputs to llama.cpp will produce the same output:

./llama-cli -m Llama-3-8B-f32.ggml -s 10000 -p "Once upon a time, there was a beautiful princess who fell in love with a prince. It was love at first sight, and they couldn’t stop thinking about each other.
The princess and the prince lived in different kingdoms, but they met at a grand ball. The prince was enchanted by the princess’s beauty, and they danced all night long. As the sun rose, they promised to meet again.
The princess and the prince were separated by their kingdoms, but their love was stronger than any obstacle. They wrote letters to each other and exchanged gifts. They were determined to be together, no matter what.
One day, the princess received news that the prince had been captured by a powerful enemy. She knew she had to save him, no matter the cost. She gathered her courage and set out on a journey to rescue her prince.
The princess faced many dangers on her journey, but she never gave up. She fought off dragons, navigated treacherous terrain, and outsmarted her enemies. Finally, she reached the castle where the prince was being held captive.
The princess confronted the enemy and demanded that the prince be released. The enemy tried to thwart her plans, but she was too strong and too determined. Eventually, the prince was freed, and the princess and prince were reunited.
They lived happily ever after, and their love was stronger than anything the world could throw at them. They knew that they would always be together, no matter what happened.
In the end, the princess and the prince’s love story reminds us that true love is worth fighting for. It shows us that with determination, courage, and love, anything is"

Note that I am compiling with GCC 14.1.1 and linking against Intel MKL 2023.1.0.

My fork is here:

https://github.com/ryao/llama3.c

I had not originally intended to post my results yet, but a potential collaborator whom I contacted regarding my WIP results asked me to push to github, so I did and since it is on github, I might as well share my results.

The performance comes from a few things:

  • A dedicated prompt processing function that uses matrix-matrix multiplication to process all tokens in parallel (minus the attention calculations which aren't technically parallelized by input tokens)
  • The use of the Intel MKL, especially cblas_sgemm_batched(), which increases performance in ways that I don't fully understand.
  • Modifying the attention calculations to use cblas_sgemm_batched(). I actually don't fully understand this yet, as after I asked Gemini 1.5 Pro if it could think of a way to use BLAS for the attention calculations many times, it managed to produce a cblas_sgemm() call for the loop after softmax that shouldn't be compatible with SGEMM, yet the outputs are identical when I run my test prompts. Why this even works needs research. Modifying it to use cblas_sgemm_batch() pushed performance above llama.cpp.

There are also some miscellaneous performance changes such as precomputing certain "expensive" operations that were being redundantly computed many times in the loop that processes the llama 3 layers, although these are cheap compared to matrix operations, so the improvement is not very visible in benchmarks. I also changed how the timing is done to be able to properly measure the new prompt processing code's performance (since you cannot really skip the first token when all input tokens are processed in parallel).

There is support for building with OpenBLAS or standalone as alternatives to the Intel MKL, but:

  • The OpenBLAS support has a regression in token generation performance (reverting forward() to use matmul() should fix it).
  • The standalone support is super slow due to the generic sgemm routines not being properly optimized. I eventually want to make this fully OSS (such that we don't need the MKL blobs to run fast), so I will likely try to work out how to make my own fast sgemm routines, but for now, this is slow.

I was able to make the matmul() routine (which is really a simplified sgemv) at least 20% faster on my machine by using AVX2 instructions. On my Ryzen 7 5800X, it runs faster than both OpenBLAS and Intel's MKL cblas_sgemv() (and cblas_sgemm() when abused as cblas_sgemv() for that matter), although the moment you can leverage the Intel MKL's cblas_sgemm_batch() to do the equivalent, it is slower and I am not sure how that is even possible on something memory bandwidth bound. It shouldn't be the multithreading since my CPU in particular allows 1 core to use full bandwidth.

I believe that the innovation that makes my matmul() (sgemv) faster than the existing cblas_sgemv() routines in OpenBLAS and the Intel MKL is the way that I am doing horizontal addition, which I have not seen described elsewhere. Instead of trying to do 1 horizontal addition at a time as everyone else seems to do, I do 8 simultaneously. This uses 9 instructions (10 if you count the memory write) and is faster than the techniques for AVX2 horizontal addition I have seen discussed elsewhere. All of them only tried to do 1 horizontal addition at a time, but since we are unrolling, I worked out how to do 8 in parallel and that is faster than doing 1 at a time 8 times.

That said, I hope others find the techniques I have used to make this faster useful/educational. I know I have. I have not fully exhausted my ideas for going faster, although it becomes harder to go faster every time I make an improvement. I also really want to figure out how to make fast sgemm routines as that would open the door to some optimization opportunities that I simply cannot exploit right now, such as incorporating the max value finding of softmax into the matrix operation essentially for free, which is an idea inspired by flash attention (although flash attention itself sadly does not appear applicable to llama 3 at first glance). I do not expect that to make a significant change in performance as nearly all CPU time is spent in matrix operations, but it is just an example of the sort of thing that I could do once I have figured out how to make my own fast sgemm code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant