Pesky little details

or: Help Needed!

If you're not ashamed of your first demo, you released too late.

I definietly didn't release too late.

Performance

If you lower the effort to 0%, you will see that it still takes visible time to perform each loop - 15ms on my M2 Pro. In the GPU profiler you can see that there are empty spaces between kernel invocations.

I know that llama.cpp and the others don't have this issue, so I assume there's something wrong in the way I implemented it all.

This is the single thing I have no idea how to fix myself. To anyone that helps out, I promise to ship a bottle of best whatever you choose, anywhere in the world you are.

Attention and longer contexts

I didn't bother optimising attention calculations so far, so the implementation will slow down very fast the moment you go into larger contexts. It is a matter of the current implementation though, and can be fixed to be the same as with regular inference.

KL Distance metric

Right after the publication, I was provided with a wonderful KL Distance metric, for measuring how far away is an approximation from the base model. It's fast to implement, the initial tests are done and I'll be publishing soon.

The gist of the results so far are that the results are okay-ish. The algorithm seems to not win against Quantization yet. It will take a few days to do proper testing, and write a summary, but if you want to see a sneak peak - the first charts are here and here. This is the top priority now, and I will be publishing a detailed writeup to the front page as soon as the testing is completed.

Testing on MathEval and HellaSWAG

Please keep in mind that this all has been a one-person operation so far.

I would love to have the method tested on MathEval, HumanEval and HellaSWAG, but the main issues right now are the implementation speed, and the lack of proper testing libraries.

Because of that, it would take a good setup of M2/M3 on the clouds to do the tests above in a sensible time. The existing testing libraries also assume you are testing a model, not an inference method - so I need to either rewrite them to use Effort's API, or rewrite them from scratch in Swift. If anyone's up for the task of doing this, please reach out. Otherwise, for the time being I will be focusing on simpler tests.

From internal tests so far, the better the model and the implementation, the more resilient it is to a drop in effort, so I feel it's honest to publish it with the tests as they are.

Finally, once the bugs are fixed, the speed needs to be so as well. Without it, it will either be very costly to rent out a server farm filled out with Apple Silicon, or it will take forever to gather reliable data. Remember that we need to rerun the same batch of tests for the whole ranges of effort - from 100% to 10%.

Help will be much appreciated

Feel free to reach out to kolinko@gmail.com, or on Twitter, @kolinko.

Especially if you have experience with LLM implementation, GPU programming, or if you'd like to implement bucketMul in other architectures (llama.cpp / transformers lib / MLX).

Thank you, and thanks for the understanding.

What's next?

- Download and Run

- About the Author(s)

- Citations, notes and so on

Or going back...

- The landing page.

- The basics.

- Introducing bucketMul.

- GPU implementation.