Effort

A possibly new algorithm for LLM Inference

With Effort, you can smoothly adjust—in real time—the number of calculations performed during the inference of an LLM model.

At 50% effort, it performs as fast as regular matrix multiplications on Apple Silicon chips; at 25% effort, it is twice as fast while still retaining most of the quality.

You also have the option to skip loading the least important weights.

It is currently implemented for Mistral, but it should work equally well for all other models without retraining—only conversion to a different format and some precomputation are necessary.

You can download the implementation here - from Github. It should run right after fetching the converted weights.

The implementation is done for FP16 only for now. The multiplications are fast, but the overall inference still requires improvement in some non-essential parts, such as softmax, attention summation etc. operations.

Mixtral and Q8 are in the works.

Oh, and there’s also the option to dynamically adjust how much of the model loads into memory. You can leave out the least important 10-20-30% of weights during loading. No conversion steps needed—it simply loads less data. It’s kind of like an ad-hoc distillation, if you will.

Let's see it in action now.

The actual speed is limited by implementation overhead. For example, even at 0% effort, it takes 15ms on my machine—and a few seconds on an M1 Air—to produce a single token. Comparable operations are completed in less than 1ms in Llama.cpp/Ollama. I would appreciate greatly any help from an experienced Swift/Metal engineer to resolve this.

You can download and test it yourself from Github.

Returning to the topic of benchmarks...

Let's now discuss quality, starting with the multiplication approximation itself.

Turning our attention to the model itself.

And basic QA tests:

If you're still skeptical—as I would be—please visit the 'Help Needed!' section to understand what is required for improved testing.

The initial results (and undocumented experiments with Mixtral) seem to be robust enough to warrant publication. I hope though that the above is enough to convince you to play with the 0.0.1B version.

Deep dive into the algorithm.

- The Basics

- Introducing bucketMul

- The GPU implementation

- MoE, quantization and the others.

- Pesky details (or: Help Needed!)

And of course...

- About the Author(s)

- Download and Run

- - Citations, additional notes, and related resources