# MoE, Quantization

and the others

## Mixture of Experts

Mixtral is one bug away from working fully and nicely.

It was working perfectly a few iterations ago, with an even higher tolerance for low Effort levels (~16-20% giving decent results IIRC). But I messed something up when implementing Mistral, and I just cannot find the bug.

**I hope someone helps find the bug,** although probably the best idea would be to refactor the whole code, clean everything up and the bug should pop up then.

By "one bug away" I mean, it will deliver some text when loaded up at 100% effort and 75% mem load, but switches to garbage very fast.

As for bigger models - my bet is that bigger models will be even more robust against effort lowering. Having said that, this implementation has the Mistral/Mixtral weights hardcoded in multiple places, and also you may start bumping into GPU cache limitations with the larger models - shouldn't be too difficult to fix, because you can always cut up a weight matrix into smaller ones, but someone needs to do the engineering part for this.

Llama 7B can work no problem if implemented - the model was originally developed and tested with it. From this I assume bucketMul should generalize to most LLMs.

## Quantization

I did some testing of Q8 quantization, the results are optimistic, but I didn't have enough time for the full implementation before the first release.

My guess is that Q8 is possible and relatively easily doable. As for lower quantizations, it would be very tricky to pull off.

If you're new to the field of Quantization, I highly recommend reading Introduction to Weight Quantization by Maxime Labonne. It outlines the methods, and ours are the same.

Two main challenges that we face with bringing BucketMul to Q8 are the number of bits, and the speed.

### Speed

With Q8 quantization, first thing we need to do is change the bucket size to 8.

Coincidentally - for Q8 we need size 8 buckets, for FP16, we need size 16. It is determined by the gpu architecture, on other archs we might want to use size 16 buckets for Q8, or the other way around.

### Number of bits

Our approach with quantization needs to account for the positional information we have encoded within each weight.

So out of the 8 bits we have available, we end up with 1 bit taken for the sign, and 3 bits taken for the position. So we are left with just 4 bits to encode the value.

Not all is lost though.

Since we sorted/bucketed the weights, **we can calculate min/max ranges for each bucket separately**. And thus we end up with values spanning relatively small intervals - less than around two decimal digits' range. So e.g. bucket 1 will have items ranging from 0.001 to 0.06. So our encoding needs to cover just a small span, whereas a traditional Q8 needs to cover many more ranges - that is say from 0.00001 to 0.1.

So when encoding Q8, we note down the min/max values for each slice stat (see convert.swift/convert.metal), and then in decoding the numbers we

There is a code for BucketMulQ8 available in the repo - it's from a few iterations ago, but it should be working, and so should the converted weights.

### What is missing in Quantization

- outliers, if necessary

- testing - the model seemed to produce total garbage, but my bet is that it was due to other parts of the code messed up, not Q8 itself

### The path to full Q8

- wrap up Q8 implementation

- test cosSim score between a multiplication of a quantized matrix and the original one

- if it's decent enough (0.99 for effort = 1), proceed with reimplementing it into the inference engine

- in case the sim score is not good enough, or the inference engine breaks - implement outliers, and that should do the trick

## Turbo Mode!

Right now, Effort is set statically for all the multiplications. But that absolutely doesn't have to be the case. I'm quite sure w2's effort can be lowered without the loss of quality (hint: look at the histogram of the input to w2), and this paper (link) suggests middle-layers are not too necessary for good results. There are many experiments that could be done here, and very many low hanging research fruits to be picked.

Also, although slightly more difficult - a variable Effort, depending on a given context and the cutoff calculations. One of the early iterations had effort/cutoff chosen such that the areas to the left and to the right of cutoff, on the probe chart, were similar in size. I removed this functionality to simplify the code for now, but there is research to be done here.

One of the things I was tracking originally was a divergence of states between the effort model and the original model. It was super-interesting, at least in LLama, the states diverged up until the layer ~10, their cos similarity scores lowering to as low as 80% and then it either went downhill from there, or back up to 98-99% in the final layer.

Having said all of the above, I can imagine variable token by token effort selection. You ask a dumb question, the model answers quickly, you ask a difficult question, the model takes more time to be precise. Or more realistically - probably half of the words are easily predictable in any given text (or 95% in case of Code of Davinci's author) - so a model could perhaps get by with a way lower precision.

## Vector search

I didn't tackle here at all, a subject of vector database search. But vector search can be seen as just another matrix multiplication. I wonder how this solution stacks up with sota algorithms of vector search.

## Where do we go from here?

- Pesky details (or: Help Needed!)

## At any time

## And of course...

- Citations, notes and so on