Very interesting. I've been looking for an easy to use a large LLM that could use the 64Gb of unified memory on my M1 Max chip.
I don't like interacting with AI in the cloud that collects my data and is not private.
I was surprised to learn that Apple has released a M3 Max with 128Gb unified memory. That would really be powerful and could run huge models.
I'll let you know how it goes.
Mac Studio and even Mac Minis are very popular option for LLM due to how unified memory works. Nowhere can you get ~188 VRAM for less than the cost of even a single A100 40G.
I'm getting 23 tokens per second using the 5 bit Mixtal 2.7 model.
macs have a big edge for this.
I would recommend the 4 bit, the 5 bit isn't much better and takes a lot more ram. I'd stick with 4 bit, or something like 8 bit if you can get there.