RE: LeoThread 2024-11-05 12:55

TinyBERT: Although optimized for tasks like classification, it also performs well on short text generation and summarization. With 66M parameters, it should work smoothly on your APU.
Alpaca and Vicuna models (7B, quantized): Although Alpaca and Vicuna are smaller versions of Llama, their quantized (like 4-bit) versions are lighter. Try models in 4-bit or 8-bit quantization, as these retain more accuracy while drastically reducing the computational load.
GPT-Neo 125M: This model is smaller and relatively fast, especially if you quantize it. It has decent performance on shorter prompts.

To run these models faster, you could consider using quantization libraries like bitsandbytes or GGML for LLaMA-based models, which reduce the model size and make them feasible on lower-end hardware without sacrificing too much accuracy.