Technical Advancements
- Adapter Weights: Meta integrated image encoder into the language model using adapter weights, enabling image reasoning capabilities.
- Cross-Attention layers: The adapter consists of cross-attention layers that feed image encoder representations into the language model.
Alignment Training: Meta employed alignment training, supervised fine-tuning, rejection sampling, and direct preference optimization to refine the model. - Synthetic data Generation: LLaMA 3.1 was used to generate synthetic data for question-answer pairs on tOP of in-domain images.