RE: LeoThread 2024-09-25 05:16

Technical Advancements

Adapter Weights: Meta integrated image encoder into the language model using adapter weights, enabling image reasoning capabilities.
Cross-Attention layers: The adapter consists of cross-attention layers that feed image encoder representations into the language model.
Alignment Training: Meta employed alignment training, supervised fine-tuning, rejection sampling, and direct preference optimization to refine the model.
Synthetic data Generation: LLaMA 3.1 was used to generate synthetic data for question-answer pairs on tOP of in-domain images.