Instead of writing captions, the team asked annotators to record 60- to 90-second verbal descriptions answering a list of questions about each image. They then transcribed the descriptions—which often stretched across several pages—and used other large language models to clean up, crunch down, and standardize them. They found that this simple switch, from written to verbal annotation, yielded far more detail with little extra effort.
You are viewing a single comment's thread from: