Cohere for AI, AI Startup Cohere’s non-profit research lab released AYA Vision, a multimodal, “open” AI model this week.
AYA Vision can perform tasks such as creating image captions, answering photo questions, translating text, and generating overviews in 23 major languages. Cohere makes Aya Vision available for free through WhatsApp, called “an important step to ensuring technical breakthroughs are accessible to researchers around the world.”
“Although AI has made great strides, there is still a big gap in the performance of models that become even more pronounced in multimodal tasks that include both text and images. “AYA Vision aims to explicitly help close that gap.”
Aya Vision comes in several flavors: Aya Vision 32b and Aya Vision 8b. The more refined of the two, the Aya Vision 32B sets up a “new frontier” and said it will outperform a model that is twice as large as its size, including Meta’s Llama-3.2 90B Vision, with a specific visual understanding benchmark. Meanwhile, according to Cohere, the AYA Vision 8B scores better on several ratings than the 10x model of its size.
Both models are available from the face-hugging AI development platform under a Creative Commons 4.0 license using Cohere’s Acceptable Use Addendum. Cannot be used for commercial applications.
Cohere said that AYA Vision was trained using a “diverse pool” of English data sets. Annotations, also known as tags or labels, help the model understand and interpret data during the training process. For example, annotations for training an image recognition model could take the form of markings around an object or caption pointing to each person, place, or object drawn on the image.

The use of Cohere’s synthetic annotations, i.e., annotations generated by AI – tend to be. Despite their potential drawbacks, rivals, including Openai, are increasingly utilizing synthetic data to train models as the actual data drys out. Research firm Gartner estimates that 60% of the data used in AI and analytics projects last year were produced synthetically.
According to Cohere, AYA Vision training on synthetic annotations allowed the lab to use less resources while achieving competitive performance.
“This shows a critical focus on efficiency and shows more with less calculations,” Cohere wrote on his blog. “This also allows for greater support for research communities where access to resource calculations is more restricted.”
Along with AYA Vision, Cohere has released AyavisionBench, a new benchmark suite designed to explore the skills of the “vision language” task model, including identifying differences between two images and converting screenshots into code.
The AI industry is in the midst of what some call a “valuation crisis.” This is the result of the widespread adoption of benchmarks, which provide an overall score that correlates with proficiency in tasks that most AI users care about. Cohere argues that Ayavisionbench is a step to modify this and provides a “broad and challenging” framework for assessing cross-sectional and multimodal understandings of models.
If you’re lucky, that’s certainly true.
”