Ollama leaps from text-only to vision

Ollama, one of the most popular open-source tools for running artificial intelligence (AI) models on local machines, has updated its free platform to support multimodal large language models (LLMs). Now it is capable of analyzing images alongside text-based LLMs, and content generation is next.

The tool enables users to run LLMs directly on their computer and for free, given they have enough computational power.

Ollama announced support for multimodal models such as Meta Llama 4, Google Gemma 3, Qwen 2.5 VL, Mistral Small 3.1, and others.

Users will also be able to input multiple images for analysis, or ask questions about the location, contents, character recognition, etc. In a blog post, developers also provided examples that LLMs can be used for translating vertical Chinese spring couplets to English from images.

Models like Llama 4 Scout can interpret images, identify landmarks, and provide contextual information. Another example was a photo analysis of the Ferry Building in San Francisco – the model determined its distance from Stanford University.

Until now, Ollama has relied on the ggml-org/llama.cpp project for model support and has focused on ease of use and model portability. This project offers “first-class” support for text-only models. The new engine was required to make use of multimodal models.

“As more multimodal models are released by major research labs, the task of supporting these models the way Ollama intends became more and more challenging,” the team behind the project said.

The blog post explains that the text decoder and vision encoder are split into separate models and executed independently.

Don't miss our latest stories on Google News

Add us as your Preferred Source on Google.

Ollama said that its platform has now set the foundations for supporting even more capabilities in the future, such as speech, image generation, video generation, longer context sizes, and improved tool support.

While many users will enjoy the improvements, cybercriminals will also find this tool useful to analyze images, and later, potentially, forge documents, manipulate digital images, create fraudulent records, etc.

Security experts are warning that hackers will employ AI agents to launch sophisticated and difficult-to-detect cyberattacks at scale. Financial institutions have seen an increase in the use of fraudulent, AI-generated identity documents.

Ollama engine now supports multimodal models capable of image recognition

More from Cybernews