Alibaba’s Qwen team has released the Qwen2.5-VL family of AI models, marking a significant advancement in AI’s ability to interact with software.
These models are capable of performing various text and image analysis tasks, including video understanding, document analysis, and object counting.
The models are also designed to control PCs and mobile devices, similar to OpenAI’s Operator.
The Qwen2.5-VL models, especially the Qwen2.5-VL-72B, have outperformed OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.0 Flash in areas like math, document analysis, and question answering. They can also parse charts, extract data from invoices and forms, and comprehend lengthy videos.
However, these models come with certain restrictions due to China’s internet regulations. For instance, when asked about politically sensitive topics, such as Xi Jinping’s mistakes, the AI refused to respond, citing an error message. This aligns with China’s regulatory requirements to ensure AI responses adhere to core socialist values.
One of the most striking features of Qwen2.5-VL is its ability to control software on both PCs and mobile devices. In a demonstration, the AI successfully launched the Booking.com app on an Android phone and booked a flight. However, its performance on a Linux desktop was less impressive, as it struggled to do more than switch tabs.
While the Qwen2.5-VL-3B and Qwen2.5-VL-7B models are available under a permissive license, the flagship Qwen2.5-VL-72B is under a custom license.
Companies with over 100 million monthly active users must seek permission from Alibaba to deploy the model commercially.