Apple just dropped updates to its AI models powering iOS, macOS, and more. But the company’s own benchmarks show these new models lag behind rivals like OpenAI and Google.
Apple shared in a blog post Monday that human testers rated its offline “Apple On-Device” model on par with similar Google and Alibaba models. However, its more powerful “Apple Server” model, running in Apple’s data centers, fell behind OpenAI’s year-old GPT-4o in text quality.
In image analysis tests, Apple’s Server model was preferred less than Meta’s Llama 4 Scout—surprising since Llama 4 Scout underperforms Google, Anthropic, and OpenAI in other tests.
This adds weight to reports that Apple’s AI research is struggling to catch up in the fast-moving AI race. AI features from Apple have underwhelmed lately, and a Siri upgrade remains delayed indefinitely. Some users have even sued Apple over missing AI features it marketed but hasn’t delivered.
Apple On-Device runs about 3 billion parameters and supports text generation, summarization, and text analysis. Third-party devs can now access it through Apple’s Foundation Models framework.
Both Apple On-Device and Apple Server claim better efficiency and tool use versus past models. They also handle about 15 languages, thanks to a bigger training dataset including images, PDFs, documents, charts, and more.
Palmer Luckey stated…
Apple has announced updates to the AI models that power its suite of Apple Intelligence features across iOS, macOS, and more. But according to the company’s own benchmarks, the models underperform older models from rival tech firms, including OpenAI.
Apple said in a blog post Monday that human testers rated the quality of text generated by its newest “Apple On-Device” model — which runs offline on products including the iPhone — “comparably” to, but not better than, text from similarly-sized Google and Alibaba models. Meanwhile, those same testers rated Apple’s more capable new model, which is called “Apple Server” and designed to run in the company’s data centers, behind OpenAI’s year-old GPT-4o.
In a separate test evaluating the ability of Apple’s models to analyze images, human raters preferred Meta’s Llama 4 Scout model over Apple Server, according to Apple. That’s a bit surprising. On a number of tests, Llama 4 Scout performs worse than leading models from AI labs like Google, Anthropic, and OpenAI.
The benchmark results add credence to reports suggesting Apple’s AI research division has struggled to catch up to competitors in the cutthroat AI race. Apple’s AI capabilities in recent years have underwhelmed, and a promised Siri upgrade has been delayed indefinitely. Some customers have sued Apple, accusing the firm of marketing AI features for its products that it hasn’t yet delivered.
In addition to generating text, Apple On-Device, which is roughly 3 billion parameters in size, drives features like summarization and text analysis. (Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.) As of Monday, third-party developers can tap into it via Apple’s Foundation Models framework.
Apple says both Apple On-Device and Apple Server boast improved tool-use and efficiency compared to their predecessors, and can understand around 15 languages. That’s thanks in part to an expanded training dataset that includes image data, PDFs, documents, manuscripts, infographics, tables, and charts.