Chinese AI lab DeepSeek recently released an updated version of its reasoning model called R1-0528. This model performs well on math and coding tests. However, DeepSeek has not shared where it got the data to train this model.
Some AI researchers believe that DeepSeek may have used data from Google’s Gemini AI family. Sam Paech, an AI developer from Melbourne, shared what he says is evidence that DeepSeek’s model uses similar words and phrases as Google’s Gemini 2.5 Pro model.
Another developer, known as SpeechMap, said the “thoughts” DeepSeek’s model produces resemble those from Gemini.
This is not proof, but it raises questions. DeepSeek has faced similar accusations before. Last year, their V3 model sometimes said it was ChatGPT, OpenAI’s chatbot, suggesting it might have learned from ChatGPT data.
Earlier in 2025, OpenAI said it found signs that DeepSeek used a technique called distillation. This method trains smaller AI models by learning from bigger ones.
Microsoft, which works closely with OpenAI, reported large amounts of data leaving OpenAI developer accounts they believe are linked to DeepSeek.
Using data from other AI models this way is common but against OpenAI’s rules. They forbid using their model outputs to create competing AI.
Many AI models sound similar because they train on much of the same web data. The internet is full of AI-generated content, making it hard to tell where training data comes from.
Leave a comment