Dec. 2024 Chinese OS LLMs

6 min readDec 4, 2024

Eight months ago, I had the privilege of speaking at Seoul National University about the emerging rise of open-source Chinese-speaking LLMs (link). At that time, these models were not only largely unnoticed in the West but also faced significant skepticism.

Today, they have not only flourished but have become major players in the AI landscape, gaining widespread recognition. Disregarding the contributions of the Chinese community in the open-source model space would ignore a vital and dynamic force within the field. Their innovations are not just important; they are actively reshaping the future of AI.

Number of models / datasets released by major Chinese companies / labs per month. Source: https://huggingface.co/spaces/zh-ai-community/zh-model-release-heatmap

The Rise of Qwen

One of the most notable changes in the open source world this year is the rise of the Qwen family, developed by Alibaba group, the competitor of Amazon in both online shopping and cloud services.

Qwen made its first release in mid 2023 without getting much attention. Since then, it has quickly become popular. The Qwen models are updated often and come in many different sizes to meet various needs. Importantly, most of these models are released under the Apache 2.0 license. This is valuable for companies because it allows them to use the models without complicated legal issues.

The Qwen family has built one of the largest ecosystems on Hugging Face in terms of number of derived models, with more than 70,000 derived models, surpassing well-known models like LLaMA and Mistral. This large ecosystem shows that many people in the AI community are adopting and adapting Qwen models.

As Thom Wolf, Chief Scientist of Hugging Face, pointed out:

DeepSeek: Leading the Way to Affordable AI

Generative AI applications have long faced a significant hurdle for both model and application developers: the high cost of running LLMs, i.e. inference costs. This expense has limited their widespread adoption in both open-source and closed-source communities.

In May 2024, DeepSeek made a groundbreaking contribution by introducing their state-of-the-art open source model, DeepSeek v2. This model not only delivered top-tier performance but also introduced an innovative technique called Multi-head Latent Attention (MLA) in the attention layer. Instead of storing the full key-value vectors in cache, MLA uses compressed latent KV vectors. This approach dramatically reduces GPU memory requirements, allowing servers to handle more requests simultaneously and significantly lowering operational costs.

Thanks to this innovation, DeepSeek managed to reduce their pricing to unprecedented levels to $0.14 per million tokens, and even $0.014 per million tokens when prefix caching is hit.

To put this into perspective: The latest ChatGPT model gpt-4o at the time of writing (Dec. 2024) costs $1.25 per million tokens when prefix caching is hit. In 2023, ChatGPT 3.5 was priced at $4 per million tokens, and GPT-4 was a steep $1,200 per million tokens.

The price of major LLM providers when DeepSeek v2 is released.

DeepSeek’s pricing is 100 times cheaper than contemporary alternatives and 10,000 times less than GPT-4’s initial price. Importantly, these cost reductions were achieved without sacrificing model performance.

This monumental price reduction unintentionally sparked a price war in the industry. Competitors, inspired or pressured by DeepSeek’s movement, followed. This made a huge change to the Chinese LLM ecosystem. For example, you now can get free API access to decent open source models like GLM-4–9B and Yi-1.5–6B-Chat from SiliconFlow, one of the leading AI infrastructure providers in China.

With the significant reduction in inference costs and the rise of zero-code environments like Dify, developing AI agents and RAG applications has never been more accessible. These tools have turned AI development into a commodity skill, enabling individuals without a strong technical background to create sophisticated applications.

On-Device Models: Bringing privacy preserving AI to Personal Devices

The reduction of inference costs is not just happening on the cloud; it’s increasingly extending to personal devices. As smaller models become more powerful, we’re witnessing a shift toward running advanced AI directly on laptops and smartphones.

This transformation involves users bearing the cost of inference through their own devices, while gaining greater control and privacy. This shift will fundamentally alter the economics of AI application cost structures, transitioning from per-usage billing to upfront device costs, and redefining how we interact with AI technologies.

It’s not surprising now that models with fewer than 10 billion parameters can now run efficiently on laptops, thanks to tools like llama.cpp, fastllm and transformers.js, which serves as a crucial foundation for local LLMs inference.

How to efficiently run AI models on smartphones is becoming a new frontier. Smartphones have access to sensitive user data, and running LLMs on devices is the best way to preserve privacy. The Chinese AI community has contributed quite a few great mobile friendly models such as the qwen2–1.5B, MiniCPM series and DeepSeek Janus.

Notably the recently released GLM Edge 1.5B has achieved inference speeds of up to 65 tokens per second on mobile phones equipped with the latest Snapdragon 8 Gen 4 processor through joint-optimization with Qualcomm’s GenAI inference extension (which is not open-sourced yet unfortunately). This speed gets closer to the average rate of human speech, indicating that real-time, on-device AI interactions are increasingly viable. And, this is just the beginning of what on-device models can achieve.

However, mass adoption faces challenges such as battery drainage and memory usage. Advanced AI models require substantial resources, particularly if each application runs its own LLM instead of sharing a system-wide model. Collaborative efforts (example) among industry stakeholders will be essential to overcoming these obstacles and enabling a new era of privacy-preserving, on-device AI.

Inference scaling law

Small AI models can be further enhanced through the application of the inference scaling law. Model performance can be greatly enhanced if they spend time to think before they respond.

Just as written in QwQ blogpost: “when given time to ponder, to question, and to reflect, the model’s understanding of mathematics and programming blossoms like a flower opening to the sun.”

ChatGPT’s o1 model was one of the first that uses this approach, attracting significant attention from the open-source AI community eager to replicate its success. Many projects emerge from the Chinese AI community including Alibaba’s Macro-o1, QwQ from the qwen team, LLaMA-O1 from Shanghai AI lab, Llama-3.2V-11B-cot from Tsinghua University and Skywork’s Open-PRM-Qwen-2.5–7B. Each of these projects adopts a unique strategy to explore new possibilities in the field. Notably, these initiatives are publicly shared, providing a foundation for anyone to access and build upon them, thereby fostering further innovation and collaboration in the AI domain.

Governance

China has moved quickly on AI regulation since 2023. Following the “Interim Measures for the Management of Generative AI Services”, more than 190 AI models have successfully completed the regulatory filing process, demonstrating China’s swift approach to creating an AI governance framework.

In September 2024, China’s Cyberspace Administration proposed a draft regulation requiring both visible markers that users can readily see, and invisible identifiers for AI-generated content. The regulation also encourages providers to implement digital watermarks as an additional invisible marking method. Significantly, content platforms must verify these identifiers and display prominent AI content notices when such content is published or distributed. This comprehensive marking system establishes one of the world’s first standardized frameworks for AI content transparency, empowering users with clear information about AI-generated content.

In November 2024, China launched a three-month regulatory campaign to enhance the social value of algorithmic recommendations. The initiative addresses five key challenges in the digital economy: content echo chambers, trending list manipulation, gig worker protection, price discrimination, and vulnerable user safeguards. By requiring platforms to optimize their algorithms for social benefit rather than pure engagement, this regulation aims to balance technological innovation with public welfare.

On the global stage, China also proposed the “AI Capability Building Inclusive Plan”, aiming to bridge digital and AI divides, particularly benefiting the Global South. The program emphasizes AI infrastructure sharing and education through South-South cooperation. As part of its action plan, China commits to working with developing countries on AI language resource development, with specific measures to eliminate racial, algorithmic, and cultural discrimination while preserving linguistic and cultural diversity.

PS: The article is getting much longer than I expected. I didn’t talk much about the progress of multimodal models as well as interesting non-profit AI research organizations in China. If you’re interested, please let me know!