Reflections After the Release of Kimi K2: More Than Just a ChatBot

A few days ago, after more than half a year of work, Kimi K2 was finally released. Following an all-nighter before launch and a solid two days of sleep, I finally found the time to share some of my thoughts today.

Disclaimer: All opinions below are my own and do not represent the company's official stance.

Second Disclaimer: Everything that follows is handcrafted by me (I only used Github Copilot as an advanced input method).

About "Writing Frontends"

Since Claude 3.5 Sonnet, AI has reached a level where writing frontend code is actually practical. Since then, nearly every new model has showcased some frontend coding abilities, and of course, Kimi K2 was no exception. I want to share some personal reflections on this.

For a long time, all text AIs have by default outputted Markdown, and all products are advanced ChatBots. People's expectations for a ChatBot are basically to answer questions, write articles, and provide some emotional value, acting human-like. One time, I saw in user feedback someone asking Kimi to "reformat an article so it fits on one A4 page." Clearly, this can't be done in pure text mode. I treated it as a kind of product manager/programmer joke.

Around March this year, Kimi Researcher started development. At the time, both OpenAI and Gemini's Deep Research ultimately delivered purely textual research reports. We wondered if we could do something different, leveraging the already strong frontend programming abilities of AI to output richer interactive reports for users. The final form of this idea went public with Kimi Researcher, and it was well received.

But when I saw this idea, something completely different popped into my mind: who says text AI has to output markdown? If "frontend programming" became an AI's default way of interacting, what would that product look like?

In other words, shift the human-AI interaction from chat-first to artifact-first: your interaction with the AI isn't just to have it spit out a section of content, but rather it understands the user's needs and immediately starts a mini project, delivering a frontend application. Users can continue to ask for modifications and iterate, but everything centers on the deliverable.

Sharp-eyed folks may recognize: isn't this just cursor/aider/openhands? Yes, technically, that's what AI programming does. But with some product design finesse, the coding process could be hidden. To users who don't understand programming, it's "I say something, and the AI makes me a PPT/flowchart/little game," etc. This time, the AI can not only "reformat the article onto A4" but also change colors, add animations—it's an experience that totally surpasses traditional ChatBots.

So over the Qingming holiday, I hacked together a demo in a day, borrowing the workflow and prompts from aider. The interaction was still a ChatBot format, but when a user asked, "Tell me about the Xiaomi Su7," a regular ChatBot would spit out a text summary, but my demo directly produced an image-rich, interactive webpage like a PPT. The user could keep modifying it: "make the background black," "add info about Su7 Ultra," and so on.

I took this demo to pitch the idea to the product team. Everyone thought it was neat, but too busy—maybe next time. Now, K2 has been released, and Kimi Researcher is live. I believe Kimi's products will soon see some amazing changes.

I recall, in 2009, my sophomore year, a senior said: "Maybe 20 years from now, the compiler will just take a programmer saying 'I want a Firefox,' and after grinding away for two days, produce a working Firefox." We used to laugh at this fantasy, but now it seems, it's not even going to take 20 years.

About Tool Use & Agent

Model-Centric Paradigm (MCP) became popular early this year, and we wondered if Kimi could connect to various third-party tools via MCP. When developing K1.5, we achieved quite good results using RLVR (Reinforcement Learning with Verifiable Rewards), so we wanted to replicate that for a set of real MCP servers directly hooked into the RL environment for joint training.

But we quickly hit walls. First, deployment was a hassle. For example, Blender MCP is easy for Blender users, but running Blender in an RL environment is a burden. Worse, many third-party tools require login—you can't realistically register tons of Notion accounts just for Notion MCP training.

We changed perspective: my hypothesis was that the model already knows how to use the tools from pretraining; we just need to unlock that ability. This is easily understood: pretraining exposes the model to massive code data, with API calls in many languages and formats—if each is treated as a tool, the model should already know how to use them. Another point: a pretrained model handles vast world knowledge, like when you ask it to play a Linux Terminal, it can role-play convincingly. Clearly, for terminal tool use, only a small amount of data should be needed to trigger the ability.

We designed a clever workflow that let the model synthesize a huge set of tool specs and usage scenarios itself. By using multi-agent simulations, it generated very diverse tool-use data—and it worked well.

As for "Agent," my understanding is: if a model can do this, it's a pretty capable Agentic Model:

task = get_user_input()
history = [task, ]
while True:
    resp = model(history, toolset)
    history.append(resp)
    if not resp.tool_calls:
        break

    for tool_call in tool_calls:
        result = call_tool(tool_call)
        history.append(result)

(Of course, this could get fancier—e.g., "toolset" could be dynamically generated by the model itself, see alita.)

From a training point of view, this data isn't hard to synthesize—just rewrite a long task into a mix of exploration, reasoning, tool use, environment feedback, error retries, and outputting content, and this capability can be readily evoked.

I think, at this stage, we're still early in developing Agent abilities in models—much pretraining data is missing (e.g., hard-to-verbalize experiences), so the next generation of pretraining models has great potential.

Why Open Source?

First, of course, for recognition. If K2 were closed-source, there wouldn't be as much attention and discussion. It might even get flak, like Grok4 did, despite its strengths.

Second, community contributions greatly enhance the tech ecosystem. Within 24 hours of open sourcing, community members already created an MLX implementation and 4bit quantization—things our small team could never accomplish alone.

But more importantly: open source means higher technical standards, and will force us to build better models, aligning more closely with AGI goals.

Why does releasing model weights "force progress"? Because once it's out there, you can't hide behind hacks and bespoke tricks—your results have to be generalizable, so any third party with the weights can easily reproduce what you claim.

For a closed-source ChatBot, users have no idea what's going on in the backend—they don't know the workflow, or how many models are running. I've heard rumors of "big companies" running dozens of models, hundreds of scene classifications, and uncountable workflows behind a single entry point, calling it an "MoE model." In an "application-first" or "user experience-first" mindset, this is natural and far more cost-effective than a monolithic model. But clearly, that's not the path to AGI. For startups like Kimi, this approach only leads to mediocrity, hinders technical progress, and can never out-compete big companies with armies of PMs for every button.

So, when open source forces you not to take shortcuts, it actually helps you build better models and products. (If someone uses Kimi K2 to make a more interesting app than Kimi, I'll definitely nudge our product team.)

On Determination, and Some Potentially Controversial Thoughts

Last year, Kimi's large ad campaign drew controversy—there are still detractors now.

Haha, I'm just a coder; I don't know or comment on the rationale behind those decisions.

I'll just state one fact: After we stopped ad campaigns earlier this year, many domestic app stores don't even show Kimi on the front page, and searching for Kimi on Apple App Store recommends Doubao, while a certain Baidu search for Kimi sends you to "Baidu DeepSeek-R1 Full Power Edition."

Even in such an unfavorable internet environment, Kimi has not resumed ad spending.

After DeepSeek-R1's massive surge, many wondered if Kimi was failing, or if we resented DeepSeek. On the contrary, many colleagues see DeepSeek-R1's success as great news—it demonstrates that core capability is the best marketing: if the model is good, the market will recognize it. It proved our chosen path is not only viable but the right one. The only regret: we weren't the ones to blaze it.

At the beginning-of-the-year review, I proposed some radical ideas, but CEO Zhiling's follow-up actions were even more radical than I expected: no more K1 updates, all resources on core models and K2 (plus more I can't discuss).

Recently, Agent products are all the rage, and some say Kimi shouldn't focus on big models but on Agent products. My take: the vast majority of Agent products are nothing without Claude—as the Windsurf/Claude supply cut case showed. In 2025, model capability still defines intelligence, and as a company aiming for AGI: if you don't pursue the upper bound of intelligence, I wouldn't stay here a day longer.

AGI is a razor-thin, perilous path; there's no room for distraction or hesitation. You may not succeed by chasing AGI, but hesitation guarantees failure. At the June 2024 BAAI conference, I heard Dr. Kai-Fu Lee blurt out, “as an investor, I focus on AI application's ROI,” and I knew his company wouldn’t last long; there are also some self-proclaimed AGI companies now talking about the importance of "closed loops"—all I can do is wish them luck.

Conclusion

I know Kimi K2 still has countless shortcomings; more than ever, I want K3.