It’s the weekend. Andrej Karpathy, one of the brilliant minds from OpenAI’s founding team and the former head of AI at Tesla, decides he wants to read a book.
But he doesn’t want to read it alone.
Instead, he imagines reading it with a team of AI experts chiming in. A whole committee of them, each giving their take, arguing with each other, and finally coming to a consensus, all led by an AI "Chairman." It sounds like a sci-fi movie, but Karpathy just... built it. In a weekend.
He called it a "vibe code project," a fun little hack mostly written by AI assistants, and threw it up on GitHub as "LLM Council." He was clear: "I’m not going to support it in any way... Code is ephemeral now and libraries are over."
It’s easy to dismiss this as just a geek having some fun. But if you're a CTO, an engineering lead, or anyone making decisions about technology, you need to look past the casual disclaimer. Because in a few hundred lines of code, Karpathy just gave us a surprisingly clear sketch of the most critical, undefined part of the modern software world: the layer that will run all our enterprise AI.
So, How Does an "AI Council" Actually Work?
On the surface, the app looks just like ChatGPT. You type in a question, you get an answer. Simple.
But behind the scenes, something way more interesting is happening. Think of it like a high-powered board meeting for AIs.
First, your question goes out to a panel of the top models. In his setup, Karpathy used heavy hitters like OpenAI’s GPT-5.1, Google’s Gemini 3.0 Pro, Anthropic’s Claude Sonnet 4.5, and xAI’s Grok 4. They all work on your prompt at the same time, in parallel.
Next comes the peer review. This is the cool part. Each AI gets to see the (anonymous) answers from the other models and is asked to critique them. It’s like they’re all grading each other’s homework. This forces a kind of quality control you just don't get from a single chatbot.
Finally, a designated "Chairman LLM" (he set it to Gemini 3) gets everything: your original question, all the individual answers, and all the critiques. Its job is to synthesize all of that information into one final, authoritative answer for you.
Karpathy said the results were fascinating. He found that the models were "surprisingly willing to select another LLM's response as superior to their own." When he fed it book chapters, the council consistently voted GPT-5.1 as the best. But here’s the human twist: Karpathy himself disagreed. He found GPT-5.1 "too wordy" and actually preferred the "condensed and processed" answer from Gemini. We'll come back to why that's so important in a minute.
The Surprisingly Simple Tech Behind the Curtain
For anyone in a technical leadership role, the how is just as important as the what. And the architecture of LLM Council is a masterclass in modern, minimal AI development.
The whole thing is built on a "thin" stack. The backend is FastAPI, a speedy Python framework, and the frontend is a standard React app. There’s no big, complicated database—it just saves data to simple JSON files.
The real magic, the piece that makes it all work, is a tool called OpenRouter.
Think of OpenRouter like a universal travel adapter for AI models. Normally, if you want to use models from OpenAI, Google, and Anthropic, you have to write separate code to connect to each of their APIs. It's a pain. OpenRouter handles all of that for you. You make one call to OpenRouter, and it routes your request to the right model.
This is a huge deal. It means the application doesn't care who made the AI. It treats these incredibly powerful models as interchangeable parts, like lightbulbs you can swap out. Karpathy can add or remove a model just by changing a single line in a config file.
This points to a massive trend we're seeing: the commoditization of the AI model layer. It protects you from being locked into one vendor. If Meta releases a killer new model next month, you can slot it into your "council" in seconds.
The Gap Between a Weekend Hack and an Enterprise Reality
Okay, so the core logic is elegant and surprisingly simple. But before you run off to build your own version, let's talk about everything that's missing. This is the "boring" stuff that separates a cool prototype from a system you can actually run a business on.
A quick look at the code reveals what’s not there:
- Authentication: Anyone with the URL can use it. There are no user accounts, no passwords, nothing.
- Permissions: A junior intern has the same power as the CEO. Not great.
- Governance & Compliance: In a real company, sending potentially sensitive data to four different external AI companies at once would give your compliance officer a heart attack. There's no system here to strip out personal information (PII) or track who is asking what.
- Reliability: The code assumes all the APIs are always working perfectly. What happens if one model is down? Or slow? Production systems need circuit breakers, retry logic, and fallback plans.
These aren't criticisms of Karpathy’s code—he was very clear it was a toy. But these missing pieces perfectly define the business model for companies like AWS Bedrock, LangChain, and a whole host of AI gateway startups. They are selling the hardening, the security, and the compliance wrappers that turn this raw "vibe code" into something an enterprise can safely depend on.
Is Code Really "Ephemeral" Now?
Maybe the most mind-bending part of this whole thing is Karpathy's philosophy. He said the project was "99% vibe-coded," meaning he let AI assistants write most of it.
His note in the repository is a bombshell for software developers: "Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like."
Let that sink in. For decades, we've built software on the idea of creating stable, reusable libraries and abstractions that last for years. Karpathy is suggesting a future where code is more like disposable scaffolding, generated by an AI for a specific task and then thrown away.
This creates a massive strategic question for every company. Does it still make sense to buy expensive, rigid software platforms for your internal tools? Or should you empower your teams to "vibe-code" their own custom, disposable tools that perfectly fit their needs for a fraction of the cost?
The Hidden Danger: When AIs Don't Agree with Humans
Remember how the AI council loved GPT-5.1's wordy answers, but Karpathy, the human, preferred Gemini's concise ones? This isn't just a fun fact; it's a critical warning.
It shows that AI models might have their own built-in biases about what a "good" answer looks like. They might prefer long, detailed, and confident-sounding responses, even if a human just wants a quick, accurate one.
As more companies start using "LLM-as-a-Judge" systems to automatically rate the quality of their own AI bots, this becomes a huge problem. Your internal metrics could show that your AI is performing brilliantly (because the AI judge loves its wordy answers), while your actual customers are getting more and more frustrated. Karpathy’s little experiment shows that relying only on AI to grade other AIs is a recipe for getting dangerously out of sync with your users.
What We Can All Learn from a Weekend Hack
In the end, Karpathy's LLM Council is like a Rorschach test. A hobbyist sees a cool toy. An AI vendor sees a threat that proves their core tech can be replicated over a weekend.
But if you're a technology leader, you should see it for what it really is: a reference architecture. It demystifies the AI orchestration layer and shows us that the hard part isn't connecting the models. The real challenge is building the governance, security, and reliability around them.
As you and your team start planning your tech strategy for 2026, you'll probably find yourselves looking at projects just like this one. Not to copy the code, but to understand the blueprint. It proves that a multi-model AI strategy is within reach for everyone. The only question is whether you’ll build that enterprise-grade armor yourself, or pay someone to wrap it around the "vibe code" for you.




