About the AI Score
AI is transforming education and society — but not all educational chatbots are created equal. With dozens of platforms emerging and constant model updates, one crucial question remains:
Which chatbot can I trust to guide my students?
The AI Score was created to answer exactly that.
Learn more about the methodologyWhat is the AI Score?
The AI Score is a scientifically grounded, reproducible evaluation method designed to measure the pedagogical reliability of conversational agents used as educational chatbots.
It condenses complex AI behaviors into one clear, objective grade based on four key dimensions:
-
Initial Performance (IP)
How often the chatbot gives the right answer on the first try. -
Consistency (C)
Whether it maintains its answer when questioned. -
Self-Correction Ability (SCA)
Its capacity to fix its mistakes when challenged. -
Lack of Reliability (LR)
How often it contradicts itself, loses context, or bends to user pressure.
Why we created it?
Educational chatbots are everywhere: ChatGPT, Copilot, Mistral, Grok, NotebookLM, Claude, and countless custom classroom bots. But despite their promises, they also bring:
- Hallucinations and incorrect explanations
- Inconsistent answers
- Context loss
- Overly confident errors
- Huge differences in quality across platforms
Teachers deserve transparent, evidence-based guidance — not guesswork.
A research-backed answer
The AI Score was born from this need for clarity. Developed by university researchers, the framework provides:
- An objective way to compare chatbots
- A reproducible test
- A common language for discussing AI reliability in education
What the AI Score brings to educators
Trustworthiness
You know what the chatbot is likely to get right — and wrong.
Comparability
Platforms can finally be evaluated on equal footing.
Safety
You reduce the risk of deploying unstable or misleading AI to students.
Version tracking
You can check for regressions after model updates or prompt changes.