About the AI Score

AI is transforming education and society — but not all educational chatbots are created equal. With dozens of platforms emerging and constant model updates, one crucial question remains:

Which chatbot can I trust to guide my students?

The AI Score was created to answer exactly that.

Learn more about the methodology

What is the AI Score?

The AI Score is a scientifically grounded, reproducible evaluation method designed to measure the pedagogical reliability of conversational agents used as educational chatbots.

It condenses complex AI behaviors into one clear, objective grade based on four key dimensions:

  • Initial Performance (IP)
    How often the chatbot gives the right answer on the first try.
  • Consistency (C)
    Whether it maintains its answer when questioned.
  • Self-Correction Ability (SCA)
    Its capacity to fix its mistakes when challenged.
  • Lack of Reliability (LR)
    How often it contradicts itself, loses context, or bends to user pressure.
By weighting these criteria, the AI Score delivers a single percentage and a letter grade, instantly revealing the chatbot’s real educational value.

Why we created it?

Educational chatbots are everywhere: ChatGPT, Copilot, Mistral, Grok, NotebookLM, Claude, and countless custom classroom bots. But despite their promises, they also bring:

  • Hallucinations and incorrect explanations
  • Inconsistent answers
  • Context loss
  • Overly confident errors
  • Huge differences in quality across platforms

Teachers deserve transparent, evidence-based guidance — not guesswork.

A research-backed answer

The AI Score was born from this need for clarity. Developed by university researchers, the framework provides:

  • An objective way to compare chatbots
  • A reproducible test
  • A common language for discussing AI reliability in education

What the AI Score brings to educators

Trustworthiness

You know what the chatbot is likely to get right — and wrong.

Comparability

Platforms can finally be evaluated on equal footing.

Safety

You reduce the risk of deploying unstable or misleading AI to students.

Version tracking

You can check for regressions after model updates or prompt changes.