AI Score Methodology
A transparent, reproducible evaluation framework.
How it works ?
- A validated set of 10 highly discriminant MCQs
- A controlled prompting protocol with initial and follow-up prompts
- Five independent runs per chatbot to capture variability
- Automatic scoring of defined criteria (see below for details)
- A final AI Score (%) and letter grade (A–E)
This ensures fairness, repeatability, and transparency — whether you compare two chatbots or track one chatbot across updates.
How the AI Score Is Calculated
The AI Score is built on a rigorous, transparent, and fully reproducible protocol designed to measure how well an educational chatbot performs with students. It evaluates four essential dimensions: Initial Performance (IP), Consistency (C), Self-Correction Ability (SCA), and Lack of Reliability (LR).
Each chatbot is tested under the exact same conditions to ensure fairness and comparability.
1. Building a Fair and Discriminant Test
A validated set of questions
Select 10 highly discriminant MCQs (see the paper for details).
Keep the 10 MCQs with the highest Δ (PropSup − PropInf) to avoid a test that is too easy or too hard.
Selecting the 10 most discriminant questions
- PropSup : success rate of strong students (average 12–20/20)
- PropInf : success rate of weak students (average < 7/20)
- Δ = PropSup − PropInf
- A high Δ sharply distinguishes mastery levels.
2. Standardized Prompting Protocol
Clean session
- New conversation thread each run
- Same system prompt
- Same initial prompt
- Same follow-up prompt
Two-step questioning
- Initial prompt — Ask the AI tutor to provide an answer from the selected options of the MCQ.
- Follow-up Prompt — Challenges the answer to test robustness and self-correction.
This structure captures not only the answer but also how the chatbot reacts to follow-up questions — just like actual students do.
3. Five Independent Runs per Chatbot
Every chatbot is tested five times, each run covering all 10 MCQs.
Because AI systems are probabilistic, multiple runs capture variability and strengthen statistical reliability.
4. Scoring the Four Key Criteria
Initial Performance (IP)
Ability to give the right answer on the first try — critical because students often rely on it.
Weight: 70%Consistency (C)
Checks whether the chatbot maintains its correct answer when questioned.
Weight: 20%Self-Correction Ability (SCA)
Measures the capacity to fix mistakes when challenged.
Weight: 10%Lack of Reliability (LR)
Penalizes instability or memory loss.
Weight: -25%SCA Rules
- Initial answer wrong, follow-up correct → +1
- Initial answer correct and stays correct → +1
LR Rules
- Instability: changing a wrong answer to another wrong one → +1
- Memory Loss: forgetting the question between prompts → +1
5. Final Formula
The AI Score turns the four criteria into a percentage, then into a letter (A = excellent, E = very poor).
| Score (%) | Badge |
|---|---|
| 91% ≤ AI Score [%] ≤ 100% | ![]() |
| 81% ≤ AI Score [%] < 91% | ![]() |
| 71% ≤ AI Score [%] < 81% | ![]() |
| 61% ≤ AI Score [%] < 71% | ![]() |
| AI Score [%] < 61% | ![]() |
The letter scale offers an immediate, intuitive way to compare chatbots.
We strongly recommend to only use A (or B)-labelled chatbots with your students.
6. Why This Method Works
Objectivity
Identical instructions and questions for every chatbot.
Reproducibility
Repeatable test to compare or check regressions.
Pedagogical relevance
Captures accuracy, stability, self-correction, and avoidance of misleading answers.
Fair comparison
Same material and constraints for all platforms.
Teachers gain clear, actionable information about which tool is reliable enough for students.
A Transparent, Open, and Evolving Method
- Weights can be adapted for other disciplines.
- The question set can change by subject matter.
- Evaluates proprietary, open-source, or RAG-based chatbots.
- Extended versions (e.g., AI Score v2) can add stricter rules for sensitive domains.
- Goal: a shared, simple, and rigorous standard for everyone.




