AI Score Methodology

A transparent, reproducible evaluation framework.

How it works ?

  • A validated set of 10 highly discriminant MCQs
  • A controlled prompting protocol with initial and follow-up prompts
  • Five independent runs per chatbot to capture variability
  • Automatic scoring of defined criteria (see below for details)
  • A final AI Score (%) and letter grade (A–E)

This ensures fairness, repeatability, and transparency — whether you compare two chatbots or track one chatbot across updates.

How the AI Score Is Calculated

The AI Score is built on a rigorous, transparent, and fully reproducible protocol designed to measure how well an educational chatbot performs with students. It evaluates four essential dimensions: Initial Performance (IP), Consistency (C), Self-Correction Ability (SCA), and Lack of Reliability (LR).

Each chatbot is tested under the exact same conditions to ensure fairness and comparability.

1. Building a Fair and Discriminant Test

A validated set of questions

Select 10 highly discriminant MCQs (see the paper for details).

Keep the 10 MCQs with the highest Δ (PropSup − PropInf) to avoid a test that is too easy or too hard.

Selecting the 10 most discriminant questions

  • PropSup : success rate of strong students (average 12–20/20)
  • PropInf : success rate of weak students (average < 7/20)
  • Δ = PropSup − PropInf
  • A high Δ sharply distinguishes mastery levels.
These 10 MCQs constitute the core benchmark used to compute the AI Score.

2. Standardized Prompting Protocol

Clean session

  • New conversation thread each run
  • Same system prompt
  • Same initial prompt
  • Same follow-up prompt

Two-step questioning

  • Initial prompt — Ask the AI tutor to provide an answer from the selected options of the MCQ.
  • Follow-up Prompt — Challenges the answer to test robustness and self-correction.

This structure captures not only the answer but also how the chatbot reacts to follow-up questions — just like actual students do.

3. Five Independent Runs per Chatbot

Every chatbot is tested five times, each run covering all 10 MCQs.

50
Initial answers
50
Follow-up answers
100
Total observations

Because AI systems are probabilistic, multiple runs capture variability and strengthen statistical reliability.

4. Scoring the Four Key Criteria

Initial Performance (IP)

Ability to give the right answer on the first try — critical because students often rely on it.

Weight: 70%

Consistency (C)

Checks whether the chatbot maintains its correct answer when questioned.

Weight: 20%

Self-Correction Ability (SCA)

Measures the capacity to fix mistakes when challenged.

Weight: 10%

Lack of Reliability (LR)

Penalizes instability or memory loss.

Weight: -25%
SCA Rules
  • Initial answer wrong, follow-up correct → +1
  • Initial answer correct and stays correct → +1
LR Rules
  • Instability: changing a wrong answer to another wrong one → +1
  • Memory Loss: forgetting the question between prompts → +1

5. Final Formula

The AI Score turns the four criteria into a percentage, then into a letter (A = excellent, E = very poor).

Score (%) Badge
91% ≤ AI Score [%] ≤ 100% A
81% ≤ AI Score [%] < 91% B
71% ≤ AI Score [%] < 81% C
61% ≤ AI Score [%] < 71% D
AI Score [%] < 61% E

The letter scale offers an immediate, intuitive way to compare chatbots.

We strongly recommend to only use A (or B)-labelled chatbots with your students.

6. Why This Method Works

Objectivity

Identical instructions and questions for every chatbot.

Reproducibility

Repeatable test to compare or check regressions.

Pedagogical relevance

Captures accuracy, stability, self-correction, and avoidance of misleading answers.

Fair comparison

Same material and constraints for all platforms.

Teachers gain clear, actionable information about which tool is reliable enough for students.

A Transparent, Open, and Evolving Method

  • Weights can be adapted for other disciplines.
  • The question set can change by subject matter.
  • Evaluates proprietary, open-source, or RAG-based chatbots.
  • Extended versions (e.g., AI Score v2) can add stricter rules for sensitive domains.
  • Goal: a shared, simple, and rigorous standard for everyone.