Note

How Letrate scores a writing paper

Scoring free-form English to the official band descriptors, consistently, at exam pace — the problem, the decisions, and what it does in production.

2026-06-244 min read

Listening and Reading are the easy half to score. There is an answer key; a response is right or wrong; the raw count maps to a band on a fixed scale. Writing is the hard half. A candidate hands you several hundred words of free-form English, and you have to return a band that an examiner would recognize as fair — and return it the same way every time, at the speed of a test that just closed for a whole cohort. This is the part of Letrate that took the most care to get right: scoring IELTS writing to a standard, automatically. Here is how it works.

The target is a rubric, not a vibe

The mistake that sinks most automated essay scoring is asking the model for a single number. "Rate this essay from 0 to 9" produces something — it just isn't a band. It's an impression, and impressions don't hold still.

IELTS writing isn't graded as one impression. It's graded against four published criteria, each scored on the nine-band scale:

Task Achievement / Response  — did the answer address the task, in full
Coherence & Cohesion         — does it organize and connect ideas so they track
Lexical Resource             — range and precision of vocabulary
Grammatical Range & Accuracy — variety and accuracy of structures

The overall writing band is built from those four. So the system scores those four — separately, each against its own descriptor — and assembles the result the way the rubric does. A score is never a guess at the whole; it is four grounded judgments combined by a rule.

Ground the model in the descriptors it must use

A model left to its own standard will invent one, and its standard drifts essay to essay. So it doesn't get to bring its own. Each criterion is scored against the official band descriptor for that criterion — the actual language that distinguishes a 6 from a 7 from an 8 — and the model's job is to place the script on that ladder, with the evidence that puts it there.

That changes the question from "how good is this essay" to "which band's description does this writing match." The first is taste. The second is a comparison against a fixed text, and a comparison is something you can hold a system to.

Consistency is the whole product

A human marker is excellent and inconsistent. The same script can earn slightly different bands from two examiners, or from one examiner at nine in the morning versus the fortieth script after midnight. That variance is invisible to the student and unfair to them, and it is the thing automated marking is actually positioned to beat.

The bar Letrate is held to is not "smarter than an examiner." It is steadier than one. The same script scores the same band. A script that is genuinely a half-band stronger scores a half-band higher — for a reason tied to a descriptor, not to fatigue. Getting there is less about the model's peak ability and more about removing the room it has to wander: scoring criterion by criterion, against fixed descriptions, with the reasoning made explicit so it can be checked.

A band with its reasons attached

A number on its own doesn't teach anyone anything. The most useful thing the system returns isn't the band — it's the why, per criterion, in plain English: where the task response fell short, which cohesion devices were missing, what a stronger lexical range would have looked like here. So "6.5" becomes "here is exactly what to fix to reach 7." That is the difference between a verdict and a lesson, and for a student preparing for a real exam, the lesson is the point.

Where a human still signs

The scoring standard is not the model's to set. The rubric mapping, the calibration against official descriptors, the rule that combines four criteria into a band — those are decisions a person owns and answers for, the same way every load-bearing decision at Diversive is owned. The model proposes a score with its evidence; the standard it's measured against is human-defined and human-maintained. When a better model arrives, it slots into the same standard and the marking simply gets sharper. The descriptors don't move. That is what keeps the score trustworthy as the technology underneath it keeps changing.

The result, in production: a student submits, and a second or two later gets back four criterion scores, an overall band, and a specific account of what stood between this attempt and the next band up — held to the same published standard, every script, without the drift.