Typos, noise, and LLM performance
Large language models are trained on clean, carefully tokenised text. Real users are not clean. They autocorrect into the wrong word, skip punctuation, hold down a key too long, and type on glass. This study asks a simple question: how much does it actually matter?
The short answer is: more than most benchmarks suggest, and less uniformly than you might expect. Small amounts of noise are often handled gracefully. At higher noise rates, models diverge sharply in how they fail—and the failure modes are not always the ones you would predict.
Setup
I constructed a noise injection pipeline that applies five perturbation types independently and in combination: character swaps, insertions, deletions, keyboard-adjacency substitutions, and random case flips. Each is parameterised by a noise rate p applied per character.
Prompts from three task categories were tested: factual question answering, multi-step reasoning, and instruction following. For each (prompt, noise level, perturbation type) combination, outputs were scored with an LLM judge using a rubric calibrated against human ratings.
Findings
Performance on factual QA degrades roughly linearly with noise rate up to around p = 0.08, after which it drops sharply. Reasoning tasks show a cliff at p = 0.04—a much lower threshold, and a much steeper drop.
Keyboard-adjacency substitutions are the most damaging perturbation type per unit of noise, likely because they produce plausible-looking alternate words rather than obviously garbled text. Models appear to commit to the misread interpretation rather than flagging uncertainty.
Instruction following is the most robust category. Even at p = 0.15, models generally extract the correct intent from noisy directives. The hypothesis is that instruction prompts have high structural redundancy—the meaning survives because it is over-specified.
Implications
If you are building anything that takes user-typed input and feeds it directly to a model, you should be normalising or lightly correcting that input before it reaches the model. This is not a controversial claim, but the data makes the magnitude concrete.
More interesting is the failure mode for reasoning tasks. When a model misreads a noisy token and confidently propagates that misreading through a chain of thought, the error is structurally invisible. The chain looks coherent. Evaluators—human or model-based—rate it well right up until the final answer is checked. This suggests that noise robustness and hallucination share a deeper failure mechanism worth studying further.
Limitations
The noise model is synthetic. Real user errors have structure—they cluster around specific keys, correlate with typing speed, and vary by language. A more faithful noise model would learn error distributions from real user logs.
Model versions matter. Results are specific to the models and prompt formats tested. A different tokeniser handles character-level noise differently; the cliff points will move.