Typos Barely Hurt LLMs. Clear Prompts Matter More.
I ran a test on noisy prompts in two common workflows: answering questions about a provided document and code editing.
The main result is simple: ordinary typos had little measurable effect, clear prompts improved accuracy, and having a clean reference document (a relevant text such as documentation or a spec supplied directly in the prompt) pushed the failure point from 30% to 50% noise. A more capable model also raises the accuracy floor across all conditions.
Clean: What automatic redelivery behavior should the operator expect from GitHub?
Typo: What automatic redelivrey behavior should the opeartor expect from GitHub?
Noise: Whzt auzomrfic redelifsry brhxvior spuld the onhratxr exphxt from GitXub?
Main result
- Prompt clarity mattered more than spelling mistakes in this test.
- Having a reference document pushed the failure point from 30% to 50% noise.
- Random character corruption caused a sharp drop around 30% noise.
What I tested
45 question-answering tasks (each with a reference document) and 24 code-editing tasks across five noise levels. In the primary test, the model always received a clean reference document; only the request varied. Full agents, codebase navigation, and open-ended chat are out of scope.
| Version of the request | Typical outcome |
|---|---|
| Clean request + clean context | Correct |
| Human typo version + clean context | Usually still correct |
| 30% random-noise version + no context | Often wrong |
| 30% random-noise version + clean context | Much better than the no-context version |
Two types of errors: natural typos (swaps, drops, insertions, keyboard-neighbor mistakes) and random scrambling (each character replaced randomly at a fixed rate). Each task was scored automatically. These results apply to this setup only.
Ordinary typos had little measurable effect. On QA, typo corruption reached 32 edits on a roughly 100-word request. Accuracy fell from 84% to 69% at the highest typo level (p = 0.078, not significant at 0.05). On code, GPT-5.4-mini stayed at 96% across all typo levels; GPT-5.4-nano drifted to 75% at the highest level, also not statistically significant.
Small spelling errors leave enough of each word intact for the model to understand the question (Sennrich, Haddow, and Birch, 2016). Deliberately crafted misspellings cause larger drops; that is a different problem (Pruthi, Dhingra, and Lipton, 2019). These results are expected and consistent with current literature on neural model robustness.
Random noise caused a sharp drop. Without a context document, accuracy declined at each noise level: 64% at 15% noise, 20% at 30%, and just 4% at 50%. Code editing followed the same pattern: GPT-5.4-mini scored 96% at 5% noise, then 92%, 63%, and 25% as noise reached 50%. Stronger models had a higher floor at extreme noise; the same pattern held across both task types (Belinkov and Bisk, 2018).
Having a reference document pushed the failure point from 30% to 50% noise. With a reference document, the model scored 69% at 30% noise. Without a reference document, the same questions scored 20%. A clean spec, docs excerpt, or code block gave the model a reliable reference to work from (Karpukhin et al., 2020).
At 50% noise, supplementary context determined what the model could still get right. "Structure" here means any additional information provided alongside the noisy prompt: answer options, a code block, or a reference text. Even when the prompt itself was largely unreadable, these gave the model reliable anchors to work from. The following four conditions were compared at this noise level:
| Structure available | GPT-5.4-nano | GPT-5.4-mini | GPT-5.2 |
|---|---|---|---|
| None (no context) | 0% | 4% | 4% |
| Multiple-choice answers | 8% | 8% | 58% |
| Code context | 29% | 25% | 42% |
| Text context (QA) | 44% | 42% | 58% |
Prompt clarity had a larger effect than typos. I wrote verbose and concise versions of the same 45 QA tasks. Verbose prompts named the full scenario and specified exactly what to answer; concise prompts asked the same thing in a shorter form:
Concise: What happens after a failed webhook delivery?
Verbose: If a GitHub webhook delivery fails and nobody clicks Redeliver, what
automatic retry behavior should the operator expect from GitHub?
On clean prompts, verbose scored 91.1% against concise at 75.6%, a gap of 15.5% (p = 0.039) that narrowed to 11.1% under light noise.
| Condition | Verbose | Concise | Gap |
|---|---|---|---|
| Clean | 91.1% | 75.6% | 15.5% |
| Light uniform noise | 82.2% | 71.1% | 11.1% |
Practical takeaway
If you use an LLM with a document, log, or code snippet as context, prompt clarity deserves more attention than spelling mistakes. This is consistent with current literature and documentation on language model robustness.
Focus on three things:
- provide the right context
- state the task clearly
- specify what counts as a correct output
References
- Belinkov, Y., & Bisk, Y. (2018). Synthetic and Natural Noise Both Break Neural Machine Translation. ICLR 2018.
- Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP 2020.
- Pruthi, D., Dhingra, B., & Lipton, Z. C. (2019). Combating Adversarial Misspellings with Robust Word Recognition. ACL 2019.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016.