When an x86 Emulator Rewrote 256KB of Code Into a Loop

Binary translation is one of the quiet workhorses of computing: it lets a machine run programs compiled for a completely different processor by converting those instructions into native code. Windows has shipped x86 emulators like this several times, so that x86-32 applications could run on systems whose native processor was something else entirely. Raymond Chen recently shared a story from one of those emulator teams on The Old New Thing, and it is a small classic of pragmatic engineering: the day the translator was taught to rewrite a genuinely absurd piece of compiler output.

Why Windows emulated x86 with binary translation

When Windows ran on a non-x86 processor, it still needed to run the enormous existing library of x86-32 software. The emulator handled that. Rather than interpreting each x86 instruction one at a time, it used binary translation: it generated native code that performed the equivalent operations of the original x86-32 code. As Chen puts it, this “offered a significant performance improvement over emulation via interpreter.” The translator is, in effect, a just-in-time compiler from one instruction set to another.

Chen is careful about the details he does not have. He notes he does not know which native processor this particular story applied to, only that Windows “included a processor emulator for x86-32 on systems that natively ran some other processor.” So this is not an ARM-specific tale or a story about any one modern product. It is about the general craft of binary translation.

A function that used 256KB of code to initialize 64KB of memory

The story turns on a single pathological function. A program needed to allocate roughly 64KB of memory on the stack and initialize it. The normal, sane way to do this is well understood: perform a stack probe, subtract 65,536 from the stack pointer, and then clear the memory with a small, tight loop. A few instructions, executed many times.

The compiler that built this particular program had other ideas. Chasing an aggressive optimization, it unrolled that initialization loop completely. Instead of a loop, it emitted 65,536 individual “write byte to memory” instructions, one per byte, each four bytes long. The arithmetic is brutal:

; What a tight loop should be (a handful of instructions):
    sub  esp, 65536
    xor  eax, eax
    mov  ecx, 65536
fill:
    mov  byte [edi], al
    inc  edi
    dec  ecx
    jnz  fill

; What the compiler actually emitted: 65,536 unrolled stores
    mov  byte [edi+0],     al
    mov  byte [edi+1],     al
    mov  byte [edi+2],     al
    ; ...65,533 more, each instruction 4 bytes wide...
    mov  byte [edi+65535], al

; 65,536 instructions x 4 bytes = 262,144 bytes
; => 256 KB of code to initialize 64 KB of data

256 kilobytes of machine code to initialize 64 kilobytes of data. The program was correct and it worked, but it was grotesquely bloated, and every byte of it had to pass through the emulator’s translator.

The translator learned to recognize the monster

This is the part engineers tell as a war story. In Chen’s words, the pattern “offended the team so much that they added special code to the translator to detect this horrible function and replace it with the equivalent tight loop.” The translator would spot the tell-tale run of tens of thousands of identical byte stores, recognize what the function was really trying to do, and emit the compact loop the compiler should have produced in the first place.

It is worth being precise about what this is and is not. The emulator did not “heal” broken code, and there was no runtime monitor watching for anomalies. The code was not broken at all; it was correct but wasteful. What the team added was a deliberate, hand-written special case in the translator: a peephole optimization that recognized one specific, well-known idiom and substituted a better native implementation. That is ordinary, if satisfying, compiler engineering, applied at translation time instead of at compile time.

Why the story lasts

The reason this anecdote resonates is that it captures how real binary translators earn their performance. A translator does not have to reproduce bad code faithfully. It is free to recognize patterns and emit something better, exactly as an optimizing compiler does. Most of the time that means routine work like strength reduction and dead-code elimination. Occasionally it means a targeted special case for a specific piece of output so egregious that a human decided it was worth hard-coding a fix. The 256KB stack initializer was one of those cases, and it remains a good reminder that the layer translating your instructions is often smarter than the code it is translating.

Sources and references

Raymond Chen, “The time the x86 emulator team found code so bad that they fixed it during emulation,” The Old New Thing, June 15, 2026