X86 Emulator Team Fixes Catastrophically Bad Code During Translation

Original: The time the x86 emulator team found code so bad they fixed it during emulation

Why This Matters

Illustrates real-world compiler inefficiencies and how emulation teams handle legacy code optimization challenges.

Microsoft's x86-32 emulator team discovered a program that used 256KB of code to initialize 64KB of memory by unrolling a loop into 65,536 individual write instructions instead of using a tight loop, prompting them to add special detection and replacement logic.

During a discussion of technical war stories, Raymond Chen recounted an incident from when Windows ran an x86-32 processor emulator on non-x86 systems using binary translation technology. The emulator converted x86-32 code into native code for performance improvement, functioning similar to a JIT compiler. A particular program needed to allocate and initialize approximately 64KB of stack memory. The standard approach involves stack probes, adjusting the stack pointer by 65,536 bytes, then initializing memory in a loop. However, the compiler used to build this program chose a different strategy. Instead of generating a loop to initialize each byte, it unrolled the loop completely into 65,536 individual write-byte-to-memory instructions, with each instruction measuring 4 bytes. This resulted in 256 kilobytes of generated code to handle 64 kilobytes of initialization—a 4:1 code-to-data ratio. The emulator team considered this optimization so egregiously bad that they implemented special detection code within the binary translator to identify this specific problematic function and replace it with the proper tight loop implementation during emulation.

Source

devblogs.microsoft.com — Read original →