in this thread we're making yaxpeax-x86 faster.
i have a super bad microbenchmark that lives in the repo, it's my spot check for "did i make it obviously worse". it involves disassembling like 500 bytes of instructions or something, so the whole disassembler and dataset pretty quickly end up in cache. it gets really good ipc numbers and makes me feel good about acing synthetic workloads :)
perf looks like this, today:
Samples: 862K of event 'cycles', Event count (approx.): 36077639848Overhead Command Shared Object Symbol 98.06% bench-bda9664bd bench-bda9664bdc9d6027 [.] bench::do_decode_swathe 1.78% bench-bda9664bd bench-bda9664bdc9d6027 [.] yaxpeax_x86::long_mode::read_E
which is to say, the whole decoder gets inlined into the `do_decode_swathe` function, with `read_E` being left out. it's a small part of the overall time, but lets give it a look - it's much smaller than trying to eyeball the entire disassembler..
first off, the source: #[inline] fn width_to_gp_reg_bank(width: u8, rex: bool) -> RegisterBank { match width { 1 => return if rex { RegisterBank::rB } else { RegisterBank::B }, 2 => return RegisterBank::W, 4 => return RegisterBank::D, 8 => return RegisterBank::Q, _ => unsafe { unreachable_unchecked(); } } }
now from perf this thing is almost 600 bytes of instructions. check out this annotated trace:
Samples: 862K of event 'cycles', 100000 Hz, Event count (approx.): 36077639848Percent│ │ Disassembly of section .text: │ │ 0000000000020d90 <yaxpeax_x86::long_mode::read_E>: │
_ZN11yaxpeax_x869long_mode6read_E17h102120264f0d061fE(): 0.49 │ push %rbp 0.55 │ push %r15 1.28 │ push %r14 5.54 │ push %rbx 0.59 │ mov $0x4,%al 2.33 │ add $0xff,%cl 0.01 │ movzbl %cl,%ecx 0.56 │ lea some_misleading_symbol+0x3bf8,%rbp │ movslq 0x0(%rbp,%rcx,4),%rcx 31.39 │ add %rbp,%rcx <--- wow! 0.01 │ → jmpq *%rcx <--- wowow! 8.76 │ mov 0x15(%rsi),%al 0.01 │ and $0xc0,%al │ cmp $0x40,%al 0.01 │ sete %al 0.11 │ add %al,%al 0.01 │ add $0x6,%al 0.57 │ cmp $0xbf,%dl │ ↓ ja ec │ 36: mov 0x8(%rdi),%r8 6.38 │ mov %edx,%eax │ and $0x7,%al 0.01 │ cmp $0x5,%al │ ↓ je 11b │ cmp $0x4,%al 0.01 │ ↓ jne 155[snip]
so what rust seems to have done is made this little lookup child into an indirect branch!! and this function is used *all over the place* in the decoder to translate widths and prefix combinations into a reported RegisterBank. this is way more complex than i'd thought this function would compile to, and it's woven through the whole dang thing.lets see if we can simplify this or delete the function entirely..