« Back to the Da Slop Pit Forum

o p t i m i z e

Posted by iximeow

posted
updated

Forum: Da Slop Pit Group

in this thread we're making yaxpeax-x86 faster.


i have a super bad microbenchmark that lives in the repo, it's my spot check for "did i make it obviously worse". it involves disassembling like 500 bytes of instructions or something, so the whole disassembler and dataset pretty quickly end up in cache. it gets really good ipc numbers and makes me feel good about acing synthetic workloads :)

perf looks like this, today:
Samples: 862K of event 'cycles', Event count (approx.): 36077639848
Overhead  Command          Shared Object           Symbol
  98.06%  bench-bda9664bd  bench-bda9664bdc9d6027  [.] bench::do_decode_swathe
   1.78%  bench-bda9664bd  bench-bda9664bdc9d6027  [.] yaxpeax_x86::long_mode::read_E

which is to say, the whole decoder gets inlined into the `do_decode_swathe` function, with `read_E` being left out. it's a small part of the overall time, but lets give it a look - it's much smaller than trying to eyeball the entire disassembler..

first off, the source:
    #[inline]
    fn width_to_gp_reg_bank(width: u8, rex: bool) -> RegisterBank {
        match width {
            1 => return if rex { RegisterBank::rB } else { RegisterBank::B },
            2 => return RegisterBank::W,
            4 => return RegisterBank::D,
            8 => return RegisterBank::Q,
            _ => unsafe { unreachable_unchecked(); }
        }
    }

now from perf this thing is almost 600 bytes of instructions. check out this annotated trace:

Samples: 862K of event 'cycles', 100000 Hz, Event count (approx.): 36077639848
Percent│       │     Disassembly of section .text:       │       │
     0000000000020d90 <yaxpeax_x86::long_mode::read_E>:       │
     _ZN11yaxpeax_x869long_mode6read_E17h102120264f0d061fE():
  0.49 │       push   %rbp
  0.55 │       push   %r15
  1.28 │       push   %r14
  5.54 │       push   %rbx
  0.59 │       mov    $0x4,%al
  2.33 │       add    $0xff,%cl
  0.01 │       movzbl %cl,%ecx
  0.56 │       lea    some_misleading_symbol+0x3bf8,%rbp
       │       movslq 0x0(%rbp,%rcx,4),%rcx
 31.39 │       add    %rbp,%rcx                              <--- wow!
  0.01 │     → jmpq   *%rcx                                  <--- wowow!
  8.76 │       mov    0x15(%rsi),%al
  0.01 │       and    $0xc0,%al
       │       cmp    $0x40,%al
  0.01 │       sete   %al
  0.11 │       add    %al,%al
  0.01 │       add    $0x6,%al
  0.57 │       cmp    $0xbf,%dl
       │     ↓ ja     ec
       │ 36:   mov    0x8(%rdi),%r8
  6.38 │       mov    %edx,%eax
       │       and    $0x7,%al
  0.01 │       cmp    $0x5,%al
       │     ↓ je     11b
       │       cmp    $0x4,%al
  0.01 │     ↓ jne    155
[snip]

so what rust seems to have done is made this little lookup child into an indirect branch!! and this function is used *all over the place* in the decoder to translate widths and prefix combinations into a reported RegisterBank. this is way more complex than i'd thought this function would compile to, and it's woven through the whole dang thing.lets see if we can simplify this or delete the function entirely..


Report Topic

1 Reply

Reply by yuu

posted

You don't need that toxic energy, good on you for getting rid of that and standing up for yourself and your instruction decoder.


Report Reply