Reverse Engineering WeeperVM 2 by Ben_Lolo

Challenge Link 1 and Challenge Link 2

Foreword

Thanks Ben_Lolo once again for putting together a fantastic reverse engineering challenge - it was a great way to showcase a whole slew of interesting subjects:

  • Anti-debug evasion on linux
  • pwndbg enhanced GDB scripting
  • Reverse engineering a custom ISA
  • Navigating embedded machine code obfuscation
  • Crypto cracking

Goals

  1. Reverse engineer the VM instruction set and architecture
  2. Recover a flag

Lab Environment

  • Ubuntu 24.04.3 LTS
  • IDA Pro 9.1
  • Pwndbg 2025.05.30
  • Python 3.12.9
  • GDB 16.2

Initial Dynamic Analysis

I first chose to analyze the application using strace -if ./WeeperVM--Level_2 to get a sense for the early behaviors the binary exhibits. Note -f to ensure child processes / threads are followed, and -i to display the program counter at time of syscall.

strace Insight 1: Embedded Bytecode

The application performs a self-read near a high entropy block that was correctly assumed to be its bytecode (0x61e0). It’s likely the application performs its anti-patching via /proc/self/exe as well, but I didn’t ever need to explicitly verify.

[pid 25473] [0000000000011cf8] open("/proc/self/exe", O_RDONLY) = 3
[pid 25473] [0000000000011cf8] read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\2004\1\0\0\0\0\0"..., 64) = 64
[pid 25473] [0000000000011cf8] lseek(3, 24672, SEEK_SET) = 24672

Hex dump showing high entropy section

strace Insight 2: Time-based Thread Stop Detection

The target repeatedly reads the current time, which is often a behavior of applications that rely on time-based anti-tracing.

[pid 25474] [0000000000011cf8] gettimeofday({tv_sec=1757866270, tv_usec=950499}, NULL) = 0

strace Insight 3: TracerPid Detection

The target reads /proc/self/status, which in proximity to receiving a kill signal is a strong anti-debug indicator.

[pid 25474] [0000000000011cf8] open("/proc/self/status", O_RDONLY) = 4

strace Insight 4: Memory Dumping / Manipulation Detection

The target uses the inotify subsystem to monitor /proc/self/mem, which indicates protection from memory based introspection and tampering.

[pid 25475] [0000000000011cf8] inotify_add_watch(5, "/proc/self/mem", IN_ACCESS <unfinished ...>

Evading Anti-debug, Anti-trace, and Anti-patch

Note that most anti-* checks the target performs originate from a single offset 0x11cf8, which indicates the virtualized code executes its operating system supported operations through a single opcode. The virtualized syscall opcode is a powerful single point of failure for anti-analysis features.

Fake procfs and no-op timestamps

Through the single point in which syscalls flow, I scripted a hook that evades the applications anti-* checks. It does so by redirecting syscalls that rely on procfs instead to the local filesystem, as well as no-op’ing gettimeofday invocations.

Relative to the running target the physical filesystem must have the following structure:

./WeeperVM--Level_2
./zproc/self/exe     *1
./zproc/self/status  *2
./zproc/self/mem     *3

🔔 *1) exe is an unmodified copy of the target

🔔 *2) status is a copy of /proc/<Weeper2_PID>/status with no TracerPid

🔔 *3) mem is an empty file

By doing this: exe always contains expected content, status never reveals TracerPid, and mem never flags inotify callbacks. Additionally with gettimeofday being no-op’d, time Δ after thread stops will always be zero.

# Usage (within pwndbg): 
#    * starti
#    * source evade.py

import pwndbg.dbg
from pwnlib.asm import asm
from pwndbg.aglib import memory
from pwndbg.dbg.gdb import GDB
from pwnlib.context import context


pdgb: GDB = pwndbg.dbg
context.arch = 'amd64'


def bypass_antidebug():
    # hook VM syscall opcode -- 6 byte (two opcode) clobber
    hook_b = asm("jmp 0x00000000000134E0; nop", 0x0000000000011CF0)

    # relocate the clobbered opcodes to the hook body prologue
    existing_ops = memory.read(0x0000000000011CF0, len(hook_b))
    memory.write(0x0000000000011CF0, hook_b)

    # write hook body in `.text` slack space
    # the hook redirects procfs to local fs by clobbering leading slashes with 'z'
    # it also no-ops gettimeofday to avoid time tamper checks
    body_b = existing_ops + asm("""
// check syscall number for open, inotify, gettimeofday
cmp rax, 2
je _do_open_hook
cmp rax, 254
je _do_inotify_hook
cmp rax, 96
je _do_noop
jmp _resume
    _do_open_hook:
// redirect procfs to local fs by clobbering leading slash with 'z'
cmp dword ptr [rdi], 0x6f72702f
jne _resume
mov byte ptr [rdi], 0x7a
jmp _resume
                            
    _do_inotify_hook:
// redirect procfs to local fs by clobbering leading slash with 'z'
cmp dword ptr [rsi], 0x6f72702f
jne _resume
mov byte ptr [rsi], 0x7a
jmp _resume
// restore original control flow
    _resume:
jmp 0x0000000000011CF6
    _do_noop:
mov rax, 0
jmp 0x0000000000011CF8
    """, 0x00000000000134E0 + len(existing_ops))
    memory.write(0x00000000000134E0, body_b)


bypass_antidebug()

Virtual Architecture

With both dynamic and static analysis available at this point, I walked the stack frame up from our known “syscall”-esque VM opcode handler to the VM’s main loop. With some renaming and cleanup the main loop (0x102EC) can be understood as follows:

__int64 __fastcall vm_main(unsigned int pc)
{
...

  _BYTE decoded_lval[9]; // [rsp+1Eh] [rbp-49Ah] BYREF
  _BYTE decoded_rval[9]; // [rsp+27h] [rbp-491h] BYREF
  unsigned __int8 *context[3]; // [rsp+30h] [rbp-488h] BYREF
  _BYTE regs[64]; // [rsp+48h] [rbp-470h] BYREF
  _BYTE stack[1072]; // [rsp+88h] [rbp-430h] BYREF

...

  context[0] = regs;
  context[1] = stack;
  LODWORD(context[2]) = 0; // Flags
  
...

  do
  {
    ...

    result = execute_single_op(pc, decoded_opcode, decoded_lval, decoded_rval, context, next_instruction);

    ...
  }
  while ( (_DWORD)result != 0xFFFFFF );

The VM components consist of a single readwrite memory range, 16 32-bit registers, a 32-bit callstack, and a flags register. Instructions are variable width, with the first three bytes containing the operation, lvalue flags, and rvalue flags. Large constants and memory config overflow into additional bytes. Instructions are unordered, being followed by an obfuscated (⊕ 0xDC2606) offset to their next instruction. The VM allows for threaded execution, making tracing more difficult at times. Setting the scheduler locking mode of GDB can be helpful, and thread scoped breakpoints can help as well.

Instruction decoding was mostly a cut-and-paste job from IDA with some minor cleanups. This could be simplified, but it wasn’t necessary.

op_b0, op_b1, op_b2 = ROM[pc:pc+3]

lflag = ((op_b1 >> 5 & 3) + 1) * 0x10 | (op_b0 * 0x02 & 2 | op_b1 >> 7) << 2 | op_b0 >> 1 & 3
lval = ROM[pc+3:pc+8]

rflag = (16 * (((op_b2 >> 5) & 3) + 1)) | (op_b1 >> 1) & 3 | (4 * ((op_b2 >> 7) | (2 * op_b1) & 2))
if rflag & 3:
    offs = 0
    if lflag & 3 == 1:
        offs = 1
    elif lflag & 2:
        offs = lflag >> 4
    rval = ROM[pc+3+offs:pc+3+offs+6]
else:
    rval = bytearray()

op = (((op_b1 & 0x18) << 0x17) >> 0x18)
op_decoded = (-((op == 8) != (op == 4)) ^ (op_b0 >> 3) ^ op_b2) & 0x1f

# Use decoded opcode
__(pc, op_decoded, (lflag, lval), (rflag, rval), ROM)

# Next opcode
ins_size = lflag >> 4 if (lflag & 0x3) else 0
ins_size += rflag >> 4 if (rflag & 0x3) else 0
encr_offs = int.from_bytes(ROM[pc+ins_size+3:pc+ins_size+6], 'little')
pc = encr_offs ^ 0xDC2606

Control flow is achieved using a flags and dedicated jump and comparison opcodes. Notably certain opcodes also contain conditional forms (i.e. conditional call’s and ret’s)

By convention register state is preserved across subroutines using a specialized mov opcode that derives save addresses by hashing the current stack frame return.

Reverse Engineering Opcodes

The IDA decompiler output for execute_single_op did not spark joy when I read it.

Meme

I instead opted to understand opcodes primarily through diffing the VM’s execution state. By capturing the state of the VM before and after unknown opcode(s) execute, their behavior can be understood observationally. Optionally fuzzing the state of the VM beforehand allows for an understanding of how opcodes operate on varied inputs. The following is a script demonstrates slicing state around a single opcode that moves a constant to a register.

Note the addition of a hook to better allow stopping the VM at a specific address, since conditional breakpoints in the host debugger are prohibitively slow (or deadlock-y).

# Usage (within pwndbg): 
#    * starti
#    * source slice_diff.py
# Edit SLICE_START/END and fuzz params to taste

from pathlib import Path
import pwndbg.dbg
from pwnlib.asm import asm
from pwndbg.aglib import memory
from pwndbg.dbg.gdb import GDB
from pwndbg.dbg import BreakpointLocation
import gdb
from pwnlib.context import context


pdgb: GDB = pwndbg.dbg
SLICE_START = 0x00013030
SLICE_END = 0x000027BC
FIRST_BP = 0
SECOND_BP = 0
context.arch = 'amd64'


with (Path(__file__).parent / 'WeeperVM--Level_2').open('rb') as f:
    weeper_vm = f.read()
    ROM = weeper_vm[0x000061E0:]


def vm_regs(ctx_addr: int) -> list[int]:
    reg_addr = int.from_bytes(memory.read(ctx_addr, 8), 'little')
    regs = []
    for i in range(16):
        r = int.from_bytes(memory.read(reg_addr + i * 4, 4), 'little')
        regs.append(r)
    return regs


def vm_stack(ctx_addr: int) -> list[int]:
    stack_addr = int.from_bytes(memory.read(ctx_addr + 8, 8), 'little')
    stack = []
    for i in range(270):
        m = int.from_bytes(memory.read(stack_addr + i * 4, 4), 'little')
        stack.append(m)
    return stack


def vm_mem() -> bytes:
    mem_addr = int.from_bytes(memory.read(0x0000000000015040, 8), 'little')
    return memory.read(mem_addr, 0x100000)


def vm_flags(ctx_addr: int) -> int:
    return int.from_bytes(memory.read(ctx_addr + 16, 8), 'little')


def fuzz_vm_state(ctx_addr: int):
    # reg_addr = int.from_bytes(memory.read(ctx_addr, 8), 'little')
    # for i in range(16):
    #     memory.write(reg_addr + i * 4, (0x12345678 + i).to_bytes(4, 'little'))
    # stack_addr = int.from_bytes(memory.read(ctx_addr + 8, 8), 'little')
    # for i in range(270):
    #     memory.write(stack_addr + i * 4, (0x66666660 + i).to_bytes(4, 'little'))
    # mem_addr = int.from_bytes(memory.read(0x0000000000015040, 8), 'little')
    # buf = bytearray(0x100000)
    # for i in range(0x100000):
    #     buf[i] = i % 256
    # memory.write(mem_addr, buf)
    pass


def print_state_diff(
        before_regs: list[int], before_stack: list[int], before_mem: bytes, before_flags: int,
        after_regs: list[int], after_stack: list[int], after_mem: bytes, after_flags: int):
    for i in range(16):
        if before_regs[i] != after_regs[i]:
            print(f'[+] Modified register $r{i}: {before_regs[i]:08x} -> {after_regs[i]:08x}')
    for i in range(270):
        if before_stack[i] != after_stack[i]:
            print(f'[+] Modified stack [0x{i * 4:08x}]: {before_stack[i]:08x} -> {after_stack[i]:08x}')
    for i in range(len(before_mem)):
        if before_mem[i] != after_mem[i]:
            print(f'[+] Modified memory [0x{i:08x}]: {before_mem[i]:02x} -> {after_mem[i]:02x}')
    if before_flags != after_flags:
        print(f'[+] Modified flags: {before_flags:08x} -> {after_flags:08x}')
    print('Diff complete')


async def slice_diff(ec: pwndbg.dbg_mod.ExecutionController):
    first = True
    with pdgb.selected_inferior().break_at(BreakpointLocation(FIRST_BP)) as first_bp:
        with pdgb.selected_inferior().break_at(BreakpointLocation(SECOND_BP)) as second_bp:
            while True:
                await ec.cont(first_bp)
                regs = pdgb.selected_frame().regs()
                ctx_addr = int(regs.by_name('r8'))
                print(f'[1] Slicing at {int(regs.by_name("rdi")):08x}')

                if first:
                    fuzz_vm_state(ctx_addr)
                    first = False

                before_regs, before_stack, before_mem, before_flags = vm_regs(ctx_addr), vm_stack(ctx_addr), vm_mem(), vm_flags(ctx_addr)
                await ec.cont(second_bp)
                print(f'[2] Slicing at {int(regs.by_name("rdi")):08x}')
                after_regs, after_stack, after_mem, after_flags = vm_regs(ctx_addr), vm_stack(ctx_addr), vm_mem(), vm_flags(ctx_addr)
                print_state_diff(before_regs, before_stack, before_mem, before_flags, after_regs, after_stack, after_mem, after_flags)



def bypass_antidebug():
    # hook VM syscall opcode -- 6 byte (two opcode) clobber
    hook_b = asm("jmp 0x00000000000134E0; nop", 0x0000000000011CF0)

    # relocate the clobbered opcodes to the hook body prologue
    existing_ops = memory.read(0x0000000000011CF0, len(hook_b))
    memory.write(0x0000000000011CF0, hook_b)

    # write hook body in `.text` slack space
    # the hook redirects procfs to local fs by clobbering leading slashes with 'z'
    # it also no-ops gettimeofday to avoid time tamper checks
    body_b = existing_ops + asm("""
// check syscall number for open, inotify, gettimeofday
cmp rax, 2
je _do_open_hook
cmp rax, 254
je _do_inotify_hook
cmp rax, 96
je _do_noop
jmp _resume
    _do_open_hook:
// redirect procfs to local fs by clobbering leading slash with 'z'
cmp dword ptr [rdi], 0x6f72702f
jne _resume
mov byte ptr [rdi], 0x7a
jmp _resume
                            
    _do_inotify_hook:
// redirect procfs to local fs by clobbering leading slash with 'z'
cmp dword ptr [rsi], 0x6f72702f
jne _resume
mov byte ptr [rsi], 0x7a
jmp _resume
// restore original control flow
    _resume:
jmp 0x0000000000011CF6
    _do_noop:
mov rax, 0
jmp 0x0000000000011CF8
    """, 0x00000000000134E0 + len(existing_ops))
    memory.write(0x00000000000134E0, body_b)


def hook_ins():
    global FIRST_BP, SECOND_BP

    # hook VM instructions, call redirection
    hook_b = asm("call 0x00000000000136E0", 0x000000000001046A)
    memory.write(0x000000000001046A, hook_b)

    # write hook body in `.text` slack space
    body_b: bytes = asm(f"""
cmp rdi, 0x{SLICE_START:X}
je _do_slice_start
cmp rdi, 0x{SLICE_END:X}
je _do_slice_end
jmp _resume
    _do_slice_start:
    nop
jmp _resume
    _do_slice_end:
    nop
_resume:
jmp 0x0000000000010491
    """, 0x00000000000136E0)
    memory.write(0x00000000000136E0, body_b)
    first_idx = body_b.find(b'\x90')
    second_idx = body_b.find(b'\x90', first_idx + 1)
    FIRST_BP = 0x00000000000136E0 + first_idx
    SECOND_BP = 0x00000000000136E0 + second_idx


def main():
    gdb.execute("starti")
    gdb.execute("bc")

    bypass_antidebug()
    hook_ins()

    pdgb.selected_inferior().dispatch_execution_controller(slice_diff)

main()

diff example

Disassembling

With opcodes understood and instructions decoded I opted to write a quick-and-dirty x86 inspired disassembler for the VM. It maintains a list of visited addresses to avoid unbounded disassembly looping and makes no attempt at semantically correct control flow graph presentation.

from pathlib import Path
import sys


HIWORD = lambda x: (x >> 16) & 0xFFFF
HIBYTE = lambda x: (x >> 8) & 0xFF
PROC_STACK: list[int] = [0]
FRAME_RETS: dict[int, int] = {0: 0}
FRAME_RET = 0
SEEN_ADDRESSES: set[int] = set()


def decode_val(param: tuple[int, bytes], lparam: tuple[int, bytes] | None = None) -> str:
    nval = param[1]
    rmc = param[0] & 3
    width = (param[0] >> 2) & 3
    csiz = param[0] >> 4
    decoded = ''
    post = {
        1: 'b',
        2: 'w',
        3: 'd',
    }

    if rmc == 0x1:
        # Reg
        r = nval[0]
        if r > 15:
            raise ValueError(f'Invalid register: {r}')
        decoded += f'r{r}.{post[width]}'
    elif rmc == 0x2:
        # Memory
        if nval[0] & 0x40:
            width = (nval[0] >> 2) & 3
            r1 = (nval[1] >> 2) & 0xF
            r2 = (nval[1] >> 6) | ((nval[0] * 4) & 0xC)
            decoded += f'm[r{r1}+r{r2}].{post[width]}'
        else:
            c = ''
            r = -1
            if nval[0] & 0x80:
                r = nval[0] & 0xF

            csiz -= 1
            if csiz > 0:
                const = int.from_bytes(nval[1:1+csiz], 'big')
                if const != 0:
                    if r != -1:
                        c += f'+'
                    c += f'{const:02X}'

            if r != -1:
                r = f'r{r}'
            else:
                r = ''
            decoded += f'm[{r}{c}].{post[width]}'
    elif rmc == 3:
        # Constant
        assert csiz > 0
        const = int.from_bytes(nval[0:csiz], 'big')
        decoded += f'0x{const:X}'
    return decoded


def dasm_op(pc: int, op_decoded: int, lparam: tuple[int, bytes], rparam: tuple[int, bytes], ROM: bytes) -> tuple[bool, int]:
    print(f'{pc:08X}: ', end='')
    if pc == 0x0000FDF1:
        print(f'\t[DEBUG] {ROM[pc:pc+12].hex()}')
        pass

    decoded_lval = decode_val(lparam)
    decoded_rval = decode_val(rparam, lparam)

    if op_decoded == 0x0:
        print(f'mov ', end='')
    elif op_decoded == 0x2:
        v305 = int(decoded_lval, 16)
        decoded_addr = ((FRAME_RET >> 8) ^ (FRAME_RET ^ 0x4571cbe2) * 0x3466875) & 0xFFFFFFFF
        decoded_addr = ((FRAME_RET >> 0x10) ^ (decoded_addr * 0x3466875)) & 0xFFFFFFFF
        decoded_addr = ((FRAME_RET >> 0x18) ^ (decoded_addr * 0x3466875)) & 0xFFFFFFFF
        decoded_addr = ((v305 ^ (decoded_addr * 0x3466875)) * 0x857c4b2) & 0xfffe
        print(f'mov_enc [{decoded_addr:08X}], {decoded_rval}')
        return True
    elif op_decoded == 0x3:
        v305 = int(decoded_rval, 16)
        decoded_addr = ((FRAME_RET >> 8) ^ (FRAME_RET ^ 0x4571cbe2) * 0x3466875) & 0xFFFFFFFF
        decoded_addr = ((FRAME_RET >> 0x10) ^ (decoded_addr * 0x3466875)) & 0xFFFFFFFF
        decoded_addr = ((FRAME_RET >> 0x18) ^ (decoded_addr * 0x3466875)) & 0xFFFFFFFF
        decoded_addr = ((v305 ^ (decoded_addr * 0x3466875)) * 0x857c4b2) & 0xfffe
        print(f'mov_enc {decoded_lval}, [{decoded_addr:08X}]')
        return True
    elif op_decoded == 0x4:
        print(f'add ', end='')
    elif op_decoded == 0x5:
        print(f'sub ', end='')
    elif op_decoded == 0x6:
        print(f'imul ', end='')
    elif op_decoded == 0x7:
        print(f'idiv ', end='')
    elif op_decoded == 0x8:
        print(f'mod ', end='')
    elif op_decoded == 0x9:
        print(f'cmp ', end='')
    elif op_decoded == 0xA:
        suffix = ''
        if decoded_rval == '':
            suffix = 'mp'
        else:
            if int(decoded_rval, 16) == 0x6:
                suffix += 'ne'
            elif int(decoded_rval, 16) & 0x2:
                suffix += 'l'
            elif int(decoded_rval, 16) & 0x4:
                suffix += 'g'
            elif int(decoded_rval, 16) & 0x1:
                suffix += 'e'
        addr = int(decoded_lval, 16)
        PROC_STACK.append(addr)
        FRAME_RETS[addr] = FRAME_RET
        print(f'j{suffix} {decoded_lval}')
        return True
    elif op_decoded == 0xb:
        subroutine = int.from_bytes(ROM[pc+3:pc+6], 'big')
        ret = int.from_bytes(ROM[pc+6:pc+9], 'little') ^ 0xDC2606
        PROC_STACK.append(subroutine)
        FRAME_RETS[subroutine] = ret

        suffix = ''
        if decoded_rval != '':
            if int(decoded_rval, 16) == 0x6:
                suffix += 'ne'
            elif int(decoded_rval, 16) & 0x2:
                suffix += 'l'
            elif int(decoded_rval, 16) & 0x4:
                suffix += 'g'
            elif int(decoded_rval, 16) & 0x1:
                suffix += 'e'

        print(f'call{suffix} {subroutine:08X} -> {ret:08X}\n', end='')
        return True
    elif op_decoded == 0xc:
        suffix = ''
        if decoded_lval != '':
            if int(decoded_lval, 16) == 0x6:
                suffix += 'ne'
            elif int(decoded_lval, 16) & 0x2:
                suffix += 'l'
            elif int(decoded_lval, 16) & 0x4:
                suffix += 'g'
            elif int(decoded_lval, 16) & 0x1:
                suffix += 'e'
            print(f'ret{suffix}')
            return True
        else:
            print(f'ret ', end='')
    elif op_decoded == 0xd:
        print(f'abort ', end='')
    elif op_decoded == 0xe:
        print(f'and ', end='')
    elif op_decoded == 0xf:
        print(f'or ', end='')
    elif op_decoded == 0x10:
        print(f'xor ', end='')
    elif op_decoded == 0x11:
        print(f'shl ', end='')
    elif op_decoded == 0x12:
        print(f'shr ', end='')
    elif op_decoded == 0x14:
        if decoded_lval == '0x20':
            decoded_lval += ' ; sys_open'
        elif decoded_lval == '0x21':
            decoded_lval += ' ; sys_read'
        elif decoded_lval == '0x11':
            decoded_lval += ' ; sys_getpid'
        elif decoded_lval == '0x13':
            decoded_lval += ' ; sys_kill'
        elif decoded_lval == '0x14':
            decoded_lval += ' ; sys_gettimeofday'
        elif decoded_lval == '0x33':
            decoded_lval += ' ; sys_futex'
        elif decoded_lval == '0x24':
            decoded_lval += ' ; sys_seek'
        elif decoded_lval == '0x22':
            decoded_lval += ' ; sys_write'
        elif decoded_lval == '0x30':
            decoded_lval += ' ; sys_inotify_init'
        elif decoded_lval == '0x31':
            decoded_lval += ' ; sys_inotify_add_watch'
        elif decoded_lval == '0x9D':
            decoded_lval += ' ; sys_prctl'
        elif decoded_lval == '0x10':
            decoded_lval += ' ; sys_nanosleep'
        elif decoded_lval == '0x23':
            decoded_lval += ' ; sys_close'
        else:
            raise NotImplementedError()
        print(f'syscall ', end='')
    elif op_decoded == 0x15:
        print(f'bswap ', end='')
    elif op_decoded == 0x16:
        print(f'bit_mirror ', end='')
    elif op_decoded == 0x17:
        print(f'merge_srchi_destlo ', end='')
    elif op_decoded == 0x18:
        print(f'merge_srclo_desthi ', end='')
    elif op_decoded == 0x19:
        print(f'rol ', end='')
    elif op_decoded == 0x1a:
        print(f'ror ', end='')
    elif op_decoded == 0x1b:
        print(f'thread ', end='')
        subroutine = int(decoded_lval, 16)
        ret = int.from_bytes(ROM[pc+3:pc+6], 'little') ^ 0xDC2606
        PROC_STACK.append(subroutine)
        FRAME_RETS[subroutine] = ret
    else:
        print(f'unk_{op_decoded:02x} ', end='')
        decode_val(lparam)
        decode_val(rparam, lparam)
        return False
    if len(decoded_lval) > 0:
        print(f'{decoded_lval}', end='')
    if len(decoded_rval) > 0:
        print(f', {decoded_rval}', end='')
    print('')
    return True


def main():
    global FRAME_RET

    with (Path(__file__).parent / 'WeeperVM--Level_2').open('rb') as f:
        weeper_vm = f.read()

    ROM = weeper_vm[0x000061E0:]
    while len(PROC_STACK) > 0:
        pc = PROC_STACK.pop(0)
        FRAME_RET = FRAME_RETS.get(pc)
        if pc in SEEN_ADDRESSES:
            print(f'[DEBUG] Already seen address: {pc:08X}', file=sys.stderr)
            continue
        if pc != 0:
            print('\n' + ('-' * 40) + '\n')
        while pc != 0xffffff:
            SEEN_ADDRESSES.add(pc)
            op_b0, op_b1, op_b2 = ROM[pc:pc+3]

            lflag = ((op_b1 >> 5 & 3) + 1) * 0x10 | (op_b0 * 0x02 & 2 | op_b1 >> 7) << 2 | op_b0 >> 1 & 3
            lval = ROM[pc+3:pc+8]

            rflag = (16 * (((op_b2 >> 5) & 3) + 1)) | (op_b1 >> 1) & 3 | (4 * ((op_b2 >> 7) | (2 * op_b1) & 2))
            if rflag & 3:
                offs = 0
                if lflag & 3 == 1:
                    offs = 1
                elif lflag & 2:
                    offs = lflag >> 4
                rval = ROM[pc+3+offs:pc+3+offs+6]
            else:
                rval = bytearray()

            op = (((op_b1 & 0x18) << 0x17) >> 0x18)
            op_decoded = (-((op == 8) != (op == 4)) ^ (op_b0 >> 3) ^ op_b2) & 0x1f
            dasm_op(
                pc,
                op_decoded,
                (lflag, lval),
                (rflag, rval),
                ROM
            )
            if op_decoded == 0xc and lflag == 0x10:
                break

            ins_size = lflag >> 4 if (lflag & 0x3) else 0
            ins_size += rflag >> 4 if (rflag & 0x3) else 0
            encr_offs = int.from_bytes(ROM[pc+ins_size+3:pc+ins_size+6], 'little')
            pc = encr_offs ^ 0xDC2606

if __name__ == '__main__':
    main()

Selective “Lifting”

Rather than lifting to IR I opted to use claude to “vibe-lift” certain routines in the disassembled target to python. While not nearly as powerful as translating the VM bytecode to something like LLVM IR or BNIL, this approach was fast and most importantly “good-enough” to solve the challenge.

Example CLAUDE.md

# Instructions

This is a source code recovery project, you help convert x86-esque assembly snippets to python in two steps. 

1. First naively replicate the behavior of the snippet in python.
2. Perform a simplication pass to make the snippet feel less like a code emulator, and more like native python. 

# Notes

Call and ret instructions appended with 'l' 'g', and 'ne' are conditional. 
Look in `examples.py` for how syscall, byte_swap, and bit_mirror opcodes should be implemented. 

DWORD constants are big-endian.

claude paying rent

This provided me with trivial decryption of stack strings, and reasonably readable implementations of algorithmic subroutines (with some minor fixups required). Example output for the subroutine 00000ECE, which prints the target’s encrypted “success” message:

# outputs decrypted stack string: `Very wise :)`

def converted_clean():
    """Clean, readable version showing what the algorithm actually does"""
    # The algorithm processes initial data and generates a transformed sequence
    initial_data = bytes([0x56, 0x33, 0x13, 0xCE, 0x18, 0x27, 0xD4, 0x4B, 
                         0x10, 0xA0, 0x78, 0x65, 0xE9, 0x3B, 0x00, 0x00])
    
    result = [initial_data[0]]  # Keep first byte
    
    # Transform bytes 1-13 using: (index-1)*0x52 XOR current_dword XOR previous_dword
    for i in range(1, 14):
        if i + 3 < len(initial_data):
            curr_dword = struct.unpack('<I', initial_data[i:i+4])[0]
            prev_dword = struct.unpack('<I', initial_data[i-1:i+3])[0]
            transformed = ((i-1) * 0x52 ^ curr_dword ^ prev_dword) & 0xFF
            result.append(transformed)
    
    # Create output with ANSI escape codes for colored text
    result_str = ''.join(chr(b) for b in result)
    print(f"\x1b[1;37m{result_str}\x1b[0m", end='')

Flag Finding

After disassembling and using claude to decrypt stack strings I worked back from the success conditions to understand the intended solution. The flag check is implemented in four steps:

  1. Process input with a custom xxhash-based hasher
  2. Seed a Mersenne Twister with 32-bit hasher output
  3. Twist for each DWORD in its state vector (0x270)
  4. Compare each DWORD to an obfuscated (hardcoded) PRNG state
0000AFEF: call 0000AE50 -> 00010806  ; xxhash(ish) input
00010806: mov r0.d, r4.d
00013804: call 00007632 -> 000061AE
000061AE: mov r10.d, 0xE2000
0000137A: mov r11.d, r0.d
000103EB: call 000029EE -> 0000B8E8 ; seed PRNG
0000B8E8: call 00007632 -> 0000A376
0000A376: mov r14.d, 0x0
00016856: mov r10.d, 0xE2000
00005CC6: call 00019936 -> 000133CD ; generate PRNG long
000133CD: mov r10.d, r4.d
0001644C: mov r11.d, r14.d
000025A0: call 00004059 -> 00015954 ; compare long to hardcoded

Identification of the algorithms was largely done by comparing hardcoded magic values to reference implementations:

  1. Identifying xxhash: https://grep.app/search?q=0x85EBCA77
  2. Identifying Mersenne Twister: https://grep.app/search?q=0x9908B0DF

Recovering PRNG State

To be able to work towards our target state output from xxhash(ish), we need to know what the target state seed for the PRNG is. I started by deobfuscating one of the vector entry checks that had a single finite solution.

00004059: mov r4.d, 0x0
000015B4: cmp r11.d, 0x0
0000F719: je 0xCC18      ; <---- non-finite solution set
00017590: cmp r11.d, 0x1
0000D38C: je 0xDD6C      ; <---- finite solution set
...
0000DD6C: bit_mirror r5.d, r10.d
00004EEA: mov r5.d, r5.b
000113DC: bit_mirror r5.b, r5.b
0000071D: mov r6.d, r10.d
00017A6A: add r5.d, r6.d
0000C75B: mov r6.d, r10.w
00000E6E: shr r6.d, 0x8
00008213: sub r5.d, r6.d
0000C788: cmp r5.d, 0xDFD6220C
0000BD46: retne

The following is a lazy brute force over all 32-bit integers that recovers the constant in a few minutes.

# pypy3 brute.py
# solution: 0xdfd6214e
import multiprocessing as mp


def bit_mirror(int_: int, width: int) -> int:
    int_ &= (1 << width) - 1
    return int(('{:0' + str(width) + 'b}').format(int_, width=width)[::-1], 2)

def hash_ladder_0000DD6C(r10: int) -> int:
    r5 = bit_mirror(r10, 32)
    r5 = bit_mirror(r5, 8)
    r6 = r10
    r5 = (r5 + r6) & 0xffffffff
    r6 = r10 & 0xFFFF
    r6 >>= 8
    r5 = (r5 - r6) & 0xffffffff
    return r5 == 0xDFD6220C

def check_range(start, end): 
    results = []
    for i in range(start, end):
        if hash_ladder_0000DD6C(i):
            results.append(i)
    return results

if __name__ == '__main__':
    chunk_size = (0xFFFFFFFF + 1) // 8
    ranges = [(i * chunk_size, (i + 1) * chunk_size) for i in range(8)]
    ranges[-1] = (ranges[-1][0], 0xFFFFFFFF + 1)  # Adjust last range
    
    with mp.Pool(8) as pool:
        results = pool.starmap(check_range, ranges)
    
    for result_list in results:
        for i in result_list:
            print(f'solution: 0x{i:08x}')

Now to recover the PRNG seed I iterated over all 32-bit integers once again to check for a second state vector entry of 0xdfd6214e. The Mersenne Twister (MTRand) implementation was a claude translation of Evan Sultanik’s mtwister. This implementation lines up extremely well with the implementation used by the VM.

# pypy3 brute_mersenne.py
# solution: 0xb48db866
from mt import MTRand
import multiprocessing as mp

def mt_check(r10: int) -> int:
    rng = MTRand(r10)
    rng.gen_rand_long()
    r5 = rng.gen_rand_long()
    return r5 == 0xdfd6214e

def check_range(start, end): 
    results = []
    for i in range(start, end):
        if mt_check(i):
            results.append(i)
            print(f'crack: 0x{i:08x}')
    return results

if __name__ == '__main__':
    chunk_size = (0xFFFFFFFF + 1) // 8
    ranges = [(i * chunk_size, (i + 1) * chunk_size) for i in range(8)]
    ranges[-1] = (ranges[-1][0], 0xFFFFFFFF + 1)  
    
    with mp.Pool(8) as pool:
        results = pool.starmap(check_range, ranges)
    
    for result_list in results:
        for i in result_list:
            print(f'solution: 0x{i:08x}')

Note the use of pypy3 in both cases for significantly faster iteration over the potential inputs. The scripts also partition their work against 8 CPU threads, which should be tweaked depending on target hardware and impatience level.

Finding Input Hash Solutions

The xxhash(ish) system used on the user input functions at a fundamental level like the following:

          Input Lo     Input Hi     Hash Out

S1 --------> ⊕ --------> ⊕ --------> ⊕
                                       V
S2 --------> ⊕ --------> ⊕ --------> ⊕
                                       V
S3 --------> ⊕ --------> ⊕ --------> ⊕
                                       V
S4 --------> ⊕ --------> ⊕ --------> ⊕
                                       V
C1 ----------------------------------> ⊕

S1-4 Represent constant state inputs hashed from a hardcoded string in the application: “Look at you, hacker. A pathetic creature of meat and bone, panting and sweating as you run through my corridor. How can you challenge a perfect immortal machine?”

C1 is a constant based on the amount of input into the system. Input hi and lo are 32 bytes of in-order user input split into DWORD sized chunks.

Even when constrained to limited character sets, the precision loss and input flexibility mean there are simply too many solutions to find the original ‘wisdom’ the author intended - but better to have too many solutions than too few 🙂

I chose to solve for a valid hash using z3, the flexibility of the inputs meant I could impart a bit of my own wisdom in the solution. The script replicates the hashing algorithm and targets an output state of our desired PRNG seed.

from z3 import *


def bit_mirror(bv, width=32):
    bits = []
    if width == 32:
        for i in range(width):
            bits.append(Extract(i, i, bv))
        return Concat(*bits)
    elif width == 16:
        for i in range(16):
            bits.append(Extract(i, i, bv))
        return Concat(Extract(31, 16, bv), *bits)
    elif width == 8:
        for i in range(8):
            bits.append(Extract(i, i, bv))
        return Concat(Extract(31, 8, bv), *bits)


def byte_swap(bv, width=32):
    if width == 32:
        return Concat(Extract(15, 0, bv), Extract(31, 16, bv))
    elif width == 16:
        return Concat(Extract(31, 16, bv), Extract(7, 0, bv), Extract(15, 8, bv))
    elif width == 8:
        return Concat(Extract(31, 8, bv), Extract(3, 0, bv), Extract(7, 4, bv))
    else:
        raise ValueError('Invalid width for byte_swap')
    

def xx_round(r0, state, rotate: int):
    r4 = byte_swap(byte_swap(byte_swap(state), 8), 16)
    r4 = bit_mirror(r4)
    r4 = byte_swap(byte_swap(byte_swap(r4, 16), 8))
    if rotate < 0:
        return r0 ^ RotateLeft(r4, -rotate)
    else:
        return r0 ^ RotateRight(r4, rotate)
    

def xx_update(states, inputs):
    new_states = []
    for i in range(4):
        r4 = states[i]
        r8 = byte_swap(inputs[i]) ^ 0x85EBCA77
        r8 = byte_swap(r8, 16)
        r8 = bit_mirror(r8, 16)
        r8 = byte_swap(r8, 8)
        r8 = bit_mirror(r8, 8)
        r8 = byte_swap(r8)
        r8 = bit_mirror(r8)
        r4 = RotateLeft(r4 ^ r8, 0xD) ^ 0x9E3779B1
        r4 = bit_mirror(r4)
        r4 = byte_swap(r4)
        r4 = byte_swap(r4, 16)
        r4 = bit_mirror(r4, 16)
        r4 = bit_mirror(r4, 8)
        r4 = byte_swap(r4, 8)
        new_states.append(r4)
    return new_states


s = Solver()
zero_vec = BitVec('zero_vec', 32)
s.add(zero_vec == 0)

in_vecs = BitVecs('in_vec_1 in_vec_2 in_vec_3 in_vec_4 in_vec_5 in_vec_6 in_vec_7 in_vec_8', 32)

for in_vec in in_vecs:
    for i in range(4):
        cond = (in_vec >> (i * 8)) & 0xFF
        s.add(And(cond > 0x20, cond < 0x7A))

        s.add((in_vecs[0] >> 24) & 0xFF == ord('n'))
        s.add((in_vecs[0] >> 16) & 0xFF == ord('4'))
        s.add((in_vecs[0] >> 8) & 0xFF == ord('k'))
        s.add((in_vecs[0] >> 0) & 0xFF == ord('3'))

        s.add((in_vecs[1] >> 24) & 0xFF == ord('d'))
        s.add((in_vecs[1] >> 16) & 0xFF == ord('_'))
        s.add((in_vecs[1] >> 8) & 0xFF == ord('t'))
        s.add((in_vecs[1] >> 0) & 0xFF == ord('4'))

        s.add((in_vecs[2] >> 24) & 0xFF == ord('X'))
        s.add((in_vecs[2] >> 16) & 0xFF == ord('_'))
        s.add((in_vecs[2] >> 8) & 0xFF == ord('f'))
        s.add((in_vecs[2] >> 0) & 0xFF == ord('R'))

        s.add((in_vecs[3] >> 24) & 0xFF == ord('a'))
        s.add((in_vecs[3] >> 16) & 0xFF == ord('u'))
        s.add((in_vecs[3] >> 8) & 0xFF == ord('D'))
        s.add((in_vecs[3] >> 0) & 0xFF == ord('!'))


# Initial xx state
E1004 = 0x6123f3e9
E1008 = 0xfffb8a27
E100C = 0xca1590f0
E1010 = 0xcaf09150

E1004, E1008, E100C, E1010 = xx_update(
    [E1004, E1008, E100C, E1010], 
    [in_vecs[0], in_vecs[1], in_vecs[2], in_vecs[3]])

E1004, E1008, E100C, E1010 = xx_update(
    [E1004, E1008, E100C, E1010], 
    [in_vecs[4], in_vecs[5], in_vecs[6], in_vecs[7]])

# xxfinish
r0 = xx_round(zero_vec, E1004, 0x1F)
r0 = xx_round(r0, E1008, -0x7)
r0 = xx_round(r0, E100C, 0x14)
r0 = xx_round(r0, E1010, -0x12)
r0 = r0 ^ 0x00400000 # constant counter state

r0 = bit_mirror(r0 ^ RotateRight(r0, 0xF)) ^ 0x85EBCA77
r0 = bit_mirror(r0 ^ RotateRight(r0, 0xD)) ^ 0xC2B2AE3D
r0 = r0 ^ RotateRight(r0, 0xD)

s.add(r0 == 0xb48db866) # Target PRNG state

while s.check() == sat:
    m = s.model()
    secret = b''
    for i, in_vec in enumerate(in_vecs):
        secret += m[in_vec].as_long().to_bytes(4, 'big')
    print(f"secret = {secret.decode()}")

Wrap-up

I asked Ben_Lolo what the intended flag was after finding my solutions, and he was kind enough to share ~~-_B-l()yAl-2_W@T_mAtT3R5_-~~, but I have to admit I am slightly partial to the ones I found.

win

🎉🎉 happy hacking, until next time 🎉🎉

darbonzo