Introduction

Hello all, in this blog we’ll be talking, once again, about instrumentation callbacks but more specifically in some real life use cases like Hyperion, a very good security product written by the developers at, now acquired, Byfron. We’ll also be reverse engineering, what appears to be, a Roblox cheating tool that seems to be using instrumentation callbacks as a form of code execution.

What an instrumentation callback is

For an in-depth explanation of what a callback is, you can refer to my previous blog titled “Nirvana Debugging”. As a shallow explanation, it’s a kernel callback to which any syscall execution will end up relocating. In other words, any call to a syscall will be routed to a registered instrumentation callback if installed. With this knowledge, we know that the syscall WILL be executed; there is no way for it to be paused through an instrumentation callback - we can only handle any actions after it has already been executed. The kernel sends back a PCONTEXT which describes the current thread’s stack after the call has been invoked. From here, we can either parse out specific data from the registers or take the entire PCONTEXT itself. Some key registers to focus on will be RIP, RCX, and R10. I’ll talk more in detail soon about what these three registers hold, as the callbacks we reverse engineer later will make use of these three in particular. I do recommend you come into this blog knowing what this form of callback is and how it works. The Instrumentation callback can actually also pick up other callbacks. What I mean by this is that other kernel callbacks regarding thread execution, exception handling, or anything similar can be picked up via this callback.

Important Registers to note

  • Rax -> Return of the function that was last called that’s now trapped within the IC
  • R10 -> Origin of last function
  • Rcx -> Address of IC

Kernel Callbacks: Watchlist

Here, we need to discuss some of the kernel callbacks that we should know about, including why they’re there and what they do. This information is important as it will help us understand exactly how code execution can be tracked.

Kernel Callbacks: LdrInitializeThunk

First on the list is LdrInitializeThunk. As you may already know, LdrInitializeThunk is crucial in thread creation. It initializes the thread in the context of the process and is mandatory. Any created thread’s true start address is not the one the user specifies, but actually LdrInitializeThunk. With this knowledge, let’s look at what C-like pseudocode for this function might look like.

VOID
LdrInitializeThunk(
	__in PCONTEXT Context,
	__in PVOID NtDllBaseAddress
	)
{
	NTSTATUS Status;

	//
	// Call LdrpInitialize to perform process or thread initialization tasks.
	//

	LdrpInitialize(
		Context,
		NtDllBaseAddress
		);

	//
	// Resume execution at the user-supplied thread context.
	//

	Status = NtContinue(
		Context,
		TRUE
		);

	//
	// NtContinue should never return.
	//

	RtlRaiseStatus( Status );

	__debugbreak();
}

Now that we know the thread’s true start address is here and not where the user specified, we can apply some security analysis to explore how we can use this to prevent thread creation in a hook-less scenario. If the malicious actor has no idea we have an IC (Instrumentation Callback) installed, they’d be completely oblivious to this form of analysis.

Lockdown: LdrInitializeThunk

Assuming we’ve already got an IC (Instrumentation Callback) installed, we can do the following: Apply the knowledge we have of what registers hold which values. R10 will be the syscall origin. If the origin is from LdrInitializeThunk, we can call a custom handler and validate whatever we need. Something like this:

cmp qword [LdrInitializeThunk], r10
je some_ldr_init_thunk_checker

This will jump to our own custom handler before it returns to the end-user. In the handler, we can perform sanity checks on any newly created thread. These checks can be as trivial or as complex as you want them to be, as ideally we’ll just need to check the start address for basics. We can use NtQueryInformationThread to get the start address. If it fails the sanity check, we can terminate the thread before it even executes.

Kernel Callbacks: KiUserApcDispatcher

To understand this callback, we first need to understand APCs. Essentially, APCs are added to a FIFO (First In, First Out) queue, and when a thread enters an alertable state, any APCs on the queue are executed one by one, with their respective APC routines being called. When this happens, underneath the hood, it’s all handled within KiUserApcDispatcher. This function runs the APC and then eventually calls NtContinue in order to restore the original thread context. It sounds like we can lock down this function, doesn’t it? Knowing that malicious actors can queue APCs in order to get code execution instead of creating their own thread, we need to secure this too.

Lockdown: KiUserApcDispatcher

cmp qword [KiUserApcDispatcher], r10 
je some_apc_dispatcher_handler 

This is where it might get a little tricky. If we get this hit, it means that the APC has already been dispatched. To attempt to find whatever APC was just executed, we could use a method like stack tracing. Since we know that the APC has been dispatched to an alertable thread, we can iterate through all threads and create a stack backtrace for each. Then, we can check if anything within the stack leads to an area of memory that it shouldn’t be within.

Kernel Callbacks: KiUserExceptionDispatcher

Essentially, this will be a bit tricky, but what this callback does is forward exceptions to any existing handler if it exists. In the context of our instrumentation callback, this could be seen as a middle ground between forwarding the exception and handling the exception. Imagine a scenario where we encrypt our pages and only decrypt them if they’re accessed from a region that’s created and managed by us, in other words, whitelisted. What we can do is only decrypt it if the access is valid - this is a little difficult without any parameters.

Lockdown: KiUserExceptionDispatcher

cmp qword [KiUserExceptionDispatcher], r10
je some_exception_handler

Once again, this is where it becomes a little tedious. Since we cannot access the original parameters and we’re in the scenario of a “We encrypt pages, let’s make sure any calls to a whitelisted page that is currently encrypted are only decrypted if called from another whitelisted page”. What we can do is, we know that the exception was created, let’s only focus on access violations so we can just check the rax register, if it’s an access violation code we can perform another stack trace. We can check that if the call to the whitelisted page or anything really within the stack trace is from outside a whitelisted page, we can just not let the exception be forwarded to the exception handler for decryption or however we setup our decryption sequence. This will leave the page encrypted and the exception unhandled. Here’s an example of what the stack would look like (Top -> Bottom)

  • Instrumentation Callback <—— us
  • KiUserExceptionDispatcher <—— the callback
  • Exception Handling Frame
  • Encrypted page <—— where the exception was generated
  • Malicious or non malicious page <—— where the exception was generated from
  • ….

Instrumentation Callback: Wave

So it begins. Here’s a little debrief: Wave is a cheating tool used on Roblox. I found interest in it as it appears to be the largest paid cheating tool for Roblox at this moment in time. Although it’s detected and the developers don’t seem to care, I figured we might as well take a look at it. The instrumentation callback is quite large, so we’ll split it into chunks in order to see what exactly it’s doing and attempt to figure out what the actual core of the code execution derives from.

Section 1: Handling Callbacks

    pushfq
    push rdx

	lea rdx, [some_structure]
	cmp rcx, QWORD PTR [rdx+8] 
	cmove rcx, QWORD PTR [some_structure]

	cmp r10, QWORD PTR [rdx+018h]
	je skip_path

	cmp r10, QWORD PTR [rdx+020h]
	je skip_path

	cmp r10, QWORD PTR [rdx+028h]
	je skip_path

	cmp r10, QWORD PTR [rdx+010h]
	je handle_exception

In this section of assembly, we can see quite a few things happening. Just to note, this is the start and this is where we’re going to be analysing this foreign structure that they seem to initialise before creating this callback. We can tell that rdx + 8 will be their instrumentation callback address, I analysed the next few offsets dynamically, here is what the structure looks thus far.

; +0 -> roblox instrumentation callback address 
; +8 -> wave instrumentation callback address 
; +10 -> KiUserExceptionDispatcher address 
; +18 -> LdrInitializeThunk address 
; +20 -> KiUserApcDispatcher address 
; +28 -> KiUserCallbackDispatcher address (?)

The code so far is pretty simple, looks like anything that isn’t an exception should just be ignored so far, this is typical because we know that Roblox will strip any allocation that has the execution protection if it’s not whitelisted. Once an exception occurs (attempted execution in non whitelisted page), Wave’s handler will take over. What’s interesting to see is that LdrInitializeThunk is ignored, this suggests that wave is not creating their own thread or queuing an APC.

Another thing to note is, the instructions

    lea rdx, [some_structure]
    cmp rcx, QWORD PTR [rdx+8]
    cmove rcx, QWORD PTR [some_structure]

This is actually interesting, this let’s us realise that what they’re doing is installing their instrumentation callback and actually just jmping to roblox’s original callback immediately after they’re done, that’s where skip_path comes into play.

The reason they’re setting rcx to roblox’s instrumentation callback is because, as noted earlier, RCX will contain the instrumentation callback address, Roblox (as shown later on) verify if the RCX register holds roblox’s instrumentation callback or not, these instructions in wave will help pass that check (if you could call it one).

The Pseudolike representation of these instructions so far are something along the lines of this

void ic() {
    if (rcx == some_structure->wave_ic)
        rcx = some_structure->roblox_ic;
    /* 
    ignored_functions_map = {
        some_structure->LdrInitializeThunk, 
        some_structure->KiUserCallbackDispatcher, 
        some_structure->KiUserApcDispatcher
    }
    */

    if (ignored_functions_map.find(r10) != ignored_functions_map.end())
        goto skip_path; 
    else if (r10 == some_structure->KiUserExceptionDispatcher)
        goto handle_exception;
    //....
}

Section 2: Initialisation


    cmp eax, 0C000001Ch
	je set_rax_to_zero

	cmp eax, 0124h
	je isprocessinjobcheck

	push rax
    mov rax, rsp
    mov rdx, gs:[030h]
    mov rdx, [rdx+010h]
    sub rax, 4D0h
    and rax, -16
    sub rax, 060h
    cmp rax, rdx
    jb skip_path_copy

    mov rdx, gs:[030h]
    mov edx, DWORD PTR [rdx+016B4h]
    cmp edx, DWORD PTR [some_structure+0560h]
    jne set_debug_registers

    lea rdx, [some_structure]
    xor rax, rax
	mov eax, DWORD PTR [rdx+038h]
	cmp eax, 1
	jne skip_path_copy

	xor rax, rax
	lock xchg DWORD PTR [rdx+038h], eax
	cmp eax, 1
	jne skip_path_copy

	pop rax
    pop rdx
    popfq

A lot to unpack here, let’s begin by splitting it into even more chunks, the first chunk we’ll be looking at is the first two comparisons.

    cmp eax, 0C000001Ch
	je set_rax_to_zero

	cmp eax, 0124h
	je spoof_job

Looks like we’re dealing with some NTSTATUS returns, the first one sets any invalid calls return to zero(hence the name), the second one appears to replace the value of eax to spoof a job to appear as though it’s not running, this is what makes me believe there is something important going on here exactly.

	push rax
    mov rax, rsp
    mov rdx, gs:[030h]
    mov rdx, [rdx+010h]
    sub rax, 4D0h
    and rax, -16
    sub rax, 060h
    cmp rax, rdx
    jb skip_path_copy

This just appears to be some sort of stack limit check vs stack pointer, not too sure entirely what’s going on here but it seems to be a deciding factor if they should ignore it.

    mov rdx, gs:[030h]
    mov edx, DWORD PTR [rdx+016B4h]
    cmp edx, DWORD PTR [some_structure+0560h]
    jne set_debug_registers

This is an important part, this checks the error code of the current thread and compares it with a field in the self-made structure, this appears to be some indicator in order to set the debug registers, the branch is named accordingly. The new structure will be posted at the bottom.

    lea rdx, [some_structure]
    xor rax, rax
	mov eax, DWORD PTR [rdx+038h]
	cmp eax, 1
	jne skip_path_copy

	xor rax, rax
	lock xchg DWORD PTR [rdx+038h], eax
	cmp eax, 1
	jne skip_path_copy

	pop rax
    pop rdx
    popfq

This part is pretty straight forward, it just appears to be checking a flag to see if the initialisation is complete, remember this is ran on every single thread so there is most definitely a scenario where this is currently being initialised. lock xchg is an atomic instruction set that will 100% execute uninterrupted.

The updated structure appears as follows

; +0 -> roblox instrumentation callback address 
; +8 -> wave instrumentation callback address 
; +10 -> KiUserExceptionDispatcher address 
; +18 -> LdrInitializeThunk address 
; +20 -> KiUserApcDispatcher address 
; +28 -> KiUserCallbackDispatcher address (?)
; +30 -> ?
; +38 -> initialised flag address
; +60 -> thread context
...
; +560 -> debug registers code to reset

The updated pseudocode is as follows

void ic() {
    if (rcx == some_structure->wave_ic)
        rcx = some_structure->roblox_ic;
    /* 
    ignored_functions_map = {
        some_structure->LdrInitializeThunk, 
        some_structure->KiUserCallbackDispatcher, 
        some_structure->KiUserApcDispatcher
    }
    */

    if (ignored_functions_map.find(r10) != ignored_functions_map.end())
        goto skip_path; 
    else if (r10 == some_structure->KiUserExceptionDispatcher)
        goto handle_exception;


    uint32_t eax = (uint32_t)rax;
    if (eax == 0xC000001c) 
        *rax = 0;
    else if (eax = 0x124)
        goto spoof_job;
    
    
    if (stack_pointer < thread_stack_limit)
        goto skip_path_copy;

    if (thread_error_mode != some_structure->reset_hwbp_code)
        goto set_debug_registers;

    if (some_structure->initialised == 1)
        goto skip_path_copy;

    //do atomically 
    some_structure->initialised = 1;
}

Section 3: Code execution


    sub rsp, 8
    mov QWORD PTR [rsp], 0
    mov rcx, rsp
    mov rdx, qword ptr [some_structure+0530h]
    mov r8, qword ptr [some_structure+0548h]
    xor r9, r9
    mov rax, qword ptr [some_structure+0538h]
    sub rsp, 20h
    call rax
    add rsp, 20h

    test eax, eax
    js push_rsp_by_8

	pop rcx
    mov rax, qword ptr [some_structure+0540h]
    sub rsp, 28h
    call rax ; TpPostWork
    add rsp, 28h

    jmp restore_context

Finally, something interesting. It appears like we have two function calls. I went and traced to see what the functions are and they appear to be TpAllocWork and TpPostWork, interesting isn’t it? makes sense with the earlier check previously. Let’s deserialise the arguments and add it to the memory structure properly.

The updated structure appears as follows

; +0 -> roblox instrumentation callback address 
; +8 -> wave instrumentation callback address 
; +10 -> KiUserExceptionDispatcher address 
; +18 -> LdrInitializeThunk address 
; +20 -> KiUserApcDispatcher address 
; +28 -> KiUserCallbackDispatcher address (?)
; +30 -> ?
; +38 -> initialised flag address
; +60 -> thread context
...
; +530 -> work thread address 
; +538 -> TpAllocWork address 
; +540 -> TpPostWork addresss
; +548 -> callback address (?)
; +560 -> debug registers code to reset

The updated pseudocode is as follows

void ic() {
    if (rcx == some_structure->wave_ic)
        rcx = some_structure->roblox_ic;
    /* 
    ignored_functions_map = {
        some_structure->LdrInitializeThunk, 
        some_structure->KiUserCallbackDispatcher, 
        some_structure->KiUserApcDispatcher
    }
    */

    if (ignored_functions_map.find(r10) != ignored_functions_map.end())
        goto skip_path; 
    else if (r10 == some_structure->KiUserExceptionDispatcher)
        goto handle_exception;


    uint32_t eax = (uint32_t)rax;
    if (eax == 0xC000001c) 
        *rax = 0;
    else if (eax = 0x124)
        goto spoof_job;
    
    
    if (stack_pointer < thread_stack_limit)
        goto skip_path_copy;

    if (thread_error_mode != some_structure->reset_hwbp_code)
        goto set_debug_registers;

    if (some_structure->initialised == 1)
        goto skip_path_copy;

    //do atomically 
    some_structure->initialised = 1;
    
    T work = T();
    auto result = TpAllocWork(&work, some_structure->work_thread, some_structure->callback, 0);

    if (result < 0)
        // add rsp by 8 and return 
    
    TpPostWork(work);
    //return 
}

Section 4: Page decryption

No assembly here as it’s extremely trivial. After encountering an exception, Wave will check if it’s an access violation or if it’s a single step code. If it’s an access violation it will just call Roblox’s default page decryptor function. If it’s a hardware breakpoint, it’ll check what index of the debug register was triggered, if it’s 2, it’ll set rax to zero, if it’s 4 it’ll call a separate function which whil thwart with query virtual memory in order to attempt to hide their allocation by modifying their allocation entry within the hyperion blacklist. Index 8 will skip HWBP detections.

Section 5: Recap

Well, that’s pretty much all there is for Wave, it seems they’re just using cheap tricks to get by and honestly - it’s somewhat impressive but still disappointing that it’s almost like a ducktaped method, cheap tricks never last. We’ve now analysed exactly how they keep their allocation “safe”(?) and how they’re executing code.

Instrumentation Callback: Hyperion

Onto Hyperion, Hyperion has a really strong implementation of a well written and smart Instrumentation Callback, it only appears to target three callbacks and all three are ones already discussed in this blog, I didn’t just ramble about those callbacks for no reason. I apologise ahead of time that this area will be a little bare as compared with wave, this is because of the heavy obfuscation within Hyperion. Luckily, I’ve already annotated the main premise of the instrumentation callback, let’s check it out.

Section 1: The Instrumentation Callback

; preserve_registers
push r10 ; syscall origin
push rax ; syscall return
pushfq 
push rbx

mov rbx,rsp ; store original stack pointer

; check if rcx contains address of ic (it always should (?))
lea rax,[RobloxPlayerBeta.dll+C4A0A0] { (-1672457663) } ; set rax to start of IC
cmp rcx,rax ; Check if shellcode is the hardcoded start of what it should be ?
cmove rcx,r10 ; set rcx to the syscall origin (where it was invoked)

; Allocate 0xC0 bytes onto stack & align to 16 byte boundary
lea r10,[rsp-000000C0]
and r10,-10 { 240 }
mov rsp,r10
cld 

; Store registers onto allocated memory (some sort of stack struct)
mov [rsp+000000B0],rcx
mov [rsp+000000A8],rdx
mov [rsp+000000A0],r8
mov [rsp+00000098],r9
mov [rsp+00000090],r11
movaps [rsp+00000080],xmm0
movaps [rsp+70],xmm1
movaps [rsp+60],xmm2
movaps [rsp+50],xmm3
movaps [rsp+40],xmm4
movaps [rsp+30],xmm5
mov rcx,[rbx+18] ; rcx = rip
; rcx = first param, don't see anything regarding a rdx for the second param
call RobloxPlayerBeta.dll+14AF110 ; from what i can see, this will verify the RIP

; set rip to new return from call & restore all registers
mov [rbx+18],rax
mov rcx,[rsp+000000B0]
mov rdx,[rsp+000000A8]
mov r8,[rsp+000000A0]
mov r9,[rsp+00000098]
mov r11,[rsp+00000090]
movaps xmm0,[rsp+00000080]
movaps xmm1,[rsp+70]
movaps xmm2,[rsp+60]
movaps xmm3,[rsp+50]
movaps xmm4,[rsp+40]
movaps xmm5,[rsp+30]
mov rsp,rbx
pop rbx
popfq 
pop rax
pop r10
jmp r10 ; return to code (modified or original, depending on verifier call)

Seems pretty straight forward and well written as compared to the previous, it’s actually intersting to see. The only parameter that’s passed to the function is rip (r10), this is where all the main handlers and checks are.

Section 2: The Verifier

push    rbp 
push    r15 
push    r14 
push    r13 
push    r12 
push    rsi 
push    rdi 
push    rbx 

sub     rsp, 0x5f8
lea     rbp, [rsp+0x80]
mov     qword [rbp+0x570], 0xfffffffffffffffe
mov     rsi, rcx  ; rsi = syscall origin
mov     r8, qword [gs:0x30]
movzx   eax, byte [r8+0x2ec]
test    al, 0x1
jne     InstrumentationDisabled

mov     rdi, qword [r8 {_TEB::NtTib.ExceptionList}]

cmp     qword [rel KiUserExceptionDispatcher], rsi
je      Return$sub_7ffd2af6a060

cmp     qword [rel LdrInitializeThunk], rsi
je      Return$sub_7ffd2af6a180

cmp     qword [rel KiUserApcDispatcher], rsi
je      Return$sub_7ffd2af6a170

jmp     Return$DisableIC

As we can see, there are three setup handlers for each of those 3 important callbacks, it’s here where they perform all the sanity checks on what should and should not happen, unfortunately I cannot show in detail what type of sanity checks they do but they are very thorough - especailly for the creation of threads. Each of those function handlers are, in what appears to be, hooked tailcalls.

Section 3: Post-verifier

After the verification is finished, a modified (or original) r10 is returned, it’s from here that decides whether to continue or not to in terms of whatever path of execution is currently happening.

Section 4: Recap

So, it just appears that Hyperions implementation of an instrumentation callback is strictly for prevention of code execution as well as handling any exceptions raised in whatever allocation there may be.

Conclusion

Thank you very much for reading, I mean no disrespect to either sides of the parties involved in this blog. I’m a very curious person and decided to try and figure something new out. I do apologise for any incorrect information that may be present here, everything I’ve written is all hypothetical in a sense. Please reach out to me if you’ve got any issues within this blog or want to correct a mistake.

References

C-Like representation of LdrInitializeThunk