Introduction
Hello all, in this blog we’ll be talking, once again, about instrumentation callbacks but more specifically in some real life use cases like Hyperion, a very good security product written by the developers at, now acquired, Byfron. We’ll also be reverse engineering, what appears to be, a Roblox cheating tool that seems to be using instrumentation callbacks as a form of code execution.
What an instrumentation callback is
For an in-depth explanation of what a callback is, you can refer to my previous blog titled “Nirvana Debugging”. As a shallow explanation, it’s a kernel callback to which any syscall execution will end up relocating. In other words, any call to a syscall will be routed to a registered instrumentation callback if installed. With this knowledge, we know that the syscall WILL be executed; there is no way for it to be paused through an instrumentation callback - we can only handle any actions after it has already been executed. The kernel sends back a PCONTEXT which describes the current thread’s stack after the call has been invoked. From here, we can either parse out specific data from the registers or take the entire PCONTEXT itself. Some key registers to focus on will be RIP, RCX, and R10. I’ll talk more in detail soon about what these three registers hold, as the callbacks we reverse engineer later will make use of these three in particular. I do recommend you come into this blog knowing what this form of callback is and how it works. The Instrumentation callback can actually also pick up other callbacks. What I mean by this is that other kernel callbacks regarding thread execution, exception handling, or anything similar can be picked up via this callback.
Important Registers to note
- Rax -> Return of the function that was last called that’s now trapped within the IC
- R10 -> Origin of last function
- Rcx -> Address of IC
Kernel Callbacks: Watchlist
Here, we need to discuss some of the kernel callbacks that we should know about, including why they’re there and what they do. This information is important as it will help us understand exactly how code execution can be tracked.
Kernel Callbacks: LdrInitializeThunk
First on the list is LdrInitializeThunk. As you may already know, LdrInitializeThunk is crucial in thread creation. It initializes the thread in the context of the process and is mandatory. Any created thread’s true start address is not the one the user specifies, but actually LdrInitializeThunk. With this knowledge, let’s look at what C-like pseudocode for this function might look like.
VOID
LdrInitializeThunk(
__in PCONTEXT Context,
__in PVOID NtDllBaseAddress
)
{
NTSTATUS Status;
//
// Call LdrpInitialize to perform process or thread initialization tasks.
//
LdrpInitialize(
Context,
NtDllBaseAddress
);
//
// Resume execution at the user-supplied thread context.
//
Status = NtContinue(
Context,
TRUE
);
//
// NtContinue should never return.
//
RtlRaiseStatus( Status );
__debugbreak();
}
Now that we know the thread’s true start address is here and not where the user specified, we can apply some security analysis to explore how we can use this to prevent thread creation in a hook-less scenario. If the malicious actor has no idea we have an IC (Instrumentation Callback) installed, they’d be completely oblivious to this form of analysis.
Lockdown: LdrInitializeThunk
Assuming we’ve already got an IC (Instrumentation Callback) installed, we can do the following: Apply the knowledge we have of what registers hold which values. R10 will be the syscall origin. If the origin is from LdrInitializeThunk, we can call a custom handler and validate whatever we need. Something like this:
cmp qword [LdrInitializeThunk], r10
je some_ldr_init_thunk_checker
This will jump to our own custom handler before it returns to the end-user. In the handler, we can perform sanity checks on any newly created thread. These checks can be as trivial or as complex as you want them to be, as ideally we’ll just need to check the start address for basics. We can use NtQueryInformationThread to get the start address. If it fails the sanity check, we can terminate the thread before it even executes.
Kernel Callbacks: KiUserApcDispatcher
To understand this callback, we first need to understand APCs. Essentially, APCs are added to a FIFO (First In, First Out) queue, and when a thread enters an alertable state, any APCs on the queue are executed one by one, with their respective APC routines being called. When this happens, underneath the hood, it’s all handled within KiUserApcDispatcher. This function runs the APC and then eventually calls NtContinue in order to restore the original thread context. It sounds like we can lock down this function, doesn’t it? Knowing that malicious actors can queue APCs in order to get code execution instead of creating their own thread, we need to secure this too.
Lockdown: KiUserApcDispatcher
cmp qword [KiUserApcDispatcher], r10
je some_apc_dispatcher_handler
This is where it might get a little tricky. If we get this hit, it means that the APC has already been dispatched. To attempt to find whatever APC was just executed, we could use a method like stack tracing. Since we know that the APC has been dispatched to an alertable thread, we can iterate through all threads and create a stack backtrace for each. Then, we can check if anything within the stack leads to an area of memory that it shouldn’t be within.
Kernel Callbacks: KiUserExceptionDispatcher
Essentially, this will be a bit tricky, but what this callback does is forward exceptions to any existing handler if it exists. In the context of our instrumentation callback, this could be seen as a middle ground between forwarding the exception and handling the exception. Imagine a scenario where we encrypt our pages and only decrypt them if they’re accessed from a region that’s created and managed by us, in other words, whitelisted. What we can do is only decrypt it if the access is valid - this is a little difficult without any parameters.
Lockdown: KiUserExceptionDispatcher
cmp qword [KiUserExceptionDispatcher], r10
je some_exception_handler
Once again, this is where it becomes a little tedious. Since we cannot access the original parameters and we’re in the scenario of a “We encrypt pages, let’s make sure any calls to a whitelisted page that is currently encrypted are only decrypted if called from another whitelisted page”. What we can do is, we know that the exception was created, let’s only focus on access violations so we can just check the rax register, if it’s an access violation code we can perform another stack trace. We can check that if the call to the whitelisted page or anything really within the stack trace is from outside a whitelisted page, we can just not let the exception be forwarded to the exception handler for decryption or however we setup our decryption sequence. This will leave the page encrypted and the exception unhandled. Here’s an example of what the stack would look like (Top -> Bottom)
- Instrumentation Callback <—— us
- KiUserExceptionDispatcher <—— the callback
- Exception Handling Frame
- Encrypted page <—— where the exception was generated
- Malicious or non malicious page <—— where the exception was generated from
- ….
Instrumentation Callback: Wave
So it begins. Here’s a little debrief: Wave is a cheating tool used on Roblox. I found interest in it as it appears to be the largest paid cheating tool for Roblox at this moment in time. Although it’s detected and the developers don’t seem to care, I figured we might as well take a look at it. The instrumentation callback is quite large, so we’ll split it into chunks in order to see what exactly it’s doing and attempt to figure out what the actual core of the code execution derives from.
Section 1: Handling Callbacks
pushfq
push rdx
lea rdx, [some_structure]
cmp rcx, QWORD PTR [rdx+8]
cmove rcx, QWORD PTR [some_structure]
cmp r10, QWORD PTR [rdx+018h]
je skip_path
cmp r10, QWORD PTR [rdx+020h]
je skip_path
cmp r10, QWORD PTR [rdx+028h]
je skip_path
cmp r10, QWORD PTR [rdx+010h]
je handle_exception
In this section of assembly, we can see quite a few things happening. Just to note, this is the start and this is where we’re going to be analysing this foreign structure that they seem to initialise before creating this callback. We can tell that rdx + 8 will be their instrumentation callback address, I analysed the next few offsets dynamically, here is what the structure looks thus far.
; +0 -> roblox instrumentation callback address
; +8 -> wave instrumentation callback address
; +10 -> KiUserExceptionDispatcher address
; +18 -> LdrInitializeThunk address
; +20 -> KiUserApcDispatcher address
; +28 -> KiUserCallbackDispatcher address (?)
The code so far is pretty simple, looks like anything that isn’t an exception should just be ignored so far, this is typical because we know that Roblox will strip any allocation that has the execution protection if it’s not whitelisted. Once an exception occurs (attempted execution in non whitelisted page), Wave’s handler will take over. What’s interesting to see is that LdrInitializeThunk is ignored, this suggests that wave is not creating their own thread or queuing an APC.
Another thing to note is, the instructions
lea rdx, [some_structure]
cmp rcx, QWORD PTR [rdx+8]
cmove rcx, QWORD PTR [some_structure]
This is actually interesting, this let’s us realise that what they’re doing is installing their instrumentation callback and actually just jmping to roblox’s original callback immediately after they’re done, that’s where skip_path comes into play.
The reason they’re setting rcx to roblox’s instrumentation callback is because, as noted earlier, RCX will contain the instrumentation callback address, Roblox (as shown later on) verify if the RCX register holds roblox’s instrumentation callback or not, these instructions in wave will help pass that check (if you could call it one).
The Pseudolike representation of these instructions so far are something along the lines of this
void ic() {
if (rcx == some_structure->wave_ic)
rcx = some_structure->roblox_ic;
/*
ignored_functions_map = {
some_structure->LdrInitializeThunk,
some_structure->KiUserCallbackDispatcher,
some_structure->KiUserApcDispatcher
}
*/
if (ignored_functions_map.find(r10) != ignored_functions_map.end())
goto skip_path;
else if (r10 == some_structure->KiUserExceptionDispatcher)
goto handle_exception;
//....
}
Section 2: Initialisation
cmp eax, 0C000001Ch
je set_rax_to_zero
cmp eax, 0124h
je isprocessinjobcheck
push rax
mov rax, rsp
mov rdx, gs:[030h]
mov rdx, [rdx+010h]
sub rax, 4D0h
and rax, -16
sub rax, 060h
cmp rax, rdx
jb skip_path_copy
mov rdx, gs:[030h]
mov edx, DWORD PTR [rdx+016B4h]
cmp edx, DWORD PTR [some_structure+0560h]
jne set_debug_registers
lea rdx, [some_structure]
xor rax, rax
mov eax, DWORD PTR [rdx+038h]
cmp eax, 1
jne skip_path_copy
xor rax, rax
lock xchg DWORD PTR [rdx+038h], eax
cmp eax, 1
jne skip_path_copy
pop rax
pop rdx
popfq
A lot to unpack here, let’s begin by splitting it into even more chunks, the first chunk we’ll be looking at is the first two comparisons.
cmp eax, 0C000001Ch
je set_rax_to_zero
cmp eax, 0124h
je spoof_job
Looks like we’re dealing with some NTSTATUS returns, the first one sets any invalid calls return to zero(hence the name), the second one appears to replace the value of eax to spoof a job to appear as though it’s not running, this is what makes me believe there is something important going on here exactly.
push rax
mov rax, rsp
mov rdx, gs:[030h]
mov rdx, [rdx+010h]
sub rax, 4D0h
and rax, -16
sub rax, 060h
cmp rax, rdx
jb skip_path_copy
This just appears to be some sort of stack limit check vs stack pointer, not too sure entirely what’s going on here but it seems to be a deciding factor if they should ignore it.
mov rdx, gs:[030h]
mov edx, DWORD PTR [rdx+016B4h]
cmp edx, DWORD PTR [some_structure+0560h]
jne set_debug_registers
This is an important part, this checks the error code of the current thread and compares it with a field in the self-made structure, this appears to be some indicator in order to set the debug registers, the branch is named accordingly. The new structure will be posted at the bottom.
lea rdx, [some_structure]
xor rax, rax
mov eax, DWORD PTR [rdx+038h]
cmp eax, 1
jne skip_path_copy
xor rax, rax
lock xchg DWORD PTR [rdx+038h], eax
cmp eax, 1
jne skip_path_copy
pop rax
pop rdx
popfq
This part is pretty straight forward, it just appears to be checking a flag to see if the initialisation is complete, remember this is ran on every single thread so there is most definitely a scenario where this is currently being initialised. lock xchg is an atomic instruction set that will 100% execute uninterrupted.
The updated structure appears as follows
; +0 -> roblox instrumentation callback address
; +8 -> wave instrumentation callback address
; +10 -> KiUserExceptionDispatcher address
; +18 -> LdrInitializeThunk address
; +20 -> KiUserApcDispatcher address
; +28 -> KiUserCallbackDispatcher address (?)
; +30 -> ?
; +38 -> initialised flag address
; +60 -> thread context
...
; +560 -> debug registers code to reset
The updated pseudocode is as follows
void ic() {
if (rcx == some_structure->wave_ic)
rcx = some_structure->roblox_ic;
/*
ignored_functions_map = {
some_structure->LdrInitializeThunk,
some_structure->KiUserCallbackDispatcher,
some_structure->KiUserApcDispatcher
}
*/
if (ignored_functions_map.find(r10) != ignored_functions_map.end())
goto skip_path;
else if (r10 == some_structure->KiUserExceptionDispatcher)
goto handle_exception;
uint32_t eax = (uint32_t)rax;
if (eax == 0xC000001c)
*rax = 0;
else if (eax = 0x124)
goto spoof_job;
if (stack_pointer < thread_stack_limit)
goto skip_path_copy;
if (thread_error_mode != some_structure->reset_hwbp_code)
goto set_debug_registers;
if (some_structure->initialised == 1)
goto skip_path_copy;
//do atomically
some_structure->initialised = 1;
}
Section 3: Code execution
sub rsp, 8
mov QWORD PTR [rsp], 0
mov rcx, rsp
mov rdx, qword ptr [some_structure+0530h]
mov r8, qword ptr [some_structure+0548h]
xor r9, r9
mov rax, qword ptr [some_structure+0538h]
sub rsp, 20h
call rax
add rsp, 20h
test eax, eax
js push_rsp_by_8
pop rcx
mov rax, qword ptr [some_structure+0540h]
sub rsp, 28h
call rax ; TpPostWork
add rsp, 28h
jmp restore_context
Finally, something interesting. It appears like we have two function calls. I went and traced to see what the functions are and they appear to be TpAllocWork and TpPostWork, interesting isn’t it? makes sense with the earlier check previously. Let’s deserialise the arguments and add it to the memory structure properly.
The updated structure appears as follows
; +0 -> roblox instrumentation callback address
; +8 -> wave instrumentation callback address
; +10 -> KiUserExceptionDispatcher address
; +18 -> LdrInitializeThunk address
; +20 -> KiUserApcDispatcher address
; +28 -> KiUserCallbackDispatcher address (?)
; +30 -> ?
; +38 -> initialised flag address
; +60 -> thread context
...
; +530 -> work thread address
; +538 -> TpAllocWork address
; +540 -> TpPostWork addresss
; +548 -> callback address (?)
; +560 -> debug registers code to reset
The updated pseudocode is as follows
void ic() {
if (rcx == some_structure->wave_ic)
rcx = some_structure->roblox_ic;
/*
ignored_functions_map = {
some_structure->LdrInitializeThunk,
some_structure->KiUserCallbackDispatcher,
some_structure->KiUserApcDispatcher
}
*/
if (ignored_functions_map.find(r10) != ignored_functions_map.end())
goto skip_path;
else if (r10 == some_structure->KiUserExceptionDispatcher)
goto handle_exception;
uint32_t eax = (uint32_t)rax;
if (eax == 0xC000001c)
*rax = 0;
else if (eax = 0x124)
goto spoof_job;
if (stack_pointer < thread_stack_limit)
goto skip_path_copy;
if (thread_error_mode != some_structure->reset_hwbp_code)
goto set_debug_registers;
if (some_structure->initialised == 1)
goto skip_path_copy;
//do atomically
some_structure->initialised = 1;
T work = T();
auto result = TpAllocWork(&work, some_structure->work_thread, some_structure->callback, 0);
if (result < 0)
// add rsp by 8 and return
TpPostWork(work);
//return
}
Section 4: Page decryption
No assembly here as it’s extremely trivial. After encountering an exception, Wave will check if it’s an access violation or if it’s a single step code. If it’s an access violation it will just call Roblox’s default page decryptor function. If it’s a hardware breakpoint, it’ll check what index of the debug register was triggered, if it’s 2, it’ll set rax to zero, if it’s 4 it’ll call a separate function which whil thwart with query virtual memory in order to attempt to hide their allocation by modifying their allocation entry within the hyperion blacklist. Index 8 will skip HWBP detections.
Section 5: Recap
Well, that’s pretty much all there is for Wave, it seems they’re just using cheap tricks to get by and honestly - it’s somewhat impressive but still disappointing that it’s almost like a ducktaped method, cheap tricks never last. We’ve now analysed exactly how they keep their allocation “safe”(?) and how they’re executing code.
Instrumentation Callback: Hyperion
Onto Hyperion, Hyperion has a really strong implementation of a well written and smart Instrumentation Callback, it only appears to target three callbacks and all three are ones already discussed in this blog, I didn’t just ramble about those callbacks for no reason. I apologise ahead of time that this area will be a little bare as compared with wave, this is because of the heavy obfuscation within Hyperion. Luckily, I’ve already annotated the main premise of the instrumentation callback, let’s check it out.
Section 1: The Instrumentation Callback
; preserve_registers
push r10 ; syscall origin
push rax ; syscall return
pushfq
push rbx
mov rbx,rsp ; store original stack pointer
; check if rcx contains address of ic (it always should (?))
lea rax,[RobloxPlayerBeta.dll+C4A0A0] { (-1672457663) } ; set rax to start of IC
cmp rcx,rax ; Check if shellcode is the hardcoded start of what it should be ?
cmove rcx,r10 ; set rcx to the syscall origin (where it was invoked)
; Allocate 0xC0 bytes onto stack & align to 16 byte boundary
lea r10,[rsp-000000C0]
and r10,-10 { 240 }
mov rsp,r10
cld
; Store registers onto allocated memory (some sort of stack struct)
mov [rsp+000000B0],rcx
mov [rsp+000000A8],rdx
mov [rsp+000000A0],r8
mov [rsp+00000098],r9
mov [rsp+00000090],r11
movaps [rsp+00000080],xmm0
movaps [rsp+70],xmm1
movaps [rsp+60],xmm2
movaps [rsp+50],xmm3
movaps [rsp+40],xmm4
movaps [rsp+30],xmm5
mov rcx,[rbx+18] ; rcx = rip
; rcx = first param, don't see anything regarding a rdx for the second param
call RobloxPlayerBeta.dll+14AF110 ; from what i can see, this will verify the RIP
; set rip to new return from call & restore all registers
mov [rbx+18],rax
mov rcx,[rsp+000000B0]
mov rdx,[rsp+000000A8]
mov r8,[rsp+000000A0]
mov r9,[rsp+00000098]
mov r11,[rsp+00000090]
movaps xmm0,[rsp+00000080]
movaps xmm1,[rsp+70]
movaps xmm2,[rsp+60]
movaps xmm3,[rsp+50]
movaps xmm4,[rsp+40]
movaps xmm5,[rsp+30]
mov rsp,rbx
pop rbx
popfq
pop rax
pop r10
jmp r10 ; return to code (modified or original, depending on verifier call)
Seems pretty straight forward and well written as compared to the previous, it’s actually intersting to see. The only parameter that’s passed to the function is rip (r10), this is where all the main handlers and checks are.
Section 2: The Verifier
push rbp
push r15
push r14
push r13
push r12
push rsi
push rdi
push rbx
sub rsp, 0x5f8
lea rbp, [rsp+0x80]
mov qword [rbp+0x570], 0xfffffffffffffffe
mov rsi, rcx ; rsi = syscall origin
mov r8, qword [gs:0x30]
movzx eax, byte [r8+0x2ec]
test al, 0x1
jne InstrumentationDisabled
mov rdi, qword [r8 {_TEB::NtTib.ExceptionList}]
cmp qword [rel KiUserExceptionDispatcher], rsi
je Return$sub_7ffd2af6a060
cmp qword [rel LdrInitializeThunk], rsi
je Return$sub_7ffd2af6a180
cmp qword [rel KiUserApcDispatcher], rsi
je Return$sub_7ffd2af6a170
jmp Return$DisableIC
As we can see, there are three setup handlers for each of those 3 important callbacks, it’s here where they perform all the sanity checks on what should and should not happen, unfortunately I cannot show in detail what type of sanity checks they do but they are very thorough - especailly for the creation of threads. Each of those function handlers are, in what appears to be, hooked tailcalls.
Section 3: Post-verifier
After the verification is finished, a modified (or original) r10 is returned, it’s from here that decides whether to continue or not to in terms of whatever path of execution is currently happening.
Section 4: Recap
So, it just appears that Hyperions implementation of an instrumentation callback is strictly for prevention of code execution as well as handling any exceptions raised in whatever allocation there may be.
Conclusion
Thank you very much for reading, I mean no disrespect to either sides of the parties involved in this blog. I’m a very curious person and decided to try and figure something new out. I do apologise for any incorrect information that may be present here, everything I’ve written is all hypothetical in a sense. Please reach out to me if you’ve got any issues within this blog or want to correct a mistake.