Notes

Do not read codes but patch binary.

Dig into LdrInitializeThunk

On my previous post, I posted piece of code which ought to work out for a dll injection on the stage where kernel32.dll has not been mapped yet. It looked worked out apparently, in fact it missed a point.

The point is kernel32.dll will be loaded on a process if its subsystem is Windows GUI/CUI even in the case that the none of the dll of the executable file depends on kernel32.dll. It is true at least on Windows10 17763.737.

The question which might be asked is when & how it will be loaded? To answer it correctly, I need to cover a little bit of overall architecture on which any processes will be started off on user land, which will be done on last stage of process creation(I'm talking about Windows Internals 6 edition Chapter 5 Flow of createprocess Stage7).


1. First execution on user land of a process

When you've just started exploring around here, you might think it is main/wmain that an application defines. Then, after you know more, you know a set of runtime(often called as CRT) routine will be involved before main just like libc will do on do depending on linker options.
But, strictly these are not the first footprint on user land. On the last stage to create a process, kernel will put APC (Asynchronous procedure call) on top of a launching thread with KeUserApc. This APC will be initiated from an address of one of exported function of ntdll ; LdrInitializeThunk. If you often use windbg, you should know it as first breakpoint will be automatically set on one of its callers(PspCreateInitialize) when the debugger creates a process. If you confirm its presence in another way, set a tls callback as a simple example.

#pragma comment(linker, "/INCLUDE: tls_used")

void NTAPI TlsCallBac(PVOID h, DWORD dwReason, PVOID pv);

#pragma data_seg(".CRT$XLB")
PIMAGE_TLS_CALLBACK p_thread_callback = TlsCallBac;
#pragma data_seg()

void NTAPI TlsCallBac(PVOID h, DWORD dwReason, PVOID pv)
{
MessageBox(NULL, "In TLS", "In TLS", MB_OK);
return;
}

This allows MS Linker to set the function address on an entry directory(IMAGE_TLS_DIRECTORY) pointed from one of image directory entry(IMAGE_DIRECTORY_ENTRY_TLS) on a PE header.

After its execution, you will find this tls_callback had been called multiple times before main is executed(1 is for a process creation, the others for a thread creation given 2nd param dwReason).

From Where exactly calls these? If you start a debugger of this process, then the initial automised breakpoint forwards to first execution of tls callback. Then you hit p which is step over & over(20-30?) tls callback for a process creation notification will be called at some point.
This point is on stepping over LdrpTlsInitialize(stack is LdrInitializeThunk/LdrpInitialize/_LdrpInitialize/LdrInitializeProcess/).
And the rest of call should be fired up after ZwTestAlert(stack is LdrInitializeThunk/LdrpInitialize/_LdrpInitialize/LdrInitializeProcess/) with additional dozens of step over.

Not only initial bp of a debugger and tls callback execution but lots of crucial stuffs are involved on this LdrInitializeProcess.

Followings are the examples.

  1. Kernel32.dll loading
  2. SysWow64 configuration
  3. .Net metadata configuration
  4. Shim engine initialisation
  5. Statically mapped library loading

Each of them is as substantial as being articled as a separate topic, I cannot cover all of it in a detail now.

What I will emphasise here is this code area is executed not only process initialisation but thread initialisation as APC.

Roughly, the call graph is as follows.

* LdrInitiallizeThunk
 * LdrpInitialize
  * _ LdrpInitialize
    * LdrpInitializeProcess
      * .. 
  * LdtpInitializeThread
    * ..
  * TestAlert
 * NtContinue

you might encounter previous tls callback was called multiple times. This is because loading(perhaps mainly resolving) DLL is done by setting multiple threads on Windows10 and no matter when and what threads are launched, this APC routine will be passed through.

Once you are realised it might be beneficial to examine this APC routine especially process creation, it is nice to know how you dynamically analyse initial process routine because a debugger like windbg will come out on the way not together with its first point. In other words, you can deal with the case where a malware tricks the routine and prevent initial debugger attach.


2. Abusing APC routine

If you start a process with general CreateProcess with its creation flags as CREATE_SUSPENDED, kernel will stop launching a thread and it is before staring its APC luckily.

You can confirm that starting the process which has tls_callback by staring with CREATE_SUSPENDED flags does not execute the tls callback until the thread resumed or another thread is run.

From another point of view, rewriting the flow of this APC potentially will be a very strong defence evasion from an attacker side. You might be able to achieve similar functionality without relying on heaven's gate, for instance.

Now, stepping back to the head of this article, my question is could I have a process without kernel32.dll rewriting this APC routine before process creation. It requires me to be a little bit more familiar with it.

But for now, let us start from much more simpler.

To confirm this routine is called and can be fookable, set a little detour on the head of LdrInitiaizeThunk.

First, check the text area of LdrInitiaizeThunk of a child process letting its main thread suspended.

STARTUPINFO info = {sizeof(info)};
PROCESS_INFORMATION processinfo;

if (CraeteProcess(NULL,"notepad.exe",NULL,NULL,1,CREATE_SUSPENDED,NULL,0,&info,&processinfo)){
void * moduleNtDll = GetModuleHandle("ntdll");
void* ldrInitThunk = GetProcAddress(moduleNtDll, "LdrInitializeThunk");

VirtualProtectEx(processInfo.hProcess, ldrInitThunk - 4, 6, PAGE_EXECUTE_READWRITE)

uint16_t dd = 0;
if (ReadProcessMemory(processinfo.hProcess, moduleNtDll, dd, 2, 0))
    printf("ldrpinitializeThunk:(addresss)%x,(value)%x",dd,*dd)
}

This code assumes NTDLL will be mapped in a same virtual address on a child process with its parent process. You should get first 2 bytes 0x40 0x53 (push rbx) of "LdrInitializeThunk".

then add a tiny dropping in on the way,

uint8_t dd[] = {0xeb,0xfa};
uint8_t ddd[] = {0x40,0x53,0xeb,0x02};

DWORD old = 0;
VirtualProtectEx(processInfo, ldrInitThunk - 4, 6, PAGE_EXECUTE_READWRITE, &old)
WriteProcessMemory(processInfo.hProcess, ldrInitThunk, &dd, 2, 0);
WriteProcessMemory(processInfo.hProcess, ldrInitThunk - 4, &ddd, 4, 0);

On this ntdll, there was a few bytes which had been left for padding above LdrInitializeThunk. 4bytes among of it is jumped on. Now the code is altered as

    push rbx,
    jmp 0x02
LdrInitializeThunk:
    jmp 0xfa(6 instruction back)
    ....

After that, you can confirm

ResumeThread(processInfo.hThread)

or CreateRemoteThread still works.


Since you make sure that APC routine is called, let us add a counter when it is called.

uint8_t ddd[] = {
       
       0x40,0x50,// push rax
       0x65,0x48,0x8b,0x04,0x25,0x60,0x00,0x00,0x00, // mov rax,[gs]:60
       0x80,0x40,0x04,0x01,  // add [rax+0x04],0x01
       0x40,0x58, // pop rax
       0x40,0x53, // push rbx
       0xeb,0x02  // jmp rip + 0x02
};
// jump to the head of the detour 
uint8_t dd[] = {0xeb, 0x100 - sizeof(ddd)};

DWORD old = 0;
VirtualProtectEx(processInfo, ldrInitThunk - sizeof(ddd), sizeof(ddd) + sizeof(dd), PAGE_EXECUTE_READWRITE, &old)
WriteProcessMemory(processInfo.hProcess, ldrInitThunk, &dd, sizeof(dd), 0);
WriteProcessMemory(processInfo.hProcess, ldrInitThunk - sizeof(ddd), &ddd, sizeof(ddd), 0);

This assumes the parent and the child process is instructed as x64. On x64 PEB, there is 4 bytes padding space just after 4bytes from its head for compatibility with x86. This code stores records how many times the APC routine was called.

After launching a thread with CreateRemoteThread, check it out with

PROCESS_BASIC_INFORMATION* pbi = malloc(sizeof(PROCESS_BASIC_INFORMATION));
void* p1 = malloc(8);
NtQueryInformationProcess(processInfo.hProcess, 0, pbi, sizeof(PROCESS_BASIC_INFORMATION), p2);
void* p2 = malloc(8);
ReadProcessMemory(pp1, p1->PebBaseAddress, &p2, 8, 0)
printf("PEB padding byte %x,%x\n", p2,*p2);

(Do Statically link with ntdll for execution for this code) On my environment, it says a thread by CreateRemoteThread called this APC routine 7 times.

I will leave a task to investigate why it was called 7 times for you and further investigation of this APC routine for better understanding.


On previous example, you can observe kernel32.dll was loaded when you run CreateRemoteThread while suspending main thread. This means no matter where a thread comes from, the first thread on user land will be in charge of process initialisation which consequently loads kernel32.dll.

When you dig a little bit more where exactly kernel32.dll was loaded , you can find a static value named on Kernel32InitThreadFunction on ntdll. This represents the address of one of exported function; Kernel32InitThreadFunction on kernel32.dll. In fact, this address is jumped from RtlUserThread which is entry point of every thread. Not be confused with the address of thread you specified when you call NtCreateThreadEx. Your value is passed as an argument of RtlUserThreadStart, and called from Kernel32InitThreadFunction. Anyway, if you do not want to map kernel32.dll, your executable image should be modified to meet 2 requirements.

  1. Do not call APC routine to load kernel32.dll
  2. Do not call when Kernel32InitThreadFunction since kernel32.dll was not loaded.

To meet 1, You can bravely ignore all of procedure of _LdrpInitialize on LdrInitializeThunk and directly calls main thread by ZwContinue. To meet 2, just rewrite the binary of RtlUserThreadStart.

Shell code is presented next.

uint8_t ddd[] = {
       
       0x40,0x50,// push rax
       0x65,0x48,0x8b,0x04,0x25,0x60,0x00,0x00,0x00, // mov rax,[gs]:60
       0x80,0x40,0x04,0x01,  // add [rax+0x04],0x01
       0x40,0x58, // pop rax
       0x40,0x53, // push rbx
+     0xb2,0x01, // added!!
-     0xeb,0x02  // modified!!
+     0xeb,0x13  // modified!!
};

You can directly jump on the call to ZwContinue setting 2nd argument as 1(this is on original code) without stepping in LdrpInitialize.

Then kernel is notified the APC routine had been done, and start the thread from RtlUserThreadStart.

uint8_t dddd[] = {
       
       0x65,0x48,0x8b,0x04,0x25,0x60,0x00,0x00,0x00, // mov rax,[gs]:60
       0x80,0x40,0x04,0x01,  // add [rax+0x04],0x01
       0xb9,0x00,0x00,0x00,0x00, // thread handle which can be 0
       0xba,0x00,0x00,0x00,0x00, // exit status
       0x4c,0x88,0xd1, // mov r10,rcx
       0xb8,0x53,0x00,0x00,0x00,0x00, // set syscall num for NtTerminateThread
       0x0f,0x05, // syscall
       0xc3
};

void* rtlUserThreadStart = GetProcAddress(moduleNtDll, "RtlUserThreadStart");
DWORD old = 0;
VirtualProtectEx(processInfo.hProcess,rtlUserThreadStart , sizef(dddd), PAGE_EXECUTE_READWRITE,&old)
WriteProcessMemory(processInfo.hProcess, rtlUserThreadStart, &dddd, sizeof(dddd), 0);

What this does is incrementing a value of PEB + 0x4 which is blank, and call NtTerminateThread. If you check the behavior of the child process out with ProcessMonitor or something equivalent, You'll realise it does not map kernel32.dll.


Summary

There are lots of cases that create a process letting its main thread resumed, and rewriting its memory from the parent. On this stage, only executable image file and ntdll are mapped. Replacing the value of PE is often called PE injection. But, you should also pay enough attention to replacement of section of ntdll as it can potentially bypass Anti-Virus DLL and alters how executable will behave.