- A+
一:背景
1. 讲故事
前些天有位朋友找到我,说他的程序每次关闭时就会自动崩溃,一直找不到原因让我帮忙看一下怎么回事,这位朋友应该是第二次找我了,分析了下 dump 还是挺经典的,拿出来给大家分享一下吧。
二:WinDbg 分析
1. 为什么会崩溃
找崩溃原因比较简单,用 !analyze -v
命令观察一下便知。
0:040> !analyze -v CONTEXT: (.ecxr) eax=0afdf5dc ebx=0698ade8 ecx=00000001 edx=00000000 esi=0698ade8 edi=7eec0000 eip=7753c5af esp=0afdf5dc ebp=0afdf62c iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 KERNELBASE!RaiseException+0x58: 7753c5af c9 leave Resetting default scope EXCEPTION_RECORD: (.exr -1) ExceptionAddress: 7753c5af (KERNELBASE!RaiseException+0x00000058) ExceptionCode: c0020001 ExceptionFlags: 00000001 NumberParameters: 1 Parameter[0]: 8007042b PROCESS_NAME: xxx.exe
从卦中数据看当前崩溃码是 c0020001,查了下码表说是 string绑定无效
,截图如下:
这看起来有点无语呀,接下来观察下线程栈。
0:040> .ecxr eax=0afdf5dc ebx=0698ade8 ecx=00000001 edx=00000000 esi=0698ade8 edi=7eec0000 eip=7753c5af esp=0afdf5dc ebp=0afdf62c iopl=0 nv up ei pl nz na po nc cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202 KERNELBASE!RaiseException+0x58: 7753c5af c9 leave 0:040> k *** Stack trace for last set context - .thread/.cxr resets it # ChildEBP RetAddr 00 0afdf62c 70e75e0b KERNELBASE!RaiseException+0x58 01 0afdf648 70f63bf5 clr!COMPlusThrowBoot+0x1a 02 0afdf654 70b6f1da clr!UMThunkStubRareDisableWorker+0x25 03 0afdf67c 77a9571e clr!UMThunkStubRareDisable+0x9 04 0afdf6bc 77a80f0b ntdll!RtlpTpTimerCallback+0x7a 05 0afdf6e0 77a809b1 ntdll!TppTimerpExecuteCallback+0x10f 06 0afdf830 75c4344d ntdll!TppWorkerThread+0x562 07 0afdf83c 77a69802 kernel32!BaseThreadInitThunk+0xe 08 0afdf87c 77a697d5 ntdll!__RtlUserThreadStart+0x70 09 0afdf894 00000000 ntdll!_RtlUserThreadStart+0x1b
从卦中的线程栈来看,这里利用了 Windows线程池
的timer回调,回到 clr 之后主动抛了一个异常。
2. 为什么会主动抛异常
要想知道这个答案需要分析下clr 的源码,简化后如下:
// Disable from a place that is calling into managed code via a UMEntryThunk. extern "C" VOID __stdcall UMThunkStubRareDisableWorker(Thread * pThread, UMEntryThunk * pUMEntryThunk, Frame * pFrame) { // Check for ShutDown scenario. This happens only when we have initiated shutdown // and someone is trying to call in after the CLR is suspended. In that case, we // must either raise an unmanaged exception or return an HRESULT, depending on the // expectations of our caller. if (!CanRunManagedCode()) { pThread->m_fPreemptiveGCDisabled = 0; COMPlusThrowBoot(E_PROCESS_SHUTDOWN_REENTRY); } } BOOL CanRunManagedCode(BOOL fCannotRunIsUserError, HINSTANCE hInst) { // If we are shutting down the runtime, then we cannot run code. if (g_fForbidEnterEE == TRUE) return FALSE; // If we are finaling live objects or processing ExitProcess event, // we can not allow managed method to run unless the current thread // is the finalizer thread if ((g_fEEShutDown & ShutDown_Finalize2) && !GCHeap::GetGCHeap()->IsCurrentThreadFinalizer()) return FALSE; // If pre-loaded objects are not present, then no way. if (g_pPreallocatedOutOfMemoryException == NULL) return FALSE; return TRUE; }
根据上面的源码,应该就是CanRunManagedCode()
函数返回false 导致的,那这个函数真的返回 false 吗?可以用 Windbg 验证下g_fForbidEnterEE 这个变量。
0:040> dp clr!g_fForbidEnterEE L1 712a2684 00000001
无语了,这个变量为true表示当前的CLR处于关闭状态,应该是主线程调用了 Exit 方法,用 windbg 可以简单验证下。
0:000> k 00 0028d3b0 77549cd4 ntdll!NtQueryAttributesFile+0x12 01 0028d3b0 70bf560b KERNELBASE!GetFileAttributesW+0x71 02 0028d3c8 710602a5 clr!CheckFileExistence+0x1a ... 39 0028ebc0 70d2684b clr!WaitForEndOfShutdown_OneIteration+0x81 3a 0028ebc8 70d300e2 clr!WaitForEndOfShutdown+0x1b 3b 0028ec08 70d1329e clr!EEShutDown+0xad 3c 0028ec14 70d132fb clr!HandleExitProcessHelper+0x4d 3d 0028ec70 70d2ff99 clr!EEPolicy::HandleExitProcess+0x50 3e 0028ec70 7115af3b clr!ForceEEShutdown+0x31 3f 0028ec70 702a9faf clr!SystemNative::Exit+0x4f
接下来研究下它要进入到什么托管方法中,这个答案就在 UMEntryThunk.m_pManagedTarget
字段里,参考源码如下:
class UMEntryThunk { private: // The start of the managed code const BYTE* m_pManagedTarget; // This is used for profiling. PTR_MethodDesc m_pMD; }
有了这些前置知识就可以用 windbg 轻松挖掘。
0:040> kb 5 # ChildEBP RetAddr Args to Child 00 0afdf62c 70e75e0b c0020001 00000001 00000001 KERNELBASE!RaiseException+0x58 01 0afdf648 70f63bf5 006e0fe0 0afdf67c 70b6f1da clr!COMPlusThrowBoot+0x1a 02 0afdf654 70b6f1da 0698ade8 00580a38 0698ade8 clr!UMThunkStubRareDisableWorker+0x25 03 0afdf67c 77a9571e 00000000 00000001 7d723ac9 clr!UMThunkStubRareDisable+0x9 04 0afdf6bc 77a80f0b 0afdf71c 006e0fe0 006f6c10 ntdll!RtlpTpTimerCallback+0x7a 0:040> dp 00580a38 L2 00580a38 00386580 008f2eb8 0:040> !U 00386580 Unmanaged code 00386580 e9ab390000 jmp 00389f30 ... 0:040> !ip2md 00389f30 MethodDesc: 0018af94 Method Name: xxx._checkInput1(IntPtr, Boolean) Class: 00435a7c MethodTable: 0018afd8 mdToken: 06000034 Module: 0018a6a8 IsJitted: yes CodeAddr: 00389f30 Transparency: Critical
通过一顿反解果然是一个托管回调函数,分析到这里ztm的开心哈,感觉马上就要看到光了,仔细找了下代码,果然是借助Windows线程池创建了一个定时事件,无语了,截图如下:
到这里就真相大白了,退出进程的时候一定要先调用C#的Dispose()
方法把非托管的Timer给关掉,否则就会出现这种偶发的崩溃异常。
3. 一些题外话
这个dump的错误码非常有误导性,一个是外部的c0020001
,一个内部的 8007042Bh
,尤其是搜内部的 8007042Bh 会把你带入到误区里,让你修复系统文件啥的,其实就是一个固定的死值,没有意义的,参见汇编代码。
0:000> ub 70f63bf5 clr!UMThunkStubRareDisableWorker+0x7: 70f63bd7 c9 leave 70f63bd8 e8d47fc3ff call clr!CanRunManagedCode (70b9bbb1) 70f63bdd 8b7508 mov esi,dword ptr [ebp+8] 70f63be0 85c0 test eax,eax 70f63be2 7511 jne clr!UMThunkStubRareDisableWorker+0x25 (70f63bf5) 70f63be4 b92b040780 mov ecx,8007042Bh 70f63be9 c7460800000000 mov dword ptr [esi+8],0 70f63bf0 e8f721f1ff call clr!COMPlusThrowBoot (70e75dec)
所以还是多以代码说话,少道听途说陷入迷途不知返。
三:总结
说实话这个dump分析起来还是挺有难度的,需要你对Windows线程池
,clr源码实现
有一个基础了解,否则很难构造出完整证据链。