记一次 .NET 某WMS仓储打单系统 内存暴涨分析

  • A+
所属分类:.NET技术
摘要

七月中旬有一位朋友加wx求助,他的程序在生产上跑着跑着内存就飙起来了,貌似没有回头的趋势,询问如何解决,截图如下:


一:背景

1. 讲故事

七月中旬有一位朋友加wx求助,他的程序在生产上跑着跑着内存就飙起来了,貌似没有回头的趋势,询问如何解决,截图如下:

记一次 .NET 某WMS仓储打单系统 内存暴涨分析

和这位朋友聊下来,感觉像是自己在小县城当了个小老板,规律的生活,有当地资源,各种小关系,有一股财务自由的味道,这也是我一直向往的生活方式 ???。

既然朋友找到我了,我得想办法给他解决问题,既然是内存暴涨,我就赌一把在托管层面吧,嘿嘿,上windbg说话。

二: windbg 分析

1. 托管还是非托管

一直在追这个系列的朋友应该知道,我无数次的用 !address -summary!eeheap -gc 这两个命令来判断当前内存属于托管层还是非托管层。

 0:000> !address -summary  --- State Summary ---------------- RgnCount ----------- Total Size -------- %ofBusy %ofTotal MEM_FREE                                393     7dfe`f2105000 ( 125.996 TB)           98.43% MEM_RESERVE                            1691      200`0f1e4000 (   2.000 TB)  99.81%    1.56% MEM_COMMIT                             6191        0`fed07000 (   3.981 GB)   0.19%    0.00%   0:000> !eeheap -gc Number of GC Heaps: 1 generation 0 starts at 0x000001D2E572BBC8 generation 1 starts at 0x000001D2E54F70E0 generation 2 starts at 0x000001D252051000 ephemeral segment allocation context: none          segment             begin         allocated              size 000001D252050000  000001D252051000  000001D26204FFE0  0xfffefe0(268431328) Large object heap starts at 0x000001D262051000          segment             begin         allocated              size 000001D262050000  000001D262051000  000001D2655F3F80  0x35a2f80(56242048) Total Size:              Size: 0xbf4dbf80 (3209543552) bytes. ------------------------------ GC Heap Size:    Size: 0xbf4dbf80 (3209543552) bytes.  

卦象上进程指标为 3.98G ,GC堆指标为 3209543552 = 3G ,很显然,本次事故属于 托管层面

2. 寻找托管层上的大对象

我们都知道C#是托管语言,所以甭管有用没用的对象都逃不出GC堆,言外之意就是看GC堆准没错,挑几个大对象看看。

 0:000> !dumpheap -stat Statistics:               MT    Count    TotalSize Class Name 00007ff98a68f090   391475     43869284 System.Int32[] 00007ff98b6adfa0  1902760     45666240 System.Collections.ObjectModel.ReadOnlyCollection`1[[System.Linq.Expressions.Expression, System.Linq.Expressions]] 00007ff98b6ac3c0  1951470     46835280 System.Linq.Expressions.ConstantExpression 00007ff98bc452e0  1681178     53797696 System.Linq.Expressions.TypedConstantExpression 00007ff98eacb6b8  1902708     60886656 System.Dynamic.Utils.ListArgumentProvider 00007ff98f236518  1774982     70999280 Microsoft.EntityFrameworkCore.Query.Expressions.ColumnExpression 00007ff98c650c58  1681142     80694816 System.Linq.Expressions.MethodCallExpression3 00007ff98a82bc38  3414094     81938256 System.RuntimeMethodHandle 00007ff98fd96fc0    17750     83936016 System.Collections.Generic.Dictionary`2+Entry[[System.Reflection.MemberInfo, System.Private.CoreLib],[System.Linq.Expressions.Expression, System.Linq.Expressions]][] 00007ff98e5ed5d8    35493    101740504 System.Collections.Generic.Dictionary`2+Entry[[Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceCacheKey, Microsoft.Extensions.DependencyInjection],[System.Object, System.Private.CoreLib]][] 00007ff98bcff6a8  3639389    116460448 System.Linq.Expressions.PropertyExpression 00007ff98b85cf00  5028347    160907104 System.Reflection.Emit.GenericFieldInfo 00007ff98a671e18  2178117    168395994 System.String 00007ff98a5b6610   160565    171498416 System.Object[] 00007ff98eaa8ab0  4981589    199263560 System.Linq.Expressions.MemberAssignment 00007ff98a672360   398740    391928469 System.Byte[] 00007ff98a746d68   181886    486150592 System.Char[]   

从托管堆上看,System.Linq.Expressions.MemberAssignment 对象高达 498w ,很明显有问题,从类名看可能和 ExpressionTree 有关,那就抽几个对象看看它的引用链上是否有过大的对象。

 0:000> !gcroot 000001d25399f690 HandleTable:     000001D251B715A8 (pinned handle)     -> 000001D262068CF0 System.Object[]     -> 000001D2531C3B78 Microsoft.EntityFrameworkCore.Internal.ServiceProviderCache     -> 000001D25399E3D0 Remotion.Linq.QueryModel     -> 000001D25399E3B8 Remotion.Linq.Clauses.SelectClause     -> 000001D25442C068 System.Linq.Expressions.MemberInitExpression     -> 000001D25442C050 System.Runtime.CompilerServices.TrueReadOnlyCollection`1[[System.Linq.Expressions.MemberBinding, System.Linq.Expressions]]     -> 000001D2539A0290 System.Linq.Expressions.MemberBinding[]     -> 000001D25399F690 System.Linq.Expressions.MemberAssignment  

引用链特别长,这里我就截取一下,经过一顿排查,我发现大对象居然是 Remotion.Linq.Clauses.SelectClause,objsize 这个对象直接爆掉了,真的很奇葩,如下代码所示:

 0:000> !objsize 000001D25399E3B8 sizeof(000001D25399E3B8) = -1187378032 (0xb93a0c90) bytes (Remotion.Linq.Clauses.SelectClause)  

有点懵,这个对象居然是罪魁祸首,从引用链看它是 EF 下的一个构建表达式树的小部件,可以肯定的是,朋友在用EF的时候出了什么问题,不过还得硬着头皮继续挖 SelectClause,经过深挖,我发现这个类有大量这样的 char[] 数组,导出来后大概是下面这样。

 Logistics.Text30),  |           Text31 = string TryReadValue(t1.Outer.Outer, 42, WmsOutboundConfirmLogistics.Text31),  |           Text32 = string TryReadValue(t1.Outer.Outer, 43, WmsOutboundConfirmLogistics.Text32),  |           Text33 = string TryReadValue(t1.Outer.Outer, 44, WmsOutboundConfirmLogistics.Text33),  |           Text34 = string TryReadValue(t1.Outer.Outer, 45, WmsOutboundConfirmLogistics.Text34),  |           Text35 = string TryReadValue(t1.Outer.Outer, 46, WmsOutboundConfirmLogistics.Text35),  |           IsQueue = Nullable<bool> TryReadValue(t1.Outer.Outer, 47, WmsOutboundConfirmLogistics.IsQueue),  |           IsStop = Nullable<bool> TryReadValue(t1.Outer.Outer, 48, WmsOutboundConfirmLogistics.IsStop),  |           CheckCode = string TryReadValue(t1.Outer.Outer, 49, WmsOutboundConfirmLogistics.CheckCode),  |           ClientCode = string TryReadValue(t1.Outer.Inner, 50, WmsOutboundOrder.ClientCode),  |           WarehouseCode = string TryReadValue(t1.Outer.Inner, 51, WmsOutboundOrder.WarehouseCode),  |           ErpNumber = string TryReadValue(t1.Outer.Inner, 52, WmsOutboundOrder.ErpNumber),  |           OrderCategory = string TryReadValue(t1.Outer.Inner, 53, WmsOutboundOrder.OrderCategory),  |           OrderStatus = string TryReadValue(t1.Outer.Inner, 54, WmsOutboundOrder.OrderStatus),  |           OrderType = string TryReadValue(t1.Outer.Inner, 55, WmsOutboundOrder.OrderType),  |           SendCompany = string TryReadValue(t1.Outer.Inner, 56, WmsOutboundOrder.SendCompany),  |           SendName = string TryReadValue(t1.Outer.Inner, 57, WmsOutboundOrder.SendName),  |           SendTel = string TryReadValue(t1.Outer.Inner, 58, WmsOutboundOrder.SendTel),  |           SendMobile = string TryReadValue(t1.Outer.Inner, 59, WmsOutboundOrder.SendMobile),  |           SendProvince = string TryReadValue(t1.Outer.Inner, 60, WmsOutboundOrder.SendProvince),  |           SendCity = string TryReadValue(t1.Outer.Inner, 61, WmsOutboundOrder.SendCity),  |           SendArea = string TryReadValue(t1.Outer.Inner, 62, WmsOutboundOrder.SendArea),  |           ... |           CategoryName = string TryReadValue(t1.Outer.Inner, 88, WmsOutboundOrder.CategoryName),  |           SourcePlatformCode = string TryReadValue(t1.Outer.Inner, 89, WmsOutboundOrder.SourcePlatformCode),  |           PayMode = (string)string TryReadValue(t1.Outer.Outer, 90, null),  |           List = List<WmsOutboundConfirmLogisticsLinesDTO> WmsOutboundConfirmLogisticsBusiness.GetOrderLines(string TryReadValue(t1.Outer.Outer, 5, WmsOutboundConfirmLogistics.OrderNumber)),  |           ConfirmTime = DateTime TryReadValue(t1.Inner, 91, WmsOutboundOrderConfirmation.CreateTime),  |           ReturnUrl = (string)string TryReadValue(t1.Outer.Outer, 92, null)  |       } |__ ),  |__ contextType: Core.DataRepository.BaseDbContext,  |__ logger: DiagnosticsLogger<Query>,  |__ queryContext: Unhandled parameter: queryContext)                                                        

从内容看,应该是 select 语句的ExpressionTree表示,问了下朋友,说大概是报表业务,不过这些信息给他貌似也没有多大帮助,说实话到这里我其实也不知道怎么继续往下排查了,陷入了绝望。

3. 从绝望中寻找希望

我在想,既然EF构建了大量这样的 ExpressionTree,肯定有问题,但也想不出是什么问题,隔了半天,我突然灵光一现,EF既然构建了树,有可能sql也出来了,对,我何不直接在 heap 上搜索 select 的sql语句。。。。

 0:000> !strings /m:*select* Address            Gen    Length   Value 000001d2e4de64e0    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4e11e78    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4e3d1f0    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4e673c8    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4e91760    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4ebb2e8    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4ee54f8    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4f10758    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",... 000001d2e4f398d0    2       1964   SELECT a."Id", a."CreateTime" AS "CreateTime0", a."CreatorId", a."CreatorRealName", a."Deleted", a."OrderNumber", a."CarrierId",...  --------------------------------------- 18128 matching strings  

果然发现了大量重复的 select 语句,而且从最左边的内存地址看都是非常接近的,也就说明他们是在某一个操作中同时生成的,然后我们导出几个sql语句。

 SELECT a."Id", .... FROM "WmsOutboundConfirmLogistics" AS a INNER JOIN "WmsOutboundOrder" AS b ON a."OrderNumber" = b."OrderNumber" INNER JOIN "WmsOutboundOrderConfirmation" AS c ON a."OrderNumber" = c."OrderNumber" WHERE (a."OrderNumber" = @__pagination_OrderNumber_0) AND (b."FreezeStatus" = FALSE) ORDER BY a."Id"";  SELECT a."Id", .... FROM "WmsOutboundConfirmLogistics" AS a INNER JOIN "WmsOutboundOrder" AS b ON a."OrderNumber" = b."OrderNumber" INNER JOIN "WmsOutboundOrderConfirmation" AS c ON a."OrderNumber" = c."OrderNumber" WHERE (a."OrderNumber" = @__pagination_OrderNumber_0) AND (b."FreezeStatus" = FALSE) ORDER BY a."Id""  

拿到这 1.8w 重复的sql 给朋友看,朋友说这是查询报表的sql。

4. 所有线索整合打通

那这里就存在着很大问题,既然是查询报表,为什么会有 1.8w 相同的sql,唯一不同的就是 a."OrderNumber" = @__pagination_OrderNumber_0 中的订单号,难道不应该是 a.OrderNumber in (xxxx) 或者是表关联查询吗??? 整理一下就是下面这样的猜想:

 -- 理想 select * from a where a.id in (1,2,3)  -- 现实 select * from a where a.id=1; select * from a where a.id=2; select * from a where a.id=3;  

加上每个sql内存地址相近,再结合爆表的 Remotion.Linq.Clauses.SelectClause 对象,整个流程大概就是:本该表关联或者in操作,结果变成了无数个单条sql语句查询,导致EF底层出现内存爆炸式增长。

三:总结

看了下朋友查询ef的写法,猜测大多都是人肉构建 ExpressionTree 去查询数据库,大写的??,比如下面的这张图:

记一次 .NET 某WMS仓储打单系统 内存暴涨分析

解决方案就是让朋友检查下表示式树的写法问题,或者直接灌写好的sql得了,说实话这个dump还是费了九牛二虎之力,本以为很简单,实操起来还是碰到了一点小困难,就当历练成长吧!

更多高质量干货:参见我的 GitHub: dotnetfly