使用手动和自动编译插桩对CPU运行时占用率突增进行检测 Adisak Pochanayon 首席软件工程师 Netherrealm工作室 adisak@wbgames.com Short introduction about self and qualifications during this slide 27 years programming games (full time 20 years) 11 years at Netherrealm Studios 8 years on MK Worked on two of the top titles of 2011 – MK & Batman Arkham City Very fast talk many slides – hold questions until the end. I hope to have some time to answer questions or feel free to hang out and talk to me after presentation. Don’t worry if you miss a detail on a slide – they will all be available online very shortly in free section of GDC Vault. I have two handouts for this talk – If you didn’t get one, they will be available on the GDC site after the conference. Neither one is required to view the talk -- especially not the MIPS detours – that’s more for offline enjoyment. The PENTER_PEXIT handout just covers some code (especially asm) that may be difficult to read off the slides and shows it all at once where the slides feed you a couple lines at a time. If you really love assembler language, you might want to follow along during part of the talk on that handout (or look at your neighbors).
涵盖的主题 本演讲是关于基于插桩的运行时间分析。 代码插桩的方法 《真人快打》(MK)的PET分析程序(占用率突增检测器) This is not going to be a “tricks and tips” with your favorite profiler talk… rather it is about techniques to implement your own instrumented profiler and then some details on the API of a Spike detection profiler. This is going to be a very advanced talk. I will mention compiler specific options, machine code trampolining, naked functions with platform specific inline assembler, and linker function wrapping. Some parts of the talk will be easy to follow and I will hopefully cover some useful information for people who are not low level system engineers. However, the bulk of the talk targets an audience of engine programmers who are interested in implementing runtime profiling systems. Also, while I’m going to talk about methods to implement your own profiling library this doesn’t mean you *NEED* to do this. Try the available tools first. It’s what we normally do. We use these types of tools for finding issues that are stubbornly persistent and hard to find after FIRST trying VTune, Tuner, Razor, PIX, Perfview, etc.
分析程序 分析程序的常见类型 硬件跟踪 基于事件的硬件 基于事件的软件 采样 插桩 Hardware Trace = record low level hardware states continuously – expensive and large sample sets (example PS2 PA) Hardware Event-based == example: GPU hardware counters / CPU Perf counters Software Event-based == incrementing counters or adjusting value during software events Sampling == high resolution timer interrupt N-times a second … then sample stack / record a sample Instrumented == Examples: fastcap / call cap NOTE: anything with exact hierarchical information is most likely from an instrumented profiler Most commercial profilers are Sampling and/or Instrumented or hybrids. What this talk is interested in is Instrumentation.
手动插桩 显式插桩 / 代码标记 Wrapper函数 Detours代码库和“蹦床”(Trampolines)功能 Manual Instrumentation Summary
显式插桩 要求代码标记(修改源代码) 作用域 - ScopedMarker(INFO) StartMarker(INFO) / StopMarker(INFO) 作用域 - ScopedMarker(INFO) class CScopedMarker { CScopedMarker(ProfDesc &info) { StartMarker(info); } ~CScopedMarker(ProfDesc &info) { StopMarker(info); } }; #define ScopedMarker(INFO) CScopedMarker(INFO) \ ProfInfo##__LINE__ FIRST METHOD OF MI: Explicit Manual Instrumentation: Many types… CPU Perf Markers (SN Razor), Unreal Stat System, Most simple user profiling systems use Explicit Manual Instrumentation Manually mark begin and end (prolog and epilog) to profiled sections or instrument a scope with a “scoped” marker Is a bit hackey because it requires code markup… but we shouldn’t dismiss – useful for many things and easy to implement. MK PET Profiler If done with #define’s, make sure it compiles out in release-builds
Wrapper函数 编译时间 链接时间更换 / Wrapping #定义函数 (…) wrapper函数 附加说明 – 与实现编译器无关 缺点– 只有在你有源代码时起作用 链接时间更换 / Wrapping GCC选项:-Wl,--wrap,函数名 __wrap_函数名 () __real_函数名() SECOND METHOD OF MI: GCC link-time wrapper version works on libraries and binaries objects without source at link time (works on functions to which you have source as well).
Wrapper函数 调用函数 目标函数 Basic Function Call
Wrapper函数 1 2 3 4 调用函数 “WRAPPER” 函数 “REAL” 函数 Again, It’s completely up to the WrapperFunction what to do. - Can call original target function as shown above. Can skip original function (replacing implementation) -- in these cases, steps 2-3 are skipped and it replaces original function. Note we used this functionality to wrap malloc / free and other memory calls on PS3 for MK Memory system that I talked about last year. 3 4
Wrapper函数 使用GCC / SNC进行wrapping malloc() 的示例 添加链接器标志: -Wl,--wrap,malloc extern "C" void* __real_malloc(size_t); extern "C" void* __wrap_malloc(size_t Size) { // 调用原始malloc() 函数 return __real_malloc(Size); } This is actually how we “took over” malloc() in memory system on PS3. The sample here is a NULL wrapper that does nothing other than call the original function. But you can do whatever you want (including call your own function instead).
Detours代码库和蹦床(Trampolines) 这是一个为插桩修改代码的方法 可以由分析程序在目标代码/二进制文件上进行 调用库函数插桩的运行时间 请参阅微软Detours代码库 MIPS示例代码(讲义) 这是另一种形式的手动插桩,但此方式不要求对目标函数进行源代码标记。 THIRD METHOD OF MI:
Detours代码库和蹦床(Trampolines) 调用函数 目标函数 Basic Function Call
Detours代码库 1 2 3 调用函数 DETOUR 函数 目标函数 跳转 A “pure” detour is when we hijack a function by writing a jump into it. We lose the original function then though So how do we allow execution of the original function ? Trampolines 3
蹦床(Trampolines)功能 Trampoline 缓冲器 调用函数 目标函数 目标 PROLOG语言 目标 PROLOG语言 拷贝 Copy First Part of Instrumentation Target Function
蹦床(Trampolines)功能 Trampoline 缓冲器 调动函数 目标函数 目标 PROLOG语言 目标 PROLOG语言 跳转 Add a “jump” from trampoline buffer to target instrumented buffer continuation point. At this time, you could now call the trampoline buffer and it would execute the same as calling the target function. 跳转
蹦床(Trampolines)功能 Trampoline 缓冲器 调用函数 目标函数 目标 PROLOG语言 目标 PROLOG语言 跳转 Add a “jump” from trampoline buffer to target instrumented buffer continuation point. At this time, you could now call the trampoline buffer and it would execute the same as calling the target function. 跳转
Detours代码库和蹦床(Trampolines) 1 Trampoline 缓冲器 2 3 调用函数 DETOUR 函数 目标函数 跳转 目标 PROLOG语言 Now lets combine the two and see what we can do… Detour function acts like a wrapper and trampoline allows calling original function. Now it’s completely up to the Detour Function what to do. - Can call original target function as shown above. - Can skip original function (replacing implementation) -- in these cases, steps 3-5 are skipped and it’s back to a “pure” detour. 跳转 4 5 6
Detours代码库和蹦床(Trampolines) 小结(缺点) 必须自行编写:基于精简指令集RISC很琐碎 / 基于复杂指令集CISC更加困难 处理页面保护/ NX (不执行) 商业使用需要付费 微软Detours软件 http://research.microsoft.com/en-us/projects/detours/ 微软1999年关于Detours代码库和蹦床功能 (Trampolines)的研究论文: http://research.microsoft.com/pubs/68568/huntusenixnt99.pdf Detours is $9,999.95 for commercial and free for not commercial Is trivial to implement on RISC – see HANDOUT for method we used 10 years ago. Note implemented independently before reading MS paper. So why pay $10K if trivial. Only trivial A) on RISC – X86 varying instruction sizes make copying target prolog to trampoline more “difficult” and B) if you don’t have to worry about NX bit – if you do use NX bit you need to be able to control memory page attributes for target functions and trampoline functions.
手动插桩 手动插桩方法总结 所有方法都要求对函数进行识别和用户干预(代码标志、库函数调用或者链接器参数)。 显示标记 Wrapper函数 Detours代码库和蹦床功能(Trampolines) 所有方法都要求对函数进行识别和用户干预(代码标志、库函数调用或者链接器参数)。 Manual Instrumentation Summary
自动插桩 你可能已经在使用自动插桩 辅助编译器插桩(CAI) 有许多分析器支持自动插桩 允许用户执行分析程序而编译器会为你进行标记 Metrowerks CATS、VTune Call Graph、Visual Studio Profiler & Visual C++ /callcap以及fastcap、GNU gprof (w/ gcc –pg) 辅助编译器插桩(CAI) 允许用户执行分析程序而编译器会为你进行标记 GCC: -finstrument-functions / SNC: -Xhooktrace Visual C++: _penter() & _pexit() using /Gh and /GH Two terms mean the same thing: Compiler Assisted == Compiler Automated
自动插桩 函数体 PROLOG语言 当一个编译器为一个函数生成机器代码,除了函数体之外,它还生成一段prolog语言(保存寄存器、堆栈帧等等)以及epilog语言(恢复之前保存的寄存器和状态返回)。 When a compiler generates a function…. Normally compiled function… EPILOG语言
自动插桩 编译器自动插桩 函数体 PROLOG语言 Log Entry { _penter() __cyg_profile_func_enter () } With CAI, the compiler inserts functions calls to log the entry and exit. You supply the functions in Green. On VITA, PS3, and PC these functions directly call PET Profiler Logging. On XBOX 360 there are a couple extra steps involved. Log Exit { _pexit() __cyg_profile_func_exit () } EPILOG语言
GCC编译器 & SNC CAI自动编译器插桩 编译器选项:函式追踪 -finstrument-functions 一般而言插桩需要进入和退出函数。在函数进入之后以及函数退出之前,使用当前函数及它的调用地址调用下列分析函数。 void __cyg_profile_func_enter (void *this_fn, void *call_site); void __cyg_profile_func_exit (void *this_fn,void *call_site); This is pretty simple to use. You just have to implement these functions and they get called on function entry and exit. The implementation can be in C / C++ and the calls save registers according to platform ABI.
SNC CAI自动编译器插桩 (PS3 / VITA) void __cyg_profile_func_enter(void *this_fn, void *call_site) { 如果(0==tls_PET_bIsInProcessing) tls_PET_bIsInProcessing=true; _internal_PET_LogEntry(0); tls_PET_bIsInProcessing=false; } Note: This will work for most GCC platforms too… SN just got this CAI working in the most recent compilers on PS3 and VITA. I am the first developer to use these features in a large scale project and I have been working with a couple of their engineers to sort out a couple minor issues. By the time the next compiler updates come out, No ASM should be necessary. PS3 – Search SCEDEV.NET for “PPU ABI Specifications” in “SDK Docs” sections
Visual C++ CAI 在Visual C++里使用_penter() & _pexit()要稍微困难一些。 最起码需要根据应用程序二进制接口ABI平台编写汇编程序来保存寄存器 需要额外的检查来确保“可用” If you have PENTER / PEXIT handout, you might want to reference that to follow along for the next couple slides. Example Additional Check == re-entrant thread check using thread local storage – __declspec(thread)
Visual C++ CAI – X86 extern "C" void __declspec(naked) _cdecl _penter( void ) { _asm push eax push ebx push ecx push edx push ebp push edi push esi } if(0==tls_PET_bIsInProcessing) tls_PET_bIsInProcessing=true; // Call C Work Function _internal_PET_LogEntry(0); tls_PET_bIsInProcessing=false; pop esi pop edi pop ebp pop edx pop ecx pop ebx pop eax ret See Handout – Code is duplicated on Handout ! Save Registers TLS – re-entrant check Call actual work function Restore Registers NOTE: _pexit() is the exact same as _penter() except that it calls a different work routine.
Visual C++ CAI – XBOX 360 自动编译器插桩CAI支持XBOX 360平台 (PowerPC) 几乎和PC上一样 保存寄存器 新步骤 – 检查DPC(延迟过程调用) TLS – 可重入检查 调用实际工作函数 恢复寄存器 NOTE: XBOX 360 CAI is not very well documented
Visual C++ CAI – XBOX 360 PowerPC版本更加复杂 需要保存和恢复更多的应用程序二进制接口ABI寄存器 如果进行函数式程序设计FP,必须保存和恢复FP寄存器 优化和早期退出 TLS访问必须在ASM中进行 混合了naked函数的asm / C语言不像在 X86上那样运行良好 后续内容请查看讲义… The feature is undocumented but works well on XBOX with some caveats. One of our senior programmers was talking to an MS rep at a XBOX 360 DevCon about out profiler and according to him, this was the first and only time he heard of someone actively using this feature.
Visual C++ CAI – XBOX 360 void __declspec(naked) _cdecl __penter( void ) { __asm // Tiny Prolog // - 设置链接寄存器(r12) & 返回地址(两个步骤) std r12,-20h(r1) // 在此处保存链接寄存器LR是额外步骤! mflr r12 stw r12,-8h(r1) // 返回地址 bl PET_prolog bl _internal_PET_LogEntry b PET_epilog } First of ALL, _penter just like _pexit… R12 is the Link Register according to the ABI so we are using it here as such. NOTE: no “return” (blr) in function ? Naked asm really is asm – it is not “C” -- We can actually return from PET_epilog so a return (blr) is not necessary (and in fact early out can return to __penter’s caller from PET_prolog). Oh look… two extra functions… our _penter has it’s own prolog and epilog (which we share with _pexit)
XBOX 360 CAI流程: _penter() 1 _penter() “C++” {asm} 插桩函数 PET PROLOG helper Let’s take a look at the FLOW of _penter
XBOX 360 CAI流程: _penter() 1 2 3 PET_Prolog {asm} _penter() “C++” {asm} 插桩函数 _penter() {asm} PET_Prolog {asm} 3 PET PROLOG helper
XBOX 360 CAI流程: _penter() 4 1 2 3 3 PET_Prolog PET_Prolog 早期退出 {asm} 插桩函数 _penter() {asm} PET_Prolog {asm} PET_Prolog 早期退出 {asm} 3 EARLY OUT RED PATH == Early Out. Special note: asm allows direct return to GrandParent TLS re-entrant check / DPC check both early out
对进入Logging函数的“C++”分析例程 XBOX 360 CAI流程: _penter() 1 2 “C++” 插桩函数 _penter() {asm} PET_Prolog {asm} 3 4 对进入Logging函数的“C++”分析例程 LOGGING STEP – This is where PET Profiler code is actually called 5
对进入Logging函数的“C++”分析例程 XBOX 360 CAI流程: _penter() 1 2 “C++” 插桩函数 _penter() {asm} PET_Prolog {asm} 3 4 对进入Logging函数的“C++”分析例程 EPILOG NOTE: Direct return from Pet_Epilog() to grandparent 5 6 PET_Epilog {asm} 7
对进入Logging函数的“C++”分析例程 XBOX 360 CAI流程: _penter() 4 1 2 3 “C++” 插桩函数 _penter() {asm} PET_Prolog {asm} PET_Prolog 早期退出{asm} 3 4 对进入Logging函数的“C++”分析例程 This is the flow for _penter… note the mix of C++ (or C) and ASM and the flow of the functions. Now this is for PowerPC / XBOX 360. However, PC is exactly the same except the purple boxes are inline in the _penter function and the early out (in red) is a simple C conditional in the _penter() function. NOTE: RED PATH == Early Out. Special note: asm allows direct return to GrandParent Direct return from Pet_Prolog() Early Out to grandparent The Parts in Purple and Red are shared with _pexit(). This allows for better utilization / less polution of the I-CACHE. 5 6 PET_Epilog {asm} 7
Visual C++ CAI – XBOX 360 PET_Prolog语言具有五个特点 小型Prolog程序用来保存最小的寄存器 检查DPC & 可能的早期退出 检查递归 (TLS var) &可能的早期退出 保存临时对象(包括r2)并返回上级 早期退出一路返回至祖父类函数
Visual C++ CAI – XBOX 360 小型Prolog用来保存寄存器 // 小型Prolog // - 保存寄存器(r11,r14) // - 设置堆栈帧(r1) std r11,-30h(r1) std r14,-28h(r1) // 原来的堆栈指针 (r1) 在此指令后处于0(r1) stwu r1,-100h(r1)
Visual C++ CAI – XBOX 360 检查DPC & 可能的早期退出 // 获取基于特定线程的TLS lwz r11,0(r13) // 不要试图在目的信令点编码DPC中运行! // 在DPC中 { 0(r13) == 0 } cmplwi cr6,r11,0 beq cr6,label__early_exit_prolog NOTE: XBOX 360 does not document checking for DPC. I discovered that TLS-base 0 check is a way to avoid DPC execution.
Visual C++ CAI – XBOX 360 检查递归 (TLS var) &可能的早期退出 lau r14,_tls_start // 获取基于全局的TLS lau r12,tls_PET_bIsInProcessing lal r14,r14,_tls_start lal r12,r12,tls_PET_bIsInProcessing sub r11,r11,r14 // TLS Base Offset (r11) add r14,r11,r12 // r14 == &tls_PET_bIsInProcessing // 使用变量线程 tls_PET_bIsInProcessing避免递归 lwzx r12,r11,r12 cmplwi cr6,r12,0 bne cr6,label__early_exit_prolog li r12,1 stw r12,0(r14) // 设置 tls_PET_bIsInProcessing NOTE: XBOX 360 does not document TLS access in naked assembler. I discovered that TLS-base using (variable - _tls_start + TLS-base{r13} ) could access TLS variables. Just using the variable name gives you the store for global copy of initial RO-DATA for the TLS variables.
Visual C++ CAI – XBOX 360 检查递归 (TLS var) &可能的早期退出 如果(tls_PET_bIsInProcessing) goto label__early_exit_prolog; tls_PET_bIsInProcessing=true; What made it complicated was the addressing the TLS variable in assembler.
Visual C++ CAI – XBOX 360 早期退出一路返回至祖父类函数 // 保存 r0/r2-r10 (临时对象) std r0,8h(r1) std r2,10h(r1) // (r2保存在 XBOX 360上) std r3,18h(r1) std r4,20h(r1) std r5,28h(r1) std r6,30h(r1) std r7,38h(r1) std r8,40h(r1) std r9,48h(r1) std r10,50h(r1) blr // 返回至调用者 NOTE: XBOX 360 does not document ABI. Registers saved are according to PowerPC ABI but I discovered that “r2” is also reserved on XBOX 360 so this is an additional requirement.
Visual C++ CAI – XBOX 360 早期退出一路返回至祖父类函数 label__early_exit_prolog: // 小型Epilog – 调整堆栈 (r1) & 恢复 r12/r14/r11 addi r1,r1,100h lwz r12,-8h(r1) mtlr r12 ld r12,-20h(r1) ld r14,-28h(r1) ld r11,-30h(r1) blr Whew… first slide without gotchas (other than the non-obvious exit to grandparent).
Visual C++ CAI – XBOX 360 PET_Epilog更加简单(请参阅讲义) 清除TLS递归预防变量 恢复临时对象 恢复用于小型Prolog中的寄存器 最后请注意:Worker函数如果执行任何浮点工作必须保存/恢复FP寄存器(fr0-fr13)。 As an optimization, we do not save and restore the FP registers unless it is required to do so. If we detect that we need to do FP work (currently only during a detected spike in the PET Profiler), then we save and restore fr0-fr13 around the FPU work. This is done with a very simple NAKED ASM function.
对进入Logging函数的“C++”分析例程 XBOX 360 CAI流程: _pexit() 4 2 3 “C++” 插桩函数 _pexit() {asm} PET_Prolog {asm} PET_Prolog 早期退出{asm} 3 4 对进入Logging函数的“C++”分析例程 The only difference here is: I used slightly different colors to show _pexit and Log-Exit routines are different from previous slide. _pexit gets called at the end of the Instrumented function instead of _penter at the beginning of the function. The Profile Routine Logs the function exit instead of entry 5 1 6 PET_Epilog {asm} 7
如何使用插桩 那么现在我们有了所有这些方法来进行代码插桩,我们该如何使用这些技术? 将其挂接到你的分析代码中 在真人快打《Mortal Kombat》团队中,插桩运用的例子之一就是用于内部开发来检测运行时占用率突增的探测器,我们将之称为PET分析程序。 MK and on Batman: Arkham City.
PET分析程序 PET = Prolog Epilog Timing 计算进入和退出函数的次数和检测峰值 能够设置一个全局阈值,任何超过该阈值的插桩函数都会受到记录 使用堆栈以及来自标识的潜在额外信息 运用自动编译器插桩进行工作 对游戏中每一个编译函数进行峰值检测 使用自动编译插桩CAI的费用大约为性能开销的15-30%
PET分析程序 没有代码标记需要检测峰值 PET标签 / 代码标记仍然是有用的 打开CAI,它会自动找到执行时间超出你所设定的全局阈值的任何函数 PET标签 / 代码标记仍然是有用的 简单的 PET_FUNCTION() (作用域标记) 为PET分析程序提供额外的信息 允许在没有自动编译器插桩CAI的情况下在各平台上进行峰值检测 备用执行方案= 增强其他分析程序 Note: PET works in Manual and Compiler Instrumented Modes with it’s own internal spike detection profiler. But once you have set up markers, there is an easy mechanism to override the PET Tags so they insert markers for any profiler that uses either entry/exit or scoped markers. This allows for “Alternate Implementations” that enhance other profiles with additional CPU markers or tags – great for SN Razor etc. So PET_FUNCTION() can providing any extra info you want depending on implementation. The current implementation passes strings for __FUNCTION__, __FILE__, and int for __LINE__ so that the runtime can emit that into the log (otherwise it emits raw addresses to the logged stack trace that need to be parsed for the address).
PET的实现 PET是通过使用TLS信息栈来实现 栈由CAI进入递增 栈由CAI退出递减 当CAI不存在时,PET栈的递增和递减由标记决定 (作用域标记) 作用域标记和应用程序编程接口API标记在全局级别、函数级别、子系统级别或者线程级别提供了额外的信息和功能 Go over stack implementation details (child / parent propagation / ignore / pause / etc). We also have things like pause / ignore / timing threshholds that can propagate up and down stack to parents or children. It’s a small TLS memory cost per thread for the stacks. We currently limit stacks to a depth of 256 so very deep function call spikes will not get fully targetted or reported until we get to a parent that is less than 256 levels in nested function calls. This is a #define that we can change easily in code though.
PET的实现 栈区数据 可以小至4个字,这取决于所选的定义#define选项 进入时间 可选阈值(覆盖全局阈值) 针对子类对象的可选阈值 描述(其中大部分是可选的) 函数地址 函数名称(指针为静态) 线数 源文件名称(指针为静态) 用户生成描述(动态字符串)
PET应用程序编程接口API / 基本标记 PET_FUNCTION() PET_SECTION_MANUAL(name) PET_SetGlobalThresholdMSec(msecs) PET_SetFunctionThresholdMSec(msecs) PET_SetChildrenThresholdMSec(msecs) PET_SetAllParentsThresholdMSec(msecs) PET_FUNCTION is used for functions only – should be one of the first (if not the very first) line in a function Thresholds? Why… lets say you set the global threshold to 1ms to find any function that takes over 1ms. Your Game::Tick() function for a game at 60 fps will take 16.66ms so to avoid kicking a spike warning, you can insert a call
PET应用程序编程接口API / 阈值 为什么阈值是有效的?让我们假定你设置了一个全局阈值为1毫秒用来找到任何运行1毫秒以上的函数。对于一个60帧下运行的游戏,你的Game::Tick()函数将需要16.66毫秒的时间,这样可以避免触发峰值警告,你可以把下面这个标记插入到Game::Tick()中: PET_FUNCTION(); PET_SetFunctionThresholdMSec(1000.0/MIN_FPS); 那么PET分析程序将不会在你的Game::Tick()里记录一次峰值。
PET阈值示例 脚本函数 (0.1毫秒) 大部分游戏函数的全局阈值 (1毫秒) Tick函数 (1 帧 = 16 毫秒) 文件输入输出I/O 函数(2 秒 – 想象这个范围大概是一个足球场长度) Bars are not too scale!!! Can handle spikes occuring over a factor of 20,000 easily !
PET应用程序编程接口API / 信息 PET_设定函数描述 PET_SetFunctionDescf(FMT,...) 从字符串池中分配一个临时(共享)字符串并根据描述对其进行格式化。 如果发生占用率突增,在log日志文件中发出说明。 PET_超时(OT)信息 PET_OTMessagef(FMT,...) 如果该函数超过阈值时间(OT)则记录一条附加信息。 系统开销比设定函数描述 f(FMT,...) 更少,这是由于输出字符串只在信息需要时生成。 Logging functions: NOTE: PET_SetFunctionDescf() costs about the same as sprintf() but is more flexible.
PET应用程序编程接口API / 信息 示例: void ExecuteScript(ScriptContext *sc) { PET_FUNCTION(); sc->ExecuteStep(); PET_OTMessagef("Script function: %s", sc->GetCurrentFunctionName()); } This will kick out a message with the script function only when the function takes too long. What to do if GetCurrentFunctionName() is not valid after ExecuteStep call … use PET_SetFunctionDescf() prior to call. Now this example uses the example function name “ExecuteScript” but we actually used this @ Netherrealm for script spikes in MK and for locating Unreal Script and Kismet spikes when we helped Batman where we could show the script / function or object / action.
PET应用程序编程接口API / 信息 示例: void ExecuteScript(ScriptContext *sc) { PET_FUNCTION(); PET_SetFunctionThresholdMSec(0.1); PET_SetFunctionDescf("Script function: %s", sc->GetCurrentFunctionName()); sc->ExecuteStep(); } This will set an extra note for the function description at the cost of a sprintf. It will work if the value returned by GetCurrentFunctionName() changes during the call to ExecuteStep(). This markup description will be emitted on the logged stacked if the function takes too long. It will also be emitted on the logged stacked if any child function exceeds this threshold (or whatever threshhold the child function sets). So it costs as much as a sprintf() but in some cases it is more powerful than the PET_OTMessagef(). Let’s also say we want entire script system to take well under 1 ms and that we want to flag any script call that takes over 0.1 ms. We can do that with one additional line of code. Obviously better to use a #define in the PET Config file than 0.1 for the threshold too.
PET应用程序编程接口API / 条件控制 如果分析处于活动状态时进行控制 PET_SetThreadActive() PET_FUNCTION_IGNORE() PET_FUNCTION_IGNORE_CHILDREN() PET_Pause() PET_Unpause() 条件分析(可以允许“通道(channels)”) PET_FUNCTION_CONDITIONAL(cond) PET_FUNCTION_CONDITIONAL_PAUSED(cond) PET_FUNCTION_CONDITIONAL_IGNORE_CHILDREN(cond) PET_FUNCTION_CONDITIONAL_IGNORE(cond) NOTE: All caps PET_FUNCTION_XXX() are used in place of PET_FUNCTION() while mixed case versions are additional calls.
PET日志输出示例 PET Log Output Sample – XBOX 360 CAI (Class::Function) PET: Function took too long: 11.302 MSec Thread: MainThread PEAK CHILD @ frame 2323 Function Name Trace 0) 0x82449eb4 - main 1) 0x824499c8 - GuardedMain 2) 0x82444c80 - FEngineLoop::Tick 3) 0x82440ddc - UMK9GameEngine::Tick 4) 0x82d6d39c - MKScriptVM::Tick 5) 0x82d6f2dc - MKListNoDestroy::ForEach 6) 0x82d6cd54 - MKScriptVM::Step 7) 0x82d690ac - _call_c_function 8) 0x82440ddc - CreateEnduranceOpponentPhase1 9) 0x82440ddc - CreatePlayerPhase1 10) 0x82440ddc - SpawnMKFGCharacterObj 11) 0x82440ddc – CreateMeshes PET: Thread (MainThread) Function took too long: 14.499 MSec PET: Thread (MainThread) Function took too long: 14.599 MSec PET: Thread (MainThread) Function took too long: 15.097 MSec !!!!!-----------------------------------------------------------!!!!! Slow Script->C Call: _CreateEnduranceOpponentPhase1 takes 16.576 ms (0.995 frames) to execute PET: Thread (MainThread) Function took too long: 15.377 MSec 7) 0x82d690ac - _call_c_function PET: Thread (MainThread) Function took too long: 15.401 MSec PET: Thread (MainThread) Function took too long: 15.490 MSec PET: Thread (MainThread) Function took too long: 15.578 MSec PET Log Output Sample – XBOX 360 CAI (Class::Function)
PET 日志输出示例 PET: Function took too long: 8.586 MSec Thread: MainThread PEAK CHILD @ frame 2422 Function Name Trace 0) Tick (LaunchEngineLoop.cpp - LINE: 1981) 1) Tick (MK9Game.cpp - LINE: 499) 2) Tick (ScriptCore.cpp - LINE: 908) 3) Step (ScriptCore.cpp - LINE: 440) 4) _call_c_function (ScriptCore.cpp - LINE: 4194) 5) CreateEnduranceOpponentPhase1 (FGPlayer.cpp - LINE: 881) 6) CreatePlayerPhase1 (FGPlayer.cpp - LINE: 527) 7) SpawnMKFGCharacterObj (FGPlayer.cpp - LINE: 243) 8) CreateMeshes (FGPlayer.cpp - LINE: 1088) 9) SetSkeletalMesh (UnSkeletalComponent.cpp - LINE: 3465) PET: Thread (MainThread) Function took too long: 12.851 MSec PET: Thread (MainThread) Function took too long: 15.429 MSec PET: Thread (MainThread) Function took too long: 15.522 MSec PET: Thread (MainThread) Function took too long: 15.869 MSec !!!!!-----------------------------------------------------------!!!!! Slow Script->C Call: _CreateEnduranceOpponentPhase1 takes 16.446 ms (0.987 frames) to execute PET: Thread (MainThread) Function took too long: 16.015 MSec PET: Thread (MainThread) Function took too long: 16.035 MSec PET: Thread (MainThread) Function took too long: 16.200 MSec PET Log Output Sample - PS3 Manual Instrumentation / See File + Line Info – No Class:: for Function Names __FUNCTION__
PET分析程序注意事项 应用程序编译接口API为可选项(CAI运行无需API) 在各发布的版本中API编译体现为nop指令 在不使用时,运行时代码开销为零 手动插桩 / 备用实现方法开销极低 开销取决于插桩数量但通常几乎无法察觉(相当于CAI开销的15-30%)
PET分析程序注意事项 标记对PET来说不是必要的但能够给你更精确的粒子控制… 特别是在一些特定的系统中。 典型的游戏标记涉及为CAI添加的总数仅为30到50的标记线路来取得最好的结果。 PET分析程序将检测峰值并为你“建议”在CAI模式下进行标记的线路 在手动模式中更多的标记会提供更好的详细信息
实际的范例 《真人快打9》(MK9)里的代码优化过程 只需在游戏和日志文件中打开 创建spiking函数名单 范围缩小至“真(real)”和“假(false)” 峰值 使用PET_Pause、PET_Ignore或者适当的阈值对少数的“假”峰值进行插桩 获取“真”峰值清单并将之分发给不同的程序员进行优化 We did a big code optimization pass over a couple weeks after CAI was first implemented and it generated a list of slow functions.
实际的范例 有时候运行会发现不同寻常的和意想不到的峰值。 同游戏内稳定的60帧状态相比加载例程不再同之前一样需要沉重的配置(《真人快打》由于加载屏幕或解析电影,基本加载无法达到60帧) 例子:由于糟糕的代码选取Ladder模式人物可能要花费将近1秒的时间!!! We found a case of very poorly written code where selecting the characters for a ladder could take an extremely long time. In fact the code was written so poorly, we jokingly calculated that we could make little model lottery balls and assign characters to each ball, then run a physics simulation that would pick the lottery balls and it would run faster than the ladder selection code.
实际的范例 追踪具体问题 《蝙蝠侠:阿甘之城》(Batman Arkham City)采用了流媒体加载方式。我们曾在几处地方寻找加载位置。我的一名同事曾花费了两天的时间寻找位置都没有成功。我去到他的办公室,打开PET分析程序并在第一次尝试的几分钟内就找到了具体的代码位置。 Alexander Barrentine went on to extol the virtues of our PET Profiler to some MS techies at an Xfest 360 devcon and the people he talked to were not even aware that the _penter / _pexit feature worked on XBOX 360.
实际的范例 …追踪具体问题 在VITA平台上格斗动作引起的“bog”问题 首先尝试 手动插桩 接着自动插桩 获得了缓慢但有效的结果,已在RAZOR上得到验证 分而治之 / 对Slow函数的子类函数进行插桩 接着自动插桩 打开CAI模式/ 执行slow move 一组函数被记录到日志文件中(如果想要可以进行插桩) 在定位问题上快速许多(自动) CAI for VITA came on fairly late in out project – where many of our optimizations were already made. But the issues all caught automatically covered all the cases from careful profiling. ½ second = RAZOR + trigger to capture.
实际的范例 针对脚本调用(*移动到原生代码?) 虚幻引擎脚本(Unreal Script)、MKScript、 Kismet等等 对脚本执行(script execute)函数插桩 使用PET_OTMessagef() 或者 PET_SetFunctionDescf() 以取得额外信息 为什么这比计时脚本调用更好? 自动插桩模式将锁定子类函数内地址并给出脚本上下文信息
联系方式: Adisak Pochanayon 有什么问题吗 ??? 首席软件工程师 Netherrealm工作室 adisak@wbgames.com
《真人快打》的“自动”分析 代码插桩 手动 自动* 创建一个分 析程序 《真人快打》 @ 60帧 检测峰值