Node.js 应用性能分析
🌐 Profiling Node.js Applications
对 Node.js 应用进行性能分析涉及在应用运行时通过分析 CPU、内存和其他运行时指标来衡量其性能。这有助于识别瓶颈、高 CPU 使用率、内存泄漏或可能影响应用效率、响应速度和可扩展性的慢函数调用。
🌐 Profiling a Node.js application involves measuring its performance by analyzing the CPU, memory, and other runtime metrics while the application is running. This helps in identifying bottlenecks, high CPU usage, memory leaks, or slow function calls that may impact the application's efficiency, responsiveness and scalability.
有许多第三方工具可用于分析 Node.js 应用,但在许多情况下,最简单的选择是使用 Node.js 内置的分析器。内置分析器使用 V8 内部的 profiler,它在程序执行期间以固定间隔对堆栈进行采样。它会记录这些采样的结果,以及诸如即时编译(jit compiles)等重要优化事件,形成一系列的 ticks:
🌐 There are many third party tools available for profiling Node.js applications but, in many cases, the easiest option is to use the Node.js built-in profiler. The built-in profiler uses the profiler inside V8 which samples the stack at regular intervals during program execution. It records the results of these samples, along with important optimization events such as jit compiles, as a series of ticks:
code-creation,LazyCompile,0,0x2d5000a337a0,396,"bp native array.js:1153:16",0x289f644df68,~
code-creation,LazyCompile,0,0x2d5000a33940,716,"hasOwnProperty native v8natives.js:198:30",0x289f64438d0,~
code-creation,LazyCompile,0,0x2d5000a33c20,284,"ToName native runtime.js:549:16",0x289f643bb28,~
code-creation,Stub,2,0x2d5000a33d40,182,"DoubleToIStub"
code-creation,Stub,2,0x2d5000a33e00,507,"NumberToStringStub"
过去,你需要 V8 的源代码才能解释 tick。幸运的是,自 Node.js 4.4.0 起,引入了一些工具,可以在不单独从源代码构建 V8 的情况下使用这些信息。让我们看看内置的分析器如何帮助提供应用性能的洞察。
🌐 In the past, you needed the V8 source code to be able to interpret the ticks. Luckily, tools have been introduced since Node.js 4.4.0 that facilitate the consumption of this information without separately building V8 from source. Let's see how the built-in profiler can help provide insight into application performance.
为了说明 tick 分析器的使用,我们将使用一个简单的 Express 应用。我们的应用将有两个处理器,其中一个用于向系统添加新用户:
🌐 To illustrate the use of the tick profiler, we will work with a simple Express application. Our application will have two handlers, one for adding new users to our system:
app.get('/newUser', (, ) => {
let = .query.username || '';
const = .query.password || '';
= .replace(/[^a-zA-Z0-9]/g, '');
if (! || ! || users[]) {
return .sendStatus(400);
}
const = .randomBytes(128).toString('base64');
const = .pbkdf2Sync(, , 10000, 512, 'sha512');
users[] = { , };
.sendStatus(200);
});
另一个用于验证用户身份验证尝试:
🌐 and another for validating user authentication attempts:
app.get('/auth', (, ) => {
let = .query.username || '';
const = .query.password || '';
= .replace(/[^a-zA-Z0-9]/g, '');
if (! || ! || !users[]) {
return .sendStatus(400);
}
const { , } = users[];
const = .pbkdf2Sync(, , 10000, 512, 'sha512');
if (.timingSafeEqual(, )) {
.sendStatus(200);
} else {
.sendStatus(401);
}
});
_ 请注意,这些并不是在你的 Node.js 应用中用于认证用户的推荐处理程序,仅用于示例说明。通常,你不应尝试设计自己的加密认证机制。使用现有的、经过验证的认证解决方案要好得多。_
🌐 Please note that these are NOT recommended handlers for authenticating users in your Node.js applications and are used purely for illustration purposes. You should not be trying to design your own cryptographic authentication mechanisms in general. It is much better to use existing, proven authentication solutions.
现在假设我们已经部署了我们的应用,用户抱怨请求延迟很高。我们可以很容易地使用内置分析器运行应用:
🌐 Now assume that we've deployed our application and users are complaining about high latency on requests. We can easily run the app with the built-in profiler:
NODE_ENV=production node --prof app.js
并使用 ab (ApacheBench) 对服务器施加一些负载:
🌐 and put some load on the server using ab (ApacheBench):
curl -X GET "http://localhost:8080/newUser?username=matt&password=password"
ab -k -c 20 -n 250 "http://localhost:8080/auth?username=matt&password=password"
并获得 ab 输出:
🌐 and get an ab output of:
Concurrency Level: 20
Time taken for tests: 46.932 seconds
Complete requests: 250
Failed requests: 0
Keep-Alive requests: 250
Total transferred: 50250 bytes
HTML transferred: 500 bytes
Requests per second: 5.33 [#/sec] (mean)
Time per request: 3754.556 [ms] (mean)
Time per request: 187.728 [ms] (mean, across all concurrent requests)
Transfer rate: 1.05 [Kbytes/sec] received
...
Percentage of the requests served within a certain time (ms)
50% 3755
66% 3804
75% 3818
80% 3825
90% 3845
95% 3858
98% 3874
99% 3875
100% 4225 (longest request)
从这个输出中,我们可以看到我们每秒只能处理大约 5 个请求,并且平均每个请求的往返时间略低于 4 秒。在实际场景中,我们可能需要为每个用户请求在许多函数中进行大量工作,但即使在我们的简单示例中,也可能会因为编译正则表达式、生成随机盐、根据用户密码生成唯一哈希,或者在 Express 框架本身中而浪费时间。
🌐 From this output, we see that we're only managing to serve about 5 requests per second and that the average request takes just under 4 seconds round trip. In a real-world example, we could be doing lots of work in many functions on behalf of a user request but even in our simple example, time could be lost compiling regular expressions, generating random salts, generating unique hashes from user passwords, or inside the Express framework itself.
由于我们使用 --prof 选项运行了应用,因此在与你本地运行应用相同的目录下生成了一个 tick 文件。它的形式应为 isolate-0xnnnnnnnnnnnn-v8.log(其中 n 是一个数字)。
🌐 Since we ran our application using the --prof option, a tick file was generated
in the same directory as your local run of the application. It should have the
form isolate-0xnnnnnnnnnnnn-v8.log (where n is a digit).
为了理解这个文件,我们需要使用随 Node.js 二进制文件打包的 tick 处理器。要运行该处理器,请使用 --prof-process 标志:
🌐 In order to make sense of this file, we need to use the tick processor bundled
with the Node.js binary. To run the processor, use the --prof-process flag:
node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt
在你喜欢的文本编辑器中打开 processed.txt 会显示几种不同类型的信息。该文件被分为几个部分,每个部分又按语言划分。首先,我们看一下摘要部分,看到的是:
🌐 Opening processed.txt in your favorite text editor will give you a few different types of information. The file is broken up into sections which are again broken up by language. First, we look at the summary section and see:
[Summary]:
ticks total nonlib name
79 0.2% 0.2% JavaScript
36703 97.2% 99.2% C++
7 0.0% 0.0% GC
767 2.0% Shared libraries
215 0.6% Unaccounted
这告诉我们,收集的所有样本中有 97%发生在 C++代码中,并且在查看处理后的输出的其他部分时,我们应该最关注在 C++中进行的工作(而不是 JavaScript)。考虑到这一点,我们接下来找到包含关于哪些 C++函数占用最多 CPU 时间信息的 [C++]部分,并查看:
🌐 This tells us that 97% of all samples gathered occurred in C++ code and that when viewing other sections of the processed output we should pay most attention to work being done in C++ (as opposed to JavaScript). With this in mind, we next find the [C++] section which contains information about which C++ functions are taking the most CPU time and see:
[C++]:
ticks total nonlib name
19557 51.8% 52.9% node::crypto::PBKDF2(v8::FunctionCallbackInfo<v8::Value> const&)
4510 11.9% 12.2% _sha1_block_data_order
3165 8.4% 8.6% _malloc_zone_malloc
我们看到前三名占用了 CPU 时间的 72.1%, 程序。从这个输出中,我们立刻看到至少有 51.8%的 CPU 时间是 由一个名为 PBKDF2 的函数接收,该函数对应于我们的哈希生成 来自用户密码。然而,可能并不立即清楚下层的效果 我们的申请中会有两份参赛记录(如果有,我们会假装不是 举个例子)。以便更好地理解它们之间的关系 函数,接下来我们将关注 [BottomUp(重)配置文件]部分,该部分 提供关于每个函数主要调用者的信息。正在研究这个问题 我们发现:
🌐 We see that the top 3 entries account for 72.1% of CPU time taken by the program. From this output, we immediately see that at least 51.8% of CPU time is taken up by a function called PBKDF2 which corresponds to our hash generation from a user's password. However, it may not be immediately obvious how the lower two entries factor into our application (or if it is we will pretend otherwise for the sake of example). To better understand the relationship between these functions, we will next look at the [Bottom up (heavy) profile] section which provides information about the primary callers of each function. Examining this section, we find:
ticks parent name
19557 51.8% node::crypto::PBKDF2(v8::FunctionCallbackInfo<v8::Value> const&)
19557 100.0% v8::internal::Builtins::~Builtins()
19557 100.0% LazyCompile: ~pbkdf2 crypto.js:557:16
4510 11.9% _sha1_block_data_order
4510 100.0% LazyCompile: *pbkdf2 crypto.js:557:16
4510 100.0% LazyCompile: *exports.pbkdf2Sync crypto.js:552:30
3165 8.4% _malloc_zone_malloc
3161 99.9% LazyCompile: *pbkdf2 crypto.js:557:16
3161 100.0% LazyCompile: *exports.pbkdf2Sync crypto.js:552:30
解析这部分比上面的原始刻度计数要多一些。 在上述每个“调用堆栈”中,父列的百分比 告诉你上述行中函数所占据的样本百分比 由当前行中的函数调用。例如,中间的“call” 堆栈“在 _sha1_block_data_order 上方,我们看到”_sha1_block_data_order“ 发生了 在11.9%的样本中,我们从上述原始计数中得知。然而,在这里,我们 还可以判断它总是被 pbkdf2 函数调用 Node.js加密模块。我们看到类似地,“_malloc_zone_malloc”也被称为 几乎完全由相同的PBKDF2函数实现。因此,利用 通过这种视角,我们可以从用户的密码判断哈希计算 这不仅占了上面提到的51.8%,还包括顶端所有CPU时间 自调用“_sha1_block_data_order”以来,3个最被采样的函数 “_malloc_zone_malloc”是代表pbkdf2函数制作的。
🌐 Parsing this section takes a little more work than the raw tick counts above.
Within each of the "call stacks" above, the percentage in the parent column
tells you the percentage of samples for which the function in the row above was
called by the function in the current row. For example, in the middle "call
stack" above for _sha1_block_data_order, we see that _sha1_block_data_order occurred
in 11.9% of samples, which we knew from the raw counts above. However, here, we
can also tell that it was always called by the pbkdf2 function inside the
Node.js crypto module. We see that similarly, _malloc_zone_malloc was called
almost exclusively by the same pbkdf2 function. Thus, using the information in
this view, we can tell that our hash computation from the user's password
accounts not only for the 51.8% from above but also for all CPU time in the top
3 most sampled functions since the calls to _sha1_block_data_order and
_malloc_zone_malloc were made on behalf of the pbkdf2 function.
此时,很明显基于密码的哈希生成应该成为我们优化的目标。幸运的是,你已经完全理解了 异步编程的好处,并且你意识到从用户密码生成哈希的工作是以同步方式进行的,因此会占用事件循环。这阻止了我们在计算哈希时处理其他传入请求。
🌐 At this point, it is very clear that the password-based hash generation should be the target of our optimization. Thankfully, you've fully internalized the benefits of asynchronous programming and you realize that the work to generate a hash from the user's password is being done in a synchronous way and thus tying down the event loop. This prevents us from working on other incoming requests while computing a hash.
为了解决这个问题,你可以对上述处理程序进行小的修改,使用 pbkdf2 函数的异步版本:
🌐 To remedy this issue, you make a small modification to the above handlers to use the asynchronous version of the pbkdf2 function:
app.get('/auth', (, ) => {
let = .query.username || '';
const = .query.password || '';
= .replace(/[^a-zA-Z0-9]/g, '');
if (! || ! || !users[]) {
return .sendStatus(400);
}
.pbkdf2(
,
users[].salt,
10000,
512,
'sha512',
(, ) => {
if (users[].hash.toString() === .toString()) {
.sendStatus(200);
} else {
.sendStatus(401);
}
}
);
});
使用你应用的异步版本对上述 ab 基准进行新一轮运行,结果如下:
🌐 A new run of the ab benchmark above with the asynchronous version of your app yields:
Concurrency Level: 20
Time taken for tests: 12.846 seconds
Complete requests: 250
Failed requests: 0
Keep-Alive requests: 250
Total transferred: 50250 bytes
HTML transferred: 500 bytes
Requests per second: 19.46 [#/sec] (mean)
Time per request: 1027.689 [ms] (mean)
Time per request: 51.384 [ms] (mean, across all concurrent requests)
Transfer rate: 3.82 [Kbytes/sec] received
...
Percentage of the requests served within a certain time (ms)
50% 1018
66% 1035
75% 1041
80% 1043
90% 1049
95% 1063
98% 1070
99% 1071
100% 1079 (longest request)
太棒了!你的应用现在每秒大约处理 20 个请求,大约是使用同步哈希生成时的 4 倍。此外,平均响应时间从之前的 4 秒降到了刚刚超过 1 秒。
🌐 Yay! Your app is now serving about 20 requests per second, roughly 4 times more than it was with the synchronous hash generation. Additionally, the average latency is down from the 4 seconds before to just over 1 second.
希望通过对这个(确实有些牵强的)示例进行性能调查,你已经看到 V8 tick 处理器如何帮助你更好地理解 Node.js 应用的性能。
🌐 Hopefully, through the performance investigation of this (admittedly contrived) example, you've seen how the V8 tick processor can help you gain a better understanding of the performance of your Node.js applications.
你可能还会发现如何创建火焰图很有帮助。
🌐 You may also find how to create a flame graph helpful.