不要阻塞事件循环(或工作线程池)
🌐 Don't Block the Event Loop (or the Worker Pool)
你应该阅读这份指南吗?
🌐 Should you read this guide?
如果你编写的内容比简短的命令行脚本更复杂,阅读本文应该可以帮助你编写性能更高、更安全的应用。
🌐 If you're writing anything more complicated than a brief command-line script, reading this should help you write higher-performance, more-secure applications.
本文档是针对 Node.js 服务器编写的,但其中的概念同样适用于复杂的 Node.js 应用。对于操作系统特定的细节有所不同时,本文档以 Linux 为中心。
🌐 This document is written with Node.js servers in mind, but the concepts apply to complex Node.js applications as well. Where OS-specific details vary, this document is Linux-centric.
摘要
🌐 Summary
Node.js 在事件循环中运行 JavaScript 代码(初始化和回调),并提供一个工作线程池来处理像文件 I/O 这样昂贵的任务。 Node.js 的可扩展性很好,有时甚至比像 Apache 这样更重量级的方法还要好。 Node.js 可扩展性的秘诀在于,它使用少量线程来处理许多客户端。 如果 Node.js 可以使用更少的线程,那么它就可以将系统的时间和内存更多地用于处理客户端,而不是用于线程开销(内存、上下文切换)。 但由于 Node.js 只有少量线程,你必须合理地设计你的应用来明智地使用它们。
🌐 Node.js runs JavaScript code in the Event Loop (initialization and callbacks), and offers a Worker Pool to handle expensive tasks like file I/O. Node.js scales well, sometimes better than more heavyweight approaches like Apache. The secret to the scalability of Node.js is that it uses a small number of threads to handle many clients. If Node.js can make do with fewer threads, then it can spend more of your system's time and memory working on clients rather than on paying space and time overheads for threads (memory, context-switching). But because Node.js has only a few threads, you must structure your application to use them wisely.
这里有一个保持你的 Node.js 服务器快速运行的经验法则: _Node.js 在每个客户端在任意时间处理的工作量“较小”时很快 _。
🌐 Here's a good rule of thumb for keeping your Node.js server speedy: Node.js is fast when the work associated with each client at any given time is "small".
这适用于事件循环上的回调和工作池上的任务。
🌐 This applies to callbacks on the Event Loop and tasks on the Worker Pool.
为什么我应该避免阻塞事件循环和工作线程池?
🌐 Why should I avoid blocking the Event Loop and the Worker Pool?
Node.js 使用少量线程来处理许多客户端。
在 Node.js 中有两种类型的线程:一个事件循环(也称为主循环、主线程、事件线程等),以及一个包含 k 个工作线程的工作线程池(也称为线程池)。
🌐 Node.js uses a small number of threads to handle many clients.
In Node.js there are two types of threads: one Event Loop (aka the main loop, main thread, event thread, etc.), and a pool of k Workers in a Worker Pool (aka the threadpool).
如果一个线程在执行回调(事件循环)或任务(工作线程)时耗费很长时间,我们称其为“阻塞”。 当一个线程在为一个客户端处理事务时被阻塞,它无法处理其他客户端的请求。 这为避免阻塞事件循环或工作线程池提供了两个理由:
🌐 If a thread is taking a long time to execute a callback (Event Loop) or a task (Worker), we call it "blocked". While a thread is blocked working on behalf of one client, it cannot handle requests from any other clients. This provides two motivations for blocking neither the Event Loop nor the Worker Pool:
- 性能:如果你经常在任意类型的线程上执行高负载活动,你的服务器的吞吐量(请求/秒)将会受到影响。
- 安全性:如果某些输入可能导致你的某个线程阻塞,恶意客户端可能会提交这种“恶意输入”,使你的线程阻塞,并阻止它们处理其他客户端的请求。这将是一种拒绝服务攻击。
Node 快速回顾
🌐 A quick review of Node
Node.js 使用事件驱动架构:它有一个用于协调的事件循环和一个用于处理高开销任务的工作线程池。
🌐 Node.js uses the Event-Driven Architecture: it has an Event Loop for orchestration and a Worker Pool for expensive tasks.
哪些代码在事件循环上运行?
🌐 What code runs on the Event Loop?
当它们开始时,Node.js 应用首先完成初始化阶段,require 模块并注册事件回调。Node.js 应用随后进入事件循环,通过执行相应的回调来响应传入的客户端请求。这个回调是同步执行的,并且可能注册异步请求以在完成后继续处理。这些异步请求的回调也将在事件循环中执行。
🌐 When they begin, Node.js applications first complete an initialization phase, require'ing modules and registering callbacks for events.
Node.js applications then enter the Event Loop, responding to incoming client requests by executing the appropriate callback.
This callback executes synchronously, and may register asynchronous requests to continue processing after it completes.
The callbacks for these asynchronous requests will also be executed on the Event Loop.
事件循环还将满足其回调触发的非阻塞异步请求,例如网络 I/O。
🌐 The Event Loop will also fulfill the non-blocking asynchronous requests made by its callbacks, e.g., network I/O.
总之,事件循环执行为事件注册的 JavaScript 回调,还负责满足非阻塞异步请求,如网络 I/O。
🌐 In summary, the Event Loop executes the JavaScript callbacks registered for events, and is also responsible for fulfilling non-blocking asynchronous requests like network I/O.
哪些代码在工作池上运行?
🌐 What code runs on the Worker Pool?
Node.js 的工作线程池是用 libuv 实现的(文档),它提供了一个通用的任务提交接口。
🌐 The Worker Pool of Node.js is implemented in libuv (docs), which exposes a general task submission API.
Node.js 使用工作线程池来处理“昂贵”的任务。这包括操作系统未提供非阻塞版本的 I/O 以及特别消耗 CPU 的任务。
🌐 Node.js uses the Worker Pool to handle "expensive" tasks. This includes I/O for which an operating system does not provide a non-blocking version, as well as particularly CPU-intensive tasks.
这些是使用此工作池的 Node.js 模块 API:
🌐 These are the Node.js module APIs that make use of this Worker Pool:
- I/O 密集型
- CPU 密集型
在许多 Node.js 应用中,这些 API 是 Worker Pool 任务的唯一来源。使用 C++ 插件 的应用和模块可以向 Worker Pool 提交其他任务。
🌐 In many Node.js applications, these APIs are the only sources of tasks for the Worker Pool. Applications and modules that use a C++ add-on can submit other tasks to the Worker Pool.
为了完整起见,我们需要注意,当你在事件循环的回调中调用这些 API 之一时,事件循环在进入该 API 的 Node.js C++ 绑定并向工作线程池提交任务时,会产生一些小的设置开销。相比任务的总体开销,这些开销可以忽略不计,这也是事件循环将其外包的原因。当将这些任务提交到工作线程池时,Node.js 会提供一个指向 Node.js C++ 绑定中相应 C++ 函数的指针。
🌐 For the sake of completeness, we note that when you call one of these APIs from a callback on the Event Loop, the Event Loop pays some minor setup costs as it enters the Node.js C++ bindings for that API and submits a task to the Worker Pool. These costs are negligible compared to the overall cost of the task, which is why the Event Loop is offloading it. When submitting one of these tasks to the Worker Pool, Node.js provides a pointer to the corresponding C++ function in the Node.js C++ bindings.
Node.js 如何决定接下来运行哪段代码?
🌐 How does Node.js decide what code to run next?
抽象地说,事件循环和工作池分别维护待处理事件和待处理任务的队列。
🌐 Abstractly, the Event Loop and the Worker Pool maintain queues for pending events and pending tasks, respectively.
事实上,事件循环实际上并不维护一个队列。相反,它拥有一组文件描述符,并请求操作系统使用类似 epoll(Linux)、kqueue(OSX)、事件端口(Solaris)或 IOCP(Windows)这样的机制进行监控。这些文件描述符对应网络套接字、它正在监视的文件等。当操作系统表示其中一个文件描述符已准备好时,事件循环会将其转换为相应的事件并调用与该事件相关联的回调函数。你可以在这里了解更多关于这个过程的信息。
🌐 In truth, the Event Loop does not actually maintain a queue. Instead, it has a collection of file descriptors that it asks the operating system to monitor, using a mechanism like epoll (Linux), kqueue (OSX), event ports (Solaris), or IOCP (Windows). These file descriptors correspond to network sockets, any files it is watching, and so on. When the operating system says that one of these file descriptors is ready, the Event Loop translates it to the appropriate event and invokes the callback(s) associated with that event. You can learn more about this process here.
相比之下,Worker Pool 使用的是一个真正的队列,其条目是需要处理的任务。Worker 会从这个队列中取出一个任务并进行处理,完成后,Worker 会为事件循环触发一个“至少有一个任务已完成”的事件。
🌐 In contrast, the Worker Pool uses a real queue whose entries are tasks to be processed. A Worker pops a task from this queue and works on it, and when finished the Worker raises an "At least one task is finished" event for the Event Loop.
这对应用设计意味着什么?
🌐 What does this mean for application design?
在像 Apache 这样的每客户端一个线程的系统中,每个等待的客户端都会被分配一个独立的线程。如果处理某个客户端的线程被阻塞,操作系统会中断它并让另一个客户端获得执行机会。因此,操作系统确保需要较少工作量的客户端不会因为需要更多工作量的客户端而受到影响。
🌐 In a one-thread-per-client system like Apache, each pending client is assigned its own thread. If a thread handling one client blocks, the operating system will interrupt it and give another client a turn. The operating system thus ensures that clients that require a small amount of work are not penalized by clients that require more work.
因为 Node.js 使用少量线程处理许多客户端,如果某个线程在处理一个客户端的请求时被阻塞,那么未处理的客户端请求可能要等到该线程完成其回调或任务后才能得到处理。 _ 因此,公平对待客户端是你的应用的责任 _。 这意味着你不应在任何单一回调或任务中为某个客户端做太多工作。
🌐 Because Node.js handles many clients with few threads, if a thread blocks handling one client's request, then pending client requests may not get a turn until the thread finishes its callback or task. The fair treatment of clients is thus the responsibility of your application. This means that you shouldn't do too much work for any client in any single callback or task.
这也是 Node.js 能够良好扩展的部分原因,但这也意味着你需要负责确保公平调度。接下来的部分将讲解如何确保事件循环和工作池的公平调度。
🌐 This is part of why Node.js can scale well, but it also means that you are responsible for ensuring fair scheduling. The next sections talk about how to ensure fair scheduling for the Event Loop and for the Worker Pool.
不要阻塞事件循环
🌐 Don't block the Event Loop
事件循环会注意到每一个新的客户端连接,并协调生成响应。所有传入的请求和传出的响应都要经过事件循环。这意味着,如果事件循环在某一点花费太长时间,所有当前和新的客户端都将无法轮到处理。
🌐 The Event Loop notices each new client connection and orchestrates the generation of a response. All incoming requests and outgoing responses pass through the Event Loop. This means that if the Event Loop spends too long at any point, all current and new clients will not get a turn.
你应该确保永远不要阻塞事件循环。换句话说,你的每个 JavaScript 回调都应快速完成。当然,这也适用于你的 await、Promise.then 等。
🌐 You should make sure you never block the Event Loop.
In other words, each of your JavaScript callbacks should complete quickly.
This of course also applies to your await's, your Promise.then's, and so on.
确保这一点的一个好方法是思考你的回调函数的计算复杂度。 如果你的回调函数无论参数如何都只需执行固定数量的步骤,那么你总能公平地为每个待处理的客户端分配时间。 如果你的回调函数所需的步骤数取决于参数,那么你应该考虑参数可能需要的时间长度。
🌐 A good way to ensure this is to reason about the "computational complexity" of your callbacks. If your callback takes a constant number of steps no matter what its arguments are, then you'll always give every pending client a fair turn. If your callback takes a different number of steps depending on its arguments, then you should think about how long the arguments might be.
示例 1:一个常数时间的回调。
🌐 Example 1: A constant-time callback.
app.get('/constant-time', (, ) => {
.sendStatus(200);
});
示例 2:一个 O(n) 回调。对于较小的 n,此回调运行得很快;对于较大的 n,运行得会较慢。
🌐 Example 2: An O(n) callback. This callback will run quickly for small n and more slowly for large n.
app.get('/countToN', (, ) => {
const = .query.n;
// n iterations before giving someone else a turn
for (let = 0; < ; ++) {
.(`Iter ${}`);
}
.sendStatus(200);
});
示例 3:一个 O(n^2) 回调。对于较小的 n,这个回调仍然能快速运行,但对于较大的 n,它的运行速度会比前面的 O(n) 示例慢得多。
🌐 Example 3: An O(n^2) callback. This callback will still run quickly for small n, but for large n it will run much more slowly than the previous O(n) example.
app.get('/countToN2', (, ) => {
const = .query.n;
// n^2 iterations before giving someone else a turn
for (let = 0; < ; ++) {
for (let = 0; < ; ++) {
.(`Iter ${}.${}`);
}
}
.sendStatus(200);
});
你应该多小心?
🌐 How careful should you be?
Node.js 使用 Google 的 V8 引擎来运行 JavaScript,这对于许多常见操作来说速度相当快。 这一规则的例外是正则表达式和 JSON 操作,下面将进行讨论。
🌐 Node.js uses the Google V8 engine for JavaScript, which is quite fast for many common operations. Exceptions to this rule are regexps and JSON operations, discussed below.
然而,对于复杂任务,你应当考虑限制输入长度,并拒绝过长的输入。这样,即使你的回调函数复杂度很高,通过限制输入,你也可以确保回调函数在最长可接受输入上的运行时间不会超过最坏情况。然后,你可以评估此回调函数的最坏情况开销,并确定其运行时间在你的环境中是否可接受。
🌐 However, for complex tasks you should consider bounding the input and rejecting inputs that are too long. That way, even if your callback has large complexity, by bounding the input you ensure the callback cannot take more than the worst-case time on the longest acceptable input. You can then evaluate the worst-case cost of this callback and determine whether its running time is acceptable in your context.
阻塞事件循环:REDOS
🌐 Blocking the Event Loop: REDOS
阻塞事件循环的一种常见且灾难性的方法是使用“易受攻击”的正则表达式。
🌐 One common way to block the Event Loop disastrously is by using a "vulnerable" regular expression.
避免易受攻击的正则表达式
🌐 Avoiding vulnerable regular expressions
正则表达式(regexp)将输入字符串与模式进行匹配。我们通常认为正则表达式匹配只需对输入字符串进行一次遍历 - 时间复杂度为 O(n),其中 n 是输入字符串的长度。在许多情况下,单次遍历确实足够。不幸的是,在某些情况下,正则表达式匹配可能需要对输入字符串进行指数级次数的遍历 - 时间复杂度为 O(2^n)。指数级次数的遍历意味着,如果引擎需要 x 次遍历来确定匹配,那么如果我们只在输入字符串末尾添加一个字符,它将需要 2*x 次遍历。由于遍历次数与所需时间线性相关,这种情况将导致事件循环被阻塞。
🌐 A regular expression (regexp) matches an input string against a pattern.
We usually think of a regexp match as requiring a single pass through the input string --- O(n) time where n is the length of the input string.
In many cases, a single pass is indeed all it takes.
Unfortunately, in some cases the regexp match might require an exponential number of trips through the input string --- O(2^n) time.
An exponential number of trips means that if the engine requires x trips to determine a match, it will need 2*x trips if we add only one more character to the input string.
Since the number of trips is linearly related to the time required, the effect of this evaluation will be to block the Event Loop.
一个 易受攻击的正则表达式 是指你的正则表达式引擎可能会在其上花费指数时间,从而在处理“恶意输入”时使你暴露于正则表达式拒绝服务攻击的风险之中。 你的正则表达式模式是否易受攻击(即正则引擎可能在其上花费指数时间)实际上是一个很难回答的问题,并且取决于你使用的是 Perl、Python、Ruby、Java、JavaScript 等语言,但以下是一些在这些语言中普遍适用的经验法则:
🌐 A vulnerable regular expression is one on which your regular expression engine might take exponential time, exposing you to REDOS on "evil input". Whether or not your regular expression pattern is vulnerable (i.e. the regexp engine might take exponential time on it) is actually a difficult question to answer, and varies depending on whether you're using Perl, Python, Ruby, Java, JavaScript, etc., but here are some rules of thumb that apply across all of these languages:
- 避免使用嵌套量词,例如
(a+)*。V8 的正则引擎可以快速处理其中的一些,但其他的则容易受到影响。 - 避免使用带有重叠子句的 OR,例如
(a|a)*。同样,这些有时会很快。 - 避免使用回溯引用,比如“(a.*) \1”。没有任何正规表达式引擎能保证在线性时间内计算这些数据。
- 如果你只是进行简单的字符串匹配,可以使用
indexOf或本地等效方法。这样开销更低,时间复杂度也不会超过O(n)。
如果你不确定你的正则表达式是否存在漏洞,请记住,即使是针对一个有漏洞的正则表达式和很长的输入字符串,Node.js 通常仍然能正确返回一个 匹配 结果。指数级行为会在出现不匹配时触发,但 Node.js 必须尝试输入字符串的多个路径才能确定不匹配。
🌐 If you aren't sure whether your regular expression is vulnerable, remember that Node.js generally doesn't have trouble reporting a match even for a vulnerable regexp and a long input string. The exponential behavior is triggered when there is a mismatch but Node.js can't be certain until it tries many paths through the input string.
一个 REDOS 示例
🌐 A REDOS example
以下是将其服务器暴露给 REDOS 的易受攻击的正则表达式示例:
🌐 Here is an example vulnerable regexp exposing its server to REDOS:
app.get('/redos-me', (, ) => {
const = .query.filePath;
// REDOS
if (.match(/(\/.+)+$/)) {
.('valid path');
} else {
.('invalid path');
}
.sendStatus(200);
});
此示例中的易受攻击的正则表达式是检查 Linux 有效路径的一种(糟糕的!)方法。它匹配由“/”分隔的名称序列的字符串,例如“/a/b/c”。它很危险,因为它违反了规则 1:它具有双重嵌套量词。
🌐 The vulnerable regexp in this example is a (bad!) way to check for a valid path on Linux. It matches strings that are a sequence of "/"-delimited names, like "/a/b/c". It is dangerous because it violates rule 1: it has a doubly-nested quantifier.
如果客户端使用文件路径 ///.../ (100 个 / 后跟一个正则表达式中的 “.” 无法匹配的换行符)进行查询,那么事件循环将几乎永远运行,阻塞事件循环。这个客户端的 REDOS 攻击会导致其他所有客户端在正则匹配完成之前都无法轮到处理。
🌐 If a client queries with filePath ///.../\n (100 /'s followed by a newline character that the regexp's "." won't match), then the Event Loop will take effectively forever, blocking the Event Loop.
This client's REDOS attack causes all other clients not to get a turn until the regexp match finishes.
因此,你应该谨慎使用复杂的正则表达式来验证用户输入。
🌐 For this reason, you should be leery of using complex regular expressions to validate user input.
反 REDOS 资源
🌐 Anti-REDOS Resources
有一些工具可以检查你的正则表达式是否安全,例如
🌐 There are some tools to check your regexps for safety, like
但是,这些都无法捕获所有易受攻击的正则表达式。
🌐 However, neither of these will catch all vulnerable regexps.
另一种方法是使用不同的正则表达式引擎。你可以使用 node-re2 模块,它使用谷歌超高速的 RE2 正则表达式引擎。但请注意,RE2 与 V8 的正则表达式并不完全兼容,因此如果你用 node-re2 模块来处理正则表达式,务必检查是否有回归。尤其是一些特别复杂的正则表达式,node-re2 是不支持的。
🌐 Another approach is to use a different regexp engine. You could use the node-re2 module, which uses Google's blazing-fast RE2 regexp engine. But be warned, RE2 is not 100% compatible with V8's regexps, so check for regressions if you swap in the node-re2 module to handle your regexps. And particularly complicated regexps are not supported by node-re2.
如果你试图匹配一些“显而易见”的东西,比如 URL 或文件路径,可以在 正则表达式库 中找到示例,或使用 npm 模块,例如 ip-regex。
🌐 If you're trying to match something "obvious", like a URL or a file path, find an example in a regexp library or use an npm module, e.g. ip-regex.
阻塞事件循环:Node.js 核心模块
🌐 Blocking the Event Loop: Node.js core modules
几个 Node.js 核心模块具有同步的昂贵 API,包括:
🌐 Several Node.js core modules have synchronous expensive APIs, including:
这些 API 很耗费资源,因为它们涉及大量计算(加密、压缩)、需要进行 I/O(文件 I/O),或者可能两者都有(子进程)。这些 API 旨在提供脚本便利,但并不适合用于服务器环境。如果你在事件循环上执行它们,它们完成所需的时间会远远超过普通的 JavaScript 指令,从而阻塞事件循环。
🌐 These APIs are expensive, because they involve significant computation (encryption, compression), require I/O (file I/O), or potentially both (child process). These APIs are intended for scripting convenience, but are not intended for use in the server context. If you execute them on the Event Loop, they will take far longer to complete than a typical JavaScript instruction, blocking the Event Loop.
在服务器中,_ 你不应该使用以下模块中的同步 API_:
🌐 In a server, you should not use the following synchronous APIs from these modules:
- 加密:
crypto.randomBytes(同步版本)crypto.randomFillSynccrypto.pbkdf2Sync- 你在提供大量输入给加密和解密程序时也应当小心。
- 压缩:
zlib.inflateSynczlib.deflateSync
- 文件系统:
- 不要使用同步文件系统 API。例如,如果你访问的文件位于像 NFS 这样的分布式文件系统中,访问时间可能会有很大差异。
- 子进程:
child_process.spawnSyncchild_process.execSyncchild_process.execFileSync
截至 Node.js v9,此列表相当完整。
🌐 This list is reasonably complete as of Node.js v9.
阻塞事件循环:JSON 拒绝服务
🌐 Blocking the Event Loop: JSON DOS
JSON.parse 和 JSON.stringify 是其他可能开销较大的操作。虽然它们的时间复杂度是输入长度 n 的 O(n),但对于较大的 n,它们可能会花费出乎意料的长时间。
如果你的服务器操纵 JSON 对象,特别是来自客户端的对象,你应该谨慎处理事件循环上处理的对象或字符串的大小。
🌐 If your server manipulates JSON objects, particularly those from a client, you should be cautious about the size of the objects or strings you work with on the Event Loop.
示例:JSON 阻塞。我们创建一个大小为 2^21 的对象 obj 并对其进行 JSON.stringify,然后在字符串上运行 indexOf,最后对其进行 JSON.parse。JSON.stringify 后的字符串为 50MB。将对象 stringify 需要 0.7 秒,在 50MB 字符串上执行 indexOf 需要 0.03 秒,解析该字符串需要 1.3 秒。
🌐 Example: JSON blocking. We create an object obj of size 2^21 and JSON.stringify it, run indexOf on the string, and then JSON.parse it. The JSON.stringify'd string is 50MB. It takes 0.7 seconds to stringify the object, 0.03 seconds to indexOf on the 50MB string, and 1.3 seconds to parse the string.
let = { : 1 };
const = 20;
// Expand the object exponentially by nesting it
for (let = 0; < ; ++) {
= { : , : };
}
// Measure time to stringify the object
let = .();
const = .();
let = .();
.('JSON.stringify took', );
// Measure time to search a string within the JSON
= .();
const = .('nomatch'); // Always -1
= .();
.('String.indexOf took', );
// Measure time to parse the JSON back to an object
= .();
const = .();
= .();
.('JSON.parse took', );
有一些 npm 模块提供异步 JSON API。例如请参见:
🌐 There are npm modules that offer asynchronous JSON APIs. See for example:
- JSONStream,它有流式 API。
- Big-Friendly JSON,它提供流式 API 以及使用下面概述的基于事件循环的分区范式的标准 JSON API 异步版本。
在不阻塞事件循环的情况下进行复杂计算
🌐 Complex calculations without blocking the Event Loop
假设你想在 JavaScript 中进行复杂计算而不阻塞事件循环。你有两种选择:分割计算或卸载计算。
🌐 Suppose you want to do complex calculations in JavaScript without blocking the Event Loop. You have two options: partitioning or offloading.
分区
🌐 Partitioning
你可以将你的计算进行分区,让每个计算在事件循环上运行,但定期让出(交替执行)其他待处理的事件。在 JavaScript 中,可以很容易地在闭包中保存正在进行的任务的状态,如下面的示例 2 所示。
🌐 You could partition your calculations so that each runs on the Event Loop but regularly yields (gives turns to) other pending events. In JavaScript it's easy to save the state of an ongoing task in a closure, as shown in example 2 below.
举一个简单的例子,假设你想计算从 1 到 n 的数字的平均值。
🌐 For a simple example, suppose you want to compute the average of the numbers 1 to n.
示例 1:未分区平均值,耗时 O(n)
🌐 Example 1: Un-partitioned average, costs O(n)
for (let = 0; < n; ++) {
sum += ;
}
const = sum / n;
.('avg: ' + );
示例 2:分区平均,每个 n 个异步步骤的成本为 O(1)。
🌐 Example 2: Partitioned average, each of the n asynchronous steps costs O(1).
function (, ) {
// Save ongoing sum in JS closure.
let = 0;
function (, ) {
+= ;
if ( == ) {
();
return;
}
// "Asynchronous recursion".
// Schedule next operation asynchronously.
(.(null, + 1, ));
}
// Start the helper, with CB to call avgCB.
(1, function () {
const = / ;
();
});
}
(n, function () {
.('avg of 1-n: ' + );
});
你可以将此原则应用于数组迭代等。
🌐 You can apply this principle to array iterations and so forth.
卸载
🌐 Offloading
如果你需要做更复杂的事情,分区不是一个好的选择。因为分区只使用事件循环,而你几乎肯定无法利用机器上可用的多核。 _ 记住,事件循环应该协调客户端请求,而不是自己去完成它们。_ 对于复杂的任务,应将工作从事件循环移到工作池上执行。
🌐 If you need to do something more complex, partitioning is not a good option. This is because partitioning uses only the Event Loop, and you won't benefit from multiple cores almost certainly available on your machine. Remember, the Event Loop should orchestrate client requests, not fulfill them itself. For a complicated task, move the work off of the Event Loop onto a Worker Pool.
如何卸载
🌐 How to offload
你有两个选项可用于将工作卸载到目标工作池。
🌐 You have two options for a destination Worker Pool to which to offload work.
- 你可以通过开发一个 C++ 插件 来使用内置的 Node.js Worker Pool。在旧版本的 Node 上,使用 NAN 构建你的 C++ 插件;在新版本上使用 N-API。node-webworker-threads 提供了一种纯 JavaScript 的方式来访问 Node.js Worker Pool。
- 你可以创建并管理专门用于计算的 Worker 池,而不是以 Node.js I/O 为主题的 Worker 池。最直接的方法是使用 子进程 或 集群。
你不应该仅仅为每个客户端创建一个 子进程。 你接收客户端请求的速度可能比创建和管理子进程的速度更快,而且你的服务器可能会变成一个 fork 蠕虫。
🌐 You should not simply create a Child Process for every client. You can receive client requests more quickly than you can create and manage children, and your server might become a fork bomb.
外包的缺点
🌐 Downside of offloading
卸载方法的缺点是它会产生以 通信成本 形式存在的开销。 只有事件循环被允许查看你应用的“命名空间”(JavaScript 状态)。 从 Worker 中,你无法操作事件循环命名空间中的 JavaScript 对象。 相反,你必须序列化和反序列化任何希望共享的对象。 然后,Worker 可以操作这些对象的自身副本,并将修改后的对象(或“补丁”)返回给事件循环。
🌐 The downside of the offloading approach is that it incurs overhead in the form of communication costs. Only the Event Loop is allowed to see the "namespace" (JavaScript state) of your application. From a Worker, you cannot manipulate a JavaScript object in the Event Loop's namespace. Instead, you have to serialize and deserialize any objects you wish to share. Then the Worker can operate on its own copy of these object(s) and return the modified object (or a "patch") to the Event Loop.
有关序列化问题,请参阅 JSON DOS 部分。
🌐 For serialization concerns, see the section on JSON DOS.
一些卸载的建议
🌐 Some suggestions for offloading
你可能希望区分 CPU 密集型任务和 I/O 密集型任务,因为它们具有明显不同的特性。
🌐 You may wish to distinguish between CPU-intensive and I/O-intensive tasks because they have markedly different characteristics.
一个 CPU 密集型任务只有在其 Worker 被调度时才能取得进展,而 Worker 必须被调度到你的机器的某个逻辑核心上。如果你有 4 个逻辑核心和 5 个 Worker,其中一个 Worker 将无法取得进展。因此,你为这个 Worker 支付了开销(内存和调度成本),却没有获得任何回报。
🌐 A CPU-intensive task only makes progress when its Worker is scheduled, and the Worker must be scheduled onto one of your machine's logical cores. If you have 4 logical cores and 5 Workers, one of these Workers cannot make progress. As a result, you are paying overhead (memory and scheduling costs) for this Worker and getting no return for it.
I/O 密集型任务涉及查询外部服务提供者(如 DNS、文件系统等)并等待其响应。 当执行 I/O 密集型任务的 Worker 在等待响应时,它没有其他可做的事情,并可能被操作系统暂停,从而让其他 Worker 有机会提交它们的请求。 因此,即使相关线程未运行,_I/O 密集型任务也会继续取得进展 _。 外部服务提供者,如数据库和文件系统,已高度优化,以便同时处理大量待处理请求。 例如,文件系统会检查大量待处理的写入和读取请求,以合并冲突的更新并以最优顺序检索文件。
🌐 I/O-intensive tasks involve querying an external service provider (DNS, file system, etc.) and waiting for its response. While a Worker with an I/O-intensive task is waiting for its response, it has nothing else to do and can be de-scheduled by the operating system, giving another Worker a chance to submit their request. Thus, I/O-intensive tasks will be making progress even while the associated thread is not running. External service providers like databases and file systems have been highly optimized to handle many pending requests concurrently. For example, a file system will examine a large set of pending write and read requests to merge conflicting updates and to retrieve files in an optimal order.
如果你仅依赖一个工作池,例如 Node.js 工作池,那么 CPU 密集型和 I/O 密集型工作的不同特性可能会损害应用的性能。
🌐 If you rely on only one Worker Pool, e.g. the Node.js Worker Pool, then the differing characteristics of CPU-bound and I/O-bound work may harm your application's performance.
因此,你可能希望维护一个单独的计算工作池。
🌐 For this reason, you might wish to maintain a separate Computation Worker Pool.
卸载:结论
🌐 Offloading: conclusions
对于简单的任务,比如迭代任意长度数组的元素,分区可能是一个不错的选择。 如果你的计算更复杂,分流是一种更好的方法:通信成本,即在事件循环和工作线程池之间传递序列化对象的开销,会被使用多核心带来的好处所抵消。
🌐 For simple tasks, like iterating over the elements of an arbitrarily long array, partitioning might be a good option. If your computation is more complex, offloading is a better approach: the communication costs, i.e. the overhead of passing serialized objects between the Event Loop and the Worker Pool, are offset by the benefit of using multiple cores.
然而,如果你的服务器严重依赖复杂计算,你应当考虑 Node.js 是否真的适合。Node.js 擅长处理 I/O 密集型工作,但对于耗费大量计算的任务,可能不是最佳选择。
🌐 However, if your server relies heavily on complex calculations, you should think about whether Node.js is really a good fit. Node.js excels for I/O-bound work, but for expensive computation it might not be the best option.
如果你采用卸载方法,请参阅不阻止工作池的部分。
🌐 If you take the offloading approach, see the section on not blocking the Worker Pool.
不要阻塞工作线程池
🌐 Don't block the Worker Pool
Node.js 有一个由 k 个工作线程组成的工作池。如果你使用上面讨论的卸载模式,你可能会有一个单独的计算工作池,同样的原理也适用。无论哪种情况,假设 k 远小于你可能同时处理的客户端数量。这符合 Node.js 的“一个线程服务多个客户端”理念,是其可扩展性的秘密所在。
🌐 Node.js has a Worker Pool composed of k Workers.
If you are using the Offloading paradigm discussed above, you might have a separate Computational Worker Pool, to which the same principles apply.
In either case, let us assume that k is much smaller than the number of clients you might be handling concurrently.
This is in keeping with the "one thread for many clients" philosophy of Node.js, the secret to its scalability.
如上所述,每个 Worker 在继续执行 Worker Pool 队列中的下一个任务之前,都会完成其当前任务。
🌐 As discussed above, each Worker completes its current Task before proceeding to the next one on the Worker Pool queue.
现在,处理客户请求所需任务的成本会有所不同。
有些任务可以很快完成(例如读取短文件或缓存文件,或生成少量随机字节),而有些任务则需要更长时间(例如读取较大或未缓存的文件,或生成更多随机字节)。
你的目标应该是 _ 最小化任务时间的差异 _,并且你应该使用 任务分割 来实现这一点。
🌐 Now, there will be variation in the cost of the Tasks required to handle your clients' requests. Some Tasks can be completed quickly (e.g. reading short or cached files, or producing a small number of random bytes), and others will take longer (e.g reading larger or uncached files, or generating more random bytes). Your goal should be to minimize the variation in Task times, and you should use Task partitioning to accomplish this.
最小化任务时间的变化
🌐 Minimizing the variation in Task times
如果一个工人的当前任务比其他任务要昂贵得多,那么它将无法处理其他待处理任务。换句话说,每个相对较长的任务实际上会在完成之前将工人池的规模减少一个。这是不理想的,因为在一定程度上,工人池中的工人越多,工人池的吞吐量(任务/秒)就越大,从而服务器的吞吐量(客户端请求/秒)也越大。一个拥有相对昂贵任务的客户端会降低工人池的吞吐量,进而降低服务器的吞吐量。
🌐 If a Worker's current Task is much more expensive than other Tasks, then it will be unavailable to work on other pending Tasks. In other words, each relatively long Task effectively decreases the size of the Worker Pool by one until it is completed. This is undesirable because, up to a point, the more Workers in the Worker Pool, the greater the Worker Pool throughput (tasks/second) and thus the greater the server throughput (client requests/second). One client with a relatively expensive Task will decrease the throughput of the Worker Pool, in turn decreasing the throughput of the server.
为避免这种情况,你应尽量减少提交给工作池的任务长度的变化。虽然将你的 I/O 请求访问的外部系统(数据库、文件系统等)视为黑箱是合适的,但你应意识到这些 I/O 请求的相对成本,并应避免提交预期会特别耗时的请求。
🌐 To avoid this, you should try to minimize variation in the length of Tasks you submit to the Worker Pool. While it is appropriate to treat the external systems accessed by your I/O requests (DB, FS, etc.) as black boxes, you should be aware of the relative cost of these I/O requests, and should avoid submitting requests you can expect to be particularly long.
两个示例应该可以说明任务时间的可能变化。
🌐 Two examples should illustrate the possible variation in task times.
示例变体:长时间运行的文件系统读取
🌐 Variation example: Long-running file system reads
假设你的服务器必须读取文件以处理某些客户端请求。
在查阅了 Node.js 的 文件系统 API 后,你选择使用 fs.readFile() 来简化操作。
然而,v10 之前的 fs.readFile() 并未进行分区:它提交了单个覆盖整个文件的 fs.read() 任务。
如果你为某些用户读取较短的文件,而为其他用户读取较长的文件,fs.readFile() 可能会导致任务长度差异显著,从而影响工作池的吞吐量。
🌐 Suppose your server must read files in order to handle some client requests.
After consulting the Node.js File system APIs, you opted to use fs.readFile() for simplicity.
However, fs.readFile() before v10 was not partitioned: it submitted a single fs.read() Task spanning the entire file.
If you read shorter files for some users and longer files for others, fs.readFile() may introduce significant variation in Task lengths, to the detriment of Worker Pool throughput.
在最坏的情况下,假设攻击者能够说服你的服务器读取一个 任意 文件(这是一个目录遍历漏洞)。 如果你的服务器运行的是 Linux,攻击者可以指定一个极其缓慢的文件:`/dev/random`。 在实际应用中,`/dev/random` 的读取速度几乎是无限慢的,每个被要求从 `/dev/random` 读取的 Worker 都永远无法完成该任务。 然后攻击者提交 k 个请求,每个 Worker 一个请求,而使用 Worker 池的其他客户端请求将无法继续处理。
🌐 For a worst-case scenario, suppose an attacker can convince your server to read an arbitrary file (this is a directory traversal vulnerability).
If your server is running Linux, the attacker can name an extremely slow file: /dev/random.
For all practical purposes, /dev/random is infinitely slow, and every Worker asked to read from /dev/random will never finish that Task.
An attacker then submits k requests, one for each Worker, and no other client requests that use the Worker Pool will make progress.
示例变体:长时间运行的加密操作
🌐 Variation example: Long-running crypto operations
假设你的服务器使用 crypto.randomBytes() 生成加密安全的随机字节。
crypto.randomBytes() 不是分区的:它会创建一个单一的 randomBytes() 任务来生成你请求的字节数。
如果你为某些用户生成较少的字节,而为其他用户生成更多的字节,crypto.randomBytes() 就会成为任务长度变化的另一来源。
🌐 Suppose your server generates cryptographically secure random bytes using crypto.randomBytes().
crypto.randomBytes() is not partitioned: it creates a single randomBytes() Task to generate as many bytes as you requested.
If you create fewer bytes for some users and more bytes for others, crypto.randomBytes() is another source of variation in Task lengths.
任务分工
🌐 Task partitioning
具有可变时间成本的任务可能会影响工作池的吞吐量。为了尽可能减少任务时间的波动,你应当将每个任务划分为成本相当的子任务。当每个子任务完成时,应提交下一个子任务;当最后一个子任务完成时,应通知提交者。
🌐 Tasks with variable time costs can harm the throughput of the Worker Pool. To minimize variation in Task times, as far as possible you should partition each Task into comparable-cost sub-Tasks. When each sub-Task completes it should submit the next sub-Task, and when the final sub-Task completes it should notify the submitter.
为了继续 fs.readFile() 的示例,你应该改用 fs.read()(手动分块)或 ReadStream(自动分块)。
🌐 To continue the fs.readFile() example, you should instead use fs.read() (manual partitioning) or ReadStream (automatically partitioned).
同样的原理也适用于以 CPU 为主的任务;asyncAvg 示例可能不适合事件循环,但非常适合工作线程池。
🌐 The same principle applies to CPU-bound tasks; the asyncAvg example might be inappropriate for the Event Loop, but it is well suited to the Worker Pool.
当你将一个任务划分为子任务时,较短的任务会扩展为少量子任务,而较长的任务会扩展为更多的子任务。对于较长任务的每个子任务之间,被分配的工作人员可以处理来自另一个较短任务的子任务,从而提高工作人员池的整体任务吞吐量。
🌐 When you partition a Task into sub-Tasks, shorter Tasks expand into a small number of sub-Tasks, and longer Tasks expand into a larger number of sub-Tasks. Between each sub-Task of a longer Task, the Worker to which it was assigned can work on a sub-Task from another, shorter, Task, thus improving the overall Task throughput of the Worker Pool.
请注意,已完成的子任务数量并不是衡量工作池吞吐量的有效指标。相反,应关注已完成的任务数量。
🌐 Note that the number of sub-Tasks completed is not a useful metric for the throughput of the Worker Pool. Instead, concern yourself with the number of Tasks completed.
避免任务分割
🌐 Avoiding Task partitioning
请记住,任务划分的目的是尽量减少任务时间的差异。如果你能区分较短的任务和较长的任务(例如求数组的和与对数组排序),你可以为每类任务创建一个工作池。将较短的任务和较长的任务分配到不同的工作池是减少任务时间差异的另一种方法。
🌐 Recall that the purpose of Task partitioning is to minimize the variation in Task times. If you can distinguish between shorter Tasks and longer Tasks (e.g. summing an array vs. sorting an array), you could create one Worker Pool for each class of Task. Routing shorter Tasks and longer Tasks to separate Worker Pools is another way to minimize Task time variation.
支持这种方法的理由是,划分任务会产生额外开销(创建工作池任务表示和操作工作池队列的成本),而避免划分可以节省这些额外成本。同时,它还可以避免你在划分任务时犯错。
🌐 In favor of this approach, partitioning Tasks incurs overhead (the costs of creating a Worker Pool Task representation and of manipulating the Worker Pool queue), and avoiding partitioning saves you the costs of additional trips to the Worker Pool. It also keeps you from making mistakes in partitioning your Tasks.
这种方法的缺点是,所有这些工作池中的工作者都会产生空间和时间开销,并且会相互竞争 CPU 时间。请记住,每个受 CPU 限制的任务只有在被调度时才能取得进展。因此,你应当在仔细分析之后才考虑采用这种方法。
🌐 The downside of this approach is that Workers in all of these Worker Pools will incur space and time overheads and will compete with each other for CPU time. Remember that each CPU-bound Task makes progress only while it is scheduled. As a result, you should only consider this approach after careful analysis.
工作池:结论
🌐 Worker Pool: conclusions
无论你只使用 Node.js 工作池还是维护单独的工作池,你都应该优化池的任务吞吐量。
🌐 Whether you use only the Node.js Worker Pool or maintain separate Worker Pool(s), you should optimize the Task throughput of your Pool(s).
为此,请使用任务分区将任务时间的变化最小化。
🌐 To do this, minimize the variation in Task times by using Task partitioning.
npm 模块的风险
🌐 The risks of npm modules
虽然 Node.js 核心模块为各种应用提供了基础,但有时候仍需要更多功能。Node.js 开发者可以从 npm 生态系统 中受益匪浅,成千上万的模块提供了各种功能,加快了开发过程。
🌐 While the Node.js core modules offer building blocks for a wide variety of applications, sometimes something more is needed. Node.js developers benefit tremendously from the npm ecosystem, with hundreds of thousands of modules offering functionality to accelerate your development process.
然而,请记住,这些模块中的大多数是由第三方开发者编写的,并且通常仅以尽力而为的方式发布。使用 npm 模块的开发者应该关注两件事,尽管后者常常被遗忘。
🌐 Remember, however, that the majority of these modules are written by third-party developers and are generally released with only best-effort guarantees. A developer using an npm module should be concerned about two things, though the latter is frequently forgotten.
- 它遵循其 API 吗?
- 它的 API 会阻塞事件循环或工作线程吗?许多模块并没有努力去说明其 API 的成本,这对社区是不利的。
对于简单的 API,你可以估算 API 的成本;字符串操作的成本不难理解。但在许多情况下,API 的具体成本是不明确的。
🌐 For simple APIs you can estimate the cost of the APIs; the cost of string manipulation isn't hard to fathom. But in many cases it's unclear how much an API might cost.
如果你正在调用可能会产生高成本的 API,请务必再次确认费用。可以请开发者记录相关信息,或者自己查看源代码(并提交一个文档化成本的 PR)。
🌐 If you are calling an API that might do something expensive, double-check the cost. Ask the developers to document it, or examine the source code yourself (and submit a PR documenting the cost).
请记住,即使 API 是异步的,你也无法知道它在每个分区中在 Worker 或事件循环上可能花费多少时间。
例如,假设在上面给出的 asyncAvg 示例中,每次调用辅助函数是对一半的数字求和,而不是对其中一个数字求和。
那么这个函数仍然是异步的,但每个分区的成本将是 O(n),而不是 O(1),这使得在任意 n 值下使用它的安全性会大大降低。
🌐 Remember, even if the API is asynchronous, you don't know how much time it might spend on a Worker or on the Event Loop in each of its partitions.
For example, suppose in the asyncAvg example given above, each call to the helper function summed half of the numbers rather than one of them.
Then this function would still be asynchronous, but the cost of each partition would be O(n), not O(1), making it much less safe to use for arbitrary values of n.
结论
🌐 Conclusion
Node.js 有两种类型的线程:一个事件循环(Event Loop)和 k 个工作线程(Workers)。
事件循环负责 JavaScript 回调和非阻塞 I/O,而工作线程则执行对应 C++ 代码的任务,这些任务完成异步请求,包括阻塞 I/O 和 CPU 密集型工作。
两种类型的线程一次处理的活动数量都不超过一个。
如果任何回调或任务耗时过长,运行它的线程就会被 _ 阻塞 _。
如果你的应用执行阻塞的回调或任务,这可能导致吞吐量(每秒处理的客户端数)下降,最坏情况下甚至导致完全的服务拒绝。
🌐 Node.js has two types of threads: one Event Loop and k Workers.
The Event Loop is responsible for JavaScript callbacks and non-blocking I/O, and a Worker executes tasks corresponding to C++ code that completes an asynchronous request, including blocking I/O and CPU-intensive work.
Both types of threads work on no more than one activity at a time.
If any callback or task takes a long time, the thread running it becomes blocked.
If your application makes blocking callbacks or tasks, this can lead to degraded throughput (clients/second) at best, and complete denial of service at worst.
要编写高吞吐量、更防 DoS 的 Web 服务器,你必须确保在良性和恶意输入时,你的事件循环和工作器都不会阻塞。
🌐 To write a high-throughput, more DoS-proof web server, you must ensure that on benign and on malicious input, neither your Event Loop nor your Workers will block.