跳到内容

追踪垃圾回收

🌐 Tracing garbage collection

本指南将介绍垃圾收集跟踪的基础知识。

🌐 This guide will go through the fundamentals of garbage collection traces.

在本指南结束时,你将能够:

🌐 By the end of this guide, you'll be able to:

  • 在你的 Node.js 应用中启用跟踪
  • 解释痕迹
  • 识别你 Node.js 应用中的潜在内存问题

关于垃圾回收器的工作原理有很多需要了解的地方,但如果你只学一件事,那就是当垃圾回收器运行时,你的代码是不会执行的。

🌐 There's a lot to learn about how the garbage collector works, but if you learn one thing it's that when GC is running, your code is not.

你可能想知道垃圾回收运行的频率和时长,以及结果如何。

🌐 You may want to know how often and long the garbage collection runs, and what is the outcome.

安装

🌐 Setup

对于本指南的提议,我们将使用以下脚本:

🌐 For the proposal of this guide, we'll use this script:

// script.mjs

import  from 'node:os';

let  = 1_000_000;
const  = new ();

function () {
  const  = {
    : .(),
    : .(),
    : .(),
    : .(),
  };

  .();
}

function () {
  .(`Total: ${.} entries`);
}

// execution
(() => {
  while ( > 0) {
    ();
    ..(`~~> ${} entries to record\r`);
    --;
  }

  ();
})();

即使泄漏在这里很明显,在现实应用中,找到泄漏源可能也很繁琐。

运行时带垃圾回收跟踪

🌐 Running with garbage collection traces

你可以使用 --trace-gc 标志在进程的控制台输出中查看垃圾回收的痕迹。

🌐 You can see traces for garbage collection in console output of your process using the --trace-gc flag.

$ node --trace-gc script.mjs

注意:你可以在 Node.js 诊断仓库中找到此 锻炼 的源代码。

它应该输出类似以下内容:

🌐 It should output something like:

[39067:0x158008000]     2297 ms: Scavenge 117.5 (135.8) -> 102.2 (135.8) MB, 0.8 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2375 ms: Scavenge 120.0 (138.3) -> 104.7 (138.3) MB, 0.9 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2453 ms: Scavenge 122.4 (140.8) -> 107.1 (140.8) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2531 ms: Scavenge 124.9 (143.3) -> 109.6 (143.3) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2610 ms: Scavenge 127.1 (145.5) -> 111.8 (145.5) MB, 0.7 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2688 ms: Scavenge 129.6 (148.0) -> 114.2 (148.0) MB, 0.8 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
[39067:0x158008000]     2766 ms: Scavenge 132.0 (150.5) -> 116.7 (150.5) MB, 1.1 / 0.0 ms  (average mu = 0.994, current mu = 0.994) allocation failure
Total: 1000000 entries

难以理解?也许我们应该回顾几个概念,并解释 --trace-gc 标志的输出。

🌐 Hard to read? Maybe we should pass in review a few concepts and explain the outputs of the --trace-gc flag.

使用 --trace-gc 检查跟踪

🌐 Examining a trace with --trace-gc

--trace-gc(或 --trace_gc,两者都可以)标志会在控制台输出所有垃圾回收事件。每行的组成可以描述如下:

🌐 The --trace-gc (or --trace_gc, either is fine) flag outputs all garbage collection events in the console. The composition of each line can be described as:

[13973:0x110008000]       44 ms: Scavenge 2.4 (3.2) -> 2.0 (4.2) MB, 0.5 / 0.0 ms  (average mu = 1.000, current mu = 1.000) allocation failure
Token valueInterpretation
13973PID of the running process
0x110008000Isolate (JS heap instance)
44 msThe time since the process started in ms
ScavengeType / Phase of GC
2.4Heap used before GC in MB
(3.2)Total heap before GC in MB
2.0Heap used after GC in MB
(4.2)Total heap after GC in MB
0.5 / 0.0 ms (average mu = 1.000, current mu = 1.000)Time spent in GC in ms
allocation failureReason for GC

我们只关注这里的两个事件:

🌐 We'll only focus on two events here:

  • 搜寻
  • 标记-清除

堆被划分为若干“空间”。其中,我们有一个称为“新生代”的空间,另一个称为“老年代”的空间。

🌐 The heap is divided into spaces. Amongst these, we have a space called the "new" space and another one called the "old" space.

👉 实际上,堆结构略有不同,但我们在本文中将坚持使用一个更简单的版本。如果你想了解更多细节,我们建议你查看关于 Orinoco 的 谈彼得·马歇尔

搜寻

🌐 Scavenge

Scavenge 是一种算法的名称,它将执行垃圾回收到新生代空间。新生代空间是对象创建的地方。新生代空间被设计得很小,以便快速进行垃圾回收。

🌐 Scavenge is the name of an algorithm that will perform garbage collection into new space. The new space is where objects are created. The new space is designed to be small and fast for garbage collection.

让我们想象一个 Scavenge 场景:

🌐 Let's imagine a Scavenge scenario:

  • 我们分配了 ABCD

    | A | B | C | D | <unallocated> |
    
  • 我们想分配 E

  • 空间不足,内存已耗尽

  • 然后,会触发一次(垃圾)回收

  • 无用对象被回收

  • 有生命的对象会停留

  • 假设 BD 已经死了

    | A | C | <unallocated> |
    
  • 现在我们可以分配 E

    | A | C | E | <unallocated> |
    

v8 会将对象从新生代提升到老生代,如果它们在经过两次 Scavenge 操作后仍未被垃圾回收。

🌐 v8 will promote objects, not garbage collected after two Scavenge operations to the old space.

👉 完整拾荒情景

标记-清除

🌐 Mark-sweep

标记-清除用于回收老年代的对象。老年代是那些在新生代中存活下来的对象所在的地方。

🌐 Mark-sweep is used to collect objects from old space. The old space is where objects that survived the new space are living.

该算法由两个阶段组成:

🌐 This algorithm is composed of two phases:

  • 标记:将仍然存活的对象标记为黑色,其他对象标记为白色。
  • 扫掠:扫描白色对象并将其转换为可通行的空间。

👉 事实上,标记和清除的步骤要复杂一些。 请阅读这个 文件 了解更多详细信息。

--trace-gc 的使用示例

🌐 --trace-gc in action

内存泄漏

🌐 Memory leak

现在,如果你快速返回到之前的终端窗口:你会在控制台中看到许多 Mark-sweep 事件。我们还可以看到,事件发生后回收的内存量微不足道。

🌐 Now, if you return quickly to the previous terminal window: you will see many Mark-sweep events in the console. We also see that the amount of memory collected after the event is insignificant.

既然我们现在是垃圾收集方面的专家了!我们能推断出什么呢?

🌐 Now that we are experts in garbage collection! What could we deduce?

我们可能有内存泄漏!但我们如何才能确定呢?(提醒:在这个例子中很明显,但在现实应用中又会怎样呢?)

🌐 We probably have a memory leak! But how could we be sure of that? (Reminder: it is pretty apparent in this example, but what about a real-world application?)

但我们如何发现上下文?

🌐 But how could we spot the context?

如何获取不良分配的上下文

🌐 How to get the context of bad allocations

  1. 假设我们观察到旧空间在不断增加。
  2. 减少 --max-old-space-size,使总堆积量更接近限制
  3. 运行程序直到内存耗尽。
  4. 生成的日志显示了失败的上下文。
  5. 如果出现内存耗尽(OOM),将堆大小增加约10%,并重复几次。如果观察到相同的模式,这表明存在内存泄漏。
  6. 如果没有发生内存溢出(OOM),则将堆大小固定为该值——压缩堆可以减少内存占用和计算延迟。

例如,尝试使用以下命令运行 script.mjs

🌐 For example, try to run script.mjs with the following command:

node --trace-gc --max-old-space-size=50 script.mjs

你应该会遇到 OOM:

🌐 You should experience an OOM:

[...]
<--- Last few GCs --->
[40928:0x148008000]      509 ms: Mark-sweep 46.8 (65.8) -> 40.6 (77.3) MB, 6.4 / 0.0 ms  (+ 1.4 ms in 11 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 24 ms) (average mu = 0.977, current mu = 0.977) finalize incrementa[40928:0x148008000]      768 ms: Mark-sweep 56.3 (77.3) -> 47.1 (83.0) MB, 35.9 / 0.0 ms  (average mu = 0.927, current mu = 0.861) allocation failure scavenge might not succeed
<--- JS stacktrace --->
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory [...]

现在,尝试以 100mb 为单位:

🌐 Now, try to it for 100mb:

node --trace-gc --max-old-space-size=100 script.mjs

你应该会经历类似的情况,唯一的区别是最后一次 GC 跟踪中包含的堆大小会更大。

🌐 You should experience something similar, the only difference should be that the last GC trace will contain a bigger heap size.

<--- Last few GCs --->
[40977:0x128008000]     2066 ms: Mark-sweep (reduce) 99.6 (102.5) -> 99.6 (102.5) MB, 46.7 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 47 ms) (average mu = 0.154, current mu = 0.155) allocati[40977:0x128008000]     2123 ms: Mark-sweep (reduce) 99.6 (102.5) -> 99.6 (102.5) MB, 47.7 / 0.0 ms  (+ 0.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 48 ms) (average mu = 0.165, current mu = 0.175) allocati

注意:在实际应用的上下文中,可能很难在代码中找到泄漏的对象。堆快照可以帮助你找到它。访问 专门针对堆快照的指南

缓慢

🌐 Slowness

你如何判断是否发生了过多的垃圾回收或导致了性能开销?

🌐 How do you assert whether too many garbage collections are happening or causing an overhead?

  1. 审查跟踪数据,特别是连续收集之间的时间。
  2. 审查跟踪数据,特别是与垃圾回收时间相关的部分。
  3. 如果两次垃圾回收之间的时间少于垃圾回收所花费的时间,应用将严重处于资源匮乏状态。
  4. 如果两次垃圾回收(GCS)之间的时间以及在垃圾回收中花费的时间都非常长,可能应用可以使用更小的堆。
  5. 如果两次垃圾回收(GC)之间的时间远大于进行垃圾回收所花的时间,则说明应用相对健康。

修复漏水

🌐 Fix the leak

现在让我们修复泄漏。与其用对象来存储条目,我们可以使用文件。

🌐 Now let's fix the leak. Instead of using an object to store our entries, we could use a file.

让我们稍微修改一下我们的脚本:

🌐 Let's modify our script a bit:

// script-fix.mjs
import  from 'node:fs/promises';
import  from 'node:os';

let  = 1_000_000;
const  = `entries-${.()}`;

async function () {
  const  = {
    : .(),
    : .(),
    : .(),
    : .(),
  };
  await .(, .() + '\n');
}

async function () {
  const  = await .();
  .(`File size ${.} bytes`);
}

// execution
(async () => {
  await .(, '----START---\n');
  while ( > 0) {
    await ();
    ..(`~~> ${} entries to record\r`);
    --;
  }

  await ();
})();

使用 Set 存储数据完全不是不好的做法;你只需要关注程序的内存占用情况。

🌐 Using a Set to store data is not a bad practice at all; you should just care about the memory footprint of your program.

注意:你可以在 Node.js 诊断仓库中找到此 锻炼 的源代码。

现在,让我们执行此脚本。

🌐 Now, let's execute this script.

node --trace-gc script-fix.mjs

你应该注意两件事:

🌐 You should observe two things:

  • 标记-清除事件出现得不那么频繁
  • 内存占用不超过25MB,而第一个脚本则超过130MB。

这是非常合理的,因为新版本对内存的压力比第一个版本小。

🌐 It makes a lot of sense as the new version puts less pressure on the memory than the first one.

要点:你觉得如何改进这个脚本? 你可能已经注意到,脚本的新版本运行得很慢。 如果我们再次使用 Set,并且只有在内存达到特定大小时才将其内容写入文件,会怎么样?

getheapstatistics API 可以帮助你。

额外内容:以编程方式跟踪垃圾回收

🌐 Bonus: Trace garbage collection programmatically

使用 v8 模块

🌐 Using v8 module

你可能想避免获取整个进程生命周期的跟踪记录。 在这种情况下,可以在进程内部设置标志。 v8 模块提供了一个用于即时设置标志的 API。

🌐 You might want to avoid getting traces from the entire lifetime of your process. In that case, set the flag from within the process. The v8 module exposes an API to put flags on the fly.

import  from 'v8';

// enabling trace-gc
.('--trace-gc');

// disabling trace-gc
.('--notrace-gc');

使用性能钩子

🌐 Using performance hooks

在 Node.js 中,你可以使用 性能钩子 来跟踪垃圾回收。

🌐 In Node.js, you can use performance hooks to trace garbage collection.

const {  } = ('node:perf_hooks');

// Create a performance observer
const  = new ( => {
  const  = .()[0];
  /*
  The entry is an instance of PerformanceEntry containing
  metrics of a single garbage collection event.
  For example:
  PerformanceEntry {
    name: 'gc',
    entryType: 'gc',
    startTime: 2820.567669,
    duration: 1.315709,
    kind: 1
  }
  */
});

// Subscribe to notifications of GCs
.({ : ['gc'] });

// Stop subscription
.();

使用性能钩子检查跟踪

🌐 Examining a trace with performance hooks

你可以从 PerformanceObserver 的回调中以 PerformanceEntry 的形式获取 GC 统计信息。

🌐 You can get GC statistics as PerformanceEntry from the callback in PerformanceObserver.

例如:

🌐 For example:

{
  "name": "gc",
  "entryType": "gc",
  "startTime": 2820.567669,
  "duration": 1.315709,
  "kind": 1
}
属性解释
name性能条目的名称。
entryType性能条目的类型。
startTime高分辨率毫秒时间戳,标记性能条目的开始时间。
duration该条目耗时的总毫秒数。
kind发生的垃圾回收操作类型。
flags关于垃圾回收的其他信息。

欲了解更多信息,你可以参考关于性能钩子的文档

🌐 For more information, you can refer to the documentation about performance hooks.