线上服务突然 OOM(OutOfMemoryError),堆内存还有空闲,但进程内存不断增长直到被 kill。这是典型的堆外内存泄漏问题。本文记录完整的排查过程和解决方案。
OOM 异常分类
OOM 不只是 heap 满了,JVM 内存分为多个区域:
| 区域 |
OOM 原因 |
表现 |
| Java Heap |
对象分配超过 -Xmx |
java.lang.OutOfMemoryError: Java heap space |
| Metaspace |
类加载过多 |
java.lang.OutOfMemoryError: Metaspace |
| Direct Memory |
NIO 直接内存 |
java.lang.OutOfMemoryError: Direct buffer memory |
| Stack |
线程栈过深 |
java.lang.StackOverflowError |
| Native Heap |
JNI/native 代码 |
系统级 OOM |
这次遇到的是 Direct Buffer Memory 问题。
问题现象
1 2 3 4 5 6 7 8 9 10
| # 监控告警 [告警] 服务内存使用率超过 90% 进程: java (pid 12345) 内存: RSS 8GB / 8GB
# k8s 事件 Warning Evicted Pod was evicted due to node memory pressure
# dmesg Out of memory: Kill process 12345 (java) score 861 or sacrifice child
|
服务启动后内存逐渐增长,GC 正常但进程 RSS 持续上升,最终被 OOM Killer 杀掉。
heap dump 获取方法
情况1:服务未崩溃
1 2 3 4 5 6 7 8
| jmap -dump:format=b,file=heap.hprof <pid>
gcore <pid>
jmap -dump:format=b,file=heap.hprof <pid> heap.hprof
|
情况2:设置自动 dump
1 2 3
| -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/logs/heap.hprof
|
情况3:服务已崩溃
如果进程被 kill,来不及 dump,查看崩溃日志:
1 2
| ls -la /data/logs/hs_err_*.log
|
MAT 工具分析
安装与启动
1 2
| ./MemoryAnalyzer -vm /path/to/java
|
关键视图
1. Histogram(直方图)
查看对象数量和内存占用:
1 2 3 4 5 6
| Class Name | Objects | Shallow Heap ----------------------------------------------|---------|------------ java.lang.String | 1,234,567 | 49,382,680 java.util.HashMap$Node | 2,345,678 | 74,261,000 java.util.ArrayList | 500,000 | 12,000,000 ...
|
2. Dominator Tree(支配树)
找出占用内存最多的对象路径:
1 2 3 4 5
| Path to GC Roots: 524,288,000 (42.5%) | +--- com.example.CacheManager | +--- Map<CacheKey, CacheEntry> map = 500MB | | +--- 100万个 CacheEntry 对象
|
3. Leak Suspects(泄漏怀疑)
MAT 自动分析可能的内存泄漏点:
1 2 3 4 5 6 7
| One instance of "com.example.CacheManager" loaded by "sun.misc.Launcher$AppClassLoader" occupies 524,288,000 (42.5%) bytes.
Keywords: com.example.CacheManager sun.misc.Launcher$AppClassLoader
|
常见泄漏场景
场景1:Cache 无限增长
1 2 3 4 5 6 7 8 9 10 11 12
| @Service public class UserService { private Map<Long, User> userCache = new HashMap<>();
public User getUser(Long id) { if (!userCache.containsKey(id)) { userCache.put(id, userMapper.selectById(id)); } return userCache.get(id); } }
|
问题:userCache 只增不删,内存持续增长。
解决:使用带过期时间的缓存:
1 2 3 4 5 6 7 8
| @Service public class UserService { private LoadingCache<Long, User> userCache = Caffeine.newBuilder() .maximumSize(10_000) .expireAfterWrite(10, TimeUnit.MINUTES) .build(id -> userMapper.selectById(id)); }
|
场景2:静态集合持有对象
1 2 3 4 5 6 7 8
| public class StaticHolder { public static List<Object> list = new ArrayList<>();
public static void add(Object obj) { list.add(obj); } }
|
解决:单例模式需要清理机制,或使用弱引用:
1 2 3 4 5 6
| public class StaticHolder { public static Map<Object, Object> cache = Collections.synchronizedMap( new WeakHashMap<>() ); }
|
场景3:监听器未注销
1 2 3 4 5 6 7 8 9 10
| @Service public class EventService { @PostConstruct public void init() { eventBus.register(this); }
}
|
解决:使用 @PreDestroy 注销:
1 2 3 4 5 6 7 8 9 10 11 12
| @Service public class EventService { @PostConstruct public void init() { eventBus.register(this); }
@PreDestroy public void destroy() { eventBus.unregister(this); } }
|
NIO 直接内存泄漏
这次问题的真正原因。
问题代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| @Service public class FileUploadService {
@Autowired private FileMapper fileMapper;
public String upload(MultipartFile file) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(file.getOriginalFilename(), "rw"); FileChannel channel = raf.getChannel()) {
MappedByteBuffer buffer = channel.map( FileChannel.MapMode.READ_WRITE, 0, file.getSize() );
} } }
|
问题分析
MappedByteBuffer 使用了 Direct Memory(堆外内存),映射了操作系统的文件。问题:
- 文件过大(几个 GB)
- 多次上传,累积大量 Direct Buffer
- GC 不管理 Direct Memory
FileChannel.close() 不会立即释放映射
解决方案
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| @Service public class FileUploadService {
public String upload(MultipartFile file) throws IOException { try (InputStream is = file.getInputStream(); OutputStream os = new FileOutputStream(targetPath)) {
byte[] buffer = new byte[8192]; int bytesRead; while ((bytesRead = is.read(buffer)) != -1) { os.write(buffer, 0, bytesRead); } }
} }
|
监控 Direct Memory
1 2 3 4 5 6 7 8 9 10 11 12
| -XX:MaxDirectMemorySize=512m -XX:+PrintGCDetails -XX:+PrintGCApplicationStoppedTime
MBeanServer mbs = ManagementFactory.getPlatformMBeanServer(); MemoryMXBean memory = ManagementFactory.getMemoryMXBean(); MemoryUsage heapUsage = memory.getHeapMemoryUsage(); MemoryUsage directUsage = memory.getNonHeapMemoryUsage();
System.out.println("Direct Memory: " + directUsage.getUsed());
|
实战:解决第三方 SDK 泄漏
问题定位
1 2 3 4 5 6 7 8 9 10 11 12
| @Service public class ThirdPartyService {
public void callApi(String param) { ThirdPartyClient client = new ThirdPartyClient(); client.connect(); } }
|
ThirdPartyClient 内部使用了 NIO:
1 2 3 4 5 6 7 8 9 10
| public class ThirdPartyClient { private ByteBuffer buffer = ByteBuffer.allocateDirect(1024 * 1024);
public void connect() { }
}
|
解决方案
- 复用客户端:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| @Service public class ThirdPartyService {
private final ThirdPartyClient client;
@PostConstruct public void init() { client = new ThirdPartyClient(); client.connect(); }
public void callApi(String param) { client.invoke(param); }
@PreDestroy public void destroy() { if (client != null) { client.close(); } } }
|
- 如果 SDK 没有提供 close:使用反射或包装:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| public class ThirdPartyClientWrapper implements AutoCloseable {
private final ThirdPartyClient client; private final Cleaner cleaner;
public ThirdPartyClientWrapper() { this.client = new ThirdPartyClient(); this.cleaner = Cleaner.create(client, this::cleanup); }
public void invoke(String param) { client.call(param); }
private void cleanup() { }
@Override public void close() { cleaner.clean(); } }
|
总结
堆外内存泄漏排查要点:
- 确认类型:是 Heap OOM 还是 Direct Memory OOM
- heap dump:使用 MAT 分析支配树
- 常见场景:Cache、静态集合、监听器未注销
- NIO 问题:
MappedByteBuffer、Direct ByteBuffer
- 监控:
-XX:MaxDirectMemorySize、JMX MBean
解决方案优先级:
- 修复代码问题(根本解决)
- 限制内存大小(止血)
- 重启服务(临时方案)
- 增加节点(缓解)