线程池原因导致java.lang.OutOfMemoryError

Jul 19, 2021 java 线程线程池

问题描述

线上环境某个服务经常性地抛出内存溢出，看日志是下面的错误

1java.lang.OutOfMemoryError: unable to create new native thread
2    at java.lang.Thread.start0(Native Method) ~[?:1.8.0_112]
3    at java.lang.Thread.start(Thread.java:714) ~[?:1.8.0_112]
4    at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:950) ~[?:1.8.0_112]
5    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357) ~[?:1.8.0_112]
6    at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134) ~[?:1.8.0_112]

造成这个错误主要有2个原因：

剩余的系统内存不足，导致无法创建新的线程
总线程数达到操作系统允许的上限

看了代码后，发现开发人员把线程池作为一个方法的局部变量，由于这方法是被定时任务调用的，也就意味着线程池局部变量会被实例化N次，如果线程池没有被回收，那么最终总线程数会达到操作系统的上限。

问题剖析

当对象实例不再使用或者方法执行完毕后，什么时候会释放线程与关闭线程池？不同的线程池的实现方式可能不一样，但是主要还是看是否设置了核心线程数。

如果没有设置核心线程数，比如 newCachedThreadPool ，在线程池的线程空闲时间到达 60s 后，线程会关闭，所有线程关闭后线程池也相应关闭回收。
如果设置了核心线程数，比如 newSingleThreadExecutor 和 newFixedThreadPool，如果没有主动去关闭，或者设置核心线程的超时时间，核心线程会一直存在不会被关闭，这个线程池就不会被释放回收。

可以通过下面的ThreadPoolExecutor源码中runWorker方法，看到要执行线程退出processWorkerExit需要这几种情况：

线程池的状态 >= STOP。对于这个情况，线程池的状态要达到 STOP，需要调用shutdown或者shutdownNow方法
getTask 获取到空任务

 1final void runWorker(Worker w) {
 2    Thread wt = Thread.currentThread();
 3    Runnable task = w.firstTask;
 4    w.firstTask = null;
 5    w.unlock(); // allow interrupts
 6    boolean completedAbruptly = true;
 7    try {
 8        while (task != null || (task = getTask()) != null) {
 9            w.lock();
10            // If pool is stopping, ensure thread is interrupted;
11            // if not, ensure thread is not interrupted.  This
12            // requires a recheck in second case to deal with
13            // shutdownNow race while clearing interrupt
14            if ((runStateAtLeast(ctl.get(), STOP) ||
15                  (Thread.interrupted() &&
16                  runStateAtLeast(ctl.get(), STOP))) &&
17                !wt.isInterrupted())
18                wt.interrupt();
19            try {
20                beforeExecute(wt, task);
21                Throwable thrown = null;
22                try {
23                    task.run();
24                } catch (RuntimeException x) {
25                    thrown = x; throw x;
26                } catch (Error x) {
27                    thrown = x; throw x;
28                } catch (Throwable x) {
29                    thrown = x; throw new Error(x);
30                } finally {
31                    afterExecute(task, thrown);
32                }
33            } finally {
34                task = null;
35                w.completedTasks++;
36                w.unlock();
37            }
38        }
39        completedAbruptly = false;
40    } finally {
41        processWorkerExit(w, completedAbruptly);
42    }
43}

对于第二种情况，先看看getTask方法的源码。

 1private Runnable getTask() {
 2    boolean timedOut = false; // Did the last poll() time out?
 3
 4    for (;;) {
 5        int c = ctl.get();
 6        int rs = runStateOf(c);
 7
 8        // Check if queue empty only if necessary.
 9        if (rs >= SHUTDOWN && (rs >= STOP || workQueue.isEmpty())) {
10            decrementWorkerCount();
11            return null;
12        }
13
14        int wc = workerCountOf(c);
15
16        // Are workers subject to culling?
17        boolean timed = allowCoreThreadTimeOut || wc > corePoolSize;
18
19        if ((wc > maximumPoolSize || (timed && timedOut))
20            && (wc > 1 || workQueue.isEmpty())) {
21            if (compareAndDecrementWorkerCount(c))
22                return null;
23            continue;
24        }
25
26        try {
27            Runnable r = timed ?
28                workQueue.poll(keepAliveTime, TimeUnit.NANOSECONDS) :
29                workQueue.take();
30            if (r != null)
31                return r;
32            timedOut = true;
33        } catch (InterruptedException retry) {
34            timedOut = false;
35        }
36    }
37}

从方法中可以看到：

当前线程数大于核心线程，会调用poll超时后返回空任务。
当前线程数小于等于核心线程，并且调用了allowCoreThreadTimeOut方法允许核心线程超时关闭的情况下，也是调用poll，超时后返回空任务。
其他情况，调用take阻塞等待。

任务队列以阻塞队列BlockingQueue为例，该队列提供了两种方法来获取任务：

poll，可以设置超时时间，当超时后会得到一个空任务。
take，阻塞住，直到有任务出现。

在没有任务的情况下，核心线程正处于getTask，调用阻塞队列BlockingQueue的 take方法阻塞等待获取到任务，从而导致线程池包括里面的核心线程迟迟不被关闭并且回收。

小结

不推荐将线程池作为局部变量使用，而要作为全局变量。一般都会把线程池作为类的静态成员或者单例成员，毕竟生命周期和进程一致。

如果业务场景非要这样用的话，并且线程池有核心线程的情况下，要注意做两件事情防止对象泄漏：

对核心线程设置超时时间。
主动调用 shutdown 或 shutdownNow 来关闭线程池。

示例:

 1public class TestThread {
 2
 3    public static void main(String[] args) {
 4        while (true) {
 5            ExecutorService service = Executors.newFixedThreadPool(1);
 6            try {
 7                service.submit(new Runnable() {
 8                    public void run() {
 9                        try {
10                            Thread.sleep(2000);
11                        } catch (InterruptedException e) {
12                        }
13                    }
14                });
15            } catch (Exception e) {
16            }finally{
17                // 调用shutdown来关闭
18                service.shutdown();
19            }
20            try {
21                Thread.sleep(2000);
22            } catch (InterruptedException e) {
23            }
24        }
25    }
26}

参考: