ablog

不器用で落着きのない技術者のメモ

OpenJDK の HotSpot のソースコードを読む

ただのメモ

  • OpenJDK をダウンロードする

http://download.java.net/openjdk/jdk7/ から openjdk-7-fcs-src-b147-27_jun_2011.zip をダウンロードする。

  • 解凍する
$ unzip openjdk-7-fcs-src-b147-27_jun_2011.zip
  • 覗いてみる
$ cd openjdk
$ tree -d -L 2
.
├── corba
│   ├── make
│   └── src
├── hotspot
│   ├── agent
│   ├── make
│   ├── src
│   └── test
├── jaxp
│   ├── make
│   ├── nbproject
│   └── patches
├── jaxws
│   ├── make
│   ├── nbproject
│   └── patches
├── jdk
│   ├── make
│   ├── src
│   └── test
├── langtools
│   ├── make
│   ├── src
│   └── test
├── make
│   ├── scripts
│   └── templates
└── test

28 directories

$ ack pthread_cond_wait *
hotspot/src/os/linux/vm/os_linux.cpp
4852://      a timeout request to the chron thread and then blocking via pthread_cond_wait().
4933:        status = pthread_cond_wait(_cond, _mutex);
5194:    status = pthread_cond_wait (_cond, _mutex) ;

hotspot/src/os/linux/vm/jsig.c
71:      pthread_cond_wait(&cond, &mutex);

hotspot/src/os/solaris/vm/os_solaris.cpp
4810:      os::Solaris::set_cond_wait(CAST_TO_FN_PTR(int_fnP_cond_tP_mutex_tP, resolve_symbol("pthread_cond_wait")));
$ ack pthread_cond_timedwait *             
hotspot/src/os/linux/vm/os_linux.cpp
4303:      return pthread_cond_timedwait(_cond, _mutex, _abstime);
4306:      // 6292965: LinuxThreads pthread_cond_timedwait() resets FPU control
4311:      int status = pthread_cond_timedwait(_cond, _mutex, _abstime);
4823:// Beware -- Some versions of NPTL embody a flaw where pthread_cond_timedwait() can
4827:// Briefly, pthread_cond_timedwait() calls with an expiry time that's not in the future
4832:// calls to pthread_cond_timedwait().  Using LD_ASSUME_KERNEL to use an older version
4839://      between the call to compute_abstime() and pthread_cond_timedwait(), more
4850:// 3.   Embargo pthread_cond_timedwait() and implement a native "chron" thread
4851://      that manages timeouts.  We'd emulate pthread_cond_timedwait() by enqueuing
4854://      on certain platforms that don't handle lots of active pthread_cond_timedwait()
4858://      correctly from pthread_cond_timedwait(), but the condvar is left corrupt.
4861://      followed by cond_init() -- after all calls to pthread_cond_timedwait().
4863://      value from pthread_cond_timedwait().  In addition to reinitializing the

hotspot/src/os/solaris/vm/os_solaris.cpp
4809:      os::Solaris::set_cond_timedwait(CAST_TO_FN_PTR(int_fnP_cond_tP_mutex_tP_timestruc_tP, resolve_symbol("pthread_cond_timedwait")));

hotspot/src/share/vm/runtime/globals.hpp
1073:         " avoid NPTL-FUTEX hang pthread_cond_timedwait" )                  \

参考

Linux 上での実装について

コメントによると, NPTL で pthread_cond_timedwait() がバグっていた時期があったらしい (永久にハングしてしまうんだとか).

以下の 4種類の解決策が考えられるが, 現在採用しているのは "4.". (WorkAroundNPTLTimedWaitHang オプションを参照).
(<= なおコメント中の WorkAroundNTPLTimedWaitHang は typo(TとPが逆))

なお, 「使用しているカーネルglibc のバージョンを確認して, バグっているバージョンの場合だけ対処する最適化も考えられる」とのこと.

    ((cite: hotspot/src/os/linux/vm/os_linux.cpp))
    // Refer to the comments in os_solaris.cpp park-unpark.
    //
    // Beware -- Some versions of NPTL embody a flaw where pthread_cond_timedwait() can
    // hang indefinitely.  For instance NPTL 0.60 on 2.4.21-4ELsmp is vulnerable.
    // For specifics regarding the bug see GLIBC BUGID 261237 :
    //    http://www.mail-archive.com/debian-glibc@lists.debian.org/msg10837.html.
    // Briefly, pthread_cond_timedwait() calls with an expiry time that's not in the future
    // will either hang or corrupt the condvar, resulting in subsequent hangs if the condvar
    // is used.  (The simple C test-case provided in the GLIBC bug report manifests the
    // hang).  The JVM is vulernable via sleep(), Object.wait(timo), LockSupport.parkNanos()
    // and monitorenter when we're using 1-0 locking.  All those operations may result in
    // calls to pthread_cond_timedwait().  Using LD_ASSUME_KERNEL to use an older version
    // of libpthread avoids the problem, but isn't practical.
    //
    // Possible remedies:
    //
    // 1.   Establish a minimum relative wait time.  50 to 100 msecs seems to work.
    //      This is palliative and probabilistic, however.  If the thread is preempted
    //      between the call to compute_abstime() and pthread_cond_timedwait(), more
    //      than the minimum period may have passed, and the abstime may be stale (in the
    //      past) resultin in a hang.   Using this technique reduces the odds of a hang
    //      but the JVM is still vulnerable, particularly on heavily loaded systems.
    //
    // 2.   Modify park-unpark to use per-thread (per ParkEvent) pipe-pairs instead
    //      of the usual flag-condvar-mutex idiom.  The write side of the pipe is set
    //      NDELAY. unpark() reduces to write(), park() reduces to read() and park(timo)
    //      reduces to poll()+read().  This works well, but consumes 2 FDs per extant
    //      thread.
    //
    // 3.   Embargo pthread_cond_timedwait() and implement a native "chron" thread
    //      that manages timeouts.  We'd emulate pthread_cond_timedwait() by enqueuing
    //      a timeout request to the chron thread and then blocking via pthread_cond_wait().
    //      This also works well.  In fact it avoids kernel-level scalability impediments
    //      on certain platforms that don't handle lots of active pthread_cond_timedwait()
    //      timers in a graceful fashion.
    //
    // 4.   When the abstime value is in the past it appears that control returns
    //      correctly from pthread_cond_timedwait(), but the condvar is left corrupt.
    //      Subsequent timedwait/wait calls may hang indefinitely.  Given that, we
    //      can avoid the problem by reinitializing the condvar -- by cond_destroy()
    //      followed by cond_init() -- after all calls to pthread_cond_timedwait().
    //      It may be possible to avoid reinitialization by checking the return
    //      value from pthread_cond_timedwait().  In addition to reinitializing the
    //      condvar we must establish the invariant that cond_signal() is only called
    //      within critical sections protected by the adjunct mutex.  This prevents
    //      cond_signal() from "seeing" a condvar that's in the midst of being
    //      reinitialized or that is corrupt.  Sadly, this invariant obviates the
    //      desirable signal-after-unlock optimization that avoids futile context switching.
    //
    //      I'm also concerned that some versions of NTPL might allocate an auxilliary
    //      structure when a condvar is used or initialized.  cond_destroy()  would
    //      release the helper structure.  Our reinitialize-after-timedwait fix
    //      put excessive stress on malloc/free and locks protecting the c-heap.
    //
    // We currently use (4).  See the WorkAroundNTPLTimedWaitHang flag.
    // It may be possible to refine (4) by checking the kernel and NTPL verisons
    // and only enabling the work-around for vulnerable environments.
Thread の待機処理の枠組み : ParkEvent (os::PlatformEvent) による処理

ほう。

Package: libc6
Version: 2.3.2.ds1-13
Severity: normal

pthread_cond_timedwait with a timespec in the past sometimes destroys the pthread_cond_t variable. The following pthread_cond_(timed)wait calls ignore all signals on the condition variable and block.

There is no Workaround, because I cannot guarantee that the timespec is really in the future (because the thread may be interrupted between gettimeofday and pthread_cond_timedwait).

The problem only occurs with NPTL (i.e. I can't reproduce the problem with LD_ASSUME_KERNEL=2.4.19).

The following program deadlocks after some iterations (10-1000).

*** /tmp/current/bug.c
// gcc -pthread -Wall -O2 -std=c99    bug.c   -o bug

#include <stdio.h>
#include <pthread.h>

pthread_mutex_t mutex;
pthread_cond_t empty;
pthread_cond_t nonempty;
unsigned count = 0;

void* run(void* env)
{
    pthread_mutex_lock(&mutex);
    struct timespec ts = { 0, 0 };
    for (;;) {
        count = 0;
        pthread_cond_signal(&empty);
        pthread_cond_timedwait(&nonempty, &mutex, &ts);
        count = 0;
        pthread_cond_signal(&empty);
        pthread_cond_wait(&nonempty, &mutex);
    }
}

int main()
{
    pthread_mutex_init(&mutex, 0);
    pthread_cond_init(&empty, 0);
    pthread_cond_init(&nonempty, 0);

    pthread_t thread;
    pthread_create(&thread, 0, &run, 0);

    pthread_mutex_lock(&mutex);
    for (unsigned i = 0;; ++i) {
        while (count > 0) {
            pthread_cond_wait(&empty, &mutex);
        }
        ++count;
        pthread_cond_signal(&nonempty);
        printf("%u\n", i);
    }
}
    • System Information:

Debian Release: testing/unstable
APT prefers unstable
APT policy: (500, 'unstable')
Architecture: i386 (i686)
Kernel: Linux 2.6.7-1-686-smp
Locale: LANG=C, [EMAIL PROTECTED]

Versions of packages libc6 depends on:
ii libdb1-compat 2.1.3-7 The Berkeley database routines [gl

Bug#261237: NPTL: pthread_cond_timedwait with timespec in the past destroys cond variable

むむむ。

#include <stdio.h>
#include <pthread.h>

pthread_mutex_t mutex;
pthread_cond_t empty;
pthread_cond_t nonempty;
unsigned count = 0;

void* run(void* env)
{
    pthread_mutex_lock(&mutex);
    struct timespec ts = { 0, 0 };
    for (;;) {
        count = 0;
        pthread_cond_signal(&empty);
        pthread_cond_timedwait(&nonempty, &mutex, &ts);
        count = 0;
        pthread_cond_signal(&empty);
        pthread_cond_wait(&nonempty, &mutex);
    }
}

int main()
{
    pthread_mutex_init(&mutex, 0);
    pthread_cond_init(&empty, 0);
    pthread_cond_init(&nonempty, 0);

    pthread_t thread;
    pthread_create(&thread, 0, &run, 0);

    pthread_mutex_lock(&mutex);
    for (unsigned i = 0;; ++i) {
        while (count > 0) {
            pthread_cond_wait(&empty, &mutex);
        }
        ++count;
        pthread_cond_signal(&nonempty);
        printf("%u\n", i);
    }
}

glibc のバージョンを確認して

$ /lib/libc.so.6
GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-4).
Compiled on a Linux 2.6.32 system on 2013-11-20.
Available extensions:
	The C stubs add-on version 2.1.2.
	crypt add-on version 2.1 by Michael Glad and others
	GNU Libidn by Simon Josefsson
	Native POSIX Threads Library by Ulrich Drepper et al
	BIND-8.2.3-T5B
	RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

コンパイルして

$ gcc -pthread -Wall -O2 -std=c99 glibc_bug_261237.c -o glibc_bug_261237

実行して、

$ ./glibc_bug_261237 > /dev/null &
[1] 24235

まずは dstat で

$ dstat -tcf --top-cpu 5 
----system---- -------cpu0-usage--------------cpu1-usage--------------cpu2-usage--------------cpu3-usage------ -most-expensive-
  date/time   |usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq:usr sys idl wai hiq siq|  cpu process   
12-01 13:33:00| 18  21  60   0   0   0: 18  20  62   0   0   0: 15  27  58   0   0   0: 16  27  56   0   0   0|glibc_bug_2614.4
12-01 13:33:05|  5  12  83   0   0   0:  5   9  86   0   0   0: 10  50  39   0   0   1: 14  44  42   0   0   0|glibc_bug_261 30
12-01 13:33:10| 11  45  44   0   0   0: 13  39  48   0   0   0:  5  12  83   0   0   0:  4  12  84   0   0   0|glibc_bug_261 29
12-01 13:33:15| 12  52  36   0   0   1: 15  43  42   0   0   0:  4   9  86   0   0   0:  4   8  88   0   0   0|glibc_bug_261 29
12-01 13:33:20|  8  35  56   0   0   0: 10  29  61   0   0   0:  7  20  72   0   0   0:  6  22  72   0   0   0|glibc_bug_261 29

top で眺めてみる

$ top -Hc -b -d 5 -p 24235
top - 13:14:50 up  3:05,  3 users,  load average: 1.42, 1.16, 1.06
Tasks:   2 total,   1 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s): 19.6%us, 21.7%sy,  0.3%ni, 57.8%id,  0.4%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  16158544k total, 15997144k used,   161400k free,   225588k buffers
Swap:  8085500k total,        0k used,  8085500k free,  5224448k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                            
24235 yazekats  20   0 14464  316  248 R 63.9  0.0   7:24.87 ./glibc_bug_261237
24236 yazekats  20   0 14464  316  248 S 59.9  0.0   7:12.00 ./glibc_bug_261237                 

[...]

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                            
24235 yazekats  20   0 14464  316  248 R 60.6  0.0   7:27.90 ./glibc_bug_261237
24236 yazekats  20   0 14464  316  248 S 59.0  0.0   7:14.95 ./glibc_bug_261237

[...]

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND   
24235 yazekats  20   0 14464  316  248 R 61.9  0.0   7:31.00 ./glibc_bug_261237
24236 yazekats  20   0 14464  316  248 R 59.7  0.0   7:17.94 ./glibc_bug_261237            

strace -ttT したり

$ strace -ttT -p 24235
13:17:59.132997 futex(0x600d24, FUTEX_WAIT_PRIVATE, 257305127, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000021>
13:17:59.133086 futex(0x600d64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x600d60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 <0.000025>
13:17:59.133180 futex(0x600ce0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000025>
13:17:59.133270 futex(0x600d20, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000031>
13:17:59.133370 futex(0x600d24, FUTEX_WAIT_PRIVATE, 257305129, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000021>
13:17:59.133459 futex(0x600d64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x600d60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 <0.000024>
13:17:59.133552 futex(0x600ce0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
13:17:59.133642 futex(0x600d20, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000031>
13:17:59.133743 futex(0x600d24, FUTEX_WAIT_PRIVATE, 257305131, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000021>
13:17:59.133841 futex(0x600d64, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x600d60, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 <0.000025>
13:17:59.133936 futex(0x600ce0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000026>
13:17:59.134027 futex(0x600d20, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000031>
13:17:59.134127 futex(0x600d24, FUTEX_WAIT_PRIVATE, 257305133, NULL) = -1 EAGAIN (Resource temporarily unavailable) <0.000020>
^CProcess 24235 detached

strace -c したり

$ strace -c -p 24235
Process 24235 attached - interrupt to quit
^CProcess 24235 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.010504           0     49197     12280 futex
  0.00    0.000000           0        30           write
------ ----------- ----------- --------- --------- ----------------
100.00    0.010504                 49227     12280 total

ltrace してみたり

$ ltrace -ttTSC -n 4 -p 4454
[pid 4455] 16:06:08.972344 pthread_cond_wait(0x600d60, 0x600ce0, 1, -1, 0x600d20 <unfinished ...>
[pid 4454] 16:06:08.972523     <... SYS_futex resumed> )                                               = -512 <0.000157>
[pid 4455] 16:06:08.972573     SYS_futex(0x600ce0, 129, 1, 1, 0x600d20 <unfinished ...>
[pid 4454] 16:06:08.972642     SYS_futex(0x600ce0, 128, 2, 0, 0x600ce0 <unfinished ...>
[pid 4455] 16:06:08.972738     <... SYS_futex resumed> )                                               = 0 <0.000077>
[pid 4454] 16:06:08.972794     <... SYS_futex resumed> )                                               = -11 <0.000063>
[pid 4455] 16:06:08.972829     SYS_futex(0x600d64, 128, 0x3dbc1ef, 0, 0x600c00 <unfinished ...>
[pid 4454] 16:06:08.972931 <... pthread_cond_wait resumed> )                                           = 0 <0.001707>
[pid 4454] 16:06:08.972969 pthread_cond_signal(0x600d60, 128, 0, -1, 0x600ce0 <unfinished ...>
[pid 4455] 16:06:08.973149     <... SYS_futex resumed> )                                               = -512 <0.000226>
[pid 4455] 16:06:08.973199     SYS_futex(0x600d64, 128, 0x3dbc1ef, 0, 0x600c00 <unfinished ...>
[pid 4454] 16:06:08.973267     SYS_futex(0x600d64, 133, 1, 1, 0x600d60 <unfinished ...>
[pid 4455] 16:06:08.973355     <... SYS_futex resumed> )                                               = -11 <0.000070>
[pid 4454] 16:06:08.973412     <... SYS_futex resumed> )                                               = 0 <0.000063>
[pid 4455] 16:06:08.973445     SYS_futex(0x600ce0, 128, 2, 0, 0x600ce0 <unfinished ...>
[pid 4454] 16:06:08.973541 <... pthread_cond_signal resumed> )                                         = 0 <0.000471>
[pid 4454] 16:06:08.973579 printf("%u\n", 32350362 <unfinished ...>
[pid 4455] 16:06:08.973748     <... SYS_futex resumed> )                                               = ^C-512 <0.000214>
$ ltrace -c -p 24235
^C% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
 50.53    6.287200        2779      2262 pthread_cond_wait
 22.54    2.803753         929      3017 pthread_cond_signal
 19.67    2.447685        1623      1508 printf
  7.26    0.902921        1197       754 pthread_cond_timedwait
------ ----------- ----------- --------- --------------------
100.00   12.441559                  7541 total

perf してみたり

$ sudo echo 0 > /proc/sys/kernel/perf_event_paranoid
$ perf record -F 99 -p 24235
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.021 MB perf.data (~922 samples) ]
$ perf report
Samples: 305  of event 'cycles:HG', Event count (approx.): 2873553649
 10.73%  glibc_bug_26123  [kernel.kallsyms]   [k] _raw_spin_unlock_irqrestore ★これを選択
  8.22%  glibc_bug_26123  [kernel.kallsyms]   [k] try_to_wake_up
  7.83%  glibc_bug_26123  libc-2.12.so        [.] vfprintf
  3.89%  glibc_bug_26123  [kernel.kallsyms]   [k] native_write_msr_safe
  3.88%  glibc_bug_26123  [kernel.kallsyms]   [k] __ticket_spin_lock
  3.08%  glibc_bug_26123  libpthread-2.12.so  [.] pthread_cond_wait@@GLIBC_2.3.2
  2.97%  glibc_bug_26123  [kernel.kallsyms]   [k] futex_wake_op
  2.83%  glibc_bug_26123  libc-2.12.so        [.] free
  2.79%  glibc_bug_26123  [kernel.kallsyms]   [k] system_call
  2.68%  glibc_bug_26123  [kernel.kallsyms]   [k] system_call_after_swapgs
  2.57%  glibc_bug_26123  [kernel.kallsyms]   [k] get_futex_key
  2.37%  glibc_bug_26123  [kernel.kallsyms]   [k] __schedule
  2.31%  glibc_bug_26123  libc-2.12.so        [.] _itoa_word
  2.12%  glibc_bug_26123  libpthread-2.12.so  [.] pthread_cond_signal@@GLIBC_2.3.2
  1.88%  glibc_bug_26123  [kernel.kallsyms]   [k] __x2apic_send_IPI_mask
  1.84%  glibc_bug_26123  [kernel.kallsyms]   [k] futex_wait
  1.77%  glibc_bug_26123  libpthread-2.12.so  [.] __lll_unlock_wake
  1.76%  glibc_bug_26123  [kernel.kallsyms]   [k] select_task_rq_fair
  1.58%  glibc_bug_26123  libc-2.12.so        [.] _IO_file_xsputn@@GLIBC_2.2.5
  1.57%  glibc_bug_26123  [kernel.kallsyms]   [k] hash_futex
  1.47%  glibc_bug_26123  [kernel.kallsyms]   [k] native_read_tsc
  1.39%  glibc_bug_26123  libc-2.12.so        [.] __strchrnul
  1.38%  glibc_bug_26123  [kernel.kallsyms]   [k] find_next_bit
  1.28%  glibc_bug_26123  [kernel.kallsyms]   [k] wake_futex
  1.25%  glibc_bug_26123  [kernel.kallsyms]   [k] perf_event_context_sched_out
  1.25%  glibc_bug_26123  [kernel.kallsyms]   [k] dequeue_task
  1.19%  glibc_bug_26123  [kernel.kallsyms]   [k] effective_load
  1.15%  glibc_bug_26123  [kernel.kallsyms]   [k] wake_affine
  1.11%  glibc_bug_26123  [kernel.kallsyms]   [k] dequeue_entity
  1.03%  glibc_bug_26123  [kernel.kallsyms]   [k] cpumask_next_and
  0.92%  glibc_bug_26123  [kernel.kallsyms]   [k] intel_pmu_enable_all
  0.90%  glibc_bug_26123  [kernel.kallsyms]   [k] sys_futex
  0.78%  glibc_bug_26123  [kernel.kallsyms]   [k] pick_next_task_idle
  0.76%  glibc_bug_26123  [kernel.kallsyms]   [k] copy_user_generic_string
  0.75%  glibc_bug_26123  [kernel.kallsyms]   [k] pick_next_task_fair
Press '?' for help on key bindings

脱線