Linux の libaio による非同期I/Oでは aio コンテキスト数は /proc/sys/fs/aio-nr で確認でき、/proc/sys/fs/aio-max-nr に達すると、io_setup システムコールの戻り値に EAGAIN が返される。このことをI/Oベンチマークツール flexible IO tester (fio) で検証してみた。
参考情報
aio-nr & aio-max-nr:
aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts. If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN. Note that raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures.
https://www.kernel.org/doc/Documentation/sysctl/fs.txt
EAGAIN The specified nr_events exceeds the user's limit of available events, as defined in /proc/sys/fs/aio-max-nr.
io_setup(2) - Linux manual page
- 詳解 Linuxカーネル 第3版 P.719
ユーザーモードプロセスが io_submit() システムコールを利用して非同期I/O操作を開始するためには、あらかじめ非同期I/Oコンテキスト(Asynchronous I/O context、以下 AIOコンテキスト)を生成する必要があります。
(中略)
AIOリングはプロセスのページユーザーモードのアドレスにマッピングされており、アプリケーションは未終了の非同期I/O操作の経過を直接確認できます。AIOリングにより、比較的遅いシステムコールを使わずに確認できます。
(中略)
io_setup()システムコーリは、呼び出したプロセス用に新しいAIOコンテキストを生成します。io_setup()システムコールは2つの引数を受け取ります。1つは、未終了の非同期I/O操作の最大数、つまりAIOリングの大きさを決定します。もう1つは、AIOコンテキストを示すハンドルを格納するための変数へのポインタであり、このハンドルはAIOリングのベースアドレスでもあります。
事前準備
- 非同期I/Oリクエスト数を上限(aio-max-nr)に到達させるため敢えて小さくする(5に変更)
[root@yazekats-linux ~]# cat /proc/sys/fs/aio-max-nr 65536 [root@yazekats-linux ~]# sysctl -w fs.aio-max-nr=5 [root@yazekats-linux ~]# cat /proc/sys/fs/aio-max-nr 5 ★
検証
- 非同期I/Oコンテキスト数(aio-nr)を1秒間隔で監視する
yazekats% while : do echo -n `date '+%Y-%m-%d %H:%M:%S:'` cat /proc/sys/fs/aio-nr sleep 1 done ... 2014-04-08 02:41:39:0 2014-04-08 02:41:40:0 2014-04-08 02:41:41:0 2014-04-08 02:41:42:5 ★ 2014-04-08 02:41:43:5 ★ 2014-04-08 02:41:44:5 ★ 2014-04-08 02:41:45:5 ★ 2014-04-08 02:41:46:5 ★ 2014-04-08 02:41:47:5 ★ 2014-04-08 02:41:48:5 ★ 2014-04-08 02:41:49:2 ★ 2014-04-08 02:41:50:0 2014-04-08 02:41:51:0
非同期I/Oコンテキスト数(aio-nr)が最大 5 に達している。
- I/Oベンチマークツール flexible IO tester (fio) で非同期I/Oを発行する。
yazekats% strace -fttt -o strace_fio.log fio -rw=randwrite -bs=4k -size=50m -directory=/tmp -direct=1 -numjobs=10 -name=file1 -ioengine=libaio file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 ... file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1 fio-2.0.13 Starting 10 processes fio: check /proc/sys/fs/aio-max-nr ★ fio: check /proc/sys/fs/aio-max-nr ★ fio: check /proc/sys/fs/aio-max-nr ★ fio: pid=4797, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★ fio: pid=4795, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★ fio: pid=4794, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★ fio: check /proc/sys/fs/aio-max-nr ★ fio: pid=4798, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★ fio: check /proc/sys/fs/aio-max-nr ★ fio: pid=4796, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★ Jobs: 2 (f=2): [ww___XXXXX] [100.0% done] [0K/27380K/0K /s] [0 /6845 /0 iops] [eta 00m:00s] file1: (groupid=0, jobs=1): err= 0: pid=4789: Tue Apr 8 02:41:50 2014 write: io=51200KB, bw=6385.7KB/s, iops=1596 , runt= 8018msec slat (usec): min=34 , max=1404 , avg=196.54, stdev=106.99 clat (usec): min=62 , max=5817 , avg=424.62, stdev=396.55 lat (usec): min=193 , max=6001 , avg=622.42, stdev=412.09 clat percentiles (usec): | 1.00th=[ 89], 5.00th=[ 123], 10.00th=[ 177], 20.00th=[ 239], | 30.00th=[ 274], 40.00th=[ 298], 50.00th=[ 322], 60.00th=[ 346], | 70.00th=[ 378], 80.00th=[ 438], 90.00th=[ 716], 95.00th=[ 1320], | 99.00th=[ 2160], 99.50th=[ 2352], 99.90th=[ 3344], 99.95th=[ 3952], | 99.99th=[ 4256] bw (KB/s) : min= 5024, max=10936, per=20.03%, avg=6379.12, stdev=1347.65 lat (usec) : 100=1.45%, 250=21.31%, 500=61.87%, 750=5.85%, 1000=2.64% lat (msec) : 2=5.22%, 4=1.62%, 10=0.05% cpu : usr=1.93%, sys=5.39%, ctx=63665, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=12800/d=0, short=r=0/w=0/d=0 file1: (groupid=0, jobs=1): err= 0: pid=4790: Tue Apr 8 02:41:50 2014 write: io=51200KB, bw=6513.2KB/s, iops=1628 , runt= 7861msec slat (usec): min=33 , max=1855 , avg=194.53, stdev=98.58 clat (usec): min=50 , max=4267 , avg=414.18, stdev=371.95 lat (usec): min=198 , max=4522 , avg=609.93, stdev=380.96 clat percentiles (usec): | 1.00th=[ 102], 5.00th=[ 155], 10.00th=[ 197], 20.00th=[ 239], | 30.00th=[ 270], 40.00th=[ 294], 50.00th=[ 314], 60.00th=[ 338], | 70.00th=[ 370], 80.00th=[ 422], 90.00th=[ 676], 95.00th=[ 1272], | 99.00th=[ 2096], 99.50th=[ 2288], 99.90th=[ 3024], 99.95th=[ 3536], | 99.99th=[ 4192] bw (KB/s) : min= 5288, max= 7400, per=19.94%, avg=6348.80, stdev=528.81 lat (usec) : 100=0.88%, 250=22.46%, 500=62.45%, 750=5.35%, 1000=2.39% lat (msec) : 2=5.18%, 4=1.27%, 10=0.02% cpu : usr=2.19%, sys=5.41%, ctx=64013, majf=0, minf=23 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=12800/d=0, short=r=0/w=0/d=0 file1: (groupid=0, jobs=1): err= 0: pid=4791: Tue Apr 8 02:41:50 2014 write: io=51200KB, bw=6749.3KB/s, iops=1687 , runt= 7586msec slat (usec): min=32 , max=1539 , avg=172.25, stdev=83.40 clat (usec): min=50 , max=4118 , avg=414.69, stdev=358.46 lat (usec): min=132 , max=4308 , avg=588.14, stdev=363.63 clat percentiles (usec): | 1.00th=[ 125], 5.00th=[ 181], 10.00th=[ 213], 20.00th=[ 251], | 30.00th=[ 278], 40.00th=[ 298], 50.00th=[ 318], 60.00th=[ 342], | 70.00th=[ 366], 80.00th=[ 414], 90.00th=[ 660], 95.00th=[ 1224], | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2832], 99.95th=[ 3376], | 99.99th=[ 3888] bw (KB/s) : min= 5368, max= 8000, per=21.13%, avg=6729.13, stdev=610.57 lat (usec) : 100=0.36%, 250=19.52%, 500=66.64%, 750=5.02%, 1000=2.19% lat (msec) : 2=5.02%, 4=1.25%, 10=0.01% cpu : usr=2.85%, sys=5.27%, ctx=63722, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=12800/d=0, short=r=0/w=0/d=0 file1: (groupid=0, jobs=1): err= 0: pid=4792: Tue Apr 8 02:41:50 2014 write: io=50552KB, bw=6780.1KB/s, iops=1695 , runt= 7455msec slat (usec): min=32 , max=2269 , avg=157.39, stdev=82.15 clat (usec): min=45 , max=4407 , avg=426.60, stdev=364.62 lat (usec): min=129 , max=4699 , avg=585.23, stdev=369.91 clat percentiles (usec): | 1.00th=[ 131], 5.00th=[ 185], 10.00th=[ 217], 20.00th=[ 258], | 30.00th=[ 286], 40.00th=[ 310], 50.00th=[ 330], 60.00th=[ 350], | 70.00th=[ 382], 80.00th=[ 430], 90.00th=[ 684], 95.00th=[ 1256], | 99.00th=[ 2128], 99.50th=[ 2256], 99.90th=[ 2736], 99.95th=[ 3696], | 99.99th=[ 4080] bw (KB/s) : min= 5696, max= 7232, per=21.00%, avg=6686.57, stdev=480.73 lat (usec) : 50=0.01%, 100=0.30%, 250=17.11%, 500=68.37%, 750=5.37% lat (usec) : 1000=2.50% lat (msec) : 2=4.90%, 4=1.42%, 10=0.02% cpu : usr=2.49%, sys=5.93%, ctx=63366, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=12638/d=0, short=r=0/w=0/d=0 file1: (groupid=0, jobs=1): err= 0: pid=4793: Tue Apr 8 02:41:50 2014 write: io=51200KB, bw=7295.6KB/s, iops=1823 , runt= 7018msec slat (usec): min=31 , max=1259 , avg=144.58, stdev=74.88 clat (usec): min=38 , max=6284 , avg=397.84, stdev=332.23 lat (usec): min=131 , max=6450 , avg=543.66, stdev=335.38 clat percentiles (usec): | 1.00th=[ 129], 5.00th=[ 179], 10.00th=[ 209], 20.00th=[ 253], | 30.00th=[ 282], 40.00th=[ 306], 50.00th=[ 326], 60.00th=[ 346], | 70.00th=[ 366], 80.00th=[ 398], 90.00th=[ 532], 95.00th=[ 956], | 99.00th=[ 1992], 99.50th=[ 2224], 99.90th=[ 3056], 99.95th=[ 3568], | 99.99th=[ 4512] bw (KB/s) : min= 6128, max= 8024, per=22.93%, avg=7301.71, stdev=488.22 lat (usec) : 50=0.02%, 100=0.26%, 250=18.98%, 500=69.81%, 750=4.13% lat (usec) : 1000=1.94% lat (msec) : 2=3.86%, 4=0.97%, 10=0.03% cpu : usr=2.59%, sys=6.47%, ctx=64512, majf=0, minf=27 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=12800/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=255352KB, aggrb=31847KB/s, minb=6385KB/s, maxb=7295KB/s, mint=7018msec, maxt=8018msec Disk stats (read/write): dm-1: ios=173/62797, merge=0/0, ticks=67/24410, in_queue=24481, util=89.99%, aggrios=173/63898, aggrmerge=0/0, aggrticks=67/24511, aggrin_queue=24583, aggrutil=88.68% dm-0: ios=173/63898, merge=0/0, ticks=67/24511, in_queue=24583, util=88.68%, aggrios=173/63898, aggrmerge=0/0, aggrticks=45/22169, aggrin_queue=22190, aggrutil=84.26% sda: ios=173/63898, merge=0/0, ticks=45/22169, in_queue=22190, util=84.26%
fio の -numjobs オプションで 10 並行を指定したが、aio-max-nr の上限に達し、5 に変更して実行されている。
- strace で取得した fio のシステムコールのトレースを確認する。
yazekats% grep io_setup strace_fio.log 4790 1396892502.061498 io_setup(2147483647, <unfinished ...> 4790 1396892502.061657 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4789 1396892502.061683 io_setup(2147483647, <unfinished ...> 4790 1396892502.061780 io_setup(1, <unfinished ...> 4789 1396892502.061797 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4790 1396892502.061906 <... io_setup resumed> {139825214746624}) = 0 4789 1396892502.061931 io_setup(1, <unfinished ...> 4789 1396892502.062153 <... io_setup resumed> {139825214746624}) = 0 4791 1396892502.062288 io_setup(2147483647, <unfinished ...> 4791 1396892502.062426 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4791 1396892502.062557 io_setup(1, <unfinished ...> 4791 1396892502.062759 <... io_setup resumed> {139825214746624}) = 0 4792 1396892502.063390 io_setup(2147483647, <unfinished ...> 4792 1396892502.063568 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4792 1396892502.063765 io_setup(1, <unfinished ...> 4792 1396892502.064023 <... io_setup resumed> {139825214746624}) = 0 4793 1396892502.064344 io_setup(2147483647, <unfinished ...> 4793 1396892502.064503 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4793 1396892502.064730 io_setup(1, <unfinished ...> 4793 1396892502.064869 <... io_setup resumed> {139825214746624}) = 0 4794 1396892502.065185 io_setup(2147483647, <unfinished ...> 4794 1396892502.065373 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4794 1396892502.065580 io_setup(1, <unfinished ...> 4795 1396892502.066413 io_setup(2147483647, <unfinished ...> 4795 1396892502.066571 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4796 1396892502.066769 io_setup(2147483647, <unfinished ...> 4795 1396892502.066789 io_setup(1, <unfinished ...> 4796 1396892502.066918 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4796 1396892502.067031 io_setup(1, <unfinished ...> 4797 1396892502.067655 io_setup(2147483647, <unfinished ...> 4797 1396892502.067767 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4797 1396892502.067907 io_setup(1, <unfinished ...> 4798 1396892502.071044 io_setup(2147483647, <unfinished ...> 4798 1396892502.071062 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument) 4798 1396892502.071082 io_setup(1, <unfinished ...> 4797 1396892502.073724 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable) ★ 4795 1396892502.073738 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable) ★ 4794 1396892502.073788 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable) ★ 4798 1396892502.080743 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable) ★ 4796 1396892502.081637 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable) ★
io_setup システムコールの戻り値に "EAGAIN (Resource temporarily unavailable)" が返されている。
補足
fio のソースコードを追ってみる
static int fio_libaio_init(struct thread_data *td) { struct libaio_data *ld = malloc(sizeof(*ld)); struct libaio_options *o = td->eo; int err = 0; memset(ld, 0, sizeof(*ld)); /* * First try passing in 0 for queue depth, since we don't * care about the user ring. If that fails, the kernel is too old * and we need the right depth. */ if (!o->userspace_reap) err = io_queue_init(INT_MAX, &ld->aio_ctx); if (o->userspace_reap || err == -EINVAL) err = io_queue_init(td->o.iodepth, &ld->aio_ctx); if (err) { td_verror(td, -err, "io_queue_init"); log_err("fio: check /proc/sys/fs/aio-max-nr\n"); ★ free(ld); return 1; } ld->aio_events = malloc(td->o.iodepth * sizeof(struct io_event)); memset(ld->aio_events, 0, td->o.iodepth * sizeof(struct io_event)); ld->iocbs = malloc(td->o.iodepth * sizeof(struct iocb *)); memset(ld->iocbs, 0, sizeof(struct iocb *)); ld->io_us = malloc(td->o.iodepth * sizeof(struct io_u *)); memset(ld->io_us, 0, td->o.iodepth * sizeof(struct io_u *)); ld->iocbs_nr = 0; td->io_ops->data = ld; return 0; }
参考
- asynchronous - Difference between POSIX AIO and libaio on Linux? - Stack Overflow
- sockets - Linux Asynch IO - difference between aio.h and libaio.h - Stack Overflow
- バッファキャッシュとAIO(3) - O'Reilly Japan Community Blog
- Linuxでaio - (ひ)メモ
- Aio
- Troubleshooting ORA-27090 async I/O errors with systemtap
- 非同期I/Oの謎 | Insight Technology, Inc.
追記(2014/04/13):
dstat のソースを見てたら --aio オプションを指定すると /proc/sys/fs/aio-nr を監視できることに気づいた。
284 --aio enable aio stats ... 521 class dstat_aio(dstat): 522 def __init__(self): 523 self.name = 'async' 524 self.type = 'd' 525 self.width = 5; 526 self.open('/proc/sys/fs/aio-nr') 527 self.nick = ('#aio',) 528 self.vars = ('aio',) 529
実行してみると、
yazekats% dstat -t --aio ----system---- async date/time | #aio 14-04 00:31:55| 0 14-04 00:31:56| 0 14-04 00:31:57| 10B 14-04 00:31:58| 10B 14-04 00:31:59| 10B