ablog

不器用で落着きのない技術者のメモ

libaio の aio コンテキスト数を調べる

Linux の libaio による非同期I/Oでは aio コンテキスト数は /proc/sys/fs/aio-nr で確認でき、/proc/sys/fs/aio-max-nr に達すると、io_setup システムコールの戻り値に EAGAIN が返される。このことをI/Oベンチマークツール flexible IO tester (fio) で検証してみた。

おさらい

Linux の非同期I/Oは2種類ある。

  • POSIX AIO
    • aio_read(), aio_write() ライブラリ関数がカレントプロセスの子スレッドを作る。
    • その子スレッドが同期な read()、write() システムコールを呼出す。
    • 子スレッドが発行した同期I/O操作を待たずに aio_read(), aio_write() ライブラリ関数は終了する。
    • カーネルレベルで実現する非同期I/Oに比べて遅い。
  • libaio
    • カーネルレベルの非同期I/O。
    • io_setup、io_submit、io_getevents、io_cancel、io_destroy システムコールを提供する。
    • O_DIRECT フラグを指定してオープンしたファイルのみ非同期で処理できる。

参考情報

aio-nr & aio-max-nr:

aio-nr is the running total of the number of events specified on the io_setup system call for all currently active aio contexts. If aio-nr reaches aio-max-nr then io_setup will fail with EAGAIN. Note that raising aio-max-nr does not result in the pre-allocation or re-sizing of any kernel data structures.

https://www.kernel.org/doc/Documentation/sysctl/fs.txt

EAGAIN The specified nr_events exceeds the user's limit of available events, as defined in /proc/sys/fs/aio-max-nr.

io_setup(2) - Linux manual page

ユーザーモードプロセスが io_submit() システムコールを利用して非同期I/O操作を開始するためには、あらかじめ非同期I/Oコンテキスト(Asynchronous I/O context、以下 AIOコンテキスト)を生成する必要があります。
(中略)
AIOリングはプロセスのページユーザーモードのアドレスにマッピングされており、アプリケーションは未終了の非同期I/O操作の経過を直接確認できます。AIOリングにより、比較的遅いシステムコールを使わずに確認できます。
(中略)
io_setup()システムコーリは、呼び出したプロセス用に新しいAIOコンテキストを生成します。io_setup()システムコールは2つの引数を受け取ります。1つは、未終了の非同期I/O操作の最大数、つまりAIOリングの大きさを決定します。もう1つは、AIOコンテキストを示すハンドルを格納するための変数へのポインタであり、このハンドルはAIOリングのベースアドレスでもあります。

事前準備

  • 非同期I/Oリクエスト数を上限(aio-max-nr)に到達させるため敢えて小さくする(5に変更)
[root@yazekats-linux ~]# cat /proc/sys/fs/aio-max-nr 
65536
[root@yazekats-linux ~]# sysctl -w fs.aio-max-nr=5
[root@yazekats-linux ~]# cat /proc/sys/fs/aio-max-nr 
5

検証

  • 非同期I/Oコンテキスト数(aio-nr)を1秒間隔で監視する
yazekats% while :
do
echo -n `date '+%Y-%m-%d %H:%M:%S:'`
cat /proc/sys/fs/aio-nr
sleep 1
done

...

2014-04-08 02:41:39:0
2014-04-08 02:41:40:0
2014-04-08 02:41:41:0
2014-04-08 02:41:42:52014-04-08 02:41:43:52014-04-08 02:41:44:52014-04-08 02:41:45:52014-04-08 02:41:46:52014-04-08 02:41:47:52014-04-08 02:41:48:52014-04-08 02:41:49:22014-04-08 02:41:50:0
2014-04-08 02:41:51:0

非同期I/Oコンテキスト数(aio-nr)が最大 5 に達している。

  • I/Oベンチマークツール flexible IO tester (fio) で非同期I/Oを発行する。
yazekats% strace -fttt -o strace_fio.log fio -rw=randwrite -bs=4k -size=50m -directory=/tmp -direct=1 -numjobs=10 -name=file1 -ioengine=libaio
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
...
file1: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1
fio-2.0.13
Starting 10 processes
fio: check /proc/sys/fs/aio-max-nr ★
fio: check /proc/sys/fs/aio-max-nr ★
fio: check /proc/sys/fs/aio-max-nr ★
fio: pid=4797, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★
fio: pid=4795, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★
fio: pid=4794, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★
fio: check /proc/sys/fs/aio-max-nr ★
fio: pid=4798, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★
fio: check /proc/sys/fs/aio-max-nr ★
fio: pid=4796, err=11/file:engines/libaio.c:271, func=io_queue_init, error=Resource temporarily unavailable ★
Jobs: 2 (f=2): [ww___XXXXX] [100.0% done] [0K/27380K/0K /s] [0 /6845 /0  iops] [eta 00m:00s]
file1: (groupid=0, jobs=1): err= 0: pid=4789: Tue Apr  8 02:41:50 2014
  write: io=51200KB, bw=6385.7KB/s, iops=1596 , runt=  8018msec
    slat (usec): min=34 , max=1404 , avg=196.54, stdev=106.99
    clat (usec): min=62 , max=5817 , avg=424.62, stdev=396.55
     lat (usec): min=193 , max=6001 , avg=622.42, stdev=412.09
    clat percentiles (usec):
     |  1.00th=[   89],  5.00th=[  123], 10.00th=[  177], 20.00th=[  239],
     | 30.00th=[  274], 40.00th=[  298], 50.00th=[  322], 60.00th=[  346],
     | 70.00th=[  378], 80.00th=[  438], 90.00th=[  716], 95.00th=[ 1320],
     | 99.00th=[ 2160], 99.50th=[ 2352], 99.90th=[ 3344], 99.95th=[ 3952],
     | 99.99th=[ 4256]
    bw (KB/s)  : min= 5024, max=10936, per=20.03%, avg=6379.12, stdev=1347.65
    lat (usec) : 100=1.45%, 250=21.31%, 500=61.87%, 750=5.85%, 1000=2.64%
    lat (msec) : 2=5.22%, 4=1.62%, 10=0.05%
  cpu          : usr=1.93%, sys=5.39%, ctx=63665, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=12800/d=0, short=r=0/w=0/d=0
file1: (groupid=0, jobs=1): err= 0: pid=4790: Tue Apr  8 02:41:50 2014
  write: io=51200KB, bw=6513.2KB/s, iops=1628 , runt=  7861msec
    slat (usec): min=33 , max=1855 , avg=194.53, stdev=98.58
    clat (usec): min=50 , max=4267 , avg=414.18, stdev=371.95
     lat (usec): min=198 , max=4522 , avg=609.93, stdev=380.96
    clat percentiles (usec):
     |  1.00th=[  102],  5.00th=[  155], 10.00th=[  197], 20.00th=[  239],
     | 30.00th=[  270], 40.00th=[  294], 50.00th=[  314], 60.00th=[  338],
     | 70.00th=[  370], 80.00th=[  422], 90.00th=[  676], 95.00th=[ 1272],
     | 99.00th=[ 2096], 99.50th=[ 2288], 99.90th=[ 3024], 99.95th=[ 3536],
     | 99.99th=[ 4192]
    bw (KB/s)  : min= 5288, max= 7400, per=19.94%, avg=6348.80, stdev=528.81
    lat (usec) : 100=0.88%, 250=22.46%, 500=62.45%, 750=5.35%, 1000=2.39%
    lat (msec) : 2=5.18%, 4=1.27%, 10=0.02%
  cpu          : usr=2.19%, sys=5.41%, ctx=64013, majf=0, minf=23
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=12800/d=0, short=r=0/w=0/d=0
file1: (groupid=0, jobs=1): err= 0: pid=4791: Tue Apr  8 02:41:50 2014
  write: io=51200KB, bw=6749.3KB/s, iops=1687 , runt=  7586msec
    slat (usec): min=32 , max=1539 , avg=172.25, stdev=83.40
    clat (usec): min=50 , max=4118 , avg=414.69, stdev=358.46
     lat (usec): min=132 , max=4308 , avg=588.14, stdev=363.63
    clat percentiles (usec):
     |  1.00th=[  125],  5.00th=[  181], 10.00th=[  213], 20.00th=[  251],
     | 30.00th=[  278], 40.00th=[  298], 50.00th=[  318], 60.00th=[  342],
     | 70.00th=[  366], 80.00th=[  414], 90.00th=[  660], 95.00th=[ 1224],
     | 99.00th=[ 2096], 99.50th=[ 2224], 99.90th=[ 2832], 99.95th=[ 3376],
     | 99.99th=[ 3888]
    bw (KB/s)  : min= 5368, max= 8000, per=21.13%, avg=6729.13, stdev=610.57
    lat (usec) : 100=0.36%, 250=19.52%, 500=66.64%, 750=5.02%, 1000=2.19%
    lat (msec) : 2=5.02%, 4=1.25%, 10=0.01%
  cpu          : usr=2.85%, sys=5.27%, ctx=63722, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=12800/d=0, short=r=0/w=0/d=0
file1: (groupid=0, jobs=1): err= 0: pid=4792: Tue Apr  8 02:41:50 2014
  write: io=50552KB, bw=6780.1KB/s, iops=1695 , runt=  7455msec
    slat (usec): min=32 , max=2269 , avg=157.39, stdev=82.15
    clat (usec): min=45 , max=4407 , avg=426.60, stdev=364.62
     lat (usec): min=129 , max=4699 , avg=585.23, stdev=369.91
    clat percentiles (usec):
     |  1.00th=[  131],  5.00th=[  185], 10.00th=[  217], 20.00th=[  258],
     | 30.00th=[  286], 40.00th=[  310], 50.00th=[  330], 60.00th=[  350],
     | 70.00th=[  382], 80.00th=[  430], 90.00th=[  684], 95.00th=[ 1256],
     | 99.00th=[ 2128], 99.50th=[ 2256], 99.90th=[ 2736], 99.95th=[ 3696],
     | 99.99th=[ 4080]
    bw (KB/s)  : min= 5696, max= 7232, per=21.00%, avg=6686.57, stdev=480.73
    lat (usec) : 50=0.01%, 100=0.30%, 250=17.11%, 500=68.37%, 750=5.37%
    lat (usec) : 1000=2.50%
    lat (msec) : 2=4.90%, 4=1.42%, 10=0.02%
  cpu          : usr=2.49%, sys=5.93%, ctx=63366, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=12638/d=0, short=r=0/w=0/d=0
file1: (groupid=0, jobs=1): err= 0: pid=4793: Tue Apr  8 02:41:50 2014
  write: io=51200KB, bw=7295.6KB/s, iops=1823 , runt=  7018msec
    slat (usec): min=31 , max=1259 , avg=144.58, stdev=74.88
    clat (usec): min=38 , max=6284 , avg=397.84, stdev=332.23
     lat (usec): min=131 , max=6450 , avg=543.66, stdev=335.38
    clat percentiles (usec):
     |  1.00th=[  129],  5.00th=[  179], 10.00th=[  209], 20.00th=[  253],
     | 30.00th=[  282], 40.00th=[  306], 50.00th=[  326], 60.00th=[  346],
     | 70.00th=[  366], 80.00th=[  398], 90.00th=[  532], 95.00th=[  956],
     | 99.00th=[ 1992], 99.50th=[ 2224], 99.90th=[ 3056], 99.95th=[ 3568],
     | 99.99th=[ 4512]
    bw (KB/s)  : min= 6128, max= 8024, per=22.93%, avg=7301.71, stdev=488.22
    lat (usec) : 50=0.02%, 100=0.26%, 250=18.98%, 500=69.81%, 750=4.13%
    lat (usec) : 1000=1.94%
    lat (msec) : 2=3.86%, 4=0.97%, 10=0.03%
  cpu          : usr=2.59%, sys=6.47%, ctx=64512, majf=0, minf=27
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=12800/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=255352KB, aggrb=31847KB/s, minb=6385KB/s, maxb=7295KB/s, mint=7018msec, maxt=8018msec

Disk stats (read/write):
    dm-1: ios=173/62797, merge=0/0, ticks=67/24410, in_queue=24481, util=89.99%, aggrios=173/63898, aggrmerge=0/0, aggrticks=67/24511, aggrin_queue=24583, aggrutil=88.68%
    dm-0: ios=173/63898, merge=0/0, ticks=67/24511, in_queue=24583, util=88.68%, aggrios=173/63898, aggrmerge=0/0, aggrticks=45/22169, aggrin_queue=22190, aggrutil=84.26%
  sda: ios=173/63898, merge=0/0, ticks=45/22169, in_queue=22190, util=84.26%

fio の -numjobs オプションで 10 並行を指定したが、aio-max-nr の上限に達し、5 に変更して実行されている。

yazekats% grep io_setup strace_fio.log
4790  1396892502.061498 io_setup(2147483647,  <unfinished ...>
4790  1396892502.061657 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4789  1396892502.061683 io_setup(2147483647,  <unfinished ...>
4790  1396892502.061780 io_setup(1,  <unfinished ...>
4789  1396892502.061797 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4790  1396892502.061906 <... io_setup resumed> {139825214746624}) = 0
4789  1396892502.061931 io_setup(1,  <unfinished ...>
4789  1396892502.062153 <... io_setup resumed> {139825214746624}) = 0
4791  1396892502.062288 io_setup(2147483647,  <unfinished ...>
4791  1396892502.062426 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4791  1396892502.062557 io_setup(1,  <unfinished ...>
4791  1396892502.062759 <... io_setup resumed> {139825214746624}) = 0
4792  1396892502.063390 io_setup(2147483647,  <unfinished ...>
4792  1396892502.063568 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4792  1396892502.063765 io_setup(1,  <unfinished ...>
4792  1396892502.064023 <... io_setup resumed> {139825214746624}) = 0
4793  1396892502.064344 io_setup(2147483647,  <unfinished ...>
4793  1396892502.064503 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4793  1396892502.064730 io_setup(1,  <unfinished ...>
4793  1396892502.064869 <... io_setup resumed> {139825214746624}) = 0
4794  1396892502.065185 io_setup(2147483647,  <unfinished ...>
4794  1396892502.065373 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4794  1396892502.065580 io_setup(1,  <unfinished ...>
4795  1396892502.066413 io_setup(2147483647,  <unfinished ...>
4795  1396892502.066571 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4796  1396892502.066769 io_setup(2147483647,  <unfinished ...>
4795  1396892502.066789 io_setup(1,  <unfinished ...>
4796  1396892502.066918 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4796  1396892502.067031 io_setup(1,  <unfinished ...>
4797  1396892502.067655 io_setup(2147483647,  <unfinished ...>
4797  1396892502.067767 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4797  1396892502.067907 io_setup(1,  <unfinished ...>
4798  1396892502.071044 io_setup(2147483647,  <unfinished ...>
4798  1396892502.071062 <... io_setup resumed> 0x223e210) = -1 EINVAL (Invalid argument)
4798  1396892502.071082 io_setup(1,  <unfinished ...>
4797  1396892502.073724 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable)
4795  1396892502.073738 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable)
4794  1396892502.073788 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable)
4798  1396892502.080743 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable)
4796  1396892502.081637 <... io_setup resumed> 0x223e210) = -1 EAGAIN (Resource temporarily unavailable)

io_setup システムコールの戻り値に "EAGAIN (Resource temporarily unavailable)" が返されている。

補足

fio のソースコードを追ってみる

static int fio_libaio_init(struct thread_data *td)
{
	struct libaio_data *ld = malloc(sizeof(*ld));
	struct libaio_options *o = td->eo;
	int err = 0;

	memset(ld, 0, sizeof(*ld));

	/*
	 * First try passing in 0 for queue depth, since we don't
	 * care about the user ring. If that fails, the kernel is too old
	 * and we need the right depth.
	 */
	if (!o->userspace_reap)
		err = io_queue_init(INT_MAX, &ld->aio_ctx);
	if (o->userspace_reap || err == -EINVAL)
		err = io_queue_init(td->o.iodepth, &ld->aio_ctx);
	if (err) {
		td_verror(td, -err, "io_queue_init");
		log_err("fio: check /proc/sys/fs/aio-max-nr\n"); ★
		free(ld);
		return 1;
	}

	ld->aio_events = malloc(td->o.iodepth * sizeof(struct io_event));
	memset(ld->aio_events, 0, td->o.iodepth * sizeof(struct io_event));
	ld->iocbs = malloc(td->o.iodepth * sizeof(struct iocb *));
	memset(ld->iocbs, 0, sizeof(struct iocb *));
	ld->io_us = malloc(td->o.iodepth * sizeof(struct io_u *));
	memset(ld->io_us, 0, td->o.iodepth * sizeof(struct io_u *));
	ld->iocbs_nr = 0;

	td->io_ops->data = ld;
	return 0;
}

参考


追記(2014/04/13):
dstat のソースを見てたら --aio オプションを指定すると /proc/sys/fs/aio-nr を監視できることに気づいた。

284   --aio                  enable aio stats

...

521 class dstat_aio(dstat):
522     def __init__(self):
523         self.name = 'async'
524         self.type = 'd'
525         self.width = 5;
526         self.open('/proc/sys/fs/aio-nr')
527         self.nick = ('#aio',)
528         self.vars = ('aio',)
529 

実行してみると、

yazekats% dstat -t --aio
----system---- async
  date/time   | #aio
14-04 00:31:55|   0 
14-04 00:31:56|   0 
14-04 00:31:57|  10B
14-04 00:31:58|  10B
14-04 00:31:59|  10B