大きなテキストファイルをawkで処理するときにcatで投げ込むと速い理由

以下、長々と書いてますが、結論は
”私の環境で”大きなテキストファイルをawkで処理するときにcatで投げ込むと速い理由 - ablogに書いてます

はじめに

awk file より cat file|awk のほうが速いという以下のブログエントリが興味深いので調べてみました。
ちなみに cat file|awk が速くなる環境は稀で、巷では awk file のほうが速い環境のほうが多いようで、どちらが速いかは環境依存です。

[root@localhost sample]# cat command1
awk '$5 ~ /((26|27|28|29|30)\/Jun|(01|02|03)\/Jul)/{ print }' sample.txt > result1.txt
（中略）
[root@localhost sample]# time sh command1

real    0m53.926s
user    0m0.926s
sys 0m33.158s
[root@localhost sample]# cat command2
cat sample.txt | awk '$5 ~ /((26|27|28|29|30)\/Jun|(01|02|03)\/Jul)/{ print }' > result2.txt
（中略）
[root@localhost sample]# time sh command2

real    0m30.789s
user    0m2.644s
sys 0m22.277s
結論

条件によるとは思うけど、catしたほうがはやいっぽい。

大きなテキストファイルをawkで処理するときにcatで投げ込むのと、ファイル読み込みするのどっちが速いか比較 - カメニッキ

仮説

マルチプロセッサ環境では awk file に比べて cat file|awk は cat がI/O待ちの間に awk は待たされず ON CPU で処理され、処理時間が短くなると想定

@yoku0825 マルチプロセッサで並列処理されるのと、awk の I/O待ちが少なくなるという2点で速くなるのではないかと想定しています。その場合、strace -Ttt で cat がI/Oしている間に awk が処理しているかを見るのが良さそうな気がします。
— Yohei Azekatsu (@yoheia) 2015, 7月 23

@yoku0825 回数より並列度が気になります。システムコールの回数も処理時間も同じでも、cat がI/O待ちの間に awk は待たされず ON CPU で処理すると、結果、処理時間が短くなります。
— Yohei Azekatsu (@yoheia) 2015, 7月 23

結論（仮）

パイプにはバッファがあるため、マルチプロセッサ環境では cat file|awk では cat と awk が多少は並列実行され、処理時間が短くなる。

検証結果

仮説を立てた上で、"Don’t guess, measure!（推測するな、計測せよ！）”ということで、検証してみました。

テストデータ作成

テストデータを作成する（5000万行で約2.7GB）

$ perl -MTime::Piece -e 'printf(qq/%08d %08d %08d %08d %s\n/,$_,2..4,$t=localtime->datetime) for 1..50000000' > sample.txt
$ ls -lh sample.txt 
-rw-rw-r-- 1 yazekats yazekats 2.7G Jul 28 14:40 sample.txt
$ wc -l sample.txt 
50000000 sample.txt
$ head -3 sample.txt 
00000001 00000002 00000003 00000004 2015-07-28T14:23:18
00000002 00000002 00000003 00000004 2015-07-28T14:23:18
00000003 00000002 00000003 00000004 2015-07-28T14:23:18

実行時間測定

ページキャッシュを解放する

# echo 3 > /proc/sys/vm/drop_caches

awk file の場合

$ time { awk '$5 ~ /00$/ {print}' sample.txt > result1.txt; }
real	1m9.755s
user	0m56.078s
sys	0m1.593s

ページキャッシュを解放する

# echo 3 > /proc/sys/vm/drop_caches

cat file|awk の場合

$ time { cat sample.txt | awk '$5 ~ /00$/ {print}' > result2.txt; }
real	0m59.299s
user	0m56.994s
sys	0m4.005s

と、awk file より cat file|awk のほうが速い。cat file|awk は実行時間をCPU時間が上回っているため、cat のプロセスと awk のプロセスが並列実行されていると想定できる。

実行時間(59.299s) < CPU時間(60.999) = user(56.994s) + sys(4.005s)

なぜこうなるかというとパイプはバッファを持っているからだと思います。

$  strace -Ttt cat sample.txt |while read LINE
> do
> sleep 10
> echo $LINE
> done
00:36:53.673164 execve("/bin/cat", ["cat", "sample.txt"], [/* 89 vars */]) = 0 <0.000332>
...
00:36:53.677511 read(3, "00000001 00000002 00000003 00000"..., 32768) = 32768 <0.000049>
00:36:53.677627 write(1, "00000001 00000002 00000003 00000"..., 32768) = 32768 <0.000622>
00:36:53.678330 read(3, " 00000002 00000003 00000004 2015"..., 32768) = 32768 <0.000039>
00:36:53.678439 write(1, " 00000002 00000003 00000004 2015"..., 32768) = 32768 <0.000031>
00:36:53.678536 read(3, "2 00000003 00000004 2015-07-28T1"..., 32768) = 32768 <0.000028>
# ↑まだ read していないのに　96KB まで read して、64KB くらい write している
00:36:53.678629 write(1, "2 00000003 00000004 2015-07-28T1"..., 3276800000001 00000002 00000003 00000004 2015-07-28T14:23:18
00000002 00000002 00000003 00000004 2015-07-28T14:23:18
00000003 00000002 00000003 00000004 2015-07-28T14:23:18
00000004 00000002 00000003 00000004 2015-07-28T14:23:18

書籍で調べてみると、パイプを使うとプロセス空間とパイプ用バッファの間でデータがやり取りされ、バッファ（2.6.11以降は16個。4KB * 16個 = 64KB）が一杯になるまで書込みできると書かれています。後述しますが、cat のI/Oサイズは 32KB なので、awk で read されなくても、cat はパイプ用バッファに2回 write することができます。

Linuxカーネル2.6解読室

作者: 高橋浩和,小田逸郎,山幡為佐久
出版社/メーカー: ソフトバンククリエイティブ
発売日: 2006/11/18
メディア: 単行本
購入: 14人クリック: 197回
この商品を含むブログ (118件) を見る

P.327

18.3.4 パイプのリード・ライト
パイプに対してリード・ライトを行うと、パイプ用バッファとプロセス空間の間でデータがやり取りされます。
（中略）
2.6.10 までは、バッファは1つでしたが、2.6.11 以降複数（16個）になっています。（中略）ライトでは書くスペースが全くない場合、書くスペースができるまでスリープします。

Pipe capacity
A pipe has a limited capacity. If the pipe is full, then a write(2) will block or fail, depending on whether the O_NONBLOCK flag is set (see below). Different implementations have different limits for the pipe capacity. Applications should not rely on a particular capacity: an application should be designed so that a reading process consumes data as soon as it is available, so that a writing process does not remain blocked.
In Linux versions before 2.6.11, the capacity of a pipe was the same as the system page size (e.g., 4096 bytes on i386). Since Linux 2.6.11, the pipe capacity is 65536 bytes. Since Linux 2.6.35, the default pipe capacity is 65536 bytes, but the capacity can be queried and set using the fcntl(2) F_GETPIPE_SZ and F_SETPIPE_SZ operations. See fcntl(2) for more information.
PIPE_BUF
POSIX.1-2001 says that write(2)s of less than PIPE_BUF bytes must be atomic: the output data is written to the pipe as a contiguous sequence. Writes of more than PIPE_BUF bytes may be nonatomic: the kernel may interleave the data with data written by other processes. POSIX.1-2001 requires PIPE_BUF to be at least 512 bytes. (On Linux, PIPE_BUF is 4096 bytes.) The precise semantics depend on whether the file descriptor is nonblocking (O_NONBLOCK), whether there are multiple writers to the pipe, and on n, the number of bytes to be written:
pipe(7) - Linux manual page

今回の検証に使った Kernel 2.6.39 でも16個でした。

linux-2.6.39.x86_64/include/linux/pipe_fs_i.h

#define PIPE_DEF_BUFFERS	16

従って、awk file|awk すると速くなります。

# echo 3 > /proc/sys/vm/drop_caches
$ time { awk '{print}' sample.txt | awk '$5 ~ /00$/ {print}' > result3.txt; }
real	0m57.671s
user	1m5.399s
sys	0m2.495s

また、パイプ用バッファの恩恵を享受できないような大きなI/Oサイズにすると直列化して遅くなり、I/Oサイズを小さくすると並列処理できる時間が長くなり、速くなります。

# echo 3 > /proc/sys/vm/drop_caches
$ time { dd if=sample.txt bs=32k| awk '$5 ~ /00$/ {print}' > result4.txt; }
85449+1 records in
85449+1 records out
2800000000 bytes (2.8 GB) copied, 59.405 s, 47.1 MB/s

real	0m59.549s
user	0m56.090s
sys	0m4.124s
# echo 3 > /proc/sys/vm/drop_caches
$ time { dd if=sample.txt bs=1024k| awk '$5 ~ /00$/ {print}' > result4.txt; }
2670+1 records in
2670+1 records out
2800000000 bytes (2.8 GB) copied, 74.642 s, 37.5 MB/s

real	1m14.762s <-- I/Oサイズを1MBにすると遅くなる
user	0m58.936s
sys	0m4.493s
# echo 3 > /proc/sys/vm/drop_caches
$ time { dd if=sample.txt bs=4k| awk '$5 ~ /00$/ {print}' > result4.txt; }
683593+1 records in
683593+1 records out
2800000000 bytes (2.8 GB) copied, 52.7871 s, 53.0 MB/s

real	0m52.900s <--I/Oサイズを小さくして、並列処理効率を上げると速くなる
user	0m51.536s
sys	0m3.348s
$ strace dd if=sample.txt bs=1024k|while read LINE
> do
> sleep 10
> echo $LINE
> done
execve("/bin/dd", ["dd", "if=sample.txt", "bs=1024k"], [/* 65 vars */]) = 0
（中略）
read(0, "00000001 00000002 00000003 00000"..., 1048576) = 1048576
write(1, "00000001 00000002 00000003 00000"..., 1048576^C <unfinished ...> <--I/Oサイズが1MBの場合、処理が直列化

補足

プロセスのCPU使用率とI/O量の分析

@hasegaw さんの prociostat.pl と visualize_prociostat.Rで分析してみると、awk file(約45MB/s) より cat file|awk(55MB/s) のほうがI/Oのスループットが高いことがわかります。*1
awk file だと I/O と ON CPU での処理が完全に直列になりますが、cat file|awk だとパイプ用バッファの恩恵を享受して awk が ON CPU で処理中にも cat がささやかながらもI/O処理を並列実行してくれるからだと思います。

- awk file (delta_read_bytes は約 45MB/s)

- cat file| (delta_read_bytes は約 55MB/s)

- |awk

上記グラフを作成した手順は以下の通りです。

$ perl prociostat.pl -C -p 7000-9999 # prociostat.pl を実行後、計測したい処理（awk や cat|awk）を実行 
^C # 計測したい処理が終わったら Ctrl + C で終了
$ ls *.txt|xargs -n1 Rscript visualize_prociostat.R # R でグラフ化

I/Oサイズ

I/Oサイズは関係ありませんでした。cat のI/Oサイズが 32KB、awk のI/Oサイズが 4KB だったので、awk のI/Oサイズを 32KB に大きくしても実行時間はほぼ同じです。

awk の I/O サイズは 4KB

$ strace awk '$5 ~ /00$/ {print}' sample.txt > result1.txt
...
read(3, "00299009 00000002 00000003 00000"..., 4096) = 4096 <- I/Oサイズが 4KB
read(3, " 00000002 00000003 00000004 2015"..., 4096) = 4096

cat のI/Oサイズは 32 KB

$ strace cat sample.txt | awk '$5 ~ /00$/ {print}' > result2.txt
...
read(3, "004 2015-07-28T14:23:25\n00362790"..., 32768) = 32768 <- I/Oサイズが 32KB
write(1, "004 2015-07-28T14:23:25\n00362790"..., 32768) = 32768

AWKBUFSIZE 環境変数で awk のI/Oサイズを大きくしても、実行時間はほぼ変わらない。

$ export AWKBUFSIZE=32768
$ strace awk '$5 ~ /00$/ {print}' sample.txt > result1.txt
...
read(3, "00688129 00000002 00000003 00000"..., 32768) = 32768 <- I/Oサイズが 32KB になっていることを確認
read(3, " 00000002 00000003 00000004 2015"..., 32768) = 32768
...
$ time { awk '$5 ~ /00$/ {print}' sample.txt > result1.txt; } 
real	1m10.359s <-- I/Oサイズを32KBにしても実行時間はほぼ同じ
user	0m56.891s
sys	0m1.491s

awk のソース

今回使った gawk 3.1.7 のソースを見ると、AWKBUFSIZE 環境変数を設定すると read(2) システムコール発行時の I/O サイズを調整できることがわかります。

http://git.savannah.gnu.org/cgit/gawk.git/tree/io.c?id=gawk-3.1.7

/* iop_alloc --- allocate an IOBUF structure for an open fd */

static IOBUF *
iop_alloc(int fd, const char *name, IOBUF *iop)
{
	struct stat sbuf;
	struct open_hook *oh;
	int iop_malloced = FALSE;

	if (iop == NULL) {
		emalloc(iop, IOBUF *, sizeof(IOBUF), "iop_alloc");
		iop_malloced = TRUE;
	}
	memset(iop, '\0', sizeof(IOBUF));
	iop->flag = 0;
	iop->fd = fd;
	iop->name = name;

	/* walk through open hooks, stop at first one that responds */
	for (oh = open_hooks; oh != NULL; oh = oh->next) {
		if ((iop->opaque = (*oh->open_func)(iop)) != NULL)
			break;
	}

	if (iop->fd == INTERNAL_HANDLE)
		return iop;

	if (iop->fd == INVALID_HANDLE) {
		if (iop_malloced)
			free(iop);
		return NULL;
	}
	if (isatty(iop->fd))
		iop->flag |= IOP_IS_TTY;
	iop->readsize = iop->size = optimal_bufsize(iop->fd, & sbuf);
	iop->sbuf = sbuf;
	if (do_lint && S_ISREG(sbuf.st_mode) && sbuf.st_size == 0)
		lintwarn(_("data file `%s' is empty"), name);
	errno = 0;
	iop->count = iop->scanoff = 0;
	emalloc(iop->buf, char *, iop->size += 2, "iop_alloc");
	iop->off = iop->buf;
	iop->dataend = NULL;
	iop->end = iop->buf + iop->size;
	iop->flag |= IOP_AT_START;
	return iop;
}

...

/* <getarecord>=                                                            */
/* get_a_record --- read a record from IOP into out, return length of EOF, set RT */

static int
get_a_record(char **out,        /* pointer to pointer to data */
        IOBUF *iop,             /* input IOP */
        int *errcode)           /* pointer to error variable */
{
        struct recmatch recm;
        SCANSTATE state;
        RECVALUE ret;
        int retval;
        NODE *rtval = NULL;
	static RECVALUE (*lastmatchrec)P((IOBUF *iop, struct recmatch *recm, SCANSTATE *state)) = NULL;

        if (at_eof(iop) && no_data_left(iop))
                return EOF;

	if (iop->get_record != NULL)
		return (*iop->get_record)(out, iop, errcode);

        /* <fill initial buffer>=                                                   */
        if (has_no_data(iop) || no_data_left(iop)) {
		if (is_internal(iop)) {
			iop->flag |= IOP_AT_EOF;
			return EOF;
		}
                iop->count = read(iop->fd, iop->buf, iop->readsize);
                if (iop->count == 0) {
                        iop->flag |= IOP_AT_EOF;
                        return EOF;
                } else if (iop->count == -1) {
                        if (! do_traditional && errcode != NULL) {
                                *errcode = errno;
                                iop->flag |= IOP_AT_EOF;
                                return EOF;
                        } else
                                fatal(_("error reading input file `%s': %s"),
                                        iop->name, strerror(errno));
                } else {
                        iop->dataend = iop->buf + iop->count;
                        iop->off = iop->buf;
                }
        }

http://git.savannah.gnu.org/cgit/gawk.git/tree/posix/gawkmisc.c?id=gawk-3.1.7

/* optimal_bufsize --- determine optimal buffer size */

/*
 * Enhance this for debugging purposes, as follows:
 *
 * Always stat the file, stat buffer is used by higher-level code.
 *
 * if (AWKBUFSIZE == "exact")
 * 	return the file size
 * else if (AWKBUFSIZE == a number)
 * 	always return that number
 * else
 * 	if the size is < default_blocksize
 * 		return the size
 *	else
 *		return default_blocksize
 *	end if
 * endif
 *
 * Hair comes in an effort to only deal with AWKBUFSIZE
 * once, the first time this routine is called, instead of
 * every time.  Performance, dontyaknow.
 */

size_t
optimal_bufsize(fd, stb)
int fd;
struct stat *stb;
{
	char *val;
	static size_t env_val = 0;
	static short first = TRUE;
	static short exact = FALSE;

	/* force all members to zero in case OS doesn't use all of them. */
	memset(stb, '\0', sizeof(struct stat));

	/* always stat, in case stb is used by higher level code. */
	if (fstat(fd, stb) == -1)
		fatal("can't stat fd %d (%s)", fd, strerror(errno));

	if (first) {
		first = FALSE;

		if ((val = getenv("AWKBUFSIZE")) != NULL) {
			if (strcmp(val, "exact") == 0)
				exact = TRUE;
			else if (ISDIGIT(*val)) {
				for (; *val && ISDIGIT(*val); val++)
					env_val = (env_val * 10) + *val - '0';

				return env_val;
			}
		}
	} else if (! exact && env_val > 0)
		return env_val;
	/* else
	  	fall through */

	/*
	 * System V.n, n < 4, doesn't have the file system block size in the
	 * stat structure. So we have to make some sort of reasonable
	 * guess. We use stdio's BUFSIZ, since that is what it was
	 * meant for in the first place.
	 */
#ifdef HAVE_ST_BLKSIZE
#define DEFBLKSIZE	(stb->st_blksize > 0 ? stb->st_blksize : BUFSIZ)
#else
#define DEFBLKSIZE	BUFSIZ
#endif

	if (S_ISREG(stb->st_mode)		/* regular file */
	    && 0 < stb->st_size			/* non-zero size */
	    && (stb->st_size < DEFBLKSIZE	/* small file */
		|| exact))			/* or debugging */
		return stb->st_size;		/* use file size */

	return DEFBLKSIZE;
}

補足

$ ls -hs result*
45M result1.txt  45M result2.txt
$ md5sum result1.txt 
5c54faf7dc1e0cf93cbf474a921f05d8  result1.txt
$ md5sum result2.txt 
5c54faf7dc1e0cf93cbf474a921f05d8  result2.txt

検証環境

Lenovo Thinkpad X230
CPU: 2 Core * 2 Thread (Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz)
Memory: 16GB
ストレージ: SSD(Intel製)
Oracle Linux 6.6 (2.6.39-400.17.1.el6uek.x86_64)
ファイルシステム: ext4
awk 3.1.7
cat 8.4

参考

追記(2015/07/30):

「catはposix_fadivse(SEQUENTIAL)してるからOSの先読み量が倍に」が正しいんじゃないの？/id:nippondanji そうですね。ただ「マルチプロセスだと速い」は理由の説明として不十分。正解は多分→https://gist.github.com/kazuho/074e2c82a2c09b9255c5
「catはposix_fadivse(SEQUENTIAL)してるからOSの先読み量が倍に」が正しいんじゃないの？/id:nippondanji そうですね。ただ「マルチプロセスだと速い」は理由の説明として不十分。正解は多分→https://gist.github.com/kazuho/074e2c82a2c09b9255c5 - kazuhookuのコメント / はてなブックマーク

「catはposix_fadivse(SEQUENTIAL)してるからOSの先読み量が倍に」は違いました。
本検証に使った cat は"posix_fadivse(SEQUENTIAL)" が入る前のものでした。

$ cat --version|head -1                                                     
cat (GNU coreutils) 8.4
$ strace cat sample.txt 2>&1|grep -C2 fadvise                               
$ /usr/src/gnu/coreutils-8.24/src/cat --version|head -1                     
cat (GNU coreutils) 8.24
$ strace /usr/src/gnu/coreutils-8.24/src/cat sample.txt 2>&1|grep -C2 fadvise
open("sample.txt", O_RDONLY)            = 3
fstat(3, {st_mode=S_IFREG|0664, st_size=2800000000, ...}) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa10fc25000
read(3, "00000001 00000002 00000003 00000"..., 131072) = 131072

8.4 より 8.24 のほうが少し速くなっていますね。

# echo 3 > /proc/sys/vm/drop_caches
$ time cat sample.txt > /dev/null

real	0m10.597s
user	0m0.014s
sys	0m1.188s
# echo 3 > /proc/sys/vm/drop_caches
$ time /usr/src/gnu/coreutils-8.24/src/cat sample.txt > /dev/null

real	0m10.374s
user	0m0.003s
sys	0m1.028s

2010年7月に cat に posix_fadivse(SEQUENTIAL) が入っていますが、検証に使っていたのはこの変更が入る前のものでした。古っ。。。

Savannah Git Hosting - coreutils.git/commitdiff

provide POSIX_FADV_SEQUENTIAL hint to appropriate utils
author	P&#225;draig Brady <P@draigBrady.com>	
Tue, 20 Jul 2010 17:51:01 +0000 (18:51 +0100)
committer	P&#225;draig Brady <P@draigBrady.com>	
Thu, 22 Jul 2010 00:04:47 +0000 (01:04 +0100)
Following on from commit dae35bac, 01-03-2010,
"sort: inform the system about our input access pattern"
apply the same hint to all appropriate utils.
This currently gives around a 5% speedup for reading
large files from fast flash devices on GNU/Linux.

* src/base64.c: Call fadvise (..., FADVISE_SEQUENTIAL);
* src/cat.c: Likewise.
* src/cksum.c: Likewise.
* src/comm.c: Likewise.
* src/cut.c: Likewise.
* src/expand.c: Likewise.
* src/fmt.c: Likewise.
* src/fold.c: Likewise.
* src/join.c: Likewise.
* src/md5sum.c: Likewise.
* src/nl.c: Likewise.
* src/paste.c: Likewise.
* src/pr.c: Likewise.
* src/ptx.c: Likewise.
* src/shuf.c: Likewise.
* src/sum.c: Likewise.
* src/tee.c: Likewise.
* src/tr.c: Likewise.
* src/tsort.c: Likewise.
* src/unexpand.c: Likewise.
* src/uniq.c: Likewise.
* src/wc.c: Likewise, unless we don't actually read().

...

diff --git a/src/cat.c b/src/cat.c
index c4a2a9e..47b5053 100644 (file)
--- a/src/cat.c
+++ b/src/cat.c
@@ -34,6 +34,7 @@
 
 #include "system.h"
 #include "error.h"
+#include "fadvise.h"
 #include "full-write.h"
 #include "quote.h"
 #include "safe-read.h"
@@ -700,6 +701,8 @@ main (int argc, char **argv)
         }
       insize = io_blksize (stat_buf);
 
+      fdadvise (input_desc, 0, 0, FADVISE_SEQUENTIAL);
+
       /* Compare the device and i-node numbers of this input file with
          the corresponding values of the (output file associated with)
          stdout, and skip this input file if they coincide.  Input

追記(2015/08/01):

cat file|awk の場合
$ time { cat sample.txt | awk '$5 ~ /00$/ {print}' > result2.txt; }
real	0m59.299s
user	0m56.994s
sys	0m4.005s
と、awk file より cat file|awk のほうが速い。cat file|awk は実行時間をCPU時間が上回っているため、cat のプロセスと awk のプロセスが並列実行されていると想定できる。
実行時間(59.299s) < CPU時間(60.999) = user(56.994s) + sys(4.005s)

cat file|awk では実行時間 < CPU時間となっていますが、cat が I/O wait していないとは限りません。実行時間は単純に終了時間 - 開始時間で算出しますが、CPU時間はプロセスのCPU時間を getrusage システムコールで取得します。catのプロセスと awk のプロセスが並列実行されている期間があるため、実行時間 < CPU時間となっています。例えば、CPUバウンドな2プロセスがほぼ完全に並列実行されると、実行時間 * 2 ≒ CPU時間となったりします。

*1:Perl スクリプトで /proc//stat の delta_utime、delta_stime、delta_delayacct_blkio_ticks、/proc//io の read_bytes、write_bytes を1秒間隔でCSVに出力して、Rでグラフにしています。