「Linus Torvalds、最近のCPUのPage Faultのコストにご不満の様子」が面白かった

本の虫: Linux Torvalds、最近のCPUのPage Faultのコストにご不満の様子
の出典元の Google+ の Linus Tovalds の投稿とそれに対する Brendan Gregg らのコメントが興味深かったのでメモ。

Linus Torvalds
...
I wrote a small test-program to pinpoint this more exactly, and it's interesting. On my Haswell CPU, the cost of a single page fault seems to be about 715 cycles. The "iret" to return is 330 cycles. So just the page fault and return is about 1050 cycles. That cost might be off by some small amount, but it's close. On another test case, I got a number that was in the 1150 cycle range, but that had more noise, so 1050 seems to be the minimum cost.
Why is that interesting? It's interesting, because the kernel software overhead for looking up the page and putting it into the page tables is actually much lower. In my worst-case situation (admittedly a pretty made up case where we just end up mapping the fixed zero-page), those 1050 cycles is actually 80.7% of all the CPU time. That's the extreme case where neither kernel nor user space does much anything else that fault pages, but on my actual kernel build, it's still 5% of all CPU time.
On an older 32-bit Core Duo, my test program says that the page fault overhead is "just" 58% instead of 80%, and it does seem to be because page faults have gotten slower (the cost on Core Duo seems to be "just" 700 + 240 cycles).
Another part of it is probably because Haswell is better at normal code (so the fault overhead is relatively more noticeable), but it was sad to see how this cost is going in the wrong direction.
I'm talking to some Intel engineers, trying to see if this can be improved.
https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6

Page fault handling(page fault + iret) にかかるサイクルが Haswell は 1050、Core Duo は 940 と大きな差はないけど、CPU Time に占める割合は Haswell は80%で Core Duo は58% とのこと。Core Duo より Haswell のほうが CPU 処理性能は向上しているけど、メモリアクセスレイテンシが同程度の場合、Haswell の性能が悪いというより CPU の性能向上に対してメモリアクセスの遅さが目立つようになったということかもしれない。

Brendan Gregg
Wow, excellent analysis, and 80% is huge. And this is a good example of why %CPU alone can be misleading, until you study the type of cycles. Did you use perf for profiling? It would be interesting to make a flame graph from the perf data (eg, http://www.brendangregg.com/perf.html#FlameGraphs).
https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6

Brendan Gregg が言ってる "type of cycles" は"メモリI/O待ち"などによる stalled cycles かどうか見た方が良いという意味かな。

Alex Solomatnikov
+Linus Torvalds page fault is a precise exception which probably requires memory fence (and synchronization of all processor state) for page fault to be restartable. Memory fences have long latency on modern CPUs because of complex memory systems, and, moreover, implementations are usually conservative for simplicity. It would be interesting to compare page fault latency to memory fence latency on the same CPU.
https://plus.google.com/+LinusTorvalds/posts/YDKRFDwHwr6

「この話は、Old New Thingの記事を思い出した」と引用されている以下の記事については、

The performance of the syscall trap gets a lot of attention.
I was reminded of a meeting that took place between Intel and Microsoft over fifteen years ago. (Sadly, I was not myself at this meeting, so the story is second-hand.)
Since Microsoft is one of Intel’s biggest customers, their representatives often visit Microsoft to show off what their latest processor can do, lobby the kernel development team to support a new processor feature, and solicit feedback on what sort of features would be most useful to add.
At this meeting, the Intel representatives asked, “So if you could ask for only one thing to be made faster, what would it be?”
Without hesitation, one of the lead kernel developers replied, “Speed up faulting on an invalid instruction.”
The Intel half of the room burst out laughing. “Oh, you Microsoft engineers are so funny!” And so the meeting ended with a cute little joke.
After returning to their labs, the Intel engineers ran profiles against the Windows kernel and lo and behold, they discovered that Windows spent a lot of its time dispatching invalid instruction exceptions. How absurd! Was the Microsoft engineer not kidding around after all?
No he wasn’t.
It so happens that on the 80386 chip of that era, the fastest way to get from V86-mode into kernel mode was to execute an invalid instruction! Consequently, Windows/386 used an invalid instruction as its syscall trap.
What’s the moral of this story? I’m not sure. Perhaps it’s that when you create something, you may find people using it in ways you had never considered.
The hunt for a faster syscall trap – The Old New Thing

無効命令(invalid instruction)はかつて Windows がソフトウェア割込み(INT命令)でシステムコールを実装してた事を言ってるとしたら、Intel がそれを知らなかったとは思えない。。。

Dave
December 15, 2004 at 12:59 pm
Kristoffer, the reason they wanted the "hack" to be made faster was that all the code up to that point used INT instructions; MS-DOS and the BIOS calls used that convention. When you do an INT instruction inside a virtual-86 machine, it naturally needs to somehow invoke the protected mode operating system. In the 386 calling out of a V86 box was an expensive operation, and it happened a lot since nearly all the code that users ran was DOS apps.
Virtual-86 mode was really Intel’s answer to a much uglier problem. Intel had made it easy to get into protected mode (just flip a status bit) but there was no way to get OUT of protected mode, which was important in the 286 era (circa 1987 or so) because there was no V86 support in the CPU and all the existing apps were real mode MS-DOS. No protected-mode OS (say, OS/2 or Windows/286) was going to launch without some sort of support for existing apps.
The solution to get from protected mode back to real mode was to create a triple-fault condition that would cause the processor to reset itself and head back to the BIOS reset vector, where it would eventually make it to some OS code that would start running the real mode apps. I had understood that Gordon Letwin figured that out, but there are some other credits for it here:
http://www.x86.org/productivity/triplefault.htm
The hunt for a faster syscall trap – The Old New Thing

よく見ると Dave という人がシステムコールに "INT instruction"(ソフトウェア割込み) が使われていたことを書いている。この Dave って David Cutler じゃないですよね。。。

ablog

不器用で落着きのない技術者のメモ

「Linus Torvalds、最近のCPUのPage Faultのコストにご不満の様子」が面白かった