ablog

不器用で落着きのない技術者のメモ

真のCPU使用率 by Brendan Gregg@Netflix

自分の「シンプルでシステマチックな〇〇性能分析」のルーツな面々のうちの2人*1
Brendan Gregg@Netflix と Tanel Poder@Gluent が素敵な絡みをしていたのでメモ。

Brendan Gregg はざっくり、以下のようなことを言っている。

  • CPUの1サイクルと比較してメモリアクセスは遅い*2ので、ON CPU でもメモリI/O*3待ちでストールしてることが多い。
  • IPC(Instructions Per Cycle)、つまり、1サイクルでどれだけ命令を実行できたかを見て、例えば、1 < IPC の場合は1サイクルで1回命令を実行できてないので、メモリI/Oネックの可能性が高いといった見方ができる。
  • Linux では tiptop(1) というコマンドでプロセス別の IPC を見れる。

昔、DBのアップグレードでDBサーバのCPUのクロック周波数が上がったのに比例して特定の処理にかかるCPU時間が下がらないという相談を受けて、このような説明したのを思い出しました。

The metric we all use for CPU utilization is deeply misleading, and getting worse every year. What is CPU utilization? How busy your processors are? No, that's not what it measures. Yes, I'm talking about the "%CPU" metric used everywhere, by everyone. In every performance monitoring product. In top(1).

(中略)

CPU utilization has become a deeply misleading metric: it includes cycles waiting on main memory, which can dominate modern workloads.

(中略)

If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies including reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.

(中略)

As for top(1), there is tiptop(1) for Linux, which shows IPC by process:

(中略)

In the cloud
If you are in a virtual environment, you might not have access to PMCs, depending on whether the hypervisor supports them for guests. I recently posted about The PMCs of EC2: Measuring IPC, showing how PMCs are now available for dedicated host types on the AWS EC2 Xen-based cloud.

(中略)

Other reasons CPU Utilization is misleading
It's not just memory stall cycles that makes CPU utilization misleading. Other factors include:

  • Temperature trips stalling the processor.
  • Turbobost varying the clockrate.
  • The kernel varying the clock rate with speed step.
  • The problem with averages: 80% utilized over 1 minute hiding bursts of 100%.
  • Spin locks: the CPU is utilized, and has high IPC, but the app is not making logical forward progress."
CPU Utilization is Wrong

Tanel が Brendan への Twitter のリプライで紹介している RAM is the new disk – and how to measure its performance – Part 2 – Tools | Tanel Poder: SQL Performance Tuning, System Troubleshooting and Training の Basic CPU Performance Counter Reference は Perf で取れるパフォーマンスカウンタのリファレンスとして素晴らしい。


MonetDB/X100 のような列指向データベースではCPUのキャッシュの使い方の最適化まで考えてデザインしているようです。

The X100 engine is designed for in-cache execution, which means that the only “randomly” accessible memory is the CPU cache, and main memory is already considered part of secondary storage, used in the buffer manager for buffering I/O operations and large intermediate results.

MonetDB/X100 - A DBMS In The CPU Cache

Tanel Poder のこちらのスライドではビジュアルで直感的にわかりやすく解説されています。


昔見た新久保さんのスライドを思い出した。


以下は id:kumagi さんに教えてもらった論文。DB(OLTP系)の IPC が低く改善の余地があるとがが書かれている。

9. CONCLUSION
In this paper, we perform a detailed micro-architectural analysis of the in-memory OLTP systems contrasting them to the disk-based OLTP systems. Our study demonstrates that in-memory OLTP system behave very similarly to the disk-based OLTP systems despite all the design differences and lighter storage manager components of the memoryoptimized systems. The lighter storage manager components reduce the instruction footprint at the storage manager layer, but the overall instruction footprint of an inmemory OLTP system is still large, which leads to a poor L1- I locality and high number of L1 instruction misses. Even though optimized compilation techniques help in minimizing the L1 instruction misses, in the absence of the instruction misses the impact of long-latency data misses surfaces resulting in low IPC values.

Micro-architectural Analysis of In-memory OLTP

*1:他には Craig@OraPub や Kyle@AWS など

*2:何百倍くらい

*3:NUMAでのリモートメモリアクセスはローカルメモリよりさらに遅い