ablog

不器用で落着きのない技術者のメモ

vm.min_free_kbytes からの wmark_{min|low|high} 算出式

Linux のページ回収の閾値である wmark_min、wmark_low、wmark_high の算出式を調べたメモ。

算出式

正確には NUMA ノードの ZONE 毎に計算されるが、合計の概算は下記の式で計算できる。

min_free_kbytes = sqrt(物理メモリサイズ(KB) * 16)
wmark_min = min_free_kbytes
wmark_low = wmark_min + (wmark_min / 4)
wmark_high = wmark_min + (wmark_min / 2) 

具体例

例えば、x86_64 で物理メモリサイズが 16GB の場合、以下のようになる。

min_free_kbytes = sqrt(16,777,216KB * 16) = 16,384 KB
wmark_min = 16,384 KB
wmark_low = 16,384 + (16,384 / 4) = 20,480 KB
wmark_high = 16,384 + (16,384 / 2) = 24,576 KB 

実機確認結果

手元の環境で確認してみた → How to check usage of different parts of memory?

$ uname -r
2.6.39-400.17.1.el6uek.x86_64 ★ Kernel 2.6.39-400 では managed_pages ではなく present_pages を算出に使っている
$ cat /proc/meminfo|head -1
MemTotal:       16158544 kB ★ 物理メモリサイズ
$ sysctl -a|grep min_free_kbytes
vm.min_free_kbytes = 16114 ★ min_free_kbytes = 16,114 KB
$ cat /proc/zoneinfo
Node 0, zone      DMA
  pages free     3976
        min      3 ★ min_pages = 3 * 4KB = 12 KB
        low      3 ★ low_pages = 3 * 4KB = 12 KB 
        high     4 ★ high_pages = 4 * 4KB = 16 KB 
        scanned  0
        spanned  4080
        present  3920 ★
...
Node 0, zone    DMA32
  pages free     327938
        min      822 ★ min_pages = 822 * 4KB = 3,288 KB
        low      1027 ★ low_pages = 1027 * 4KB = 4,108 KB
        high     1233 ★ high_pages = 1027 * 4KB = 4,932 KB
        scanned  0
        spanned  1044480
        present  828008 ★
...
Node 0, zone   Normal
  pages free     3994
        min      3202 ★ min_pages = 3202 * 4KB = 12,808 KB
        low      4002 ★ low_pages = 4002 * 4KB = 16,008 KB
        high     4803 ★ high_pages = 4803 * 4KB = 19,212 KB
        scanned  0
        spanned  3270144
        present  3225435

DMA、DMA32、Noraml の3つの ZONE の min、low、high を合計すると以下の通り。

min:  16,108 KB (12 + 3,288 + 12,808)
low:  20,128 KB (12 + 4,108 + 16,008)
high: 24,160 KB (16 + 4,932 + 19,212)

上記のように実際にはシステム全体ではなく NUMA ノード毎、さらに ZONE(x86_64 の場合、DMA、DMA32、Normal) 毎に閾値があり、各 ZONE 単位で閾値を下回るとページ回収が行われるようです。
正確にNUMAノードのZONE毎の閾値を見積もるにはがちゃぴん先生のツッコミ通り、もう少し複雑な計算式になります。

前提知識

Linux*1 は空きメモリを有効活用し、ファイル*2を読み書きするとメモリにキャッシュ*3するが、空きが少なくなるとページを回収して空きを確保する。LRU リストを基に使用頻度が低いページが解放される*4

ページ回収(reclaim)の種類と動作
  • Background reclaim
    • 空きメモリが wmark_low(low pages) 未満になるとカーネルスレッド kswapd がバックラウンドでページ回収を初める。
    • 空きメモリが wmark_high(high pages) を超えると kswapd はページ回収をやめる。
  • Direct reclaim
    • 空きメモリが wmark_min(min pages) 未満になるとプロセスがメモリを割当を要求するとページ回収して空きを作ってから割当られる。


Systems Performance: Enterprise and the Cloud Figure - 7.8 kswapd wake-ups and modes より

Reducing Memory Access Latency より

参考情報

Professional Linux Kernel Architecture (Wrox Programmer to Programmer)

Professional Linux Kernel Architecture (Wrox Programmer to Programmer)

3.2.2.3. Calculation of Zone Watermarks

Before calculating the various watermarks, the kernel first determines the minimum memory space that must remain free for critical allocations. This value scales nonlinearly with the size of the available RAM. It is stored in the global variable min_free_kbytes. Figure 3.4 provides an overview of the scaling behavior, and the inset — which does not use a logarithmic scale for the main memory size in contrast to the main graph — shows a magnification of the region up to 4 GiB. Some exemplary values to provide a feeling for the situation on systems with modest memory that are common in desktop environments are collected in Table 3.1. An invariant is that not less than 128 KiB but not more than 64 MiB may be used. Note, however, that the upper bound is only necessary on machines equipped with a really satisfactory amount of main memory. The file /proc/sys/vm/min_free_kbytes allows reading and adapting the value from userland.

Filling the watermarks in the data structure is handled by init_per_zone_pages_min, which is invoked during kernel boot and need not be started explicitly.

setup_per_zone_pages_min sets the pages_min, pages_low, and pages_high elements of struct zone. After the total number of pages outside the highmem zone has been calculated (and stored in lowmem_ pages), the kernel iterates over all zones in the system and performs the following calculation:

  • mm/page_alloc.c
void setup_per_zone_pages_min(void)
{
        unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
        unsigned long lowmem_pages = 0;
        struct zone *zone;
        unsigned long flags;

...
        for_each_zone(zone) {
                u64 tmp;

                tmp = (u64)pages_min * zone->present_pages;
                do_div(tmp,lowmem_pages);
                if (is_highmem(zone)) {
                        int min_pages;

                        min_pages = zone->present_pages / 1024;
                        if (min_pages < SWAP_CLUSTER_MAX)
                                min_pages = SWAP_CLUSTER_MAX;
                        if (min_pages > 128)
                                min_pages = 128;
                        zone->pages_min = min_pages;
                } else {
                        zone->pages_min = tmp;
                }

                zone->pages_low = zone->pages_min + (tmp >> 2);
                zone->pages_high = zone->pages_min + (tmp >> 1);
        }
}



static void __setup_per_zone_wmarks(void)
{
	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
	unsigned long lowmem_pages = 0;
	struct zone *zone;
	unsigned long flags;

	/* Calculate total number of !ZONE_HIGHMEM pages */
	for_each_zone(zone) {
		if (!is_highmem(zone))
			lowmem_pages += zone->managed_pages;
	}

	for_each_zone(zone) {
		u64 tmp;

		spin_lock_irqsave(&zone->lock, flags);
		tmp = (u64)pages_min * zone->managed_pages;
		do_div(tmp, lowmem_pages);
		if (is_highmem(zone)) {
			/*
			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
			 * need highmem pages, so cap pages_min to a small
			 * value here.
			 *
			 * The WMARK_HIGH-WMARK_LOW and (WMARK_LOW-WMARK_MIN)
			 * deltas controls asynch page reclaim, and so should
			 * not be capped for highmem.
			 */
			unsigned long min_pages;

			min_pages = zone->managed_pages / 1024;
			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
			zone->watermark[WMARK_MIN] = min_pages;
		} else {
			/*
			 * If it's a lowmem zone, reserve a number of pages
			 * proportionate to the zone's size.
			 */
			zone->watermark[WMARK_MIN] = tmp;
		}

		zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
		zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);

		setup_zone_migrate_reserve(zone);
		spin_unlock_irqrestore(&zone->lock, flags);
	}

	/* update totalreserve_pages */
	calculate_totalreserve_pages();
}

...

/*
 * Initialise min_free_kbytes.
 *
 * For small machines we want it small (128k min).  For large machines
 * we want it large (64MB max).  But it is not linear, because network
 * bandwidth does not increase linearly with machine size.  We use
 *
 * 	min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy: ★
 *	min_free_kbytes = sqrt(lowmem_kbytes * 16) ★
 *
 * which yields
 *
 * 16MB:	512k
 * 32MB:	724k
 * 64MB:	1024k
 * 128MB:	1448k
 * 256MB:	2048k
 * 512MB:	2896k
 * 1024MB:	4096k
 * 2048MB:	5792k
 * 4096MB:	8192k
 * 8192MB:	11584k
 * 16384MB:	16384k
 */
int __meminit init_per_zone_wmark_min(void)
{
	unsigned long lowmem_kbytes;

	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);

	min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
	if (min_free_kbytes < 128)
		min_free_kbytes = 128;
	if (min_free_kbytes > 65536)
		min_free_kbytes = 65536;
	setup_per_zone_wmarks();
	refresh_zone_stat_thresholds();
	setup_per_zone_lowmem_reserve();
	setup_per_zone_inactive_ratio();
	return 0;
}
module_init(init_per_zone_wmark_min)

This is an example of the problem and solution above.

  • System Memory: 2GB
  • High memory pressure

In this case, min_free_kbytes and watermarks are automatically
set as follows.
(Here, watermark shows sum of the each zone's watermark.)

min_free_kbytes: 5752
watermark[min] : 5752
watermark[low] : 7190
watermark[high]: 8628

Tunable watermark [LWN.net]

min_free_kbytes:

This is used to force the Linux VM to keep a minimum number of kilobytes free. The VM uses this number to compute a watermark[WMARK_MIN] value for each lowmem zone in the system. Each lowmem zone gets a number of reserved free pages based proportionally on its size.

Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will become subtly broken, and prone to deadlock under high loads.

Setting this too high will OOM your machine instantly.

vm.txt\sysctl\Documentation - kernel/git/stable/linux.git - Linux kernel stable tree

*1:Linuxに限らず大体のOSは似たような動作をします

*2:ブロックデバイスのブロック

*3:ページキャッシュ

*4:1.ページキャッシュ解放、2.プロセスのメモリ(Anon pages)のページアウトのいずれかを行う。1と2のバランスはカーネルパラメータ vm.swapiness で調整できる。