ablog

不器用で落着きのない技術者のメモ

zone_reclaim_mode=1 だと NUMA のリモートノードのゾーンからページが割当られない?

zone_reclaim_mode=1 だと、NUMA ノードのゾーンごとに wmark_{min|low|high} が計算され、各ゾーンで空きメモリが wmark_low を下回るとページ回収が始まると理解している。
zone_reclaim_mode=1 の場合、ページ回収が追いつかないとリモートノードのゾーンからメモリが割当てられずに "page allocation failure" が発生するのではないかという話を聞いたので調べてみたが、以下の通り、ローカルノードのゾーンからメモリで割当できない場合はリモートノードのゾーンからメモリから割当てられそう。

When vm.zone_reclaim_mode is enabled, the kernel attempts to free up or reclaim pages from the target node's memory before going off-node for the allocation. For example, these pages might be cached file pages or other applications' pages that have not been referenced for a relatively long time. The allocation overflows only if the attempt to reclaim local pages fails.

So, why does Linux provide this option? What is the benefit?

For some long running applications—for example, high performance technical computing applications—the overall runtime can vary dramatically, based on the locality of their memory references. When such an application is started, even with well-thought out memory policies, a given node's memory could be filled up with page cache pages from previous jobs or previous phases of the application. Enabling vm.zone_reclaim_mode allows the application to reclaim those cached file pages for its own use, rather than going off-node. This most likely benefits the application’s performance over its remaining lifetime.

The default setting for vm.zone_reclaim_mode is enabled if any of the distances in the SLIT are greater than a fixed threshold, and disabled otherwise. Currently, the threshold in most shipping distros is a SLIT value of 20, and this is the case for both Red Hat Enterprise Linux 5 and Red Hat Enterprise Linux 6. (The upstream kernel now uses a value of 30 as the threshold; this will appear in newer distro versions.) For a server that does not supply a populated SLIT, vm.zone_reclaim_mode defaults to disabled, because the remote distances in the kernel's default SLIT are all 20. If the default setting is not appropriate for your workload, you can change it with the following command:

sysctl –w vm.zone_reclaim_mode={0|1}

http://h20564.www2.hpe.com/hpsc/doc/public/display?docId=emr_na-c03261871


追記(2016/08/18):
Linux Kernel*1ソースコードでも確認してみた。mm/page_alloc.c#get_page_from_freelist を読むと、zone_reclaim_mode=0 の場合はページ回収をスキップしている模様。

/*
 * get_page_from_freelist goes through the zonelist trying to allocate
 * a page.
 */
static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
		struct zone *preferred_zone, int migratetype)
{
	struct zoneref *z;
	struct page *page = NULL;
	int classzone_idx;
	struct zone *zone;
	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
	int zlc_active = 0;		/* set if using zonelist_cache */
	int did_zlc_setup = 0;		/* just call zlc_setup() one time */

	classzone_idx = zone_idx(preferred_zone);
zonelist_scan:
	/*
	 * Scan zonelist, looking for a zone with enough free.
	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
	 */
	for_each_zone_zonelist_nodemask(zone, z, zonelist,
						high_zoneidx, nodemask) {
		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
			!zlc_zone_worth_trying(zonelist, z, allowednodes))
				continue;
		if ((alloc_flags & ALLOC_CPUSET) &&
			!cpuset_zone_allowed_softwall(zone, gfp_mask))
				continue;
		/*
		 * When allocating a page cache page for writing, we
		 * want to get it from a zone that is within its dirty
		 * limit, such that no single zone holds more than its
		 * proportional share of globally allowed dirty pages.
		 * The dirty limits take into account the zone's
		 * lowmem reserves and high watermark so that kswapd
		 * should be able to balance it without having to
		 * write pages from its LRU list.
		 *
		 * This may look like it could increase pressure on
		 * lower zones by failing allocations in higher zones
		 * before they are full.  But the pages that do spill
		 * over are limited as the lower zones are protected
		 * by this very same mechanism.  It should not become
		 * a practical burden to them.
		 *
		 * XXX: For now, allow allocations to potentially
		 * exceed the per-zone dirty limit in the slowpath
		 * (ALLOC_WMARK_LOW unset) before going into reclaim,
		 * which is important when on a NUMA setup the allowed
		 * zones are together not big enough to reach the
		 * global limit.  The proper fix for these situations
		 * will require awareness of zones in the
		 * dirty-throttling and the flusher threads.
		 */
		if ((alloc_flags & ALLOC_WMARK_LOW) &&
		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
			goto this_zone_full;

		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
			unsigned long mark;
			int ret;

			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
			if (zone_watermark_ok(zone, order, mark,
				    classzone_idx, alloc_flags))
				goto try_this_zone;

			if (IS_ENABLED(CONFIG_NUMA) &&
					!did_zlc_setup && nr_online_nodes > 1) {
				/*
				 * we do zlc_setup if there are multiple nodes
				 * and before considering the first zone allowed
				 * by the cpuset.
				 */
				allowednodes = zlc_setup(zonelist, alloc_flags);
				zlc_active = 1;
				did_zlc_setup = 1;
			}

			if (zone_reclaim_mode == 0 || ★ zone_reclaim_modeが0→ ページ回収をスキップ
			    !zone_allows_reclaim(preferred_zone, zone))
				goto this_zone_full;

			/*
			 * As we may have just activated ZLC, check if the first
			 * eligible zone has failed zone_reclaim recently.
			 */
			if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
				!zlc_zone_worth_trying(zonelist, z, allowednodes))
				continue;

			ret = zone_reclaim(zone, gfp_mask, order);
			switch (ret) {
			case ZONE_RECLAIM_NOSCAN:
				/* did not scan */
				continue;
			case ZONE_RECLAIM_FULL:
				/* scanned but unreclaimable */
				continue;
			default:
				/* did we reclaim enough */
				if (zone_watermark_ok(zone, order, mark,
						classzone_idx, alloc_flags))
					goto try_this_zone;

				/*
				 * Failed to reclaim enough to meet watermark.
				 * Only mark the zone full if checking the min
				 * watermark or if we failed to reclaim just
				 * 1<<order pages or else the page allocator
				 * fastpath will prematurely mark zones full
				 * when the watermark is between the low and
				 * min watermarks.
				 */
				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
				    ret == ZONE_RECLAIM_SOME)
					goto this_zone_full;

				continue;
			}
		}

try_this_zone:
		page = buffered_rmqueue(preferred_zone, zone, order,
						gfp_mask, migratetype);
		if (page)
			break; ★ ページを確保できたらループを抜ける
this_zone_full:
		if (IS_ENABLED(CONFIG_NUMA))
			zlc_mark_zone_full(zonelist, z);
	}

	if (unlikely(IS_ENABLED(CONFIG_NUMA) && page == NULL && zlc_active)) {
		/* Disable zlc cache for second zonelist scan */
		zlc_active = 0;
		goto zonelist_scan;
	}

	if (page)
		/*
		 * page->pfmemalloc is set when ALLOC_NO_WATERMARKS was
		 * necessary to allocate the page. The expectation is
		 * that the caller is taking steps that will free more
		 * memory. The caller should avoid the page being used
		 * for !PFMEMALLOC purposes.
		 */
		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);

	return page;
}

*1:RHEL7系で採用されている 3.10 で確認