thr3ads.net - Btrfs devel - [patch 0/5] per-zone dirty limits v3 [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Johannes Weiner

2011-Sep-30 07:17 UTC

[patch 0/5] per-zone dirty limits v3

Hi,

this is the third revision of the per-zone dirty limits.  Changes from
the second version have been mostly documentation, changelog, and
naming fixes based on review feedback:

o add new dirty_balance_reserve instead of abusing totalreserve_pages
  for undirtyable (per-zone) reserves and document the variable and
  its calculation (Mel)
o use !ALLOC_WMARK_LOW instead of adding new ALLOC_SLOWPATH (Mel)
o rename determine_dirtyable_memory -> global_dirtyable_memory (Andrew)
o better explain behaviour on NUMA in changelog (Andrew)
o extend changelogs and code comments on how per-zone dirty limits are
  calculated, and why, and their proportions to the global limit (Mel, Andrew)
o kernel-doc zone_dirty_ok() (Andrew)
o extend changelogs and code comments on how per-zone dirty limits are
  used to protect zones from dirty pages (Mel, Andrew)
o revert back to a separate set of zone_dirtyable_memory() and
zone_dirty_limit()
  for easier reading (Andrew)

Based on v3.1-rc3-mmotm-2011-08-24-14-08.

 fs/btrfs/file.c           |    2 +-
 include/linux/gfp.h       |    4 +-
 include/linux/mmzone.h    |    6 ++
 include/linux/swap.h      |    1 +
 include/linux/writeback.h |    1 +
 mm/filemap.c              |    5 +-
 mm/page-writeback.c       |  181 +++++++++++++++++++++++++++++++++------------
 mm/page_alloc.c           |   48 ++++++++++++
 8 files changed, 197 insertions(+), 51 deletions(-)

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Sep-30 07:17 UTC

head link

[patch 1/5] mm: exclude reserved pages from dirtyable memory

The amount of dirtyable pages should not include the full number of
free pages: there is a number of reserved pages that the page
allocator and kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of
reserved pages, the more likely it becomes for reclaim to run into
dirty pages:

       +----------+ ---
       |   anon   |  |
       +----------+  |
       |          |  |
       |          |  -- dirty limit new    -- flusher new
       |   file   |  |                     |
       |          |  |                     |
       |          |  -- dirty limit old    -- flusher old
       |          |                        |
       +----------+                       --- reclaim
       | reserved |
       +----------+
       |  kernel  |
       +----------+

This patch introduces a per-zone dirty reserve that takes both the
lowmem reserve as well as the high watermark of the zone into account,
and a global sum of those per-zone values that is subtracted from the
global amount of dirtyable pages.  The lowmem reserve is unavailable
to page cache allocations and kswapd tries to keep the high watermark
free.  We don''t want to end up in a situation where reclaim has to
clean pages in order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix.  In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages
is mostly much smaller in absolute numbers.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h |    6 ++++++
 include/linux/swap.h   |    1 +
 mm/page-writeback.c    |    6 ++++--
 mm/page_alloc.c        |   19 +++++++++++++++++++
 4 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1ed4116..37a61e7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -317,6 +317,12 @@ struct zone {
 	 */
 	unsigned long		lowmem_reserve[MAX_NR_ZONES];
 
+	/*
+	 * This is a per-zone reserve of pages that should not be
+	 * considered dirtyable memory.
+	 */
+	unsigned long		dirty_balance_reserve;
+
 #ifdef CONFIG_NUMA
 	int node;
 	/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3808f10..5e70f65 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -209,6 +209,7 @@ struct swap_list_t {
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
+extern unsigned long dirty_balance_reserve;
 extern unsigned int nr_free_buffer_pages(void);
 extern unsigned int nr_free_pagecache_pages(void);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index da6d263..c8acf8a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned long
total)
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
 		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
+		     zone_reclaimable_pages(z) -
+		     zone->dirty_balance_reserve;
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+	    dirty_balance_reserve;
 
 	if (!vm_highmem_is_dirtyable)
 		x -= highmem_dirtyable_memory(x);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1dba05e..f8cba89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
 
 unsigned long totalram_pages __read_mostly;
 unsigned long totalreserve_pages __read_mostly;
+/*
+ * When calculating the number of globally allowed dirty pages, there
+ * is a certain number of per-zone reserves that should not be
+ * considered dirtyable memory.  This is the sum of those reserves
+ * over all existing zones that contribute dirtyable memory.
+ */
+unsigned long dirty_balance_reserve __read_mostly;
+
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
@@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
 			if (max > zone->present_pages)
 				max = zone->present_pages;
 			reserve_pages += max;
+			/*
+			 * Lowmem reserves are not available to
+			 * GFP_HIGHUSER page cache allocations and
+			 * kswapd tries to balance zones to their high
+			 * watermark.  As a result, neither should be
+			 * regarded as dirtyable memory, to prevent a
+			 * situation where reclaim has to clean pages
+			 * in order to balance the zones.
+			 */
+			zone->dirty_balance_reserve = max;
 		}
 	}
+	dirty_balance_reserve = reserve_pages;
 	totalreserve_pages = reserve_pages;
 }
 
-- 
1.7.6.2

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Sep-30 07:17 UTC

head link

[patch 2/5] mm: writeback: cleanups in preparation for per-zone dirty limits

The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.

Rename determine_dirtyable_memory() to global_dirtyable_memory()
before adding the zone-specific version, and fix up its documentation.

Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that
their relationship is more apparent and that they can be commented on
as a group.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mel@suse.de>
---
 mm/page-writeback.c |   92 +++++++++++++++++++++++++-------------------------
 1 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c8acf8a..78604a6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -186,12 +186,12 @@ static unsigned long highmem_dirtyable_memory(unsigned
long total)
 }
 
 /**
- * determine_dirtyable_memory - amount of memory that may be used
+ * global_dirtyable_memory - number of globally dirtyable pages
  *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
+ * Returns the global number of pages potentially available for dirty
+ * page cache.  This is the base value for the global dirty limits.
  */
-static unsigned long determine_dirtyable_memory(void)
+static unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
@@ -205,6 +205,47 @@ static unsigned long determine_dirtyable_memory(void)
 }
 
 /*
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
+ * - vm.dirty_ratio             or  vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+{
+	unsigned long background;
+	unsigned long dirty;
+	unsigned long uninitialized_var(available_memory);
+	struct task_struct *tsk;
+
+	if (!vm_dirty_bytes || !dirty_background_bytes)
+		available_memory = global_dirtyable_memory();
+
+	if (vm_dirty_bytes)
+		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	else
+		dirty = (vm_dirty_ratio * available_memory) / 100;
+
+	if (dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	else
+		background = (dirty_background_ratio * available_memory) / 100;
+
+	if (background >= dirty)
+		background = dirty / 2;
+	tsk = current;
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		background += background / 4;
+		dirty += dirty / 4;
+	}
+	*pbackground = background;
+	*pdirty = dirty;
+	trace_global_dirty_state(background, dirty);
+}
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -216,7 +257,7 @@ static int calc_period_shift(void)
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+		dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) /
 				100;
 	return 2 + ilog2(dirty_total - 1);
 }
@@ -416,47 +457,6 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
 	return max(thresh, global_dirty_limit);
 }
 
-/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
- *
- * Calculate the dirty thresholds based on sysctl parameters
- * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
- * - vm.dirty_ratio             or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
- */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
-{
-	unsigned long background;
-	unsigned long dirty;
-	unsigned long uninitialized_var(available_memory);
-	struct task_struct *tsk;
-
-	if (!vm_dirty_bytes || !dirty_background_bytes)
-		available_memory = determine_dirtyable_memory();
-
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
-	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
-
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
-	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
-	if (background >= dirty)
-		background = dirty / 2;
-	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
-		background += background / 4;
-		dirty += dirty / 4;
-	}
-	*pbackground = background;
-	*pdirty = dirty;
-	trace_global_dirty_state(background, dirty);
-}
-
 /**
  * bdi_dirty_limit - @bdi''s share of dirty throttling threshold
  * @bdi: the backing_dev_info to query
-- 
1.7.6.2

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Sep-30 07:17 UTC

head link

[patch 3/5] mm: try to distribute dirty pages fairly across zones

The maximum number of dirty pages that exist in the system at any time
is determined by a number of pages considered dirtyable and a
user-configured percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all
the zones in the system minus their lowmem reserves and high
watermarks, so that the system can retain a healthy number of free
pages without having to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does
not care about the global state but rather the state of individual
memory zones.  And right now there is nothing that prevents one zone
from filling up with dirty pages while other zones are spared, which
frequently leads to situations where kswapd, in order to restore the
watermark of free pages, does indeed have to write pages from that
zone''s LRU list.  This can interfere so badly with IO from the flusher
threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
requests from reclaim already, taking away the VM''s only possibility
to keep such a zone balanced, aside from hoping the flushers will soon
clean pages from that zone.

Enter per-zone dirty limits.  They are to a zone''s dirtyable memory
what the global limit is to the global amount of dirtyable memory, and
try to make sure that no single zone receives more than its fair share
of the globally allowed dirty pages in the first place.  As the number
of pages considered dirtyable exclude the zones'' lowmem reserves and
high watermarks, the maximum number of dirty pages in a zone is such
that the zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to
pass __GFP_WRITE when they know in advance that the page will be
written to and become dirty soon.  The page allocator will then
attempt to allocate from the first zone of the zonelist - which on
NUMA is determined by the task''s NUMA memory policy - that has not
exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case.  With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations.  Workloads
that previously could fit their dirty pages completely in the higher
zone may be forced to allocate from lower zones, but the amount of
pages that ''spill over'' are limited themselves by the lower
zones''
dirty constraints, and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for
NUMA configurations where the zones allowed for allocation are in sum
not big enough to trigger the global dirty limits, wake up the flusher
threads and remedy the situation.  Because of this, an allocation that
could not succeed on any of the considered zones is allowed to ignore
the dirty limits before going into direct reclaim or even failing the
allocation, until a future patch changes the global dirty throttling
and flusher thread activation so that they take individual zone states
into account.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h       |    4 ++-
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |   83 +++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c           |   29 ++++++++++++++++
 4 files changed, 116 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..50efc7e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -36,6 +36,7 @@ struct vm_area_struct;
 #endif
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
+#define ___GFP_WRITE		0x1000000u
 
 /*
  * GFP bitmasks..
@@ -85,6 +86,7 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of
other node */
+#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty
page */
 
 /*
  * This may seem redundant, but it''s a way of annotating false
positives vs.
@@ -92,7 +94,7 @@ struct vm_area_struct;
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a5f495f..c96ee0c 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
 static inline void laptop_sync_completion(void) { }
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
+bool zone_dirty_ok(struct zone *zone);
 
 extern unsigned long global_dirty_limit;
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 78604a6..f60fd57 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -159,6 +159,25 @@ static struct prop_descriptor vm_dirties;
  * We make sure that the background writeout level is below the adjusted
  * clamping level.
  */
+
+/*
+ * In a memory zone, there is a certain amount of pages we consider
+ * available for the page cache, which is essentially the number of
+ * free and reclaimable pages, minus some zone reserves to protect
+ * lowmem and the ability to uphold the zone''s watermarks without
+ * requiring writeback.
+ *
+ * This number of dirtyable pages is the base value of which the
+ * user-configurable dirty ratio is the effictive number of pages that
+ * are allowed to be actually dirtied.  Per individual zone, or
+ * globally by using the sum of dirtyable pages over all zones.
+ *
+ * Because the user is allowed to specify the dirty limit globally as
+ * absolute number of bytes, calculating the per-zone dirty limit can
+ * require translating the configured limit into a percentage of
+ * global dirtyable memory first.
+ */
+
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
@@ -245,6 +264,70 @@ void global_dirty_limits(unsigned long *pbackground,
unsigned long *pdirty)
 	trace_global_dirty_state(background, dirty);
 }
 
+/**
+ * zone_dirtyable_memory - number of dirtyable pages in a zone
+ * @zone: the zone
+ *
+ * Returns the zone''s number of pages potentially available for dirty
+ * page cache.  This is the base value for the per-zone dirty limits.
+ */
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+	/*
+	 * The effective global number of dirtyable pages may exclude
+	 * highmem as a big-picture measure to keep the ratio between
+	 * dirty memory and lowmem reasonable.
+	 *
+	 * But this function is purely about the individual zone and a
+	 * highmem zone can hold its share of dirty pages, so we don''t
+	 * care about vm_highmem_is_dirtyable here.
+	 */
+	return zone_page_state(zone, NR_FREE_PAGES) +
+	       zone_reclaimable_pages(zone) -
+	       zone->dirty_balance_reserve;
+}
+
+/**
+ * zone_dirty_limit - maximum number of dirty pages allowed in a zone
+ * @zone: the zone
+ *
+ * Returns the maximum number of dirty pages allowed in a zone, based
+ * on the zone''s dirtyable memory.
+ */
+static unsigned long zone_dirty_limit(struct zone *zone)
+{
+	unsigned long zone_memory = zone_dirtyable_memory(zone);
+	struct task_struct *tsk = current;
+	unsigned long dirty;
+
+	if (vm_dirty_bytes)
+		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
+			zone_memory / global_dirtyable_memory();
+	else
+		dirty = vm_dirty_ratio * zone_memory / 100;
+
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+		dirty += dirty / 4;
+
+	return dirty;
+}
+
+/**
+ * zone_dirty_ok - tells whether a zone is within its dirty limits
+ * @zone: the zone to check
+ *
+ * Returns %true when the dirty pages in @zone are within the zone''s
+ * dirty limit, %false if the limit is exceeded.
+ */
+bool zone_dirty_ok(struct zone *zone)
+{
+	unsigned long limit = zone_dirty_limit(zone);
+
+	return zone_page_state(zone, NR_FILE_DIRTY) +
+	       zone_page_state(zone, NR_UNSTABLE_NFS) +
+	       zone_page_state(zone, NR_WRITEBACK) <= limit;
+}
+
 /*
  * couple the period to the dirty_ratio:
  *
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8cba89..afaf59e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1675,6 +1675,35 @@ zonelist_scan:
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
+		/*
+		 * When allocating a page cache page for writing, we
+		 * want to get it from a zone that is within its dirty
+		 * limit, such that no single zone holds more than its
+		 * proportional share of globally allowed dirty pages.
+		 * The dirty limits take into account the zone''s
+		 * lowmem reserves and high watermark so that kswapd
+		 * should be able to balance it without having to
+		 * write pages from its LRU list.
+		 *
+		 * This may look like it could increase pressure on
+		 * lower zones by failing allocations in higher zones
+		 * before they are full.  But the pages that do spill
+		 * over are limited as the lower zones are protected
+		 * by this very same mechanism.  It should not become
+		 * a practical burden to them.
+		 *
+		 * XXX: For now, allow allocations to potentially
+		 * exceed the per-zone dirty limit in the slowpath
+		 * (ALLOC_WMARK_LOW unset) before going into reclaim,
+		 * which is important when on a NUMA setup the allowed
+		 * zones are together not big enough to reach the
+		 * global limit.  The proper fix for these situations
+		 * will require awareness of zones in the
+		 * dirty-throttling and the flusher threads.
+		 */
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+			goto this_zone_full;
 
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
-- 
1.7.6.2

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Sep-30 07:17 UTC

head link

[patch 4/5] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 mm/filemap.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 645a080..cf0352d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2349,8 +2349,11 @@ struct page *grab_cache_page_write_begin(struct
address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
 	int status;
+	gfp_t gfp_mask;
 	struct page *page;
 	gfp_t gfp_notmask = 0;
+
+	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
 		gfp_notmask = __GFP_FS;
 repeat:
@@ -2358,7 +2361,7 @@ repeat:
 	if (page)
 		goto found;
 
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
+	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
 	status = add_to_page_cache_lru(page, mapping, index,
-- 
1.7.6.2

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Sep-30 07:17 UTC

head link

[patch 5/5] Btrfs: pass __GFP_WRITE for buffered write page allocations

Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 fs/btrfs/file.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index e7872e4..ea1b892 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1084,7 +1084,7 @@ static noinline int prepare_pages(struct btrfs_root *root,
struct file *file,
 again:
 	for (i = 0; i < num_pages; i++) {
 		pages[i] = find_or_create_page(inode->i_mapping, index + i,
-					       GFP_NOFS);
+					       GFP_NOFS | __GFP_WRITE);
 		if (!pages[i]) {
 			faili = i - 1;
 			err = -ENOMEM;
-- 
1.7.6.2

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Pekka Enberg

2011-Sep-30 07:35 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Hi Johannes!

On Fri, Sep 30, 2011 at 10:17 AM, Johannes Weiner <jweiner@redhat.com>
wrote:> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones.  And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone''s LRU list.  This can interfere so badly with IO from the
flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM''s only
possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.
The obvious question is: how did you test this? Can you share the results?

                        Pekka

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Sep-30 08:55 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

On Fri, Sep 30, 2011 at 10:35:25AM +0300, Pekka Enberg
wrote:> Hi Johannes!
> 
> On Fri, Sep 30, 2011 at 10:17 AM, Johannes Weiner
<jweiner@redhat.com> wrote:
> > But there is a flaw in that we have a zoned page allocator which does
> > not care about the global state but rather the state of individual
> > memory zones.  And right now there is nothing that prevents one zone
> > from filling up with dirty pages while other zones are spared, which
> > frequently leads to situations where kswapd, in order to restore the
> > watermark of free pages, does indeed have to write pages from that
> > zone''s LRU list.  This can interfere so badly with IO from
the flusher
> > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> > requests from reclaim already, taking away the VM''s only
possibility
> > to keep such a zone balanced, aside from hoping the flushers will soon
> > clean pages from that zone.
> 
> The obvious question is: how did you test this? Can you share the results?
Meh, sorry about that, they were in the series introduction the last
time and I forgot to copy them over.

I did single-threaded, linear writing to an USB stick as the effect is
most pronounced with slow backing devices.

[ The write deferring on ext4 because of delalloc is so extreme that I
  could trigger it even with simple linear writers on a mediocre
  rotating disk, though.  I can not access the logfiles right now, but
  the nr_vmscan_writes went practically away here as well and runtime
  was unaffected with the patched kernel. ]

			Test results

15M DMA + 3246M DMA32 + 504M Normal = 3765M memory
40% dirty ratio, 10% background ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

		seconds			nr_vmscan_write
		        (stddev)	       min|     median|        max
xfs
vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
patched:	 550.996( 3.802)	     0.000|      0.000|      0.000

fuse-ntfs
vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
patched:	 558.049(17.914)	     0.000|      0.000|     43.000

btrfs
vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
patched:	 563.365(11.368)	     0.000|      0.000|   1362.000

ext4
vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
patched:	 568.806(17.496)	     0.000|      0.000|      0.000

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Michal Hocko

2011-Sep-30 13:53 UTC

head link

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

On Fri 30-09-11 09:17:20, Johannes Weiner wrote:> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
> 
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
> 
>        +----------+ ---
>        |   anon   |  |
>        +----------+  |
>        |          |  |
>        |          |  -- dirty limit new    -- flusher new
>        |   file   |  |                     |
>        |          |  |                     |
>        |          |  -- dirty limit old    -- flusher old
>        |          |                        |
>        +----------+                       --- reclaim
>        | reserved |
>        +----------+
>        |  kernel  |
>        +----------+
> 
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages.  The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free.  We don''t want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
> 
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix.  In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
> 
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Makes sense.
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/mmzone.h |    6 ++++++
>  include/linux/swap.h   |    1 +
>  mm/page-writeback.c    |    6 ++++--
>  mm/page_alloc.c        |   19 +++++++++++++++++++
>  4 files changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1ed4116..37a61e7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -317,6 +317,12 @@ struct zone {
>  	 */
>  	unsigned long		lowmem_reserve[MAX_NR_ZONES];
>  
> +	/*
> +	 * This is a per-zone reserve of pages that should not be
> +	 * considered dirtyable memory.
> +	 */
> +	unsigned long		dirty_balance_reserve;
> +
>  #ifdef CONFIG_NUMA
>  	int node;
>  	/*
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3808f10..5e70f65 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -209,6 +209,7 @@ struct swap_list_t {
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> +extern unsigned long dirty_balance_reserve;
>  extern unsigned int nr_free_buffer_pages(void);
>  extern unsigned int nr_free_pagecache_pages(void);
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index da6d263..c8acf8a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -170,7 +170,8 @@ static unsigned long highmem_dirtyable_memory(unsigned
long total)
>  			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>  
>  		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> +		     zone_reclaimable_pages(z) -
> +		     zone->dirty_balance_reserve;
>  	}
>  	/*
>  	 * Make sure that the number of highmem pages is never larger
> @@ -194,7 +195,8 @@ static unsigned long determine_dirtyable_memory(void)
>  {
>  	unsigned long x;
>  
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> +	    dirty_balance_reserve;
>  
>  	if (!vm_highmem_is_dirtyable)
>  		x -= highmem_dirtyable_memory(x);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1dba05e..f8cba89 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -96,6 +96,14 @@ EXPORT_SYMBOL(node_states);
>  
>  unsigned long totalram_pages __read_mostly;
>  unsigned long totalreserve_pages __read_mostly;
> +/*
> + * When calculating the number of globally allowed dirty pages, there
> + * is a certain number of per-zone reserves that should not be
> + * considered dirtyable memory.  This is the sum of those reserves
> + * over all existing zones that contribute dirtyable memory.
> + */
> +unsigned long dirty_balance_reserve __read_mostly;
> +
>  int percpu_pagelist_fraction;
>  gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
>  
> @@ -5076,8 +5084,19 @@ static void calculate_totalreserve_pages(void)
>  			if (max > zone->present_pages)
>  				max = zone->present_pages;
>  			reserve_pages += max;
> +			/*
> +			 * Lowmem reserves are not available to
> +			 * GFP_HIGHUSER page cache allocations and
> +			 * kswapd tries to balance zones to their high
> +			 * watermark.  As a result, neither should be
> +			 * regarded as dirtyable memory, to prevent a
> +			 * situation where reclaim has to clean pages
> +			 * in order to balance the zones.
> +			 */
> +			zone->dirty_balance_reserve = max;
>  		}
>  	}
> +	dirty_balance_reserve = reserve_pages;
>  	totalreserve_pages = reserve_pages;
>  }
>  
> -- 
> 1.7.6.2
> 
> --
> To unsubscribe, send a message with ''unsubscribe
linux-mm'' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
http://stopthemeter.ca/
> Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Michal Hocko

2011-Sep-30 13:56 UTC

head link

Re: [patch 2/5] mm: writeback: cleanups in preparation for per-zone dirty limits

On Fri 30-09-11 09:17:21, Johannes Weiner wrote:> The next patch will introduce per-zone dirty limiting functions in
> addition to the traditional global dirty limiting.
> 
> Rename determine_dirtyable_memory() to global_dirtyable_memory()
> before adding the zone-specific version, and fix up its documentation.
> 
> Also, move the functions to determine the dirtyable memory and the
> function to calculate the dirty limit based on that together so that
> their relationship is more apparent and that they can be commented on
> as a group.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Acked-by: Mel Gorman <mel@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/page-writeback.c |   92
+++++++++++++++++++++++++-------------------------
>  1 files changed, 46 insertions(+), 46 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index c8acf8a..78604a6 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -186,12 +186,12 @@ static unsigned long
highmem_dirtyable_memory(unsigned long total)
>  }
>  
>  /**
> - * determine_dirtyable_memory - amount of memory that may be used
> + * global_dirtyable_memory - number of globally dirtyable pages
>   *
> - * Returns the numebr of pages that can currently be freed and used
> - * by the kernel for direct mappings.
> + * Returns the global number of pages potentially available for dirty
> + * page cache.  This is the base value for the global dirty limits.
>   */
> -static unsigned long determine_dirtyable_memory(void)
> +static unsigned long global_dirtyable_memory(void)
>  {
>  	unsigned long x;
>  
> @@ -205,6 +205,47 @@ static unsigned long determine_dirtyable_memory(void)
>  }
>  
>  /*
> + * global_dirty_limits - background-writeback and dirty-throttling
thresholds
> + *
> + * Calculate the dirty thresholds based on sysctl parameters
> + * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
> + * - vm.dirty_ratio             or  vm.dirty_bytes
> + * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
and
> + * real-time tasks.
> + */
> +void global_dirty_limits(unsigned long *pbackground, unsigned long
*pdirty)
> +{
> +	unsigned long background;
> +	unsigned long dirty;
> +	unsigned long uninitialized_var(available_memory);
> +	struct task_struct *tsk;
> +
> +	if (!vm_dirty_bytes || !dirty_background_bytes)
> +		available_memory = global_dirtyable_memory();
> +
> +	if (vm_dirty_bytes)
> +		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	else
> +		dirty = (vm_dirty_ratio * available_memory) / 100;
> +
> +	if (dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	else
> +		background = (dirty_background_ratio * available_memory) / 100;
> +
> +	if (background >= dirty)
> +		background = dirty / 2;
> +	tsk = current;
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> +		background += background / 4;
> +		dirty += dirty / 4;
> +	}
> +	*pbackground = background;
> +	*pdirty = dirty;
> +	trace_global_dirty_state(background, dirty);
> +}
> +
> +/*
>   * couple the period to the dirty_ratio:
>   *
>   *   period/2 ~ roundup_pow_of_two(dirty limit)
> @@ -216,7 +257,7 @@ static int calc_period_shift(void)
>  	if (vm_dirty_bytes)
>  		dirty_total = vm_dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> +		dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) /
>  				100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> @@ -416,47 +457,6 @@ static unsigned long hard_dirty_limit(unsigned long
thresh)
>  	return max(thresh, global_dirty_limit);
>  }
>  
> -/*
> - * global_dirty_limits - background-writeback and dirty-throttling
thresholds
> - *
> - * Calculate the dirty thresholds based on sysctl parameters
> - * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
> - * - vm.dirty_ratio             or  vm.dirty_bytes
> - * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd)
and
> - * real-time tasks.
> - */
> -void global_dirty_limits(unsigned long *pbackground, unsigned long
*pdirty)
> -{
> -	unsigned long background;
> -	unsigned long dirty;
> -	unsigned long uninitialized_var(available_memory);
> -	struct task_struct *tsk;
> -
> -	if (!vm_dirty_bytes || !dirty_background_bytes)
> -		available_memory = determine_dirtyable_memory();
> -
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> -	else
> -		dirty = (vm_dirty_ratio * available_memory) / 100;
> -
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> -	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> -	if (background >= dirty)
> -		background = dirty / 2;
> -	tsk = current;
> -	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> -		background += background / 4;
> -		dirty += dirty / 4;
> -	}
> -	*pbackground = background;
> -	*pdirty = dirty;
> -	trace_global_dirty_state(background, dirty);
> -}
> -
>  /**
>   * bdi_dirty_limit - @bdi''s share of dirty throttling threshold
>   * @bdi: the backing_dev_info to query
> -- 
> 1.7.6.2
> 
> --
> To unsubscribe, send a message with ''unsubscribe
linux-mm'' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
http://stopthemeter.ca/
> Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Michal Hocko

2011-Sep-30 14:28 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

On Fri 30-09-11 09:17:22, Johannes Weiner wrote:> The maximum number of dirty pages that exist in the system at any time
> is determined by a number of pages considered dirtyable and a
> user-configured percentage of those, or an absolute number in bytes.
> 
> This number of dirtyable pages is the sum of memory provided by all
> the zones in the system minus their lowmem reserves and high
> watermarks, so that the system can retain a healthy number of free
> pages without having to reclaim dirty pages.
> 
> But there is a flaw in that we have a zoned page allocator which does
> not care about the global state but rather the state of individual
> memory zones.  And right now there is nothing that prevents one zone
> from filling up with dirty pages while other zones are spared, which
> frequently leads to situations where kswapd, in order to restore the
> watermark of free pages, does indeed have to write pages from that
> zone''s LRU list.  This can interfere so badly with IO from the
flusher
> threads that major filesystems (btrfs, xfs, ext4) mostly ignore write
> requests from reclaim already, taking away the VM''s only
possibility
> to keep such a zone balanced, aside from hoping the flushers will soon
> clean pages from that zone.
> 
> Enter per-zone dirty limits.  They are to a zone''s dirtyable
memory
> what the global limit is to the global amount of dirtyable memory, and
> try to make sure that no single zone receives more than its fair share
> of the globally allowed dirty pages in the first place.  As the number
> of pages considered dirtyable exclude the zones'' lowmem reserves
and
> high watermarks, the maximum number of dirty pages in a zone is such
> that the zone can always be balanced without requiring page cleaning.
> 
> As this is a placement decision in the page allocator and pages are
> dirtied only after the allocation, this patch allows allocators to
> pass __GFP_WRITE when they know in advance that the page will be
> written to and become dirty soon.  The page allocator will then
> attempt to allocate from the first zone of the zonelist - which on
> NUMA is determined by the task''s NUMA memory policy - that has not
> exceeded its dirty limit.
> 
> At first glance, it would appear that the diversion to lower zones can
> increase pressure on them, but this is not the case.  With a full high
> zone, allocations will be diverted to lower zones eventually, so it is
> more of a shift in timing of the lower zone allocations.  Workloads
> that previously could fit their dirty pages completely in the higher
> zone may be forced to allocate from lower zones, but the amount of
> pages that ''spill over'' are limited themselves by the
lower zones''
> dirty constraints, and thus unlikely to become a problem.
> 
> For now, the problem of unfair dirty page distribution remains for
> NUMA configurations where the zones allowed for allocation are in sum
> not big enough to trigger the global dirty limits, wake up the flusher
> threads and remedy the situation.  Because of this, an allocation that
> could not succeed on any of the considered zones is allowed to ignore
> the dirty limits before going into direct reclaim or even failing the
> allocation, until a future patch changes the global dirty throttling
> and flusher thread activation so that they take individual zone states
> into account.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
Nice
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  include/linux/gfp.h       |    4 ++-
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |   83
+++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c           |   29 ++++++++++++++++
>  4 files changed, 116 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3a76faf..50efc7e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -36,6 +36,7 @@ struct vm_area_struct;
>  #endif
>  #define ___GFP_NO_KSWAPD	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
> +#define ___GFP_WRITE		0x1000000u
>  
>  /*
>   * GFP bitmasks..
> @@ -85,6 +86,7 @@ struct vm_area_struct;
>  
>  #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf
of other node */
> +#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to
dirty page */
>  
>  /*
>   * This may seem redundant, but it''s a way of annotating false
positives vs.
> @@ -92,7 +94,7 @@ struct vm_area_struct;
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 24	/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) -
1))
>  
>  /* This equals 0, but use constants in case they ever change */
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a5f495f..c96ee0c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data);
>  static inline void laptop_sync_completion(void) { }
>  #endif
>  void throttle_vm_writeout(gfp_t gfp_mask);
> +bool zone_dirty_ok(struct zone *zone);
>  
>  extern unsigned long global_dirty_limit;
>  
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 78604a6..f60fd57 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -159,6 +159,25 @@ static struct prop_descriptor vm_dirties;
>   * We make sure that the background writeout level is below the adjusted
>   * clamping level.
>   */
> +
> +/*
> + * In a memory zone, there is a certain amount of pages we consider
> + * available for the page cache, which is essentially the number of
> + * free and reclaimable pages, minus some zone reserves to protect
> + * lowmem and the ability to uphold the zone''s watermarks without
> + * requiring writeback.
> + *
> + * This number of dirtyable pages is the base value of which the
> + * user-configurable dirty ratio is the effictive number of pages that
> + * are allowed to be actually dirtied.  Per individual zone, or
> + * globally by using the sum of dirtyable pages over all zones.
> + *
> + * Because the user is allowed to specify the dirty limit globally as
> + * absolute number of bytes, calculating the per-zone dirty limit can
> + * require translating the configured limit into a percentage of
> + * global dirtyable memory first.
> + */
> +
>  static unsigned long highmem_dirtyable_memory(unsigned long total)
>  {
>  #ifdef CONFIG_HIGHMEM
> @@ -245,6 +264,70 @@ void global_dirty_limits(unsigned long *pbackground,
unsigned long *pdirty)
>  	trace_global_dirty_state(background, dirty);
>  }
>  
> +/**
> + * zone_dirtyable_memory - number of dirtyable pages in a zone
> + * @zone: the zone
> + *
> + * Returns the zone''s number of pages potentially available for
dirty
> + * page cache.  This is the base value for the per-zone dirty limits.
> + */
> +static unsigned long zone_dirtyable_memory(struct zone *zone)
> +{
> +	/*
> +	 * The effective global number of dirtyable pages may exclude
> +	 * highmem as a big-picture measure to keep the ratio between
> +	 * dirty memory and lowmem reasonable.
> +	 *
> +	 * But this function is purely about the individual zone and a
> +	 * highmem zone can hold its share of dirty pages, so we don''t
> +	 * care about vm_highmem_is_dirtyable here.
> +	 */
> +	return zone_page_state(zone, NR_FREE_PAGES) +
> +	       zone_reclaimable_pages(zone) -
> +	       zone->dirty_balance_reserve;
> +}
> +
> +/**
> + * zone_dirty_limit - maximum number of dirty pages allowed in a zone
> + * @zone: the zone
> + *
> + * Returns the maximum number of dirty pages allowed in a zone, based
> + * on the zone''s dirtyable memory.
> + */
> +static unsigned long zone_dirty_limit(struct zone *zone)
> +{
> +	unsigned long zone_memory = zone_dirtyable_memory(zone);
> +	struct task_struct *tsk = current;
> +	unsigned long dirty;
> +
> +	if (vm_dirty_bytes)
> +		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
> +			zone_memory / global_dirtyable_memory();
> +	else
> +		dirty = vm_dirty_ratio * zone_memory / 100;
> +
> +	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
> +		dirty += dirty / 4;
> +
> +	return dirty;
> +}
> +
> +/**
> + * zone_dirty_ok - tells whether a zone is within its dirty limits
> + * @zone: the zone to check
> + *
> + * Returns %true when the dirty pages in @zone are within the
zone''s
> + * dirty limit, %false if the limit is exceeded.
> + */
> +bool zone_dirty_ok(struct zone *zone)
> +{
> +	unsigned long limit = zone_dirty_limit(zone);
> +
> +	return zone_page_state(zone, NR_FILE_DIRTY) +
> +	       zone_page_state(zone, NR_UNSTABLE_NFS) +
> +	       zone_page_state(zone, NR_WRITEBACK) <= limit;
> +}
> +
>  /*
>   * couple the period to the dirty_ratio:
>   *
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f8cba89..afaf59e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1675,6 +1675,35 @@ zonelist_scan:
>  		if ((alloc_flags & ALLOC_CPUSET) &&
>  			!cpuset_zone_allowed_softwall(zone, gfp_mask))
>  				continue;
> +		/*
> +		 * When allocating a page cache page for writing, we
> +		 * want to get it from a zone that is within its dirty
> +		 * limit, such that no single zone holds more than its
> +		 * proportional share of globally allowed dirty pages.
> +		 * The dirty limits take into account the zone''s
> +		 * lowmem reserves and high watermark so that kswapd
> +		 * should be able to balance it without having to
> +		 * write pages from its LRU list.
> +		 *
> +		 * This may look like it could increase pressure on
> +		 * lower zones by failing allocations in higher zones
> +		 * before they are full.  But the pages that do spill
> +		 * over are limited as the lower zones are protected
> +		 * by this very same mechanism.  It should not become
> +		 * a practical burden to them.
> +		 *
> +		 * XXX: For now, allow allocations to potentially
> +		 * exceed the per-zone dirty limit in the slowpath
> +		 * (ALLOC_WMARK_LOW unset) before going into reclaim,
> +		 * which is important when on a NUMA setup the allowed
> +		 * zones are together not big enough to reach the
> +		 * global limit.  The proper fix for these situations
> +		 * will require awareness of zones in the
> +		 * dirty-throttling and the flusher threads.
> +		 */
> +		if ((alloc_flags & ALLOC_WMARK_LOW) &&
> +		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
> +			goto this_zone_full;
>  
>  		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
>  		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
> -- 
> 1.7.6.2
> 
> --
> To unsubscribe, send a message with ''unsubscribe
linux-mm'' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
http://stopthemeter.ca/
> Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Hocko

2011-Sep-30 14:41 UTC

head link

Re: [patch 4/5] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

On Fri 30-09-11 09:17:23, Johannes Weiner wrote:> Tell the page allocator that pages allocated through
> grab_cache_page_write_begin() are expected to become dirty soon.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/filemap.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 645a080..cf0352d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2349,8 +2349,11 @@ struct page *grab_cache_page_write_begin(struct
address_space *mapping,
>  					pgoff_t index, unsigned flags)
>  {
>  	int status;
> +	gfp_t gfp_mask;
>  	struct page *page;
>  	gfp_t gfp_notmask = 0;
> +
> +	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
>  	if (flags & AOP_FLAG_NOFS)
>  		gfp_notmask = __GFP_FS;
>  repeat:
> @@ -2358,7 +2361,7 @@ repeat:
>  	if (page)
>  		goto found;
>  
> -	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
> +	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
>  	if (!page)
>  		return NULL;
>  	status = add_to_page_cache_lru(page, mapping, index,
> -- 
> 1.7.6.2
> 
> --
> To unsubscribe, send a message with ''unsubscribe
linux-mm'' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
http://stopthemeter.ca/
> Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Minchan Kim

2011-Oct-01 07:10 UTC

head link

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

On Fri, Sep 30, 2011 at 09:17:20AM +0200, Johannes Weiner
wrote:> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
> 
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
> 
>        +----------+ ---
>        |   anon   |  |
>        +----------+  |
>        |          |  |
>        |          |  -- dirty limit new    -- flusher new
>        |   file   |  |                     |
>        |          |  |                     |
>        |          |  -- dirty limit old    -- flusher old
>        |          |                        |
>        +----------+                       --- reclaim
>        | reserved |
>        +----------+
>        |  kernel  |
>        +----------+
> 
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages.  The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free.  We don''t want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
> 
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix.  In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
> 
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kinds regards,
Minchan Kim

Mel Gorman

2011-Oct-03 11:22 UTC

head link

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

On Fri, Sep 30, 2011 at 09:17:20AM +0200, Johannes Weiner
wrote:> The amount of dirtyable pages should not include the full number of
> free pages: there is a number of reserved pages that the page
> allocator and kswapd always try to keep free.
> 
> The closer (reclaimable pages - dirty pages) is to the number of
> reserved pages, the more likely it becomes for reclaim to run into
> dirty pages:
> 
>        +----------+ ---
>        |   anon   |  |
>        +----------+  |
>        |          |  |
>        |          |  -- dirty limit new    -- flusher new
>        |   file   |  |                     |
>        |          |  |                     |
>        |          |  -- dirty limit old    -- flusher old
>        |          |                        |
>        +----------+                       --- reclaim
>        | reserved |
>        +----------+
>        |  kernel  |
>        +----------+
> 
> This patch introduces a per-zone dirty reserve that takes both the
> lowmem reserve as well as the high watermark of the zone into account,
> and a global sum of those per-zone values that is subtracted from the
> global amount of dirtyable pages.  The lowmem reserve is unavailable
> to page cache allocations and kswapd tries to keep the high watermark
> free.  We don''t want to end up in a situation where reclaim has to
> clean pages in order to balance zones.
> 
> Not treating reserved pages as dirtyable on a global level is only a
> conceptual fix.  In reality, dirty pages are not distributed equally
> across zones and reclaim runs into dirty pages on a regular basis.
> 
> But it is important to get this right before tackling the problem on a
> per-zone level, where the distance between reclaim and the dirty pages
> is mostly much smaller in absolute numbers.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mel Gorman

2011-Oct-03 11:25 UTC

head link

Re: [patch 5/5] Btrfs: pass __GFP_WRITE for buffered write page allocations

On Fri, Sep 30, 2011 at 09:17:24AM +0200, Johannes Weiner
wrote:> Tell the page allocator that pages allocated for a buffered write are
> expected to become dirty soon.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wu Fengguang

2011-Oct-28 20:18 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Hi Johannes,

I tested this patchset over the IO-less dirty throttling one.
The below numbers show that

//improvements
1) write bandwidth increased by 1% in general
2) greatly reduced nr_vmscan_immediate_reclaim

//regression
3) much increased cpu %user and %system for btrfs

Thanks,
Fengguang
---

kernel before this patchset: 3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
kernel after this patchset:  3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+

3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------
                 2056.51        +1.0%      2076.29  TOTAL write_bw
             32260625.00       -86.0%   4532517.00  TOTAL
nr_vmscan_immediate_reclaim
                   90.44       +25.7%       113.67  TOTAL cpu_user
                  113.05        +9.9%       124.25  TOTAL cpu_system

3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------
                   52.43        +1.3%        53.12 
thresh=1000M/btrfs-100dd-4k-8p-4096M-1000M:10-X
                   52.72        +0.8%        53.16 
thresh=1000M/btrfs-10dd-4k-8p-4096M-1000M:10-X
                   52.24        +2.7%        53.67 
thresh=1000M/btrfs-1dd-4k-8p-4096M-1000M:10-X
                   35.52        +1.2%        35.94 
thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
                   39.37        +1.6%        39.98 
thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
                   47.52        +0.5%        47.75 
thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
                   47.13        +1.1%        47.64 
thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
                   52.28        +3.0%        53.86 
thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
                   54.34        +1.0%        54.87 
thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
                   47.63        +0.3%        47.78 
thresh=1000M/xfs-100dd-4k-8p-4096M-1000M:10-X
                   51.25        +2.1%        52.34 
thresh=1000M/xfs-10dd-4k-8p-4096M-1000M:10-X
                   52.66        +2.5%        54.00 
thresh=1000M/xfs-1dd-4k-8p-4096M-1000M:10-X
                   54.63        -0.0%        54.63 
thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   53.75        +1.0%        54.29 
thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.4%        54.35 
thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   36.87        -0.0%        36.86 
thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   45.20        -0.3%        45.07 
thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   40.75        -0.6%        40.51 
thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   44.14        +0.3%        44.29 
thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   52.91        +0.1%        52.99 
thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   50.30        +0.8%        50.72 
thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.55        +2.8%        45.80 
thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   52.75        +4.3%        55.03 
thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   50.99        +1.7%        51.87 
thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   37.35        +2.0%        38.11 
thresh=10M/btrfs-10dd-4k-8p-4096M-10M:10-X
                   53.32        +2.3%        54.55 
thresh=10M/btrfs-1dd-4k-8p-4096M-10M:10-X
                   50.72        +3.9%        52.70 
thresh=10M/btrfs-2dd-4k-8p-4096M-10M:10-X
                   32.05        +0.7%        32.27 
thresh=10M/ext3-10dd-4k-8p-4096M-10M:10-X
                   43.91        -1.2%        43.39 
thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
                   42.37        +0.3%        42.50 
thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
                   35.04        -1.9%        34.36 
thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
                   52.93        -0.4%        52.73 
thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
                   49.24        -0.0%        49.22 
thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
                   30.96        -0.8%        30.73 
thresh=10M/xfs-10dd-4k-8p-4096M-10M:10-X
                   54.30        -0.8%        53.89 
thresh=10M/xfs-1dd-4k-8p-4096M-10M:10-X
                   45.63        +1.2%        46.17 
thresh=10M/xfs-2dd-4k-8p-4096M-10M:10-X
                    1.92        -1.5%         1.89 
thresh=1M/btrfs-10dd-4k-8p-4096M-1M:10-X
                    2.28        +5.9%         2.42 
thresh=1M/btrfs-1dd-4k-8p-4096M-1M:10-X
                    2.07        -1.4%         2.04 
thresh=1M/btrfs-2dd-4k-8p-4096M-1M:10-X
                   25.31       +10.2%        27.88 
thresh=1M/ext3-10dd-4k-8p-4096M-1M:10-X
                   42.95        -0.9%        42.56 
thresh=1M/ext3-1dd-4k-8p-4096M-1M:10-X
                   38.62        -0.9%        38.26 
thresh=1M/ext3-2dd-4k-8p-4096M-1M:10-X
                   30.81        -1.0%        30.51 
thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                   49.72        +0.2%        49.80 
thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                   44.75        -0.3%        44.61 
thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                   27.87        +1.3%        28.23 
thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                   51.05        +1.0%        51.54 
thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                   45.25        +0.3%        45.39 
thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                 2056.51        +1.0%      2076.29  TOTAL write_bw

3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------
               560289.00       -98.5%      8145.00 
thresh=1000M/btrfs-100dd-4k-8p-4096M-1000M:10-X
               576882.00       -98.4%      9511.00 
thresh=1000M/btrfs-10dd-4k-8p-4096M-1000M:10-X
               651258.00       -98.8%      7963.00 
thresh=1000M/btrfs-1dd-4k-8p-4096M-1000M:10-X
              1963294.00       -85.4%    286815.00 
thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
              2108028.00       -10.6%   1885114.00 
thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
              2499456.00       -99.9%      2061.00 
thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
              2534868.00       -78.5%    545815.00 
thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
              2921668.00       -76.8%    677177.00 
thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
              2841049.00      -100.0%       779.00 
thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
              2481823.00       -86.3%    339342.00 
thresh=1000M/xfs-100dd-4k-8p-4096M-1000M:10-X
              2508629.00       -87.4%    316614.00 
thresh=1000M/xfs-10dd-4k-8p-4096M-1000M:10-X
              2656628.00      -100.0%       678.00 
thresh=1000M/xfs-1dd-4k-8p-4096M-1000M:10-X
               466024.00       -98.9%      5263.00 
thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
               460626.00       -99.6%      1853.00 
thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
               454364.00       -99.3%      2959.00 
thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
               682975.00       -89.3%     73185.00 
thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
               787717.00       -99.7%      2648.00 
thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
               611101.00       -99.2%      4629.00 
thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
               555841.00       -87.9%     67433.00 
thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
               475452.00       -99.9%       311.00 
thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
               501009.00       -97.9%     10608.00 
thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
               362202.00       -82.4%     63873.00 
thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
               716571.00                      0.00 
thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
               621495.00       -93.9%     38030.00 
thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                 4463.00       -81.2%       839.00 
thresh=10M/btrfs-10dd-4k-8p-4096M-10M:10-X
                18824.00       -97.4%       490.00 
thresh=10M/btrfs-1dd-4k-8p-4096M-10M:10-X
                12486.00       -94.1%       736.00 
thresh=10M/btrfs-2dd-4k-8p-4096M-10M:10-X
                43396.00       -70.2%     12945.00 
thresh=10M/ext3-10dd-4k-8p-4096M-10M:10-X
               109247.00      -100.0%        42.00 
thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
                92196.00      -100.0%        15.00 
thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
                44717.00       -52.9%     21078.00 
thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
                87977.00                      0.00 
thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
               130864.00       -98.9%      1381.00 
thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
                35133.00       -99.9%        52.00 
thresh=10M/xfs-10dd-4k-8p-4096M-10M:10-X
               117181.00      -100.0%        10.00 
thresh=10M/xfs-1dd-4k-8p-4096M-10M:10-X
               133795.00       -79.3%     27705.00 
thresh=10M/xfs-2dd-4k-8p-4096M-10M:10-X
                    0.00                      0.00 
thresh=1M/btrfs-10dd-4k-8p-4096M-1M:10-X
                    5.00                      0.00 
thresh=1M/btrfs-1dd-4k-8p-4096M-1M:10-X
                    0.00                      0.00 
thresh=1M/btrfs-2dd-4k-8p-4096M-1M:10-X
                34914.00       -62.8%     12983.00 
thresh=1M/ext3-10dd-4k-8p-4096M-1M:10-X
                73497.00                      0.00 
thresh=1M/ext3-1dd-4k-8p-4096M-1M:10-X
                52923.00       -68.0%     16922.00 
thresh=1M/ext3-2dd-4k-8p-4096M-1M:10-X
                40172.00       -65.8%     13750.00 
thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                60073.00       -79.0%     12601.00 
thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                58565.00       -69.8%     17690.00 
thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                21840.00       -50.8%     10744.00 
thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                46227.00       -65.2%     16103.00 
thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                42881.00       -63.6%     15625.00 
thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
             32260625.00       -86.0%   4532517.00  TOTAL
nr_vmscan_immediate_reclaim


3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------
                    4.46       +48.6%         6.62 
thresh=1000M/btrfs-100dd-4k-8p-4096M-1000M:10-X
                    0.92      +261.7%         3.34 
thresh=1000M/btrfs-10dd-4k-8p-4096M-1000M:10-X
                    1.12      +222.2%         3.61 
thresh=1000M/btrfs-1dd-4k-8p-4096M-1000M:10-X
                    2.59       -14.3%         2.22 
thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
                    0.68        -0.6%         0.67 
thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
                    0.67        -3.2%         0.64 
thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
                    2.84        +1.9%         2.89 
thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
                    0.70        +1.7%         0.71 
thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
                    0.70        -6.3%         0.66 
thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
                    2.86        +1.5%         2.91 
thresh=1000M/xfs-100dd-4k-8p-4096M-1000M:10-X
                    0.75        -0.5%         0.75 
thresh=1000M/xfs-10dd-4k-8p-4096M-1000M:10-X
                    0.96        -4.0%         0.92 
thresh=1000M/xfs-1dd-4k-8p-4096M-1000M:10-X
                    1.15      +229.7%         3.79 
thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                    0.95      +269.8%         3.50 
thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                    0.84      +309.1%         3.45 
thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                    0.76        -0.8%         0.76 
thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                    0.73        -5.5%         0.69 
thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                    0.66        -5.3%         0.62 
thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                    0.89        +2.0%         0.91 
thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                    0.75        -7.0%         0.70 
thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                    0.74        -4.5%         0.71 
thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                    0.92        +1.1%         0.93 
thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                    0.99        -4.4%         0.95 
thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                    0.91        -2.2%         0.89 
thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                    2.51      +107.7%         5.21 
thresh=10M/btrfs-10dd-4k-8p-4096M-10M:10-X
                    2.46      +103.1%         4.99 
thresh=10M/btrfs-1dd-4k-8p-4096M-10M:10-X
                    2.33      +113.0%         4.97 
thresh=10M/btrfs-2dd-4k-8p-4096M-10M:10-X
                    1.52        +0.2%         1.53 
thresh=10M/ext3-10dd-4k-8p-4096M-10M:10-X
                    2.07        -1.4%         2.04 
thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
                    1.92        -0.1%         1.92 
thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
                    1.66        -3.2%         1.61 
thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
                    2.48        -0.8%         2.46 
thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
                    2.22        -1.2%         2.19 
thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
                    1.51        -1.4%         1.49 
thresh=10M/xfs-10dd-4k-8p-4096M-10M:10-X
                    2.04        -1.8%         2.00 
thresh=10M/xfs-1dd-4k-8p-4096M-10M:10-X
                    1.80        +1.5%         1.82 
thresh=10M/xfs-2dd-4k-8p-4096M-10M:10-X
                    2.72       +13.0%         3.08 
thresh=1M/btrfs-10dd-4k-8p-4096M-1M:10-X
                    1.05       +15.4%         1.21 
thresh=1M/btrfs-1dd-4k-8p-4096M-1M:10-X
                    1.07       +16.5%         1.25 
thresh=1M/btrfs-2dd-4k-8p-4096M-1M:10-X
                    4.58        +7.6%         4.93 
thresh=1M/ext3-10dd-4k-8p-4096M-1M:10-X
                    2.49        -0.3%         2.49 
thresh=1M/ext3-1dd-4k-8p-4096M-1M:10-X
                    2.81        +0.8%         2.83 
thresh=1M/ext3-2dd-4k-8p-4096M-1M:10-X
                    5.25        +1.7%         5.34 
thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                    2.52        +1.4%         2.56 
thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                    2.83        -0.4%         2.82 
thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                    5.11        +1.5%         5.19 
thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                    2.81        -0.1%         2.81 
thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                    3.11        -0.6%         3.09 
thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                   90.44       +25.7%       113.67  TOTAL cpu_user

3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------
                    6.49       +20.1%         7.79 
thresh=1000M/btrfs-100dd-4k-8p-4096M-1000M:10-X
                    5.24       +26.9%         6.65 
thresh=1000M/btrfs-10dd-4k-8p-4096M-1000M:10-X
                    6.16       +22.0%         7.51 
thresh=1000M/btrfs-1dd-4k-8p-4096M-1000M:10-X
                    1.15       -12.3%         1.01 
thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
                    0.71        +1.5%         0.72 
thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
                    2.15        -3.2%         2.08 
thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
                    1.29        +1.1%         1.31 
thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
                    0.84        +0.1%         0.84 
thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
                    2.10        -1.9%         2.06 
thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
                    1.24        -0.5%         1.23 
thresh=1000M/xfs-100dd-4k-8p-4096M-1000M:10-X
                    0.65        +1.6%         0.66 
thresh=1000M/xfs-10dd-4k-8p-4096M-1000M:10-X
                    1.77        +3.5%         1.83 
thresh=1000M/xfs-1dd-4k-8p-4096M-1000M:10-X
                    5.38       +22.5%         6.59 
thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                    6.05       +19.7%         7.25 
thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                    5.99       +18.9%         7.13 
thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                    0.71        +2.8%         0.73 
thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                    2.28        -1.3%         2.25 
thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                    1.88        -2.0%         1.85 
thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                    0.68        -1.1%         0.67 
thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                    1.65        +0.4%         1.66 
thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                    1.51        -2.9%         1.47 
thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                    0.63        +2.9%         0.65 
thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                    1.87        +1.7%         1.90 
thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                    1.70        -1.4%         1.68 
thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                    5.31       +25.7%         6.67 
thresh=10M/btrfs-10dd-4k-8p-4096M-10M:10-X
                    5.50       +21.3%         6.67 
thresh=10M/btrfs-1dd-4k-8p-4096M-10M:10-X
                    5.74       +20.8%         6.94 
thresh=10M/btrfs-2dd-4k-8p-4096M-10M:10-X
                    0.85        -0.6%         0.84 
thresh=10M/ext3-10dd-4k-8p-4096M-10M:10-X
                    1.41        -4.4%         1.35 
thresh=10M/ext3-1dd-4k-8p-4096M-10M:10-X
                    1.43        -2.7%         1.40 
thresh=10M/ext3-2dd-4k-8p-4096M-10M:10-X
                    0.77        -3.0%         0.75 
thresh=10M/ext4-10dd-4k-8p-4096M-10M:10-X
                    1.39        -3.3%         1.35 
thresh=10M/ext4-1dd-4k-8p-4096M-10M:10-X
                    1.37        -5.1%         1.30 
thresh=10M/ext4-2dd-4k-8p-4096M-10M:10-X
                    0.70        -2.5%         0.68 
thresh=10M/xfs-10dd-4k-8p-4096M-10M:10-X
                    2.11        -3.7%         2.03 
thresh=10M/xfs-1dd-4k-8p-4096M-10M:10-X
                    1.77        -1.0%         1.75 
thresh=10M/xfs-2dd-4k-8p-4096M-10M:10-X
                    0.86       +10.3%         0.94 
thresh=1M/btrfs-10dd-4k-8p-4096M-1M:10-X
                    0.66       +14.0%         0.76 
thresh=1M/btrfs-1dd-4k-8p-4096M-1M:10-X
                    0.57       +11.8%         0.63 
thresh=1M/btrfs-2dd-4k-8p-4096M-1M:10-X
                    1.89        +8.8%         2.06 
thresh=1M/ext3-10dd-4k-8p-4096M-1M:10-X
                    3.20        -0.5%         3.19 
thresh=1M/ext3-1dd-4k-8p-4096M-1M:10-X
                    2.77        +0.1%         2.77 
thresh=1M/ext3-2dd-4k-8p-4096M-1M:10-X
                    2.00        +0.3%         2.01 
thresh=1M/ext4-10dd-4k-8p-4096M-1M:10-X
                    3.16        +1.5%         3.21 
thresh=1M/ext4-1dd-4k-8p-4096M-1M:10-X
                    2.52        -0.9%         2.50 
thresh=1M/ext4-2dd-4k-8p-4096M-1M:10-X
                    1.84        +0.7%         1.85 
thresh=1M/xfs-10dd-4k-8p-4096M-1M:10-X
                    2.82        +0.3%         2.82 
thresh=1M/xfs-1dd-4k-8p-4096M-1M:10-X
                    2.27        -0.3%         2.26 
thresh=1M/xfs-2dd-4k-8p-4096M-1M:10-X
                  113.05        +9.9%       124.25  TOTAL cpu_system
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wu Fengguang

2011-Oct-28 20:39 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

[restore CC list]
> > I''m trying to understand where the performance gain comes
from.
> > 
> > I noticed that in all cases, before/after patchset, nr_vmscan_write
are all zero.
> > 
> > nr_vmscan_immediate_reclaim is significantly reduced though:
> 
> That''s a good thing, it means we burn less CPU time on skipping
> through dirty pages on the LRU.
> 
> Until a certain priority level, the dirty pages encountered on the LRU
> list are marked PageReclaim and put back on the list, this is the
> nr_vmscan_immediate_reclaim number.  And only below that priority, we
> actually ask the FS to write them, which is nr_vmscan_write.
Yes, it is.
> I suspect this is where the performance improvement comes from: we
> find clean pages for reclaim much faster.
That explains how it could reduce CPU overheads. However the dd''s are
throttled anyway, so I still don''t understand how the speedup of dd
page
allocations improve the _IO_ performance.
> > $ ./compare.rb -g 1000M -e nr_vmscan_immediate_reclaim
thresh*/*-ioless-full-nfs-wq5-next-20111014+
thresh*/*-ioless-full-per-zone-dirty-next-20111014+
> > 3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
> > ------------------------  ------------------------  
> >                560289.00       -98.5%      8145.00 
thresh=1000M/btrfs-100dd-4k-8p-4096M-1000M:10-X
> >                576882.00       -98.4%      9511.00 
thresh=1000M/btrfs-10dd-4k-8p-4096M-1000M:10-X
> >                651258.00       -98.8%      7963.00 
thresh=1000M/btrfs-1dd-4k-8p-4096M-1000M:10-X
> >               1963294.00       -85.4%    286815.00 
thresh=1000M/ext3-100dd-4k-8p-4096M-1000M:10-X
> >               2108028.00       -10.6%   1885114.00 
thresh=1000M/ext3-10dd-4k-8p-4096M-1000M:10-X
> >               2499456.00       -99.9%      2061.00 
thresh=1000M/ext3-1dd-4k-8p-4096M-1000M:10-X
> >               2534868.00       -78.5%    545815.00 
thresh=1000M/ext4-100dd-4k-8p-4096M-1000M:10-X
> >               2921668.00       -76.8%    677177.00 
thresh=1000M/ext4-10dd-4k-8p-4096M-1000M:10-X
> >               2841049.00      -100.0%       779.00 
thresh=1000M/ext4-1dd-4k-8p-4096M-1000M:10-X
> >               2481823.00       -86.3%    339342.00 
thresh=1000M/xfs-100dd-4k-8p-4096M-1000M:10-X
> >               2508629.00       -87.4%    316614.00 
thresh=1000M/xfs-10dd-4k-8p-4096M-1000M:10-X
> >               2656628.00      -100.0%       678.00 
thresh=1000M/xfs-1dd-4k-8p-4096M-1000M:10-X
> >              24303872.00       -83.2%   4080014.00  TOTAL
nr_vmscan_immediate_reclaim
> > 
> > If you''d like to compare any other vmstat items before/after
patch,
> > let me know and I''ll run the compare script to find them out.
> 
> I will come back to you on this, so tired right now.  But I find your
> scripts interesting ;-) Are those released and available for download
> somewhere?  I suspect every kernel hacker has their own collection of
> scripts to process data like this, maybe we should pull them all
> together and put them into a git tree!
Thank you for the interest :-)

I used to upload my writeback test scripts to kernel.org. However its
file service is not restored yet. So I attach the compare script here.
It''s a bit hacky for now, which I hope can be improved over time to be
useful to other projects as well.

Thanks,
Fengguang

Wu Fengguang

2011-Oct-31 11:33 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

> //regression
> 3) much increased cpu %user and %system for btrfs
Sorry I find out that the CPU time regressions for btrfs are caused by
some additional trace events enabled on btrfs (for debugging an
unrelated btrfs hang bug) which results in 7 times more trace event
lines:

 2701238
/export/writeback/thresh=1000M/btrfs-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
19054054
/export/writeback/thresh=1000M/btrfs-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+

So no real regressions.

Besides, the patchset also performs good on random writes:

3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------  
                    1.65        -5.1%         1.57 
MMAP-RANDWRITE-4K/btrfs-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
                   18.65        -6.4%        17.46 
MMAP-RANDWRITE-4K/ext3-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
                    2.09        +1.2%         2.12 
MMAP-RANDWRITE-4K/ext4-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
                    2.49        -0.3%         2.48 
MMAP-RANDWRITE-4K/xfs-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
                   51.35        +0.0%        51.36 
MMAP-RANDWRITE-64K/btrfs-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
                   45.20        +0.5%        45.43 
MMAP-RANDWRITE-64K/ext3-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
                   44.77        +0.7%        45.10 
MMAP-RANDWRITE-64K/ext4-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
                   45.11        +2.5%        46.23 
MMAP-RANDWRITE-64K/xfs-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
                  211.31        +0.2%       211.74  TOTAL write_bw

And writes to USB key:

3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
------------------------  ------------------------  
                    5.94        +0.8%         5.99 
UKEY-thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                    2.64        -0.8%         2.62 
UKEY-thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                    5.10        +0.3%         5.12 
UKEY-thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                    3.26        -0.8%         3.24 
UKEY-thresh=1G/ext3-2dd-4k-8p-4096M-1024M:10-X
                    5.63        -0.5%         5.60 
UKEY-thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                    6.04        -0.1%         6.04 
UKEY-thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                    5.90        -0.2%         5.88 
UKEY-thresh=1G/ext4-2dd-4k-8p-4096M-1024M:10-X
                    2.45       +22.6%         3.00 
UKEY-thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                    6.18        -0.4%         6.16 
UKEY-thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                    4.81        +0.0%         4.81 
UKEY-thresh=1G/xfs-2dd-4k-8p-4096M-1024M:10-X
                   47.94        +1.1%        48.45  TOTAL write_bw

In summary, I see no problem at all in these trivial writeback tests.

Tested-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks,
Fengguang

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Nov-01 10:52 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

On Sat, Oct 29, 2011 at 04:39:44AM +0800, Wu Fengguang
wrote:> [restore CC list]
> 
> > > I''m trying to understand where the performance gain
comes from.
> > > 
> > > I noticed that in all cases, before/after patchset,
nr_vmscan_write are all zero.
> > > 
> > > nr_vmscan_immediate_reclaim is significantly reduced though:
> > 
> > That''s a good thing, it means we burn less CPU time on
skipping
> > through dirty pages on the LRU.
> > 
> > Until a certain priority level, the dirty pages encountered on the LRU
> > list are marked PageReclaim and put back on the list, this is the
> > nr_vmscan_immediate_reclaim number.  And only below that priority, we
> > actually ask the FS to write them, which is nr_vmscan_write.
> 
> Yes, it is.
> 
> > I suspect this is where the performance improvement comes from: we
> > find clean pages for reclaim much faster.
> 
> That explains how it could reduce CPU overheads. However the dd''s
are
> throttled anyway, so I still don''t understand how the speedup of
dd page
> allocations improve the _IO_ performance.
They are throttled in balance_dirty_pages() when there are too many
dirty pages.  But they are also ''throttled'' in direct reclaim
when
there are too many clean + dirty pages.  Wild guess: speeding up
direct reclaim allows dirty pages to be generated faster and the
writer can better saturate the BDI?

Not all filesystems ignore all VM writepage requests, either.  xfs
e.g. ignores only direct reclaim but honors requests from kswapd.
ext4 honors writepage whenever it pleases.  On those, I can imagine
the reduced writepage intereference to help.  But that can not be the
only reason as btrfs ignores writepage from the reclaim in general and
still sees improvement.

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Johannes Weiner

2011-Nov-01 10:55 UTC

head link

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

On Mon, Oct 31, 2011 at 07:33:21PM +0800, Wu Fengguang
wrote:> > //regression
> > 3) much increased cpu %user and %system for btrfs
> 
> Sorry I find out that the CPU time regressions for btrfs are caused by
> some additional trace events enabled on btrfs (for debugging an
> unrelated btrfs hang bug) which results in 7 times more trace event
> lines:
> 
>  2701238
/export/writeback/thresh=1000M/btrfs-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+
> 19054054
/export/writeback/thresh=1000M/btrfs-1dd-4k-8p-2941M-1000M:10-3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
> 
> So no real regressions.
Phew :-)
> Besides, the patchset also performs good on random writes:
> 
> 3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
> ------------------------  ------------------------  
>                     1.65        -5.1%         1.57 
MMAP-RANDWRITE-4K/btrfs-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
>                    18.65        -6.4%        17.46 
MMAP-RANDWRITE-4K/ext3-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
>                     2.09        +1.2%         2.12 
MMAP-RANDWRITE-4K/ext4-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
>                     2.49        -0.3%         2.48 
MMAP-RANDWRITE-4K/xfs-fio_fat_mmap_randwrite_4k-4k-8p-4096M-20:10-X
>                    51.35        +0.0%        51.36 
MMAP-RANDWRITE-64K/btrfs-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
>                    45.20        +0.5%        45.43 
MMAP-RANDWRITE-64K/ext3-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
>                    44.77        +0.7%        45.10 
MMAP-RANDWRITE-64K/ext4-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
>                    45.11        +2.5%        46.23 
MMAP-RANDWRITE-64K/xfs-fio_fat_mmap_randwrite_64k-64k-8p-4096M-20:10-X
>                   211.31        +0.2%       211.74  TOTAL write_bw
Hmm, mmapped IO page allocations are not annotated yet, so I expect
this to be just runtime variations?
> And writes to USB key:
> 
> 3.1.0-rc9-ioless-full-nfs-wq5-next-20111014+ 
3.1.0-rc9-ioless-full-per-zone-dirty-next-20111014+
> ------------------------  ------------------------  
>                     5.94        +0.8%         5.99 
UKEY-thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                     2.64        -0.8%         2.62 
UKEY-thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                     5.10        +0.3%         5.12 
UKEY-thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                     3.26        -0.8%         3.24 
UKEY-thresh=1G/ext3-2dd-4k-8p-4096M-1024M:10-X
>                     5.63        -0.5%         5.60 
UKEY-thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                     6.04        -0.1%         6.04 
UKEY-thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                     5.90        -0.2%         5.88 
UKEY-thresh=1G/ext4-2dd-4k-8p-4096M-1024M:10-X
>                     2.45       +22.6%         3.00 
UKEY-thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                     6.18        -0.4%         6.16 
UKEY-thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                     4.81        +0.0%         4.81 
UKEY-thresh=1G/xfs-2dd-4k-8p-4096M-1024M:10-X
>                    47.94        +1.1%        48.45  TOTAL write_bw
> 
> In summary, I see no problem at all in these trivial writeback tests.
> 
> Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Thanks!

--
To unsubscribe, send a message with ''unsubscribe linux-mm'' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don''t email: <a href=mailto:"dont@kvack.org">
email@kvack.org </a>

Btrfs devel - Sep 2011 - [patch 0/5] per-zone dirty limits v3

[patch 0/5] per-zone dirty limits v3

[patch 1/5] mm: exclude reserved pages from dirtyable memory

[patch 2/5] mm: writeback: cleanups in preparation for per-zone dirty limits

[patch 3/5] mm: try to distribute dirty pages fairly across zones

[patch 4/5] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

[patch 5/5] Btrfs: pass __GFP_WRITE for buffered write page allocations

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

Re: [patch 2/5] mm: writeback: cleanups in preparation for per-zone dirty limits

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 4/5] mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

Re: [patch 1/5] mm: exclude reserved pages from dirtyable memory

Re: [patch 5/5] Btrfs: pass __GFP_WRITE for buffered write page allocations

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones

Re: [patch 3/5] mm: try to distribute dirty pages fairly across zones