thr3ads.net - Lustre discuss - [Lustre-discuss] Slow read performance after ~700MB-2GB [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Alexander Oltu

2010-Jun-09 14:44 UTC

[Lustre-discuss] Slow read performance after ~700MB-2GB

We are experiencing strange behavior on our Lustre setup. The problem
appears only while reading files. It is a Cray XT4 machine and this
problem is reproducible only on login nodes, while compute nodes are
fine.

To reproduce it we run:
cd /lustrefs
dd if=/dev/zero of=test.file bs=1024k count=5000
# drop cache
echo 1 > /proc/sys/vm/drop_caches
dd if=test.file of=/dev/null bs=1024k &
# check dd status:
kill -USR1 %1

After 700MB-2GB of reading speed drops to 1-2 MB/s, iowait will grow to
50% (one full CPU):
hexgrid:~ # vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system--
-----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 0 4016432 0 3349580 0 0 0 0 253 109 0 0 99 0 0
0 1 0 4016788 0 3349240 0 0 0 0 515 2322 9 6 38 47 0
0 1 0 4016836 0 3349512 0 0 0 0 505 90 0 0 50 50 0
0 1 0 4016860 0 3349716 0 0 0 0 505 82 0 0 50 50 0
0 1 0 4016860 0 3349784 0 0 0 0 505 85 0 0 50 50 0
0 1 0 4016856 0 3349920 0 0 0 0 505 85 0 0 50 50 0
0 1 0 4017048 0 3349852 0 0 0 0 505 86 0 0 50 50 0

hexgrid:~ # mpstat -P ALL 3
Linux 2.6.16.60-0.39_1.0102.4784.2.2.48B-ss (hexgrid.bccs.uib.no)
05/11/2010

03:35:12 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
03:35:15 PM all 0.00 0.00 0.00 49.92 0.00 0.00 0.00 50.08 504.67
03:35:15 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
03:35:15 PM 1 0.00 0.00 0.00 99.67 0.00 0.00 0.00 0.00 504.67

03:35:15 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
03:35:18 PM all 0.00 0.00 0.00 50.00 0.00 0.00 0.00 50.00 503.99
03:35:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
03:35:18 PM 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 503.99

I am not sure where to look into and if it is lustre or HW problem? No
messages in dmesg.

Thanks,
Alex.

Alexander Oltu

2010-Jun-09 14:52 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

Sorry, I had no Lustre version information. It is 
lustre: 1.6.5
kernel: patchless
build:  1.6.5-2.2.48B-rsrel-REL_2_2_48B-19851-2009-12-23-16:31


On Wed, 9 Jun 2010 16:44:57 +0200
Alexander Oltu wrote:
> We are experiencing strange behavior on our Lustre setup. The problem
> appears only while reading files. It is a Cray XT4 machine and this
> problem is reproducible only on login nodes, while compute nodes are
> fine.
> 
> To reproduce it we run:
> cd /lustrefs
> dd if=/dev/zero of=test.file bs=1024k count=5000
> # drop cache
> echo 1 > /proc/sys/vm/drop_caches
> dd if=test.file of=/dev/null bs=1024k &
> # check dd status:
> kill -USR1 %1
> 
> After 700MB-2GB of reading speed drops to 1-2 MB/s, iowait will grow
> to 50% (one full CPU):
> hexgrid:~ # vmstat 3
> procs -----------memory---------- ---swap-- -----io---- -system--
> -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa
> st 0 1 0 4016432 0 3349580 0 0 0 0 253 109 0 0 99 0 0
> 0 1 0 4016788 0 3349240 0 0 0 0 515 2322 9 6 38 47 0
> 0 1 0 4016836 0 3349512 0 0 0 0 505 90 0 0 50 50 0
> 0 1 0 4016860 0 3349716 0 0 0 0 505 82 0 0 50 50 0
> 0 1 0 4016860 0 3349784 0 0 0 0 505 85 0 0 50 50 0
> 0 1 0 4016856 0 3349920 0 0 0 0 505 85 0 0 50 50 0
> 0 1 0 4017048 0 3349852 0 0 0 0 505 86 0 0 50 50 0
> 
> hexgrid:~ # mpstat -P ALL 3
> Linux 2.6.16.60-0.39_1.0102.4784.2.2.48B-ss (hexgrid.bccs.uib.no)
> 05/11/2010
> 
> 03:35:12 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle
> intr/s 03:35:15 PM all 0.00 0.00 0.00 49.92 0.00 0.00 0.00 50.08
> 504.67 03:35:15 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
> 03:35:15 PM 1 0.00 0.00 0.00 99.67 0.00 0.00 0.00 0.00 504.67
> 
> 03:35:15 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle
> intr/s 03:35:18 PM all 0.00 0.00 0.00 50.00 0.00 0.00 0.00 50.00
> 503.99 03:35:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
> 03:35:18 PM 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 503.99
> 
> I am not sure where to look into and if it is lustre or HW problem? No
> messages in dmesg.
> 
> Thanks,
> Alex.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Alexander Oltu
System Engineer
Parallab, Uni BCCS

Thorm?hlensgt.55
N-5008 Bergen, Norway
phone: +47 55584144

Jason Rappleye

2010-Jun-09 17:29 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

Is vm.zone_reclaim_mode set to a value other that zero?

This sounds a lot like a problem we recently experienced when a BIOS  
upgrade changed to the ACPI SLIT table, which specifies the distances  
between NUMA nodes in a system. It put the remote node distance over  
the threshold the kernel uses to decide whether or not to enable the  
inline zone reclaim path. At least on SLES, Lustre doesn''t seem to be  
able to free up pages in the page cache in this path, and performance  
dropped to 2-4MB/s. In my test case I was issuing 2MB I/Os and the  
kernel only let 1-2 I/Os trickle out per second, so that matches up  
with what you''re seeing.

j


On Jun 9, 2010, at 7:44 AM, Alexander Oltu wrote:
> We are experiencing strange behavior on our Lustre setup. The problem
> appears only while reading files. It is a Cray XT4 machine and this
> problem is reproducible only on login nodes, while compute nodes are
> fine.
>
> To reproduce it we run:
> cd /lustrefs
> dd if=/dev/zero of=test.file bs=1024k count=5000
> # drop cache
> echo 1 > /proc/sys/vm/drop_caches
> dd if=test.file of=/dev/null bs=1024k &
> # check dd status:
> kill -USR1 %1
>
> After 700MB-2GB of reading speed drops to 1-2 MB/s, iowait will grow  
> to
> 50% (one full CPU):
> hexgrid:~ # vmstat 3
> procs -----------memory---------- ---swap-- -----io---- -system--
> -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id  
> wa st
> 0 1 0 4016432 0 3349580 0 0 0 0 253 109 0 0 99 0 0
> 0 1 0 4016788 0 3349240 0 0 0 0 515 2322 9 6 38 47 0
> 0 1 0 4016836 0 3349512 0 0 0 0 505 90 0 0 50 50 0
> 0 1 0 4016860 0 3349716 0 0 0 0 505 82 0 0 50 50 0
> 0 1 0 4016860 0 3349784 0 0 0 0 505 85 0 0 50 50 0
> 0 1 0 4016856 0 3349920 0 0 0 0 505 85 0 0 50 50 0
> 0 1 0 4017048 0 3349852 0 0 0 0 505 86 0 0 50 50 0
>
> hexgrid:~ # mpstat -P ALL 3
> Linux 2.6.16.60-0.39_1.0102.4784.2.2.48B-ss (hexgrid.bccs.uib.no)
> 05/11/2010
>
> 03:35:12 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle  
> intr/s
> 03:35:15 PM all 0.00 0.00 0.00 49.92 0.00 0.00 0.00 50.08 504.67
> 03:35:15 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
> 03:35:15 PM 1 0.00 0.00 0.00 99.67 0.00 0.00 0.00 0.00 504.67
>
> 03:35:15 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle  
> intr/s
> 03:35:18 PM all 0.00 0.00 0.00 50.00 0.00 0.00 0.00 50.00 503.99
> 03:35:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00
> 03:35:18 PM 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 503.99
>
> I am not sure where to look into and if it is lustre or HW problem? No
> messages in dmesg.
>
> Thanks,
> Alex.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Alexander Oltu

2010-Jun-09 19:41 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

On Wed, 9 Jun 2010 10:29:36 -0700
Jason Rappleye wrote:
> 
> Is vm.zone_reclaim_mode set to a value other that zero?
Yes, it was 1, as soon as I set it to 0 the problem disappears. 

Thank you Jason! It was a long issue for us.
> 
> This sounds a lot like a problem we recently experienced when a BIOS  
> upgrade changed to the ACPI SLIT table, which specifies the
> distances between NUMA nodes in a system. It put the remote node
> distance over the threshold the kernel uses to decide whether or not
> to enable the inline zone reclaim path. At least on SLES, Lustre
> doesn''t seem to be able to free up pages in the page cache in this
> path, and performance dropped to 2-4MB/s. In my test case I was
> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per
> second, so that matches up with what you''re seeing.
> 
> j
> 
>

Andreas Dilger

2010-Jun-09 19:45 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

On 2010-06-09, at 13:41, Alexander Oltu wrote:> On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote:
>> Is vm.zone_reclaim_mode set to a value other that zero?
> 
> Yes, it was 1, as soon as I set it to 0 the problem disappears. 
Interesting, I have never heard of this problem before.  Is this a client-side
parameter, or on the server?
>> This sounds a lot like a problem we recently experienced when a BIOS  
>> upgrade changed to the ACPI SLIT table, which specifies the
>> distances between NUMA nodes in a system. It put the remote node
>> distance over the threshold the kernel uses to decide whether or not
>> to enable the inline zone reclaim path. At least on SLES, Lustre
>> doesn''t seem to be able to free up pages in the page cache in
this
>> path, and performance dropped to 2-4MB/s. In my test case I was
>> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per
>> second, so that matches up with what you''re seeing.
Jason, do you have enough of an understanding of this codepath to know why
Lustre is not freeing pages in this case?  Is it because Lustre just
doesn''t have a VM callback that frees pages at all, or is it somehow
ignoring the requests from the kernel to free up the pages?

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Jason Rappleye

2010-Jun-09 20:51 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

On Jun 9, 2010, at 12:45 PM, Andreas Dilger wrote:
> On 2010-06-09, at 13:41, Alexander Oltu wrote:
>> On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote:
>>> Is vm.zone_reclaim_mode set to a value other that zero?
>>
>> Yes, it was 1, as soon as I set it to 0 the problem disappears.
>
> Interesting, I have never heard of this problem before.  Is this a  
> client-side parameter, or on the server?
Client. In 1.6.x Lustre doesn''t use the page cache on the server  
(right?), so I don''t think this will cause a problem. At least, not  
this particular problem.

When we first saw this problem a few weeks ago it appeared that client  
processes were stuck in uninterruptible sleep in blk_congestion_wait,  
but upon further examination we saw they were still issuing 1-2 I/Os  
per second. The kernel stack trace looked like this:

<ffffffff8013d6ec>{internal_add_timer+21}
<ffffffff8030fbc4>{schedule_timeout+138}
<ffffffff8013def0>{process_timeout+0}
<ffffffff8030f3ec>{io_schedule_timeout+88}
<ffffffff801f1d74>{blk_congestion_wait+102}
<ffffffff80148d46>{autoremove_wake_function+0}
<ffffffff8016851d>{throttle_vm_writeout+33}
<ffffffff8016aa0e>{remove_mapping+133}
<ffffffff8016b8e8>{shrink_zone+3367}
<ffffffff80218799>{find_next_bit+96}
<ffffffff8016c435>{zone_reclaim+430}
<ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
<ffffffff8852df5a>{:osc:cache_add_extent+1178}
<ffffffff8860f838>{:lustre:ll_removepage+488}
<ffffffff8852152a>{:osc:osc_prep_async_page+426}
<ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
<ffffffff88524224>{:osc:osc_queue_group_io+644}
<ffffffff801671a2>{get_page_from_freelist+222}
<ffffffff8016756d>{__alloc_pages+113}
<ffffffff80162416>{add_to_page_cache+57}
<ffffffff80162c49>{grab_cache_page_nowait+53}
<ffffffff8860e368>{:lustre:ll_readahead+2584}
<ffffffff8851db55>{:osc:osc_check_rpcs+773}
<ffffffff8012c52c>{__wake_up+56}
<ffffffff88515db1>{:osc:loi_list_maint+225}
<ffffffff88330288>{:libcfs:cfs_alloc+40}
<ffffffff88615557>{:lustre:ll_readpage+4775}
<ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
<ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
<ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
<ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
<ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
<ffffffff885b235f>{:lov:lov_finish_set+1695}
<ffffffff801629bd>{do_generic_mapping_read+525}
<ffffffff8016476e>{file_read_actor+0}
<ffffffff8016328b>{__generic_file_aio_read+324}
<ffffffff80164576>{generic_file_readv+143}
<ffffffff885b07c9>{:lov:lov_merge_lvb+281}
<ffffffff80148d46>{autoremove_wake_function+0}
<ffffffff8019f156>{__touch_atime+118}
<ffffffff885ef821>{:lustre:ll_file_readv+6385}
<ffffffff80216f4f>{__up_read+16}
<ffffffff885efada>{:lustre:ll_file_read+26}
<ffffffff801878f0>{vfs_read+212}
<ffffffff80187cd0>{sys_read+69}
<ffffffff8010ae5e>{system_call+126}

zone_reclaim_mode is set in mm/page_alloc.c:build_zonelists by  
examining the distance between nodes. The value of 20 in the old BOIS  
was fine; the remote distance is set to 21 in the new BIOS and that  
put it over the edge. We''re working with the vendor to understand why  
that change was made. In any case, setting it back to zero works  
around the issue.

Just to be clear, the remote node distance in and of itself doesn''t  
have anything to do with the problem - setting zone_reclaim_mode to  
one on a host with the original distances is sufficient to reproduce  
the problem.

When the problem occurs, echoing 3 into drop_caches provides some  
temporarily relief, but the kernel will eventually re-enter  
zone_reclaim.
>
>>> This sounds a lot like a problem we recently experienced when a
BIOS
>>> upgrade changed to the ACPI SLIT table, which specifies the
>>> distances between NUMA nodes in a system. It put the remote node
>>> distance over the threshold the kernel uses to decide whether or
not
>>> to enable the inline zone reclaim path. At least on SLES, Lustre
>>> doesn''t seem to be able to free up pages in the page cache
in this
>>> path, and performance dropped to 2-4MB/s. In my test case I was
>>> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per
>>> second, so that matches up with what you''re seeing.
>
> Jason, do you have enough of an understanding of this codepath to  
> know why Lustre is not freeing pages in this case?  Is it because  
> Lustre just doesn''t have a VM callback that frees pages at all, or
> is it somehow ignoring the requests from the kernel to free up the  
> pages?

Kind of. The stack trace above definitely shows that the process  
enters the inline zone reclaim path. zone_reclaim itself calls  
shrink_slabs, This asks each cache that has been registered via  
set_shrinker to free up memory. At least, that''s how it works in the  
SLES 10 kernel.

Looking at our 1.6.7 client sources (no, we don''t run 1.6.7 on our  
servers!), llite registers the cache using ll_register_cache, which  
calls the kernel register_cache function. It ends up being a no-op if  
register_cache isn''t found by autoconf. My understanding is that the  
interface changed at some point, and some versions of Lustre doesn''t  
support the set_shrinker interface.

It *looks* like 1.8.2 does the right thing, at least in terms of  
registering the cache. Apparently register_shrinker is the function to  
call in some kernels. autoconf checks for it''s existence, and if
it''s
there it wraps the call with its own version of set_shrinker (in  
lustre/include/linux/lustre_compat25.h). Otherwise, it calls the  
kernel''s set_shrinker.

I haven''t tried to reproduce this problem with a 1.8.2 client yet, but
I should be able to sneak in a test this week and see if it works as  
expected.

j

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Jason Rappleye

2010-Jun-10 00:36 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

On Jun 9, 2010, at 1:51 PM, Jason Rappleye wrote:
>
> I haven''t tried to reproduce this problem with a 1.8.2 client yet,
but
> I should be able to sneak in a test this week and see if it works as
> expected.
Same behavior with what is essentially a 1.8.2.0-24chaos client  
(changes to it are inconsequential) and SLES 10 SP2 kernel  
2.6.16.60-0.42.5 + some local patches. Those patches *should* be  
inconsequential as well - mostly changes to the OOM handler, IIRC  
(which isn''t an issue here).

Our vendor already has a case on this - I''ll go ahead and have them  
file a BZ.

Thanks,

j

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Cory Spitz

2010-Jun-10 14:48 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

Hi,

Slightly off-topic, but did anyone else notice that readahead is triggering the
shrinking and page writeout?  ll_read_ahead_page() clears __GFP_WAIT but it
seems sane to me that it should also drop __GFP_IO.  In my opinion, Lustre
shouldn''t speculatively force other pages out.  Only when there is an
actual,
demonstrated need, should it force out the other pages.

Thanks,
-Cory

Jason Rappleye wrote:
[...]> When we first saw this problem a few weeks ago it appeared that client  
> processes were stuck in uninterruptible sleep in blk_congestion_wait,  
> but upon further examination we saw they were still issuing 1-2 I/Os  
> per second. The kernel stack trace looked like this:
> 
> <ffffffff8013d6ec>{internal_add_timer+21}
> <ffffffff8030fbc4>{schedule_timeout+138}
> <ffffffff8013def0>{process_timeout+0}
> <ffffffff8030f3ec>{io_schedule_timeout+88}
> <ffffffff801f1d74>{blk_congestion_wait+102}
> <ffffffff80148d46>{autoremove_wake_function+0}
> <ffffffff8016851d>{throttle_vm_writeout+33}
> <ffffffff8016aa0e>{remove_mapping+133}
> <ffffffff8016b8e8>{shrink_zone+3367}
> <ffffffff80218799>{find_next_bit+96}
> <ffffffff8016c435>{zone_reclaim+430}
> <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
> <ffffffff8852df5a>{:osc:cache_add_extent+1178}
> <ffffffff8860f838>{:lustre:ll_removepage+488}
> <ffffffff8852152a>{:osc:osc_prep_async_page+426}
> <ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
> <ffffffff88524224>{:osc:osc_queue_group_io+644}
> <ffffffff801671a2>{get_page_from_freelist+222}
> <ffffffff8016756d>{__alloc_pages+113}
> <ffffffff80162416>{add_to_page_cache+57}
> <ffffffff80162c49>{grab_cache_page_nowait+53}
> <ffffffff8860e368>{:lustre:ll_readahead+2584}
> <ffffffff8851db55>{:osc:osc_check_rpcs+773}
> <ffffffff8012c52c>{__wake_up+56}
> <ffffffff88515db1>{:osc:loi_list_maint+225}
> <ffffffff88330288>{:libcfs:cfs_alloc+40}
> <ffffffff88615557>{:lustre:ll_readpage+4775}
> <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
> <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
> <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
> <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
> <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
> <ffffffff885b235f>{:lov:lov_finish_set+1695}
> <ffffffff801629bd>{do_generic_mapping_read+525}
> <ffffffff8016476e>{file_read_actor+0}
> <ffffffff8016328b>{__generic_file_aio_read+324}
> <ffffffff80164576>{generic_file_readv+143}
> <ffffffff885b07c9>{:lov:lov_merge_lvb+281}
> <ffffffff80148d46>{autoremove_wake_function+0}
> <ffffffff8019f156>{__touch_atime+118}
> <ffffffff885ef821>{:lustre:ll_file_readv+6385}
> <ffffffff80216f4f>{__up_read+16}
> <ffffffff885efada>{:lustre:ll_file_read+26}
> <ffffffff801878f0>{vfs_read+212}
> <ffffffff80187cd0>{sys_read+69}
> <ffffffff8010ae5e>{system_call+126}
> [...]

Andreas Dilger

2010-Jun-10 22:46 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

On 2010-06-10, at 08:48, Cory Spitz wrote:> Slightly off-topic, but did anyone else notice that readahead is triggering
the shrinking and page writeout?  ll_read_ahead_page() clears __GFP_WAIT but it
seems sane to me that it should also drop __GFP_IO.  In my opinion, Lustre
> shouldn''t speculatively force other pages out.  Only when there is
an actual,
> demonstrated need, should it force out the other pages.
We used to have a kernel patch (and more recently I implemented this using
generic kernel EXPORT_FUNCTION() operations) to implement
grab_cache_page_nowait_gfp() to allow specifying the GFP mask when allocating
pages for readahead.  Without that, the kernel uses the GFP mask from the
address space, which we have no control over.

That said, disabling memory pressure from readahead has a negative side effect
also.  When the client memory is full (i.e. all the time) there is NO readahead
generated because the readahead grab_cache_page_nowait_gfp() calls always fail,
and this degrades performance significantly, since the reads are now synchronous
and a single stream, instead of pipelined.

While it is true that some speculative readahead may result in evicting other
useful pages from cache, it is more likely to be prefetching useful pages that
the current process wants to use immediately and evicting old/useless pages.

The readahead algorithms definitely need some improvement, and it is possible
that it is over-zealous in this case, but it isn''t possible to say in
this case.

I''d say the core problem is that no reclaim is being triggered and/or
the reclaim is deadlocked on the cache cleaning, and that is the first issue to
focus on here.

> Jason Rappleye wrote:
> [...]
>> When we first saw this problem a few weeks ago it appeared that client
>> processes were stuck in uninterruptible sleep in blk_congestion_wait,  
>> but upon further examination we saw they were still issuing 1-2 I/Os  
>> per second. The kernel stack trace looked like this:
>> 
>> <ffffffff8013d6ec>{internal_add_timer+21}
>> <ffffffff8030fbc4>{schedule_timeout+138}
>> <ffffffff8013def0>{process_timeout+0}
>> <ffffffff8030f3ec>{io_schedule_timeout+88}
>> <ffffffff801f1d74>{blk_congestion_wait+102}
>> <ffffffff80148d46>{autoremove_wake_function+0}
>> <ffffffff8016851d>{throttle_vm_writeout+33}
>> <ffffffff8016aa0e>{remove_mapping+133}
>> <ffffffff8016b8e8>{shrink_zone+3367}
>> <ffffffff80218799>{find_next_bit+96}
>> <ffffffff8016c435>{zone_reclaim+430}
>> <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
>> <ffffffff8852df5a>{:osc:cache_add_extent+1178}
>> <ffffffff8860f838>{:lustre:ll_removepage+488}
>> <ffffffff8852152a>{:osc:osc_prep_async_page+426}
>> <ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
>> <ffffffff88524224>{:osc:osc_queue_group_io+644}
>> <ffffffff801671a2>{get_page_from_freelist+222}
>> <ffffffff8016756d>{__alloc_pages+113}
>> <ffffffff80162416>{add_to_page_cache+57}
>> <ffffffff80162c49>{grab_cache_page_nowait+53}
>> <ffffffff8860e368>{:lustre:ll_readahead+2584}
>> <ffffffff8851db55>{:osc:osc_check_rpcs+773}
>> <ffffffff8012c52c>{__wake_up+56}
>> <ffffffff88515db1>{:osc:loi_list_maint+225}
>> <ffffffff88330288>{:libcfs:cfs_alloc+40}
>> <ffffffff88615557>{:lustre:ll_readpage+4775}
>> <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
>> <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
>> <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
>> <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
>> <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
>> <ffffffff885b235f>{:lov:lov_finish_set+1695}
>> <ffffffff801629bd>{do_generic_mapping_read+525}
>> <ffffffff8016476e>{file_read_actor+0}
>> <ffffffff8016328b>{__generic_file_aio_read+324}
>> <ffffffff80164576>{generic_file_readv+143}
>> <ffffffff885b07c9>{:lov:lov_merge_lvb+281}
>> <ffffffff80148d46>{autoremove_wake_function+0}
>> <ffffffff8019f156>{__touch_atime+118}
>> <ffffffff885ef821>{:lustre:ll_file_readv+6385}
>> <ffffffff80216f4f>{__up_read+16}
>> <ffffffff885efada>{:lustre:ll_file_read+26}
>> <ffffffff801878f0>{vfs_read+212}
>> <ffffffff80187cd0>{sys_read+69}
>> <ffffffff8010ae5e>{system_call+126}
>> 
> [...]

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Cory Spitz

2010-Jun-11 03:09 UTC

head link

[Lustre-discuss] Slow read performance after ~700MB-2GB

Hi, Andreas.

Yes, of course the negative side effect makes perfect sense.  In fact, 
since users prefer to use all the memory that they can and the Linux 
buffer cache the same, then full memory case is probably more important.

Thanks,
-Cory


On 6/10/2010 5:46 PM, Andreas Dilger wrote:> On 2010-06-10, at 08:48, Cory Spitz wrote:
>> Slightly off-topic, but did anyone else notice that readahead is
triggering the shrinking and page writeout?  ll_read_ahead_page() clears
__GFP_WAIT but it seems sane to me that it should also drop __GFP_IO.  In my
opinion, Lustre
>> shouldn''t speculatively force other pages out.  Only when
there is an actual,
>> demonstrated need, should it force out the other pages.
>
> We used to have a kernel patch (and more recently I implemented this using
generic kernel EXPORT_FUNCTION() operations) to implement
grab_cache_page_nowait_gfp() to allow specifying the GFP mask when allocating
pages for readahead.  Without that, the kernel uses the GFP mask from the
address space, which we have no control over.
>
> That said, disabling memory pressure from readahead has a negative side
effect also.  When the client memory is full (i.e. all the time) there is NO
readahead generated because the readahead grab_cache_page_nowait_gfp() calls
always fail, and this degrades performance significantly, since the reads are
now synchronous and a single stream, instead of pipelined.
>
> While it is true that some speculative readahead may result in evicting
other useful pages from cache, it is more likely to be prefetching useful pages
that the current process wants to use immediately and evicting old/useless
pages.
>
> The readahead algorithms definitely need some improvement, and it is
possible that it is over-zealous in this case, but it isn''t possible to
say in this case.
>
> I''d say the core problem is that no reclaim is being triggered
and/or the reclaim is deadlocked on the cache cleaning, and that is the first
issue to focus on here.
>
>
>> Jason Rappleye wrote:
>> [...]
>>> When we first saw this problem a few weeks ago it appeared that
client
>>> processes were stuck in uninterruptible sleep in
blk_congestion_wait,
>>> but upon further examination we saw they were still issuing 1-2
I/Os
>>> per second. The kernel stack trace looked like this:
>>>
>>> <ffffffff8013d6ec>{internal_add_timer+21}
>>> <ffffffff8030fbc4>{schedule_timeout+138}
>>> <ffffffff8013def0>{process_timeout+0}
>>> <ffffffff8030f3ec>{io_schedule_timeout+88}
>>> <ffffffff801f1d74>{blk_congestion_wait+102}
>>> <ffffffff80148d46>{autoremove_wake_function+0}
>>> <ffffffff8016851d>{throttle_vm_writeout+33}
>>> <ffffffff8016aa0e>{remove_mapping+133}
>>> <ffffffff8016b8e8>{shrink_zone+3367}
>>> <ffffffff80218799>{find_next_bit+96}
>>> <ffffffff8016c435>{zone_reclaim+430}
>>> <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154}
>>> <ffffffff8852df5a>{:osc:cache_add_extent+1178}
>>> <ffffffff8860f838>{:lustre:ll_removepage+488}
>>> <ffffffff8852152a>{:osc:osc_prep_async_page+426}
>>> <ffffffff8860c953>{:lustre:llap_shrink_cache+1715}
>>> <ffffffff88524224>{:osc:osc_queue_group_io+644}
>>> <ffffffff801671a2>{get_page_from_freelist+222}
>>> <ffffffff8016756d>{__alloc_pages+113}
>>> <ffffffff80162416>{add_to_page_cache+57}
>>> <ffffffff80162c49>{grab_cache_page_nowait+53}
>>> <ffffffff8860e368>{:lustre:ll_readahead+2584}
>>> <ffffffff8851db55>{:osc:osc_check_rpcs+773}
>>> <ffffffff8012c52c>{__wake_up+56}
>>> <ffffffff88515db1>{:osc:loi_list_maint+225}
>>> <ffffffff88330288>{:libcfs:cfs_alloc+40}
>>> <ffffffff88615557>{:lustre:ll_readpage+4775}
>>> <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585}
>>> <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119}
>>> <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190}
>>> <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207}
>>> <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356}
>>> <ffffffff885b235f>{:lov:lov_finish_set+1695}
>>> <ffffffff801629bd>{do_generic_mapping_read+525}
>>> <ffffffff8016476e>{file_read_actor+0}
>>> <ffffffff8016328b>{__generic_file_aio_read+324}
>>> <ffffffff80164576>{generic_file_readv+143}
>>> <ffffffff885b07c9>{:lov:lov_merge_lvb+281}
>>> <ffffffff80148d46>{autoremove_wake_function+0}
>>> <ffffffff8019f156>{__touch_atime+118}
>>> <ffffffff885ef821>{:lustre:ll_file_readv+6385}
>>> <ffffffff80216f4f>{__up_read+16}
>>> <ffffffff885efada>{:lustre:ll_file_read+26}
>>> <ffffffff801878f0>{vfs_read+212}
>>> <ffffffff80187cd0>{sys_read+69}
>>> <ffffffff8010ae5e>{system_call+126}
>>>
>> [...]
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>

Lustre discuss - Jun 2010 - Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB

[Lustre-discuss] Slow read performance after ~700MB-2GB