Alexander Oltu
2010-Jun-09 14:44 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
We are experiencing strange behavior on our Lustre setup. The problem appears only while reading files. It is a Cray XT4 machine and this problem is reproducible only on login nodes, while compute nodes are fine. To reproduce it we run: cd /lustrefs dd if=/dev/zero of=test.file bs=1024k count=5000 # drop cache echo 1 > /proc/sys/vm/drop_caches dd if=test.file of=/dev/null bs=1024k & # check dd status: kill -USR1 %1 After 700MB-2GB of reading speed drops to 1-2 MB/s, iowait will grow to 50% (one full CPU): hexgrid:~ # vmstat 3 procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 0 4016432 0 3349580 0 0 0 0 253 109 0 0 99 0 0 0 1 0 4016788 0 3349240 0 0 0 0 515 2322 9 6 38 47 0 0 1 0 4016836 0 3349512 0 0 0 0 505 90 0 0 50 50 0 0 1 0 4016860 0 3349716 0 0 0 0 505 82 0 0 50 50 0 0 1 0 4016860 0 3349784 0 0 0 0 505 85 0 0 50 50 0 0 1 0 4016856 0 3349920 0 0 0 0 505 85 0 0 50 50 0 0 1 0 4017048 0 3349852 0 0 0 0 505 86 0 0 50 50 0 hexgrid:~ # mpstat -P ALL 3 Linux 2.6.16.60-0.39_1.0102.4784.2.2.48B-ss (hexgrid.bccs.uib.no) 05/11/2010 03:35:12 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 03:35:15 PM all 0.00 0.00 0.00 49.92 0.00 0.00 0.00 50.08 504.67 03:35:15 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 03:35:15 PM 1 0.00 0.00 0.00 99.67 0.00 0.00 0.00 0.00 504.67 03:35:15 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s 03:35:18 PM all 0.00 0.00 0.00 50.00 0.00 0.00 0.00 50.00 503.99 03:35:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 03:35:18 PM 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 503.99 I am not sure where to look into and if it is lustre or HW problem? No messages in dmesg. Thanks, Alex.
Alexander Oltu
2010-Jun-09 14:52 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
Sorry, I had no Lustre version information. It is lustre: 1.6.5 kernel: patchless build: 1.6.5-2.2.48B-rsrel-REL_2_2_48B-19851-2009-12-23-16:31 On Wed, 9 Jun 2010 16:44:57 +0200 Alexander Oltu wrote:> We are experiencing strange behavior on our Lustre setup. The problem > appears only while reading files. It is a Cray XT4 machine and this > problem is reproducible only on login nodes, while compute nodes are > fine. > > To reproduce it we run: > cd /lustrefs > dd if=/dev/zero of=test.file bs=1024k count=5000 > # drop cache > echo 1 > /proc/sys/vm/drop_caches > dd if=test.file of=/dev/null bs=1024k & > # check dd status: > kill -USR1 %1 > > After 700MB-2GB of reading speed drops to 1-2 MB/s, iowait will grow > to 50% (one full CPU): > hexgrid:~ # vmstat 3 > procs -----------memory---------- ---swap-- -----io---- -system-- > -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa > st 0 1 0 4016432 0 3349580 0 0 0 0 253 109 0 0 99 0 0 > 0 1 0 4016788 0 3349240 0 0 0 0 515 2322 9 6 38 47 0 > 0 1 0 4016836 0 3349512 0 0 0 0 505 90 0 0 50 50 0 > 0 1 0 4016860 0 3349716 0 0 0 0 505 82 0 0 50 50 0 > 0 1 0 4016860 0 3349784 0 0 0 0 505 85 0 0 50 50 0 > 0 1 0 4016856 0 3349920 0 0 0 0 505 85 0 0 50 50 0 > 0 1 0 4017048 0 3349852 0 0 0 0 505 86 0 0 50 50 0 > > hexgrid:~ # mpstat -P ALL 3 > Linux 2.6.16.60-0.39_1.0102.4784.2.2.48B-ss (hexgrid.bccs.uib.no) > 05/11/2010 > > 03:35:12 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle > intr/s 03:35:15 PM all 0.00 0.00 0.00 49.92 0.00 0.00 0.00 50.08 > 504.67 03:35:15 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 > 03:35:15 PM 1 0.00 0.00 0.00 99.67 0.00 0.00 0.00 0.00 504.67 > > 03:35:15 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle > intr/s 03:35:18 PM all 0.00 0.00 0.00 50.00 0.00 0.00 0.00 50.00 > 503.99 03:35:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 > 03:35:18 PM 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 503.99 > > I am not sure where to look into and if it is lustre or HW problem? No > messages in dmesg. > > Thanks, > Alex. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Alexander Oltu System Engineer Parallab, Uni BCCS Thorm?hlensgt.55 N-5008 Bergen, Norway phone: +47 55584144
Jason Rappleye
2010-Jun-09 17:29 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
Is vm.zone_reclaim_mode set to a value other that zero? This sounds a lot like a problem we recently experienced when a BIOS upgrade changed to the ACPI SLIT table, which specifies the distances between NUMA nodes in a system. It put the remote node distance over the threshold the kernel uses to decide whether or not to enable the inline zone reclaim path. At least on SLES, Lustre doesn''t seem to be able to free up pages in the page cache in this path, and performance dropped to 2-4MB/s. In my test case I was issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per second, so that matches up with what you''re seeing. j On Jun 9, 2010, at 7:44 AM, Alexander Oltu wrote:> We are experiencing strange behavior on our Lustre setup. The problem > appears only while reading files. It is a Cray XT4 machine and this > problem is reproducible only on login nodes, while compute nodes are > fine. > > To reproduce it we run: > cd /lustrefs > dd if=/dev/zero of=test.file bs=1024k count=5000 > # drop cache > echo 1 > /proc/sys/vm/drop_caches > dd if=test.file of=/dev/null bs=1024k & > # check dd status: > kill -USR1 %1 > > After 700MB-2GB of reading speed drops to 1-2 MB/s, iowait will grow > to > 50% (one full CPU): > hexgrid:~ # vmstat 3 > procs -----------memory---------- ---swap-- -----io---- -system-- > -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id > wa st > 0 1 0 4016432 0 3349580 0 0 0 0 253 109 0 0 99 0 0 > 0 1 0 4016788 0 3349240 0 0 0 0 515 2322 9 6 38 47 0 > 0 1 0 4016836 0 3349512 0 0 0 0 505 90 0 0 50 50 0 > 0 1 0 4016860 0 3349716 0 0 0 0 505 82 0 0 50 50 0 > 0 1 0 4016860 0 3349784 0 0 0 0 505 85 0 0 50 50 0 > 0 1 0 4016856 0 3349920 0 0 0 0 505 85 0 0 50 50 0 > 0 1 0 4017048 0 3349852 0 0 0 0 505 86 0 0 50 50 0 > > hexgrid:~ # mpstat -P ALL 3 > Linux 2.6.16.60-0.39_1.0102.4784.2.2.48B-ss (hexgrid.bccs.uib.no) > 05/11/2010 > > 03:35:12 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle > intr/s > 03:35:15 PM all 0.00 0.00 0.00 49.92 0.00 0.00 0.00 50.08 504.67 > 03:35:15 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 > 03:35:15 PM 1 0.00 0.00 0.00 99.67 0.00 0.00 0.00 0.00 504.67 > > 03:35:15 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle > intr/s > 03:35:18 PM all 0.00 0.00 0.00 50.00 0.00 0.00 0.00 50.00 503.99 > 03:35:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0.00 > 03:35:18 PM 1 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 503.99 > > I am not sure where to look into and if it is lustre or HW problem? No > messages in dmesg. > > Thanks, > Alex. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Alexander Oltu
2010-Jun-09 19:41 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote:> > Is vm.zone_reclaim_mode set to a value other that zero?Yes, it was 1, as soon as I set it to 0 the problem disappears. Thank you Jason! It was a long issue for us.> > This sounds a lot like a problem we recently experienced when a BIOS > upgrade changed to the ACPI SLIT table, which specifies the > distances between NUMA nodes in a system. It put the remote node > distance over the threshold the kernel uses to decide whether or not > to enable the inline zone reclaim path. At least on SLES, Lustre > doesn''t seem to be able to free up pages in the page cache in this > path, and performance dropped to 2-4MB/s. In my test case I was > issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per > second, so that matches up with what you''re seeing. > > j > >
Andreas Dilger
2010-Jun-09 19:45 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
On 2010-06-09, at 13:41, Alexander Oltu wrote:> On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote: >> Is vm.zone_reclaim_mode set to a value other that zero? > > Yes, it was 1, as soon as I set it to 0 the problem disappears.Interesting, I have never heard of this problem before. Is this a client-side parameter, or on the server?>> This sounds a lot like a problem we recently experienced when a BIOS >> upgrade changed to the ACPI SLIT table, which specifies the >> distances between NUMA nodes in a system. It put the remote node >> distance over the threshold the kernel uses to decide whether or not >> to enable the inline zone reclaim path. At least on SLES, Lustre >> doesn''t seem to be able to free up pages in the page cache in this >> path, and performance dropped to 2-4MB/s. In my test case I was >> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per >> second, so that matches up with what you''re seeing.Jason, do you have enough of an understanding of this codepath to know why Lustre is not freeing pages in this case? Is it because Lustre just doesn''t have a VM callback that frees pages at all, or is it somehow ignoring the requests from the kernel to free up the pages? Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Jason Rappleye
2010-Jun-09 20:51 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
On Jun 9, 2010, at 12:45 PM, Andreas Dilger wrote:> On 2010-06-09, at 13:41, Alexander Oltu wrote: >> On Wed, 9 Jun 2010 10:29:36 -0700 Jason Rappleye wrote: >>> Is vm.zone_reclaim_mode set to a value other that zero? >> >> Yes, it was 1, as soon as I set it to 0 the problem disappears. > > Interesting, I have never heard of this problem before. Is this a > client-side parameter, or on the server?Client. In 1.6.x Lustre doesn''t use the page cache on the server (right?), so I don''t think this will cause a problem. At least, not this particular problem. When we first saw this problem a few weeks ago it appeared that client processes were stuck in uninterruptible sleep in blk_congestion_wait, but upon further examination we saw they were still issuing 1-2 I/Os per second. The kernel stack trace looked like this: <ffffffff8013d6ec>{internal_add_timer+21} <ffffffff8030fbc4>{schedule_timeout+138} <ffffffff8013def0>{process_timeout+0} <ffffffff8030f3ec>{io_schedule_timeout+88} <ffffffff801f1d74>{blk_congestion_wait+102} <ffffffff80148d46>{autoremove_wake_function+0} <ffffffff8016851d>{throttle_vm_writeout+33} <ffffffff8016aa0e>{remove_mapping+133} <ffffffff8016b8e8>{shrink_zone+3367} <ffffffff80218799>{find_next_bit+96} <ffffffff8016c435>{zone_reclaim+430} <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154} <ffffffff8852df5a>{:osc:cache_add_extent+1178} <ffffffff8860f838>{:lustre:ll_removepage+488} <ffffffff8852152a>{:osc:osc_prep_async_page+426} <ffffffff8860c953>{:lustre:llap_shrink_cache+1715} <ffffffff88524224>{:osc:osc_queue_group_io+644} <ffffffff801671a2>{get_page_from_freelist+222} <ffffffff8016756d>{__alloc_pages+113} <ffffffff80162416>{add_to_page_cache+57} <ffffffff80162c49>{grab_cache_page_nowait+53} <ffffffff8860e368>{:lustre:ll_readahead+2584} <ffffffff8851db55>{:osc:osc_check_rpcs+773} <ffffffff8012c52c>{__wake_up+56} <ffffffff88515db1>{:osc:loi_list_maint+225} <ffffffff88330288>{:libcfs:cfs_alloc+40} <ffffffff88615557>{:lustre:ll_readpage+4775} <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585} <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119} <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190} <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207} <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356} <ffffffff885b235f>{:lov:lov_finish_set+1695} <ffffffff801629bd>{do_generic_mapping_read+525} <ffffffff8016476e>{file_read_actor+0} <ffffffff8016328b>{__generic_file_aio_read+324} <ffffffff80164576>{generic_file_readv+143} <ffffffff885b07c9>{:lov:lov_merge_lvb+281} <ffffffff80148d46>{autoremove_wake_function+0} <ffffffff8019f156>{__touch_atime+118} <ffffffff885ef821>{:lustre:ll_file_readv+6385} <ffffffff80216f4f>{__up_read+16} <ffffffff885efada>{:lustre:ll_file_read+26} <ffffffff801878f0>{vfs_read+212} <ffffffff80187cd0>{sys_read+69} <ffffffff8010ae5e>{system_call+126} zone_reclaim_mode is set in mm/page_alloc.c:build_zonelists by examining the distance between nodes. The value of 20 in the old BOIS was fine; the remote distance is set to 21 in the new BIOS and that put it over the edge. We''re working with the vendor to understand why that change was made. In any case, setting it back to zero works around the issue. Just to be clear, the remote node distance in and of itself doesn''t have anything to do with the problem - setting zone_reclaim_mode to one on a host with the original distances is sufficient to reproduce the problem. When the problem occurs, echoing 3 into drop_caches provides some temporarily relief, but the kernel will eventually re-enter zone_reclaim.> >>> This sounds a lot like a problem we recently experienced when a BIOS >>> upgrade changed to the ACPI SLIT table, which specifies the >>> distances between NUMA nodes in a system. It put the remote node >>> distance over the threshold the kernel uses to decide whether or not >>> to enable the inline zone reclaim path. At least on SLES, Lustre >>> doesn''t seem to be able to free up pages in the page cache in this >>> path, and performance dropped to 2-4MB/s. In my test case I was >>> issuing 2MB I/Os and the kernel only let 1-2 I/Os trickle out per >>> second, so that matches up with what you''re seeing. > > Jason, do you have enough of an understanding of this codepath to > know why Lustre is not freeing pages in this case? Is it because > Lustre just doesn''t have a VM callback that frees pages at all, or > is it somehow ignoring the requests from the kernel to free up the > pages?Kind of. The stack trace above definitely shows that the process enters the inline zone reclaim path. zone_reclaim itself calls shrink_slabs, This asks each cache that has been registered via set_shrinker to free up memory. At least, that''s how it works in the SLES 10 kernel. Looking at our 1.6.7 client sources (no, we don''t run 1.6.7 on our servers!), llite registers the cache using ll_register_cache, which calls the kernel register_cache function. It ends up being a no-op if register_cache isn''t found by autoconf. My understanding is that the interface changed at some point, and some versions of Lustre doesn''t support the set_shrinker interface. It *looks* like 1.8.2 does the right thing, at least in terms of registering the cache. Apparently register_shrinker is the function to call in some kernels. autoconf checks for it''s existence, and if it''s there it wraps the call with its own version of set_shrinker (in lustre/include/linux/lustre_compat25.h). Otherwise, it calls the kernel''s set_shrinker. I haven''t tried to reproduce this problem with a 1.8.2 client yet, but I should be able to sneak in a test this week and see if it works as expected. j -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Jason Rappleye
2010-Jun-10 00:36 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
On Jun 9, 2010, at 1:51 PM, Jason Rappleye wrote:> > I haven''t tried to reproduce this problem with a 1.8.2 client yet, but > I should be able to sneak in a test this week and see if it works as > expected.Same behavior with what is essentially a 1.8.2.0-24chaos client (changes to it are inconsequential) and SLES 10 SP2 kernel 2.6.16.60-0.42.5 + some local patches. Those patches *should* be inconsequential as well - mostly changes to the OOM handler, IIRC (which isn''t an issue here). Our vendor already has a case on this - I''ll go ahead and have them file a BZ. Thanks, j -- Jason Rappleye System Administrator NASA Advanced Supercomputing Division NASA Ames Research Center Moffett Field, CA 94035
Hi, Slightly off-topic, but did anyone else notice that readahead is triggering the shrinking and page writeout? ll_read_ahead_page() clears __GFP_WAIT but it seems sane to me that it should also drop __GFP_IO. In my opinion, Lustre shouldn''t speculatively force other pages out. Only when there is an actual, demonstrated need, should it force out the other pages. Thanks, -Cory Jason Rappleye wrote: [...]> When we first saw this problem a few weeks ago it appeared that client > processes were stuck in uninterruptible sleep in blk_congestion_wait, > but upon further examination we saw they were still issuing 1-2 I/Os > per second. The kernel stack trace looked like this: > > <ffffffff8013d6ec>{internal_add_timer+21} > <ffffffff8030fbc4>{schedule_timeout+138} > <ffffffff8013def0>{process_timeout+0} > <ffffffff8030f3ec>{io_schedule_timeout+88} > <ffffffff801f1d74>{blk_congestion_wait+102} > <ffffffff80148d46>{autoremove_wake_function+0} > <ffffffff8016851d>{throttle_vm_writeout+33} > <ffffffff8016aa0e>{remove_mapping+133} > <ffffffff8016b8e8>{shrink_zone+3367} > <ffffffff80218799>{find_next_bit+96} > <ffffffff8016c435>{zone_reclaim+430} > <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154} > <ffffffff8852df5a>{:osc:cache_add_extent+1178} > <ffffffff8860f838>{:lustre:ll_removepage+488} > <ffffffff8852152a>{:osc:osc_prep_async_page+426} > <ffffffff8860c953>{:lustre:llap_shrink_cache+1715} > <ffffffff88524224>{:osc:osc_queue_group_io+644} > <ffffffff801671a2>{get_page_from_freelist+222} > <ffffffff8016756d>{__alloc_pages+113} > <ffffffff80162416>{add_to_page_cache+57} > <ffffffff80162c49>{grab_cache_page_nowait+53} > <ffffffff8860e368>{:lustre:ll_readahead+2584} > <ffffffff8851db55>{:osc:osc_check_rpcs+773} > <ffffffff8012c52c>{__wake_up+56} > <ffffffff88515db1>{:osc:loi_list_maint+225} > <ffffffff88330288>{:libcfs:cfs_alloc+40} > <ffffffff88615557>{:lustre:ll_readpage+4775} > <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585} > <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119} > <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190} > <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207} > <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356} > <ffffffff885b235f>{:lov:lov_finish_set+1695} > <ffffffff801629bd>{do_generic_mapping_read+525} > <ffffffff8016476e>{file_read_actor+0} > <ffffffff8016328b>{__generic_file_aio_read+324} > <ffffffff80164576>{generic_file_readv+143} > <ffffffff885b07c9>{:lov:lov_merge_lvb+281} > <ffffffff80148d46>{autoremove_wake_function+0} > <ffffffff8019f156>{__touch_atime+118} > <ffffffff885ef821>{:lustre:ll_file_readv+6385} > <ffffffff80216f4f>{__up_read+16} > <ffffffff885efada>{:lustre:ll_file_read+26} > <ffffffff801878f0>{vfs_read+212} > <ffffffff80187cd0>{sys_read+69} > <ffffffff8010ae5e>{system_call+126} >[...]
Andreas Dilger
2010-Jun-10 22:46 UTC
[Lustre-discuss] Slow read performance after ~700MB-2GB
On 2010-06-10, at 08:48, Cory Spitz wrote:> Slightly off-topic, but did anyone else notice that readahead is triggering the shrinking and page writeout? ll_read_ahead_page() clears __GFP_WAIT but it seems sane to me that it should also drop __GFP_IO. In my opinion, Lustre > shouldn''t speculatively force other pages out. Only when there is an actual, > demonstrated need, should it force out the other pages.We used to have a kernel patch (and more recently I implemented this using generic kernel EXPORT_FUNCTION() operations) to implement grab_cache_page_nowait_gfp() to allow specifying the GFP mask when allocating pages for readahead. Without that, the kernel uses the GFP mask from the address space, which we have no control over. That said, disabling memory pressure from readahead has a negative side effect also. When the client memory is full (i.e. all the time) there is NO readahead generated because the readahead grab_cache_page_nowait_gfp() calls always fail, and this degrades performance significantly, since the reads are now synchronous and a single stream, instead of pipelined. While it is true that some speculative readahead may result in evicting other useful pages from cache, it is more likely to be prefetching useful pages that the current process wants to use immediately and evicting old/useless pages. The readahead algorithms definitely need some improvement, and it is possible that it is over-zealous in this case, but it isn''t possible to say in this case. I''d say the core problem is that no reclaim is being triggered and/or the reclaim is deadlocked on the cache cleaning, and that is the first issue to focus on here.> Jason Rappleye wrote: > [...] >> When we first saw this problem a few weeks ago it appeared that client >> processes were stuck in uninterruptible sleep in blk_congestion_wait, >> but upon further examination we saw they were still issuing 1-2 I/Os >> per second. The kernel stack trace looked like this: >> >> <ffffffff8013d6ec>{internal_add_timer+21} >> <ffffffff8030fbc4>{schedule_timeout+138} >> <ffffffff8013def0>{process_timeout+0} >> <ffffffff8030f3ec>{io_schedule_timeout+88} >> <ffffffff801f1d74>{blk_congestion_wait+102} >> <ffffffff80148d46>{autoremove_wake_function+0} >> <ffffffff8016851d>{throttle_vm_writeout+33} >> <ffffffff8016aa0e>{remove_mapping+133} >> <ffffffff8016b8e8>{shrink_zone+3367} >> <ffffffff80218799>{find_next_bit+96} >> <ffffffff8016c435>{zone_reclaim+430} >> <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154} >> <ffffffff8852df5a>{:osc:cache_add_extent+1178} >> <ffffffff8860f838>{:lustre:ll_removepage+488} >> <ffffffff8852152a>{:osc:osc_prep_async_page+426} >> <ffffffff8860c953>{:lustre:llap_shrink_cache+1715} >> <ffffffff88524224>{:osc:osc_queue_group_io+644} >> <ffffffff801671a2>{get_page_from_freelist+222} >> <ffffffff8016756d>{__alloc_pages+113} >> <ffffffff80162416>{add_to_page_cache+57} >> <ffffffff80162c49>{grab_cache_page_nowait+53} >> <ffffffff8860e368>{:lustre:ll_readahead+2584} >> <ffffffff8851db55>{:osc:osc_check_rpcs+773} >> <ffffffff8012c52c>{__wake_up+56} >> <ffffffff88515db1>{:osc:loi_list_maint+225} >> <ffffffff88330288>{:libcfs:cfs_alloc+40} >> <ffffffff88615557>{:lustre:ll_readpage+4775} >> <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585} >> <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119} >> <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190} >> <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207} >> <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356} >> <ffffffff885b235f>{:lov:lov_finish_set+1695} >> <ffffffff801629bd>{do_generic_mapping_read+525} >> <ffffffff8016476e>{file_read_actor+0} >> <ffffffff8016328b>{__generic_file_aio_read+324} >> <ffffffff80164576>{generic_file_readv+143} >> <ffffffff885b07c9>{:lov:lov_merge_lvb+281} >> <ffffffff80148d46>{autoremove_wake_function+0} >> <ffffffff8019f156>{__touch_atime+118} >> <ffffffff885ef821>{:lustre:ll_file_readv+6385} >> <ffffffff80216f4f>{__up_read+16} >> <ffffffff885efada>{:lustre:ll_file_read+26} >> <ffffffff801878f0>{vfs_read+212} >> <ffffffff80187cd0>{sys_read+69} >> <ffffffff8010ae5e>{system_call+126} >> > [...]Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Hi, Andreas. Yes, of course the negative side effect makes perfect sense. In fact, since users prefer to use all the memory that they can and the Linux buffer cache the same, then full memory case is probably more important. Thanks, -Cory On 6/10/2010 5:46 PM, Andreas Dilger wrote:> On 2010-06-10, at 08:48, Cory Spitz wrote: >> Slightly off-topic, but did anyone else notice that readahead is triggering the shrinking and page writeout? ll_read_ahead_page() clears __GFP_WAIT but it seems sane to me that it should also drop __GFP_IO. In my opinion, Lustre >> shouldn''t speculatively force other pages out. Only when there is an actual, >> demonstrated need, should it force out the other pages. > > We used to have a kernel patch (and more recently I implemented this using generic kernel EXPORT_FUNCTION() operations) to implement grab_cache_page_nowait_gfp() to allow specifying the GFP mask when allocating pages for readahead. Without that, the kernel uses the GFP mask from the address space, which we have no control over. > > That said, disabling memory pressure from readahead has a negative side effect also. When the client memory is full (i.e. all the time) there is NO readahead generated because the readahead grab_cache_page_nowait_gfp() calls always fail, and this degrades performance significantly, since the reads are now synchronous and a single stream, instead of pipelined. > > While it is true that some speculative readahead may result in evicting other useful pages from cache, it is more likely to be prefetching useful pages that the current process wants to use immediately and evicting old/useless pages. > > The readahead algorithms definitely need some improvement, and it is possible that it is over-zealous in this case, but it isn''t possible to say in this case. > > I''d say the core problem is that no reclaim is being triggered and/or the reclaim is deadlocked on the cache cleaning, and that is the first issue to focus on here. > > >> Jason Rappleye wrote: >> [...] >>> When we first saw this problem a few weeks ago it appeared that client >>> processes were stuck in uninterruptible sleep in blk_congestion_wait, >>> but upon further examination we saw they were still issuing 1-2 I/Os >>> per second. The kernel stack trace looked like this: >>> >>> <ffffffff8013d6ec>{internal_add_timer+21} >>> <ffffffff8030fbc4>{schedule_timeout+138} >>> <ffffffff8013def0>{process_timeout+0} >>> <ffffffff8030f3ec>{io_schedule_timeout+88} >>> <ffffffff801f1d74>{blk_congestion_wait+102} >>> <ffffffff80148d46>{autoremove_wake_function+0} >>> <ffffffff8016851d>{throttle_vm_writeout+33} >>> <ffffffff8016aa0e>{remove_mapping+133} >>> <ffffffff8016b8e8>{shrink_zone+3367} >>> <ffffffff80218799>{find_next_bit+96} >>> <ffffffff8016c435>{zone_reclaim+430} >>> <ffffffff8843c3ba>{:ptlrpc:ldlm_lock_decref+154} >>> <ffffffff8852df5a>{:osc:cache_add_extent+1178} >>> <ffffffff8860f838>{:lustre:ll_removepage+488} >>> <ffffffff8852152a>{:osc:osc_prep_async_page+426} >>> <ffffffff8860c953>{:lustre:llap_shrink_cache+1715} >>> <ffffffff88524224>{:osc:osc_queue_group_io+644} >>> <ffffffff801671a2>{get_page_from_freelist+222} >>> <ffffffff8016756d>{__alloc_pages+113} >>> <ffffffff80162416>{add_to_page_cache+57} >>> <ffffffff80162c49>{grab_cache_page_nowait+53} >>> <ffffffff8860e368>{:lustre:ll_readahead+2584} >>> <ffffffff8851db55>{:osc:osc_check_rpcs+773} >>> <ffffffff8012c52c>{__wake_up+56} >>> <ffffffff88515db1>{:osc:loi_list_maint+225} >>> <ffffffff88330288>{:libcfs:cfs_alloc+40} >>> <ffffffff88615557>{:lustre:ll_readpage+4775} >>> <ffffffff885b3109>{:lov:lov_fini_enqueue_set+585} >>> <ffffffff88438cc7>{:ptlrpc:ldlm_lock_add_to_lru+119} >>> <ffffffff8843719e>{:ptlrpc:lock_res_and_lock+190} >>> <ffffffff883d792f>{:obdclass:class_handle_unhash_nolock+207} >>> <ffffffff8843bb1c>{:ptlrpc:ldlm_lock_decref_internal+1356} >>> <ffffffff885b235f>{:lov:lov_finish_set+1695} >>> <ffffffff801629bd>{do_generic_mapping_read+525} >>> <ffffffff8016476e>{file_read_actor+0} >>> <ffffffff8016328b>{__generic_file_aio_read+324} >>> <ffffffff80164576>{generic_file_readv+143} >>> <ffffffff885b07c9>{:lov:lov_merge_lvb+281} >>> <ffffffff80148d46>{autoremove_wake_function+0} >>> <ffffffff8019f156>{__touch_atime+118} >>> <ffffffff885ef821>{:lustre:ll_file_readv+6385} >>> <ffffffff80216f4f>{__up_read+16} >>> <ffffffff885efada>{:lustre:ll_file_read+26} >>> <ffffffff801878f0>{vfs_read+212} >>> <ffffffff80187cd0>{sys_read+69} >>> <ffffffff8010ae5e>{system_call+126} >>> >> [...] > > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. >