We are unable to use the combination of Lustre and the cgroup memory controller, because of intermittent hangs when trying to close the cgroup. In a thread on LKML [1] we diagnosed that the problem was a leak of page accounting or resources. Memory pages are charged to the cgroup, but the cgroup is unable to un-charge them, and so it spins. It suggests that, perhaps, at least one page gets allocated but not placed in the LRU. Using the NFS client, via a gateway, has never shown this problem. I''m in the client code, but I really need some pointers. And disadvantaged by being unable to find a reproducable test case. Any ideas? Our system is Lustre 1.8.6 server, with clients on Linux 2.6.32 and Lustre 1.8.5. Thanks [1] https://lkml.org/lkml/2010/9/9/534 -- Mark
Two ideas come to mind. On is that the reason you are having difficulty to reproduce the problem is that it only happens after some fault condition. Possibly you need the client to do recovery to an OST and resend a bulk RPC, or resend due to a checksum error? It might also be due to application IO types (e.g. mmap, direct IO, pwrite, splice, etc). Possibly you can correlate reproducer cases with Lustre errors on the console? Lustre also has memory debugging that can be enabled, but without a reasonably concise reproducer it would be difficult to log/analyze so much data for hours of runtime. Cheers, Andreas On 2011-07-27, at 10:21 AM, Mark Hills <Mark.Hills at framestore.com> wrote:> We are unable to use the combination of Lustre and the cgroup memory > controller, because of intermittent hangs when trying to close the cgroup. > > In a thread on LKML [1] we diagnosed that the problem was a leak of page > accounting or resources. > > Memory pages are charged to the cgroup, but the cgroup is unable to > un-charge them, and so it spins. It suggests that, perhaps, at least one > page gets allocated but not placed in the LRU. > > Using the NFS client, via a gateway, has never shown this problem. > > I''m in the client code, but I really need some pointers. And disadvantaged > by being unable to find a reproducable test case. Any ideas? > > Our system is Lustre 1.8.6 server, with clients on Linux 2.6.32 and Lustre > 1.8.5. > > Thanks > > [1] https://lkml.org/lkml/2010/9/9/534 > > -- > Mark > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
On Wed, 27 Jul 2011, Andreas Dilger wrote:> Two ideas come to mind. On is that the reason you are having difficulty > to reproduce the problem is that it only happens after some fault > condition. Possibly you need the client to do recovery to an OST and > resend a bulk RPC, or resend due to a checksum error?Is there an easy way to trigger some error cases like this?> It might also be due to application IO types (e.g. mmap, direct IO, > pwrite, splice, etc).Yes, of course. Although I didn''t gather any statistics, there wasn''t a clear standout application which was more affected than others.> Possibly you can correlate reproducer cases with Lustre errors on the > console?Back when I tried this last year on the production system, I wasn''t able to see corresponding errors. But I don''t have any of this data around any more. I''d need to do some tests on the production system to capture one case.> Lustre also has memory debugging that can be enabled, but without a > reasonably concise reproducer it would be difficult to log/analyze so > much data for hours of runtime.If I am able to capture a case, is there a way to, for example, dump a list of Lustre pages still held by the client? And correlate these with the files in question? What I am thinking is that I could stop the running processes and attempt to drain all the pages, and this could hopefully leave a small number of ''bad'' ones -- with the files in question I could at least help to identify the I/O type. Thanks for your reply -- Mark
On Wed, 27 Jul 2011, Andreas Dilger wrote: [...]> Possibly you can correlate reproducer cases with Lustre errors on the > console?I''ve managed to catch the bad state, on a clean client too -- there''s no errors reported from Lustre in dmesg. Here''s the information reported by the cgroup. It seems that there''s a discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout). The process which was in the group terminated a long time ago. I can leave the machine in this state until tomorrow, so any suggestions for data to capture that could help trace this bug would be welcomed. Thanks. # cd /cgroup/p25321 # echo 1 > memory.force_empty <hangs: the bug> # cat tasks <none> # cat memory.max_usage_in_bytes 1281351680 # cat memory.usage_in_bytes 8192 # cat memory.stat cache 8192 <--- two pages rss 0 mapped_file 0 pgpgin 396369 <--- two pages higher than pgpgout pgpgout 396367 swap 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 8388608000 hierarchical_memsw_limit 10485760000 total_cache 8192 total_rss 0 total_mapped_file 0 total_pgpgin 396369 total_pgpgout 396367 total_swap 0 total_inactive_anon 0 total_active_anon 0 total_inactive_file 0 total_active_file 0 total_unevictable 0 # echo 1 > /proc/sys/vm/drop_caches <success> # echo 2 > /proc/sys/vm/drop_caches <success> # cat memory.stat <same as above> -- Mark
On 2011-07-27, at 12:57 PM, Mark Hills wrote:> On Wed, 27 Jul 2011, Andreas Dilger wrote: > [...] >> Possibly you can correlate reproducer cases with Lustre errors on the >> console? > > I''ve managed to catch the bad state, on a clean client too -- there''s no > errors reported from Lustre in dmesg. > > Here''s the information reported by the cgroup. It seems that there''s a > discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout).To dump Lustre pagecache pages use "lctl get_param llite.*.dump_page_cache", which will print the inode, page index, read/write access, and page flags. It wouldn''t hurt to dump the kernel debug log, but it is unlikely to hold anything useful.> The process which was in the group terminated a long time ago. > > I can leave the machine in this state until tomorrow, so any suggestions > for data to capture that could help trace this bug would be welcomed. > Thanks. > > # cd /cgroup/p25321 > > # echo 1 > memory.force_empty > <hangs: the bug> > > # cat tasks > <none> > > # cat memory.max_usage_in_bytes > 1281351680 > > # cat memory.usage_in_bytes > 8192 > > # cat memory.stat > cache 8192 <--- two pages > rss 0 > mapped_file 0 > pgpgin 396369 <--- two pages higher than pgpgout > pgpgout 396367 > swap 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 8388608000 > hierarchical_memsw_limit 10485760000 > total_cache 8192 > total_rss 0 > total_mapped_file 0 > total_pgpgin 396369 > total_pgpgout 396367 > total_swap 0 > total_inactive_anon 0 > total_active_anon 0 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > # echo 1 > /proc/sys/vm/drop_caches > <success> > > # echo 2 > /proc/sys/vm/drop_caches > <success> > > # cat memory.stat > <same as above> > > -- > MarkCheers, Andreas -- Andreas Dilger Principal Engineer Whamcloud, Inc.
On Wed, 27 Jul 2011, Andreas Dilger wrote:> On 2011-07-27, at 12:57 PM, Mark Hills wrote: > > On Wed, 27 Jul 2011, Andreas Dilger wrote: > > [...] > >> Possibly you can correlate reproducer cases with Lustre errors on the > >> console? > > > > I''ve managed to catch the bad state, on a clean client too -- there''s no > > errors reported from Lustre in dmesg. > > > > Here''s the information reported by the cgroup. It seems that there''s a > > discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout). > > To dump Lustre pagecache pages use "lctl get_param llite.*.dump_page_cache", > which will print the inode, page index, read/write access, and page flags.So I lost the previous test case, but acquired another. This time there are 147 pages of difference. But not listed by the lctl command, which gives an empty list. The cgroup reports approx. 600KiB used as ''cache'' (memory.stat). Yet /proc/meminfo does not; only 69KiB. But, what caught my attention is that cgroup ''cache'' value dropped slightly a few minutes later. drop_caches method wasn''t touching this memory. But when I put the system under memory pressure, these pages were discarded and ''cached'' was reduced. Until eventually the the cgroup unhangs. So what I observed is that the pages cannot be forced out of the cache -- only by memory pressure. I did a quick test on the regular behaviour, and drop_caches normally works fine with Lustre content, both in and out of a cgroup. So these pages are ''special'' in some way. It is possible that some pages could not be in LRU, but would still be seen by the memory pressure codepaths? Thanks # cd /group/p1243 # echo 1 > memory.force_empty <hangs> # echo 2 > /proc/sys/vm/drop_caches # lctl get_param llite.*.dump_page_cache llite.beta-ffff88042b186400.dump_page_cachegener | llap cookie origin wq du wb | page inode index count [ page flags ] # cat memory.usage_in_bytes 602112 # cat memory.stat cache 602112 rss 0 mapped_file 0 pgpgin 1998315 pgpgout 1998168 swap 0 inactive_anon 0 active_anon 0 inactive_file 0 active_file 0 unevictable 0 hierarchical_memory_limit 16777216000 hierarchical_memsw_limit 20971520000 total_cache 602112 total_rss 0 total_mapped_file 0 total_pgpgin 1998315 total_pgpgout 1998168 total_swap 0 total_inactive_anon 0 total_active_anon 0 total_inactive_file 0 total_active_file 0 total_unevictable 0 # cat /proc/meminfo MemTotal: 16464728 kB MemFree: 15875412 kB Buffers: 256 kB Cached: 69540 kB SwapCached: 0 kB Active: 59452 kB Inactive: 87736 kB Active(anon): 33072 kB Inactive(anon): 61224 kB Active(file): 26380 kB Inactive(file): 26512 kB Unevictable: 228 kB Mlocked: 0 kB SwapTotal: 16587072 kB SwapFree: 16587072 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 77620 kB Mapped: 26768 kB Shmem: 16676 kB Slab: 67120 kB SReclaimable: 29136 kB SUnreclaim: 37984 kB KernelStack: 3336 kB PageTables: 10292 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 24819436 kB Committed_AS: 659876 kB VmallocTotal: 34359738367 kB VmallocUsed: 320240 kB VmallocChunk: 34359359884 kB HardwareCorrupted: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7488 kB DirectMap2M: 16764928 kB <some time later> # cat memory.stat | grep cache cache 581632 # echo 2 > /proc/sys/vm/drop_caches # cat memory.stat | grep cache cache 581632 <put system under memory pressure> # cat memory.stat | grep cache cache 118784 <keep going> # cat memory.stat | grep cache cache 0 <memory.force_empty un-hangs> -- Mark
If you get another system in this hang there are some more things you could check: lctl get_param memused pagesused This will print the count of all memory Lustre still thinks is allocated. Check the slab cache allocations (/proc/slabinfo) for Lustre slab objects. Usually they are called ll_* or ldlm_* and are listed in sequence. Enable memory allocation tracing before applying memory pressure: lctl set_param debug=+malloc And then when the memory is freed dump the debug logs: lctl dk /tmp/debug And grep out the "free" lines. The other thing that may free Lustre memory is to remove the modules, but you need to keep libcfs loaded in order to be able to dump the debug log. Cheers, Andreas On 2011-07-28, at 7:53 AM, Mark Hills <Mark.Hills at framestore.com> wrote:> On Wed, 27 Jul 2011, Andreas Dilger wrote: > >> On 2011-07-27, at 12:57 PM, Mark Hills wrote: >>> On Wed, 27 Jul 2011, Andreas Dilger wrote: >>> [...] >>>> Possibly you can correlate reproducer cases with Lustre errors on the >>>> console? >>> >>> I''ve managed to catch the bad state, on a clean client too -- there''s no >>> errors reported from Lustre in dmesg. >>> >>> Here''s the information reported by the cgroup. It seems that there''s a >>> discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout). >> >> To dump Lustre pagecache pages use "lctl get_param llite.*.dump_page_cache", >> which will print the inode, page index, read/write access, and page flags. > > So I lost the previous test case, but acquired another. This time there > are 147 pages of difference. But not listed by the lctl command, which > gives an empty list. > > The cgroup reports approx. 600KiB used as ''cache'' (memory.stat). Yet > /proc/meminfo does not; only 69KiB. > > But, what caught my attention is that cgroup ''cache'' value dropped > slightly a few minutes later. drop_caches method wasn''t touching this > memory. But when I put the system under memory pressure, these pages were > discarded and ''cached'' was reduced. Until eventually the the cgroup > unhangs. > > So what I observed is that the pages cannot be forced out of the cache -- > only by memory pressure. > > I did a quick test on the regular behaviour, and drop_caches normally > works fine with Lustre content, both in and out of a cgroup. So these > pages are ''special'' in some way. > > It is possible that some pages could not be in LRU, but would still be > seen by the memory pressure codepaths? > > Thanks > > # cd /group/p1243 > > # echo 1 > memory.force_empty > <hangs> > > # echo 2 > /proc/sys/vm/drop_caches > > # lctl get_param llite.*.dump_page_cache > llite.beta-ffff88042b186400.dump_page_cache> gener | llap cookie origin wq du wb | page inode index count [ page flags ] > > # cat memory.usage_in_bytes > 602112 > > # cat memory.stat > cache 602112 > rss 0 > mapped_file 0 > pgpgin 1998315 > pgpgout 1998168 > swap 0 > inactive_anon 0 > active_anon 0 > inactive_file 0 > active_file 0 > unevictable 0 > hierarchical_memory_limit 16777216000 > hierarchical_memsw_limit 20971520000 > total_cache 602112 > total_rss 0 > total_mapped_file 0 > total_pgpgin 1998315 > total_pgpgout 1998168 > total_swap 0 > total_inactive_anon 0 > total_active_anon 0 > total_inactive_file 0 > total_active_file 0 > total_unevictable 0 > > # cat /proc/meminfo > MemTotal: 16464728 kB > MemFree: 15875412 kB > Buffers: 256 kB > Cached: 69540 kB > SwapCached: 0 kB > Active: 59452 kB > Inactive: 87736 kB > Active(anon): 33072 kB > Inactive(anon): 61224 kB > Active(file): 26380 kB > Inactive(file): 26512 kB > Unevictable: 228 kB > Mlocked: 0 kB > SwapTotal: 16587072 kB > SwapFree: 16587072 kB > Dirty: 0 kB > Writeback: 0 kB > AnonPages: 77620 kB > Mapped: 26768 kB > Shmem: 16676 kB > Slab: 67120 kB > SReclaimable: 29136 kB > SUnreclaim: 37984 kB > KernelStack: 3336 kB > PageTables: 10292 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 24819436 kB > Committed_AS: 659876 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 320240 kB > VmallocChunk: 34359359884 kB > HardwareCorrupted: 0 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 7488 kB > DirectMap2M: 16764928 kB > > <some time later> > > # cat memory.stat | grep cache > cache 581632 > > # echo 2 > /proc/sys/vm/drop_caches > > # cat memory.stat | grep cache > cache 581632 > > <put system under memory pressure> > > # cat memory.stat | grep cache > cache 118784 > > <keep going> > > # cat memory.stat | grep cache > cache 0 > > <memory.force_empty un-hangs> > > -- > Mark
On Wed, Jul 27, 2011 at 07:57:57PM +0100, Mark Hills wrote:>On Wed, 27 Jul 2011, Andreas Dilger wrote: >> Possibly you can correlate reproducer cases with Lustre errors on the >> console? >I''ve managed to catch the bad state, on a clean client too -- there''s no >errors reported from Lustre in dmesg. > >Here''s the information reported by the cgroup. It seems that there''s a >discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout). > >The process which was in the group terminated a long time ago. > >I can leave the machine in this state until tomorrow, so any suggestions >for data to capture that could help trace this bug would be welcomed. >Thanks.maybe try vm.zone_reclaim_mode=0 with zone_reclaim_mode=1 (even without memcg) we saw ~infinite scanning for pages when doing Lustre i/o + memory pressure, which also hung up a core in 100% system time. the scanning can be seen with grep scan /proc/zoneinfo that zone_reclaim_mode=0 helps our problem could be related to your memcg semi-missing pages, or perhaps it''s a workaround for a core kernel problem with zones - we only have Lustre so can''t distinguish. secondly, and even more of a long shot - I presume slab isn''t accounted as part of memcg, but you could also try clearing the ldlm locks. Linux is reluctant to drop inodes caches until the locks are cleared first lctl set_param ldlm.namespaces.*.lru_size=clear cheers, robin
On Thu, 28 Jul 2011, Andreas Dilger wrote:> If you get another system in this hang there are some more things you > could check: > > lctl get_param memused pagesused > > This will print the count of all memory Lustre still thinks is > allocated. > > Check the slab cache allocations (/proc/slabinfo) for Lustre slab > objects. Usually they are called ll_* or ldlm_* and are listed in > sequence. > > Enable memory allocation tracing before applying memory pressure: > > lctl set_param debug=+malloc > > And then when the memory is freed dump the debug logs: > > lctl dk /tmp/debug > > And grep out the "free" lines.I followed these steps. Neither /proc/slabinfo nor Lustre''s own logs show activity at the point where the pages are forced out due to memory pressure. (There''s a certain amount of periodic noise in the debug logs, but looking beyond that I was able to force memory pressure, watch pages out, and Lustre logged nothing) As before, dumping Lustre pagecache pages shows nothing. So it looks like these aren''t Lustre pages. Furthermore...> The other thing that may free Lustre memory is to remove the modules, > but you need to keep libcfs loaded in order to be able to dump the debug > log.I then unmounted the filesystem and removed all the modules, right the way down to libcfs. On completion the cgroup still reported a certain amount of cached memory. And on memory pressure this was freed. Exactly the same as with the modules loaded. I think this enforces the explanation above, that they aren''t Lustre pages at all (though perhaps they used to be.) But they are some side effect of Lustre activity; this whole problem only happens when Lustre disks are mounted and accessed. Hosts with Lustre mounted via an NFS gateway perform flawlessly for months (and they still have Lustre modules loaded.) Whereas a host with Lustre mounted directly (and no other changes) fails -- it can be made to block a cgroup in 10 minutes or so. The kernel seems to be able to handle these pages, rather than them being an inconsistency in data structures. Is there a reasonable explanation for pages like this in the kernel? One that could hopefully trace them back to their source. Thanks -- Mark # echo 2 > /proc/sys/vm/drop_caches # lctl get_param llite.*.dump_page_cache llite.beta-ffff88040bdb9800.dump_page_cachegener | llap cookie origin wq du wb | page inode index count [ page flags ] llite.pi-ffff88040bde6000.dump_page_cachegener | llap cookie origin wq du wb | page inode index count [ page flags ] # cat /cgroup/d*/memory.usage_in_bytes 61440 1069056 1892352 92405760 # lctl get_param memused pagesused lnet.memused=925199 memused=16609140 pagesused=0 # cat /proc/slabinfo | grep ll_ ll_import_cache 0 0 1248 26 8 : tunables 0 0 0 : slabdata 0 0 0 ll_obdo_cache 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0 ll_obd_dev_cache 45 45 5696 5 8 : tunables 0 0 0 : slabdata 9 9 0 # cat /proc/slabinfo | grep ldlm_ ldlm_locks 361 532 576 28 4 : tunables 0 0 0 : slabdata 19 19 0 <memory pressure> # cat /cgroup/d*/memory.usage_in_bytes 0 0 0 12288 # cat /proc/slabinfo | grep ll_ ll_import_cache 0 0 1248 26 8 : tunables 0 0 0 : slabdata 0 0 0 ll_obdo_cache 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0 ll_obd_dev_cache 45 45 5696 5 8 : tunables 0 0 0 : slabdata 9 9 0 # cat /proc/slabinfo | grep ldlm_ ldlm_locks 361 532 576 28 4 : tunables 0 0 0 : slabdata 19 19 0 # lctl get_param memused pagesused lnet.memused=925199 memused=16609140 pagesused=0 # umount /net/lustre <in dmesg> LustreError: 20169:0:(ldlm_request.c:1034:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 20169:0:(ldlm_request.c:1592:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 LustreError: 20169:0:(ldlm_request.c:1034:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 20169:0:(ldlm_request.c:1592:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 # rmmod mgc # rmmod mdc # rmmod osc # rmmod lov # rmmod lquota # rmmod ptlrpc # rmmod lvfs # rmmod lnet # rmmod obdclass # rmmod ksocklnd # rmmod libcfs # echo 2 > /proc/sys/vm/drop_caches # cat /cgroup/d*/memory.usage_in_bytes 0 0 0 12288 <memory pressure> # cat /cgroup/d*/memory.usage_in_bytes 0 0 0 0
On Fri, 29 Jul 2011, Robin Humble wrote:> On Wed, Jul 27, 2011 at 07:57:57PM +0100, Mark Hills wrote: > >On Wed, 27 Jul 2011, Andreas Dilger wrote: > >> Possibly you can correlate reproducer cases with Lustre errors on the > >> console? > >I''ve managed to catch the bad state, on a clean client too -- there''s no > >errors reported from Lustre in dmesg. > > > >Here''s the information reported by the cgroup. It seems that there''s a > >discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout). > > > >The process which was in the group terminated a long time ago. > > > >I can leave the machine in this state until tomorrow, so any suggestions > >for data to capture that could help trace this bug would be welcomed. > >Thanks. > > maybe try > vm.zone_reclaim_mode=0 > with zone_reclaim_mode=1 (even without memcg) we saw ~infinite scanning > for pages when doing Lustre i/o + memory pressure, which also hung up a > core in 100% system time.0 is the default on this kernel, and is what we have been using. I tried the other possibilities, without any difference. I think it''s the reclaim that''s actually working; if I understand correctly it scans the pages looking for a good match to reclaim. But cgroup.force_empty relies on the LRU, and the pages cannot be found here.> the scanning can be seen with > grep scan /proc/zoneinfoI don''t see any incrementing of these counters when the memory is freed by memory pressure.> that zone_reclaim_mode=0 helps our problem could be related to your > memcg semi-missing pages, or perhaps it''s a workaround for a core > kernel problem with zones - we only have Lustre so can''t distinguish. > > secondly, and even more of a long shot - I presume slab isn''t accounted > as part of memcg, but you could also try clearing the ldlm locks. Linux > is reluctant to drop inodes caches until the locks are cleared first > lctl set_param ldlm.namespaces.*.lru_size=clearI tried this, and it didn''t remove the cache pages. Or enable them to be removed. -- Mark
Mark Hills
2011-Aug-04 17:24 UTC
[Lustre-devel] Bad page state after unlink (was Re: Hangs with cgroup memory controller)
On Fri, 29 Jul 2011, Mark Hills wrote: [...]> Hosts with Lustre mounted via an NFS gateway perform flawlessly for months > (and they still have Lustre modules loaded.) Whereas a host with Lustre > mounted directly (and no other changes) fails -- it can be made to block a > cgroup in 10 minutes or so.Following this up, I seem to have a reproducable test case of a page bug, on a kernel with more debugging features. At first it appeared with Bonnie. I looked more closely and the bug occurs on unlink() of the file shortly after it was written to. Presumably with pages still in the local cache (pending writes?) It seems unlink is affected, but not truncate. $ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1 $ rm /net/lustre/file BUG: Bad page state in process rm pfn:21fe6a page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null) index:1 If there is a delay of a few seconds before the rm, all is okay. Truncate works, but a subsequent unlink rm can fail if it is quick enough. The task does not need to be running in a cgroup for the "Bad page" to be reported, although the kernel is build with cgroup. I can''t be certain this is the same bug seen on the production system (which uses packaged kernel etc.) but it seems like a good start :-) It also correlates with it. It seems the production kernel glosses over this bug, but when a cgroup is used the symptoms start to show. $ uname -a Linux joker 2.6.32.28-mh #27 SMP PREEMPT Thu Aug 4 17:15:46 BST 2011 x86_64 x86_64 x86_64 GNU/Linux Lustre source: Git 9302433 (beyond v1_8_6_80) Reproduced with 1.8.6 server (Whamcloud release), and also 1.8.3. Thanks -- Mark BUG: Bad page state in process rm pfn:21fe6a page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null) index:1 Pid: 24724, comm: rm Tainted: G B 2.6.32.28-mh #27 Call Trace: [<ffffffff81097ebc>] ? bad_page+0xcc/0x130 [<ffffffffa059f119>] ? ll_page_removal_cb+0x1e9/0x4d0 [lustre] [<ffffffffa03b17a3>] ? __ldlm_handle2lock+0x93/0x3b0 [ptlrpc] [<ffffffffa04c6522>] ? cache_remove_lock+0x182/0x268 [osc] [<ffffffffa04ad95d>] ? osc_extent_blocking_cb+0x29d/0x2d0 [osc] [<ffffffff81383920>] ? _spin_unlock+0x10/0x30 [<ffffffffa03b23a5>] ? ldlm_cancel_callback+0x55/0xe0 [ptlrpc] [<ffffffffa03cb3c7>] ? ldlm_cli_cancel_local+0x67/0x340 [ptlrpc] [<ffffffff81383920>] ? _spin_unlock+0x10/0x30 [<ffffffffa03cd65a>] ? ldlm_cancel_list+0xea/0x230 [ptlrpc] [<ffffffffa02e1312>] ? lnet_md_unlink+0x42/0x2d0 [lnet] [<ffffffff81383920>] ? _spin_unlock+0x10/0x30 [<ffffffffa03cd939>] ? ldlm_cancel_resource_local+0x199/0x2b0 [ptlrpc] [<ffffffffa029a629>] ? cfs_alloc+0x89/0xf0 [libcfs] [<ffffffffa04b0c22>] ? osc_destroy+0x112/0x720 [osc] [<ffffffffa05608ab>] ? lov_prep_destroy_set+0x27b/0x960 [lov] [<ffffffff8138374e>] ? _spin_lock_irqsave+0x1e/0x50 [<ffffffffa054adc4>] ? lov_destroy+0x584/0xf40 [lov] [<ffffffffa05575ed>] ? lov_unpackmd+0x4bd/0x8e0 [lov] [<ffffffffa05d9e98>] ? ll_objects_destroy+0x4c8/0x1820 [lustre] [<ffffffffa03f7cbe>] ? lustre_swab_buf+0xfe/0x180 [ptlrpc] [<ffffffff8138374e>] ? _spin_lock_irqsave+0x1e/0x50 [<ffffffffa05db940>] ? ll_unlink_generic+0x2e0/0x3a0 [lustre] [<ffffffff810d7309>] ? vfs_unlink+0x89/0xd0 [<ffffffff810e634c>] ? mnt_want_write+0x5c/0xb0 [<ffffffff810dac89>] ? do_unlinkat+0x199/0x1d0 [<ffffffff810cc8d5>] ? sys_faccessat+0x1a5/0x1f0 [<ffffffff8100b5ab>] ? system_call_fastpath+0x16/0x1b