thr3ads.net - Lustre devel - [Lustre-devel] Hangs with cgroup memory controller [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Mark Hills

2011-Jul-27 16:21 UTC

[Lustre-devel] Hangs with cgroup memory controller

We are unable to use the combination of Lustre and the cgroup memory 
controller, because of intermittent hangs when trying to close the cgroup.

In a thread on LKML [1] we diagnosed that the problem was a leak of page 
accounting or resources.

Memory pages are charged to the cgroup, but the cgroup is unable to 
un-charge them, and so it spins. It suggests that, perhaps, at least one 
page gets allocated but not placed in the LRU.

Using the NFS client, via a gateway, has never shown this problem.

I''m in the client code, but I really need some pointers. And
disadvantaged
by being unable to find a reproducable test case. Any ideas?

Our system is Lustre 1.8.6 server, with clients on Linux 2.6.32 and Lustre 
1.8.5.

Thanks

[1] https://lkml.org/lkml/2010/9/9/534

-- 
Mark

Andreas Dilger

2011-Jul-27 17:11 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

Two ideas come to mind. On is that the reason you are having difficulty to
reproduce the problem is that it only happens after some fault condition.
Possibly you need the client to do recovery to an OST and resend a bulk RPC, or
resend due to a checksum error?  It might also be due to application IO types
(e.g. mmap, direct IO, pwrite, splice, etc).

Possibly you can correlate reproducer cases with Lustre errors on the console?

Lustre also has memory debugging that can be enabled, but without a reasonably
concise reproducer it would be difficult to log/analyze so much data for hours
of runtime.

Cheers, Andreas

On 2011-07-27, at 10:21 AM, Mark Hills <Mark.Hills at framestore.com>
wrote:
> We are unable to use the combination of Lustre and the cgroup memory 
> controller, because of intermittent hangs when trying to close the cgroup.
> 
> In a thread on LKML [1] we diagnosed that the problem was a leak of page 
> accounting or resources.
> 
> Memory pages are charged to the cgroup, but the cgroup is unable to 
> un-charge them, and so it spins. It suggests that, perhaps, at least one 
> page gets allocated but not placed in the LRU.
> 
> Using the NFS client, via a gateway, has never shown this problem.
> 
> I''m in the client code, but I really need some pointers. And
disadvantaged
> by being unable to find a reproducable test case. Any ideas?
> 
> Our system is Lustre 1.8.6 server, with clients on Linux 2.6.32 and Lustre 
> 1.8.5.
> 
> Thanks
> 
> [1] https://lkml.org/lkml/2010/9/9/534
> 
> -- 
> Mark
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Mark Hills

2011-Jul-27 17:33 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On Wed, 27 Jul 2011, Andreas Dilger wrote:
> Two ideas come to mind. On is that the reason you are having difficulty 
> to reproduce the problem is that it only happens after some fault 
> condition. Possibly you need the client to do recovery to an OST and 
> resend a bulk RPC, or resend due to a checksum error?
Is there an easy way to trigger some error cases like this?
> It might also be due to application IO types (e.g. mmap, direct IO, 
> pwrite, splice, etc).
Yes, of course. Although I didn''t gather any statistics, there
wasn''t a
clear standout application which was more affected than others.
> Possibly you can correlate reproducer cases with Lustre errors on the 
> console?
Back when I tried this last year on the production system, I wasn''t
able
to see corresponding errors. But I don''t have any of this data around
any
more.

I''d need to do some tests on the production system to capture one case.
> Lustre also has memory debugging that can be enabled, but without a 
> reasonably concise reproducer it would be difficult to log/analyze so 
> much data for hours of runtime.
If I am able to capture a case, is there a way to, for example, dump a 
list of Lustre pages still held by the client? And correlate these with 
the files in question?

What I am thinking is that I could stop the running processes and attempt 
to drain all the pages, and this could hopefully leave a small number of 
''bad'' ones -- with the files in question I could at least help
to identify
the I/O type.

Thanks for your reply

-- 
Mark

Mark Hills

2011-Jul-27 18:57 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On Wed, 27 Jul 2011, Andreas Dilger wrote:

[...] > Possibly you can correlate reproducer cases with Lustre errors on the 
> console?
I''ve managed to catch the bad state, on a clean client too --
there''s no
errors reported from Lustre in dmesg.

Here''s the information reported by the cgroup. It seems that
there''s a
discrepancy of 2x pages (the ''cache'' field, pgpgin, pgpgout).

The process which was in the group terminated a long time ago.

I can leave the machine in this state until tomorrow, so any suggestions 
for data to capture that could help trace this bug would be welcomed. 
Thanks.

# cd /cgroup/p25321

# echo 1 > memory.force_empty
<hangs: the bug>

# cat tasks
<none>

# cat memory.max_usage_in_bytes 
1281351680

# cat memory.usage_in_bytes 
8192

# cat memory.stat 
cache 8192                   <--- two pages
rss 0
mapped_file 0
pgpgin 396369                <--- two pages higher than pgpgout
pgpgout 396367
swap 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 8388608000
hierarchical_memsw_limit 10485760000
total_cache 8192
total_rss 0
total_mapped_file 0
total_pgpgin 396369
total_pgpgout 396367
total_swap 0
total_inactive_anon 0
total_active_anon 0
total_inactive_file 0
total_active_file 0
total_unevictable 0

# echo 1 > /proc/sys/vm/drop_caches
<success>

# echo 2 > /proc/sys/vm/drop_caches
<success>

# cat memory.stat
<same as above>

-- 
Mark

Andreas Dilger

2011-Jul-27 19:16 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On 2011-07-27, at 12:57 PM, Mark Hills wrote:> On Wed, 27 Jul 2011, Andreas Dilger wrote:
> [...] 
>> Possibly you can correlate reproducer cases with Lustre errors on the 
>> console?
> 
> I''ve managed to catch the bad state, on a clean client too --
there''s no
> errors reported from Lustre in dmesg.
> 
> Here''s the information reported by the cgroup. It seems that
there''s a
> discrepancy of 2x pages (the ''cache'' field, pgpgin,
pgpgout).
To dump Lustre pagecache pages use "lctl get_param
llite.*.dump_page_cache",
which will print the inode, page index, read/write access, and page flags.

It wouldn''t hurt to dump the kernel debug log, but it is unlikely to
hold
anything useful.
> The process which was in the group terminated a long time ago.
> 
> I can leave the machine in this state until tomorrow, so any suggestions 
> for data to capture that could help trace this bug would be welcomed. 
> Thanks.
> 
> # cd /cgroup/p25321
> 
> # echo 1 > memory.force_empty
> <hangs: the bug>
> 
> # cat tasks
> <none>
> 
> # cat memory.max_usage_in_bytes 
> 1281351680
> 
> # cat memory.usage_in_bytes 
> 8192
> 
> # cat memory.stat 
> cache 8192                   <--- two pages
> rss 0
> mapped_file 0
> pgpgin 396369                <--- two pages higher than pgpgout
> pgpgout 396367
> swap 0
> inactive_anon 0
> active_anon 0
> inactive_file 0
> active_file 0
> unevictable 0
> hierarchical_memory_limit 8388608000
> hierarchical_memsw_limit 10485760000
> total_cache 8192
> total_rss 0
> total_mapped_file 0
> total_pgpgin 396369
> total_pgpgout 396367
> total_swap 0
> total_inactive_anon 0
> total_active_anon 0
> total_inactive_file 0
> total_active_file 0
> total_unevictable 0
> 
> # echo 1 > /proc/sys/vm/drop_caches
> <success>
> 
> # echo 2 > /proc/sys/vm/drop_caches
> <success>
> 
> # cat memory.stat
> <same as above>
> 
> -- 
> Mark

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.

Mark Hills

2011-Jul-28 13:53 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On Wed, 27 Jul 2011, Andreas Dilger wrote:
> On 2011-07-27, at 12:57 PM, Mark Hills wrote:
> > On Wed, 27 Jul 2011, Andreas Dilger wrote:
> > [...] 
> >> Possibly you can correlate reproducer cases with Lustre errors on
the
> >> console?
> > 
> > I''ve managed to catch the bad state, on a clean client too --
there''s no
> > errors reported from Lustre in dmesg.
> > 
> > Here''s the information reported by the cgroup. It seems that
there''s a
> > discrepancy of 2x pages (the ''cache'' field, pgpgin,
pgpgout).
> 
> To dump Lustre pagecache pages use "lctl get_param
llite.*.dump_page_cache",
> which will print the inode, page index, read/write access, and page flags.
So I lost the previous test case, but acquired another. This time there 
are 147 pages of difference. But not listed by the lctl command, which 
gives an empty list.

The cgroup reports approx. 600KiB used as ''cache''
(memory.stat). Yet
/proc/meminfo does not; only 69KiB.

But, what caught my attention is that cgroup ''cache'' value
dropped
slightly a few minutes later. drop_caches method wasn''t touching this 
memory. But when I put the system under memory pressure, these pages were 
discarded and ''cached'' was reduced. Until eventually the the
cgroup
unhangs.

So what I observed is that the pages cannot be forced out of the cache -- 
only by memory pressure.

I did a quick test on the regular behaviour, and drop_caches normally 
works fine with Lustre content, both in and out of a cgroup. So these 
pages are ''special'' in some way.

It is possible that some pages could not be in LRU, but would still be 
seen by the memory pressure codepaths?

Thanks

# cd /group/p1243

# echo 1 > memory.force_empty
<hangs>

# echo 2 > /proc/sys/vm/drop_caches

# lctl get_param llite.*.dump_page_cache
llite.beta-ffff88042b186400.dump_page_cachegener |  llap  cookie  origin wq du
wb | page inode index count [ page flags ]

# cat memory.usage_in_bytes 
602112

# cat memory.stat
cache 602112
rss 0
mapped_file 0
pgpgin 1998315
pgpgout 1998168
swap 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 16777216000
hierarchical_memsw_limit 20971520000
total_cache 602112
total_rss 0
total_mapped_file 0
total_pgpgin 1998315
total_pgpgout 1998168
total_swap 0
total_inactive_anon 0
total_active_anon 0
total_inactive_file 0
total_active_file 0
total_unevictable 0

# cat /proc/meminfo 
MemTotal:       16464728 kB
MemFree:        15875412 kB
Buffers:             256 kB
Cached:            69540 kB
SwapCached:            0 kB
Active:            59452 kB
Inactive:          87736 kB
Active(anon):      33072 kB
Inactive(anon):    61224 kB
Active(file):      26380 kB
Inactive(file):    26512 kB
Unevictable:         228 kB
Mlocked:               0 kB
SwapTotal:      16587072 kB
SwapFree:       16587072 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         77620 kB
Mapped:            26768 kB
Shmem:             16676 kB
Slab:              67120 kB
SReclaimable:      29136 kB
SUnreclaim:        37984 kB
KernelStack:        3336 kB
PageTables:        10292 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    24819436 kB
Committed_AS:     659876 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      320240 kB
VmallocChunk:   34359359884 kB
HardwareCorrupted:     0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7488 kB
DirectMap2M:    16764928 kB

<some time later>

# cat memory.stat | grep cache
cache 581632

# echo 2 > /proc/sys/vm/drop_caches

# cat memory.stat | grep cache
cache 581632

<put system under memory pressure>

# cat memory.stat | grep cache
cache 118784

<keep going>

# cat memory.stat | grep cache
cache 0

<memory.force_empty un-hangs>

-- 
Mark

Andreas Dilger

2011-Jul-28 17:10 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

If you get another system in this hang there are some more things you could
check:

lctl get_param memused pagesused

This will print the count of all memory Lustre still thinks is allocated. 

Check the slab cache allocations (/proc/slabinfo) for Lustre slab objects.
Usually they are called ll_* or ldlm_* and are listed in sequence.

Enable memory allocation tracing before applying memory pressure:

lctl set_param debug=+malloc

And then when the memory is freed dump the debug logs:

lctl dk /tmp/debug

And grep out the "free" lines. 

The other thing that may free Lustre memory is to remove the modules, but you
need to keep libcfs loaded in order to be able to dump the debug log.

Cheers, Andreas

On 2011-07-28, at 7:53 AM, Mark Hills <Mark.Hills at framestore.com>
wrote:
> On Wed, 27 Jul 2011, Andreas Dilger wrote:
> 
>> On 2011-07-27, at 12:57 PM, Mark Hills wrote:
>>> On Wed, 27 Jul 2011, Andreas Dilger wrote:
>>> [...] 
>>>> Possibly you can correlate reproducer cases with Lustre errors
on the
>>>> console?
>>> 
>>> I''ve managed to catch the bad state, on a clean client too
-- there''s no
>>> errors reported from Lustre in dmesg.
>>> 
>>> Here''s the information reported by the cgroup. It seems
that there''s a
>>> discrepancy of 2x pages (the ''cache'' field,
pgpgin, pgpgout).
>> 
>> To dump Lustre pagecache pages use "lctl get_param
llite.*.dump_page_cache",
>> which will print the inode, page index, read/write access, and page
flags.
> 
> So I lost the previous test case, but acquired another. This time there 
> are 147 pages of difference. But not listed by the lctl command, which 
> gives an empty list.
> 
> The cgroup reports approx. 600KiB used as ''cache''
(memory.stat). Yet
> /proc/meminfo does not; only 69KiB.
> 
> But, what caught my attention is that cgroup ''cache''
value dropped
> slightly a few minutes later. drop_caches method wasn''t touching
this
> memory. But when I put the system under memory pressure, these pages were 
> discarded and ''cached'' was reduced. Until eventually the
the cgroup
> unhangs.
> 
> So what I observed is that the pages cannot be forced out of the cache -- 
> only by memory pressure.
> 
> I did a quick test on the regular behaviour, and drop_caches normally 
> works fine with Lustre content, both in and out of a cgroup. So these 
> pages are ''special'' in some way.
> 
> It is possible that some pages could not be in LRU, but would still be 
> seen by the memory pressure codepaths?
> 
> Thanks
> 
> # cd /group/p1243
> 
> # echo 1 > memory.force_empty
> <hangs>
> 
> # echo 2 > /proc/sys/vm/drop_caches
> 
> # lctl get_param llite.*.dump_page_cache
> llite.beta-ffff88042b186400.dump_page_cache> gener |  llap  cookie 
origin wq du wb | page inode index count [ page flags ]
> 
> # cat memory.usage_in_bytes 
> 602112
> 
> # cat memory.stat
> cache 602112
> rss 0
> mapped_file 0
> pgpgin 1998315
> pgpgout 1998168
> swap 0
> inactive_anon 0
> active_anon 0
> inactive_file 0
> active_file 0
> unevictable 0
> hierarchical_memory_limit 16777216000
> hierarchical_memsw_limit 20971520000
> total_cache 602112
> total_rss 0
> total_mapped_file 0
> total_pgpgin 1998315
> total_pgpgout 1998168
> total_swap 0
> total_inactive_anon 0
> total_active_anon 0
> total_inactive_file 0
> total_active_file 0
> total_unevictable 0
> 
> # cat /proc/meminfo 
> MemTotal:       16464728 kB
> MemFree:        15875412 kB
> Buffers:             256 kB
> Cached:            69540 kB
> SwapCached:            0 kB
> Active:            59452 kB
> Inactive:          87736 kB
> Active(anon):      33072 kB
> Inactive(anon):    61224 kB
> Active(file):      26380 kB
> Inactive(file):    26512 kB
> Unevictable:         228 kB
> Mlocked:               0 kB
> SwapTotal:      16587072 kB
> SwapFree:       16587072 kB
> Dirty:                 0 kB
> Writeback:             0 kB
> AnonPages:         77620 kB
> Mapped:            26768 kB
> Shmem:             16676 kB
> Slab:              67120 kB
> SReclaimable:      29136 kB
> SUnreclaim:        37984 kB
> KernelStack:        3336 kB
> PageTables:        10292 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    24819436 kB
> Committed_AS:     659876 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      320240 kB
> VmallocChunk:   34359359884 kB
> HardwareCorrupted:     0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        7488 kB
> DirectMap2M:    16764928 kB
> 
> <some time later>
> 
> # cat memory.stat | grep cache
> cache 581632
> 
> # echo 2 > /proc/sys/vm/drop_caches
> 
> # cat memory.stat | grep cache
> cache 581632
> 
> <put system under memory pressure>
> 
> # cat memory.stat | grep cache
> cache 118784
> 
> <keep going>
> 
> # cat memory.stat | grep cache
> cache 0
> 
> <memory.force_empty un-hangs>
> 
> -- 
> Mark

Robin Humble

2011-Jul-29 07:15 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On Wed, Jul 27, 2011 at 07:57:57PM +0100, Mark Hills
wrote:>On Wed, 27 Jul 2011, Andreas Dilger wrote:
>> Possibly you can correlate reproducer cases with Lustre errors on the 
>> console?
>I''ve managed to catch the bad state, on a clean client too --
there''s no
>errors reported from Lustre in dmesg.
>
>Here''s the information reported by the cgroup. It seems that
there''s a
>discrepancy of 2x pages (the ''cache'' field, pgpgin,
pgpgout).
>
>The process which was in the group terminated a long time ago.
>
>I can leave the machine in this state until tomorrow, so any suggestions 
>for data to capture that could help trace this bug would be welcomed. 
>Thanks.
maybe try
  vm.zone_reclaim_mode=0
with zone_reclaim_mode=1 (even without memcg) we saw ~infinite scanning
for pages when doing Lustre i/o + memory pressure, which also hung up a
core in 100% system time. the scanning can be seen with
  grep scan /proc/zoneinfo

that zone_reclaim_mode=0 helps our problem could be related to your
memcg semi-missing pages, or perhaps it''s a workaround for a core
kernel problem with zones - we only have Lustre so can''t distinguish.

secondly, and even more of a long shot - I presume slab isn''t accounted
as part of memcg, but you could also try clearing the ldlm locks. Linux
is reluctant to drop inodes caches until the locks are cleared first
  lctl set_param ldlm.namespaces.*.lru_size=clear

cheers,
robin

Mark Hills

2011-Jul-29 14:39 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On Thu, 28 Jul 2011, Andreas Dilger wrote:
> If you get another system in this hang there are some more things you 
> could check:
> 
> lctl get_param memused pagesused
> 
> This will print the count of all memory Lustre still thinks is 
> allocated.
> 
> Check the slab cache allocations (/proc/slabinfo) for Lustre slab 
> objects. Usually they are called ll_* or ldlm_* and are listed in 
> sequence.
> 
> Enable memory allocation tracing before applying memory pressure:
> 
> lctl set_param debug=+malloc
> 
> And then when the memory is freed dump the debug logs:
> 
> lctl dk /tmp/debug
> 
> And grep out the "free" lines.
I followed these steps. Neither /proc/slabinfo nor Lustre''s own logs
show
activity at the point where the pages are forced out due to memory 
pressure.

(There''s a certain amount of periodic noise in the debug logs, but
looking
beyond that I was able to force memory pressure, watch pages out, and 
Lustre logged nothing)

As before, dumping Lustre pagecache pages shows nothing.

So it looks like these aren''t Lustre pages. Furthermore...
 > The other thing that may free Lustre memory is to remove the modules, 
> but you need to keep libcfs loaded in order to be able to dump the debug 
> log.
I then unmounted the filesystem and removed all the modules, right the way 
down to libcfs.

On completion the cgroup still reported a certain amount of cached memory. 
And on memory pressure this was freed. Exactly the same as with the 
modules loaded.

I think this enforces the explanation above, that they aren''t Lustre
pages
at all (though perhaps they used to be.)

But they are some side effect of Lustre activity; this whole problem only 
happens when Lustre disks are mounted and accessed.

Hosts with Lustre mounted via an NFS gateway perform flawlessly for months 
(and they still have Lustre modules loaded.) Whereas a host with Lustre 
mounted directly (and no other changes) fails -- it can be made to block a 
cgroup in 10 minutes or so.

The kernel seems to be able to handle these pages, rather than them being 
an inconsistency in data structures. Is there a reasonable explanation for 
pages like this in the kernel? One that could hopefully trace them back to 
their source.

Thanks

-- 
Mark



# echo 2 > /proc/sys/vm/drop_caches

# lctl get_param llite.*.dump_page_cache
llite.beta-ffff88040bdb9800.dump_page_cachegener |  llap  cookie  origin wq du
wb | page inode index count [ page flags ]
llite.pi-ffff88040bde6000.dump_page_cachegener |  llap  cookie  origin wq du wb
| page inode index count [ page flags ]

# cat /cgroup/d*/memory.usage_in_bytes
61440
1069056
1892352
92405760

# lctl get_param memused pagesused
lnet.memused=925199
memused=16609140

pagesused=0

# cat /proc/slabinfo | grep ll_
ll_import_cache        0      0   1248   26    8 : tunables    0    0    0 :
slabdata      0      0      0
ll_obdo_cache        312    312    208   39    2 : tunables    0    0    0 :
slabdata      8      8      0
ll_obd_dev_cache      45     45   5696    5    8 : tunables    0    0    0 :
slabdata      9      9      0

# cat /proc/slabinfo | grep ldlm_
ldlm_locks           361    532    576   28    4 : tunables    0    0    0 :
slabdata     19     19      0

<memory pressure>

# cat /cgroup/d*/memory.usage_in_bytes
0
0
0
12288

# cat /proc/slabinfo | grep ll_
ll_import_cache        0      0   1248   26    8 : tunables    0    0    0 :
slabdata      0      0      0
ll_obdo_cache        312    312    208   39    2 : tunables    0    0    0 :
slabdata      8      8      0
ll_obd_dev_cache      45     45   5696    5    8 : tunables    0    0    0 :
slabdata      9      9      0

# cat /proc/slabinfo | grep ldlm_
ldlm_locks           361    532    576   28    4 : tunables    0    0    0 :
slabdata     19     19      0

# lctl get_param memused pagesused
lnet.memused=925199
memused=16609140

pagesused=0

# umount /net/lustre
<in dmesg>
LustreError: 20169:0:(ldlm_request.c:1034:ldlm_cli_cancel_req()) Got rc -108
from cancel RPC: canceling anyway
LustreError: 20169:0:(ldlm_request.c:1592:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
LustreError: 20169:0:(ldlm_request.c:1034:ldlm_cli_cancel_req()) Got rc -108
from cancel RPC: canceling anyway
LustreError: 20169:0:(ldlm_request.c:1592:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108

# rmmod mgc
# rmmod mdc
# rmmod osc
# rmmod lov
# rmmod lquota
# rmmod ptlrpc
# rmmod lvfs
# rmmod lnet
# rmmod obdclass
# rmmod ksocklnd
# rmmod libcfs

# echo 2 > /proc/sys/vm/drop_caches

# cat /cgroup/d*/memory.usage_in_bytes
0
0
0
12288

<memory pressure>

# cat /cgroup/d*/memory.usage_in_bytes
0
0
0
0

Mark Hills

2011-Jul-29 16:42 UTC

head link

[Lustre-devel] Hangs with cgroup memory controller

On Fri, 29 Jul 2011, Robin Humble wrote:
> On Wed, Jul 27, 2011 at 07:57:57PM +0100, Mark Hills wrote:
> >On Wed, 27 Jul 2011, Andreas Dilger wrote:
> >> Possibly you can correlate reproducer cases with Lustre errors on
the
> >> console?
> >I''ve managed to catch the bad state, on a clean client too --
there''s no
> >errors reported from Lustre in dmesg.
> >
> >Here''s the information reported by the cgroup. It seems that
there''s a
> >discrepancy of 2x pages (the ''cache'' field, pgpgin,
pgpgout).
> >
> >The process which was in the group terminated a long time ago.
> >
> >I can leave the machine in this state until tomorrow, so any
suggestions
> >for data to capture that could help trace this bug would be welcomed. 
> >Thanks.
> 
> maybe try
>   vm.zone_reclaim_mode=0
> with zone_reclaim_mode=1 (even without memcg) we saw ~infinite scanning
> for pages when doing Lustre i/o + memory pressure, which also hung up a
> core in 100% system time.
0 is the default on this kernel, and is what we have been using. I tried 
the other possibilities, without any difference.

I think it''s the reclaim that''s actually working; if I
understand
correctly it scans the pages looking for a good match to reclaim.

But cgroup.force_empty relies on the LRU, and the pages cannot be found 
here.
> the scanning can be seen with
>   grep scan /proc/zoneinfo
I don''t see any incrementing of these counters when the memory is freed
by
memory pressure.
> that zone_reclaim_mode=0 helps our problem could be related to your
> memcg semi-missing pages, or perhaps it''s a workaround for a core
> kernel problem with zones - we only have Lustre so can''t
distinguish.
> 
> secondly, and even more of a long shot - I presume slab isn''t
accounted
> as part of memcg, but you could also try clearing the ldlm locks. Linux
> is reluctant to drop inodes caches until the locks are cleared first
>   lctl set_param ldlm.namespaces.*.lru_size=clear
I tried this, and it didn''t remove the cache pages. Or enable them to
be
removed.

-- 
Mark

Mark Hills

2011-Aug-04 17:24 UTC

head link

[Lustre-devel] Bad page state after unlink (was Re: Hangs with cgroup memory controller)

On Fri, 29 Jul 2011, Mark Hills wrote:

[...]> Hosts with Lustre mounted via an NFS gateway perform flawlessly for months 
> (and they still have Lustre modules loaded.) Whereas a host with Lustre 
> mounted directly (and no other changes) fails -- it can be made to block a 
> cgroup in 10 minutes or so.
Following this up, I seem to have a reproducable test case of a page bug, 
on a kernel with more debugging features.

At first it appeared with Bonnie. I looked more closely and the bug occurs 
on unlink() of the file shortly after it was written to. Presumably with 
pages still in the local cache (pending writes?) It seems unlink is 
affected, but not truncate.

  $ dd if=/dev/zero of=/net/lustre/file bs=4096 count=1
  $ rm /net/lustre/file
  BUG: Bad page state in process rm  pfn:21fe6a
  page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null)
index:1

If there is a delay of a few seconds before the rm, all is okay. Truncate 
works, but a subsequent unlink rm can fail if it is quick enough.

The task does not need to be running in a cgroup for the "Bad page" to
be
reported, although the kernel is build with cgroup.

I can''t be certain this is the same bug seen on the production system 
(which uses packaged kernel etc.) but it seems like a good start :-) It 
also correlates with it. It seems the production kernel glosses over this 
bug, but when a cgroup is used the symptoms start to show.

  $ uname -a
  Linux joker 2.6.32.28-mh #27 SMP PREEMPT Thu Aug 4 17:15:46 BST 2011 x86_64
x86_64 x86_64 GNU/Linux

  Lustre source: Git 9302433 (beyond v1_8_6_80)
  Reproduced with 1.8.6 server (Whamcloud release), and also 1.8.3.

Thanks

-- 
Mark


BUG: Bad page state in process rm  pfn:21fe6a
page:ffffea00076fa730 flags:800000000000000c count:0 mapcount:0 mapping:(null)
index:1
Pid: 24724, comm: rm Tainted: G    B      2.6.32.28-mh #27
Call Trace:
 [<ffffffff81097ebc>] ? bad_page+0xcc/0x130
 [<ffffffffa059f119>] ? ll_page_removal_cb+0x1e9/0x4d0 [lustre]
 [<ffffffffa03b17a3>] ? __ldlm_handle2lock+0x93/0x3b0 [ptlrpc]
 [<ffffffffa04c6522>] ? cache_remove_lock+0x182/0x268 [osc]
 [<ffffffffa04ad95d>] ? osc_extent_blocking_cb+0x29d/0x2d0 [osc]
 [<ffffffff81383920>] ? _spin_unlock+0x10/0x30
 [<ffffffffa03b23a5>] ? ldlm_cancel_callback+0x55/0xe0 [ptlrpc]
 [<ffffffffa03cb3c7>] ? ldlm_cli_cancel_local+0x67/0x340 [ptlrpc]
 [<ffffffff81383920>] ? _spin_unlock+0x10/0x30
 [<ffffffffa03cd65a>] ? ldlm_cancel_list+0xea/0x230 [ptlrpc]
 [<ffffffffa02e1312>] ? lnet_md_unlink+0x42/0x2d0 [lnet]
 [<ffffffff81383920>] ? _spin_unlock+0x10/0x30
 [<ffffffffa03cd939>] ? ldlm_cancel_resource_local+0x199/0x2b0 [ptlrpc]
 [<ffffffffa029a629>] ? cfs_alloc+0x89/0xf0 [libcfs]
 [<ffffffffa04b0c22>] ? osc_destroy+0x112/0x720 [osc]
 [<ffffffffa05608ab>] ? lov_prep_destroy_set+0x27b/0x960 [lov]
 [<ffffffff8138374e>] ? _spin_lock_irqsave+0x1e/0x50
 [<ffffffffa054adc4>] ? lov_destroy+0x584/0xf40 [lov]
 [<ffffffffa05575ed>] ? lov_unpackmd+0x4bd/0x8e0 [lov]
 [<ffffffffa05d9e98>] ? ll_objects_destroy+0x4c8/0x1820 [lustre]
 [<ffffffffa03f7cbe>] ? lustre_swab_buf+0xfe/0x180 [ptlrpc]
 [<ffffffff8138374e>] ? _spin_lock_irqsave+0x1e/0x50
 [<ffffffffa05db940>] ? ll_unlink_generic+0x2e0/0x3a0 [lustre]
 [<ffffffff810d7309>] ? vfs_unlink+0x89/0xd0
 [<ffffffff810e634c>] ? mnt_want_write+0x5c/0xb0
 [<ffffffff810dac89>] ? do_unlinkat+0x199/0x1d0
 [<ffffffff810cc8d5>] ? sys_faccessat+0x1a5/0x1f0
 [<ffffffff8100b5ab>] ? system_call_fastpath+0x16/0x1b

Lustre devel - Jul 2011 - Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Hangs with cgroup memory controller

[Lustre-devel] Bad page state after unlink (was Re: Hangs with cgroup memory controller)