Charlie Smurthwaite
2019-Sep-25 18:01 UTC
[Ocfs2-users] OCFS2 filesystem hangs with "dirty" locks on internal files
Hi, I have been trying for some time to get to the bottom of a problem that is causing an OCFS2 filesystem to hang (increasing numbers of file operations hang until the filesystem becomes unusable) seemingly at random, approximately once per day. I have got as far as dumping the busy locks and dlm lock state on all nodes when this occurs. In summary, it appears that all nodes are waiting on locks for shared internal data files, specifically: debugfs: encode //global_bitmap M000000000000000000000baa25b2b2 debugfs: encode //aquota.user M000000000000000000000caa25b2b2 debugfs: encode //aquota.group M000000000000000000000daa25b2b2 The DLM status of these 3 files are pasted below. It seems that all nodes are waiting for access to the global bitmap (the bottom entry in the DLM output below) but nobody is able to obtain this lock. Is there an obvious cause of this situation? I'd be happy to provide any further information that may help. Sorry if I'm not understanding the situation very well yet. Thanks! Charlie Lockres: M000000000000000000000caa25b2b2 Owner: 3 State: 0x8 Dirty Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 12 Locks: 9 On Lists: Dirty Reference Map: 0 1 2 4 5 6 7 8 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted 0 NL -1 0:5 2 No No None Converting 1 NL EX 1:10 2 No No None Converting 5 NL EX 5:8 2 No No None Converting 6 NL EX 6:4 2 No No None Converting 2 NL EX 2:7 2 No No None Converting 7 NL EX 7:9 2 No No None Converting 8 NL EX 8:11 2 No No None Converting 4 NL EX 4:6 2 No No None Converting 3 NL EX 3:27 2 No No None -- Lockres: M000000000000000000000daa25b2b2 Owner: 3 State: 0x8 Dirty Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 12 Locks: 9 On Lists: Dirty Reference Map: 0 1 2 4 5 6 7 8 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted 0 NL -1 0:8 2 No No None Converting 1 NL EX 1:13 2 No No None Converting 5 NL EX 5:11 2 No No None Converting 6 NL EX 6:7 2 No No None Converting 2 NL EX 2:10 2 No No None Converting 7 NL EX 7:12 2 No No None Converting 8 NL EX 8:14 2 No No None Converting 4 NL EX 4:9 2 No No None Converting 3 NL EX 3:30 2 No No None -- Lockres: M000000000000000000000baa25b2b2 Owner: 3 State: 0x8 Dirty Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 12 Locks: 9 On Lists: Dirty Reference Map: 0 1 2 4 5 6 7 8 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Converting 4 NL EX 4:39 2 No No None Converting 8 NL PR 8:39 2 No No None Converting 0 NL PR 0:30 2 No No None Converting 6 NL PR 6:39 2 No No None Converting 1 NL PR 1:39 2 No No None Converting 3 NL EX 3:33 2 No No None Converting 7 NL EX 7:39 2 No No None Converting 2 NL EX 2:39 2 No No None Converting 5 NL PR 5:39 2 No No None
Charlie Smurthwaite
2019-Sep-28 13:58 UTC
[Ocfs2-users] OCFS2 filesystem hangs with "dirty" locks on internal files
On 25/09/2019 19:01, Charlie Smurthwaite wrote:> Hi, > > I have been trying for some time to get to the bottom of a problem that > is causing an OCFS2 filesystem to hang (increasing numbers of file > operations hang until the filesystem becomes unusable) seemingly at > random, approximately once per day. >Hi, Just wanted to follow up my own question as I believe I have found the solution. After reading some threads about similar issues, I noticed that my LocalAlloc size was rather large: LocalAlloc => State: 1 Descriptor: 0 Size: 27136 bits Default: 27136 bits I compared this to the the contiguous free blocks on my filesystem and determined that there was almost no such blocks available, despite the disk being at only 50% space utilization. I did not set this manually, but it seems to be the default for my filesystem / kernel. Adding localalloc=16 to my mount options appears to have set a localalloc size of 4096 bits (I don't fully understand what this number means) and has seemingly resolved my problem, significantly reducing disk IO utilization, process IO wait time, and (so far) recurrences of the crash. LocalAlloc => State: 1 Descriptor: 0 Size: 4096 bits Default: 4096 bits I have some follow-up questions that are hopefully a bit simpler than my original question: 1) Why is the default localalloc size so large? This seems much larger than any default I have seen documented. 2) Does this mean that my filesystem is seriously fragmented? If so, is there any online tool that can fix this? 3) Is the setting I have chosen reasonable, and is this likely to prevent a recurrence of the problem for the foreseeable future? Thanks! Charlie
Gang He
2019-Oct-14 05:34 UTC
[Ocfs2-users] OCFS2 filesystem hangs with "dirty" locks on internal files
Hi Charlie, Which Linux kernel version and distribution are you using? Do you have the hang process stacks? Could you reproduce this hang stably? If yes, please provide the detailed steps. There is dlm lock hang detect tool, you can use it when the file system is in stuck. https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ganghe_o2locktop&d=DwIGoQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=wXmkJNAUtutY0U9inuQWCbzSSRji5zLpyR0a_Mek4jM&m=1KI0iZXiTo-Adev-k4GuqjpiUrXwvQHxUecwICfKFRc&s=zkryUSZ0yuubP-NDb6DShlbgJQonALEdJXtS2fHJ7j0&e= Thanks Gang> -----Original Message----- > From: ocfs2-users-bounces at oss.oracle.com > [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Charlie > Smurthwaite > Sent: 2019?9?26? 2:01 > To: ocfs2-users at oss.oracle.com > Subject: [Ocfs2-users] OCFS2 filesystem hangs with "dirty" locks on internal > files > > Hi, > > I have been trying for some time to get to the bottom of a problem that is > causing an OCFS2 filesystem to hang (increasing numbers of file operations > hang until the filesystem becomes unusable) seemingly at random, > approximately once per day. > > I have got as far as dumping the busy locks and dlm lock state on all nodes > when this occurs. > > In summary, it appears that all nodes are waiting on locks for shared internal > data files, specifically: > > debugfs: encode //global_bitmap > M000000000000000000000baa25b2b2 > debugfs: encode //aquota.user > M000000000000000000000caa25b2b2 > debugfs: encode //aquota.group > M000000000000000000000daa25b2b2 > > The DLM status of these 3 files are pasted below. It seems that all nodes are > waiting for access to the global bitmap (the bottom entry in the DLM output > below) but nobody is able to obtain this lock. Is there an obvious cause of this > situation? > > I'd be happy to provide any further information that may help. Sorry if I'm not > understanding the situation very well yet. > > Thanks! > Charlie > > > > Lockres: M000000000000000000000caa25b2b2 Owner: 3 State: 0x8 Dirty > Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No > Refs: 12 Locks: 9 On Lists: Dirty > Reference Map: 0 1 2 4 5 6 7 8 > Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted > 0 NL -1 0:5 2 No No None Converting 1 NL EX 1:10 2 No No None Converting > 5 NL EX 5:8 2 No No None Converting 6 NL EX 6:4 2 No No None Converting > 2 NL EX 2:7 2 No No None Converting 7 NL EX 7:9 2 No No None Converting > 8 NL EX 8:11 2 No No None Converting 4 NL EX 4:6 2 No No None > Converting 3 NL EX 3:27 2 No No None > -- > Lockres: M000000000000000000000daa25b2b2 Owner: 3 State: 0x8 Dirty > Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No > Refs: 12 Locks: 9 On Lists: Dirty > Reference Map: 0 1 2 4 5 6 7 8 > Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted > 0 NL -1 0:8 2 No No None Converting 1 NL EX 1:13 2 No No None Converting > 5 NL EX 5:11 2 No No None Converting 6 NL EX 6:7 2 No No None > Converting 2 NL EX 2:10 2 No No None Converting 7 NL EX 7:12 2 No No > None Converting 8 NL EX 8:14 2 No No None Converting 4 NL EX 4:9 2 No > No None Converting 3 NL EX 3:30 2 No No None > -- > > Lockres: M000000000000000000000baa25b2b2 Owner: 3 State: 0x8 Dirty > Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No > Refs: 12 Locks: 9 On Lists: Dirty > Reference Map: 0 1 2 4 5 6 7 8 > Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action > Converting 4 NL EX 4:39 2 No No None Converting 8 NL PR 8:39 2 No No > None Converting 0 NL PR 0:30 2 No No None Converting 6 NL PR 6:39 2 No > No None Converting 1 NL PR 1:39 2 No No None Converting 3 NL EX 3:33 2 > No No None Converting 7 NL EX 7:39 2 No No None Converting 2 NL EX 2:39 > 2 No No None Converting 5 NL PR 5:39 2 No No None > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users