Mark Fasheh
2017-Aug-07 20:19 UTC
[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability
On Mon, Aug 7, 2017 at 2:13 AM, Changwei Ge <ge.changwei at h3c.com> wrote:> Hi, > > In current code, while flushing AST, we don't handle an exception that > sending AST or BAST is failed. > But it is indeed possible that AST or BAST is lost due to some kind of > networks fault. > > If above exception happens, the requesting node will never obtain an AST > back, hence, it will never acquire the lock or abort current locking. > > With this patch, I'd like to fix this issue by re-queuing the AST or > BAST if sending is failed due to networks fault. > > And the re-queuing AST or BAST will be dropped if the requesting node is > dead! > > It will improve the reliability a lot.Can you detail your testing? Code-wise this looks fine to me but as you note, this is a pretty hard to hit corner case so it'd be nice to hear that you were able to exercise it. Thanks, --Mark
Changwei Ge
2017-Aug-08 10:56 UTC
[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability
On 2017/8/8 4:20, Mark Fasheh wrote:> On Mon, Aug 7, 2017 at 2:13 AM, Changwei Ge <ge.changwei at h3c.com> wrote: >> Hi, >> >> In current code, while flushing AST, we don't handle an exception that >> sending AST or BAST is failed. >> But it is indeed possible that AST or BAST is lost due to some kind of >> networks fault. >> >> If above exception happens, the requesting node will never obtain an AST >> back, hence, it will never acquire the lock or abort current locking. >> >> With this patch, I'd like to fix this issue by re-queuing the AST or >> BAST if sending is failed due to networks fault. >> >> And the re-queuing AST or BAST will be dropped if the requesting node is >> dead! >> >> It will improve the reliability a lot. > Can you detail your testing? Code-wise this looks fine to me but as > you note, this is a pretty hard to hit corner case so it'd be nice to > hear that you were able to exercise it. > > Thanks, > --MarkHi Mark, My test is quite simple to perform. Test environment includes 7 hosts. Ethernet devices in 6 of them are down and then up repetitively. After several rounds of up and down. Some file operation hangs. Through debugfs.ocfs2 tool involved in NODE 2 which was the owner of lock resource 'O000000000000000011150300000000', it told that: debugfs: dlm_locks O000000000000000011150300000000 Lockres: O000000000000000011150300000000 Owner: 2 State: 0x0 Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 4 Locks: 2 On Lists: None Reference Map: 3 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted 2 PR -1 2:53 2 No No None Granted 3 PR -1 3:48 2 No No None That meant NODE 2 had granted NODE 3 and the AST had been transited to NODE 3. Meanwhile, through debugfs.ocfs2 tool involved in NODE 3, it told that: debugfs: dlm_locks O000000000000000011150300000000 Lockres: O000000000000000011150300000000 Owner: 2 State: 0x0 Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 3 Locks: 1 On Lists: None Reference Map: Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Blocked 3 PR -1 3:48 2 No No None That meant NODE 3 didn't ever receive any AST to move local lock from blocked list to grant list. This consequence makes sense, since AST sending is failed which can be seen in kernel log. As for BAST, it is more or less the same. Thanks Changwei From