Michael Ulbrich
2017-Sep-13 07:54 UTC
[Ocfs2-users] Node 8 doesn't mount / Wrong slot map assignment?
Hi all, we've a small (?) problem with a 2-node cluster on Debian 8: Linux h1b 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u2 (2017-06-26) x86_64 GNU/Linux ocfs2-tools 1.6.4-3 Two ocfs2 filesystems (drbd0 600 GB w/ 8 slots and drbd1 6 TB w/ 6 slots) are created on top of drbd w/ 4k block and cluster size, 'max_features' enabled. cluster.conf assigns sequential node numbers 1 - 8. Nodes 1, 2 are the hypervisors. Nodes 3, 4, 5 are VMs on node 1. Nodes 6, 7, 8 the corresponding VMs on node 2. VMs all run Debian 8 as well: Linux srv2 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux When mounting drbd0 in order of increasing node numbers and concurrently watching the 'hb' output from debugsfs.ocfs2 we get a clean slot map (?): hb node: node seq generation checksum 1: 1 0000000059b8d94a fa60f0d8423590d9 edec9643 2: 2 0000000059b8d94c aca059df4670f467 994e3458 3: 3 0000000059b8d949 f03dc9ba8f27582c d4473fc2 4: 4 0000000059b8d94b df5bbdb756e757f8 12a198eb 5: 5 0000000059b8d94a 1af81d94a7cb681b 91fba906 6: 6 0000000059b8d94b 104538f30cdb35fa 8713e798 7: 7 0000000059b8d94b 195658c9fb8ca7f9 5e54edf6 8: 8 0000000059b8d949 dc6bfb46b9cf1ac3 de7a8757 Device drbd1 in contrast yields the following table after mounting on nodes 1, 2: hb node: node seq generation checksum 8: 1 0000000059b8d9ba 73a63eb550a33095 f4e074d1 16: 2 0000000059b8d9b9 5c7504c05637983e 07d696ec Proceeding with the drbd1 mounts on nodes 3, 5, 6 leads us to: hb node: node seq generation checksum 3: 3 0000000059b8da3b 9443b4b209b16175 f2cc87ec 5: 5 0000000059b8da3c 4b742f709377466f 3ac41cf3 6: 6 0000000059b8da3b d96e2de0a55514f6 335a4d90 8: 1 0000000059b8da3c 73a63eb550a33095 2312c1c4 16: 2 0000000059b8da3d 5c7504c05637983e 659571a1 The problem arises when trying to mount node 8 since its slot is already occupied by node 1: kern.log node 1: (o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c, 0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b) kern.log node 8: ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode. (o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c, 0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c) This can be "fixed" by exchanging node numbers 1 and 8 in cluster.conf. Then node 8 will be assigned slot 8, node 2 stays in slot 16, 3 to 7 as expected. There is no node 16 configured so there's no conflict. But since we experience some other so far not explainable instabilities with this ocfs2 device / system during operation further down the road we decided to take care of and try to fix this issue first. Somehow the failure reminds of bit shift or masking problems: 1 << 3 = 8 2 << 3 = 16 But then again - what do I know ... Tried so far: A. Create offending file system with 8 slots instead of 6 -> same issue. B. Set features to 'default' (disables feature 'extended-slotmap') -> same issue. We'd very much appreciate any comments on this. Has anything similar ever been experienced before? Are we completely missing something important here? If there's a fix already out for this any pointers (src files / commits) to where to look would be greatly appreciated. Thanks in advance + Best regards ... Michael U.
Michael Ulbrich
2017-Sep-14 18:58 UTC
[Ocfs2-users] Node 8 doesn't mount / Wrong slot map assignment?
Hi again, I made some progress with debugging the situation. To recap: 2 ocfs2 file systems: /dev/drbd0 -> lvm -> RAID1 from 2 x 600 GB SAS disks /dev/drbd1 -> lvm -> RAID1 from 2 x 6 TB NL (Near-Line) SAS disks This is configured identically on 2 DELL R 530 servers (node 1 + 2 as hypervisors). Disks are connected via PERC H730 mini (Linux kernel driver: megaraid_sas ver. 06.811.02.00-rc1). drbd has a private GigE link for replication traffic. Both hypervisors run 3 virtual machines each. /dev/drbd0 works as expected as long as it's allocated on the 600 GB RAID 1. If it's moved to the large 6 TB RAID1 device the behaviuor gets identical to /dev/drbd1. As described in my previous post there's an unusual slot (?) numbering which prevents the mount of the ocfs2 file system /dev/drbd1 on node 8. As a quick fix we could swap node numbers 1 <-> 8 in cluster.conf. But this does not address the underlying problem as we will soon see. In deliberately formatted form the list of nodes looks as follows: node (number = 8, name = h1a) - Hypervisor node (number = 2, name = h1b) - Hypervisor node (number = 3, name = web1) - Guest 1 on h1a node (number = 4, name = db1) - Guest 2 on h1a node (number = 5, name = srv1) - Guest 3 on h1a node (number = 6, name = web2) - Guest 4 on h1b node (number = 7, name = db2) - Guest 5 on h1b node (number = 1, name = srv2) - Guest 6 on h1b Now node 8 is the first (Hypervisor) node to mount /dev/drbd1 which leads to ('watch -d -n 1 "echo \"hb\" | debugfs.ocfs2 -n /dev/drbd1"): hb node: node seq generation checksum 64: 8 0000000059b8d9ba 73a63eb550a33095 f4e074d1 Node 2 is the second (Hypervisor) node to mount: hb node: node seq generation checksum 16: 2 0000000059b8d9b9 5c7504c05637983e 07d696ec 64: 8 0000000059b8d9ba 73a63eb550a33095 f4e074d1 Again we see the strange "* 8" or "shift left 3" relationship between columns "node:" and "node". Now the guests are brought up and mount the file system in order 3, 5, 6, 1 (I don't have the actual seq / gen values, so from memory): hb node: node seq generation checksum 1: 1 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx 3: 3 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx 5: 5 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx 6: 6 xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx xxxxxxxx 16: 2 0000000059b8d9b9 5c7504c05637983e 07d696ec 64: 8 0000000059b8d9ba 73a63eb550a33095 f4e074d1 Please note that the virtual machines get assigned the corresponding "node:" = "node" values as expected. Now we went a step further and enabled tracing: "debugfs.ocfs2 -l HEARTBEAT allow". This periodically logs messages from the heartbeat threads of the individual file systems. For the file system /dev/drbd1 we get on the hypervisors: (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 1 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 2 gen 0x98be08e71122efed cksum 0x33a84ac0 seq 1505346907 last 1505346907 changed 1 equal 0 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 3 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 4 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 5 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 6 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 7 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 1544 (o2hb-3B0327532D,32784,3):o2hb_check_slot:849 Slot 8 gen 0x551934cc4ba0b1bf cksum 0xf606e2be seq 1505346907 last 1505346907 changed 1 equal 0 We only see the hypervisors heartbeating in slots 2 and 8 although 4 additional guests have also mounted the same file system. Tracing the ocfs2 heartbeat on one of the guests (web1) gives the following: (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 1 gen 0xd1f96dee2509bc73 cksum 0x1dc10931 seq 1505371587 last 1505371587 changed 1 equal 0 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 2 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 3 gen 0x5d8c200c0113510f cksum 0xbfc95a14 seq 1505371590 last 1505371590 changed 1 equal 0 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 4 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 5 gen 0x39a8da3bae49161b cksum 0x49b4a110 seq 1505371588 last 1505371588 changed 1 equal 0 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 6 gen 0xc00a0ba3931ad15 cksum 0x92625e99 seq 1505371587 last 1505371587 changed 1 equal 0 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 7 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674 (o2hb-3B0327532D,514,0):o2hb_check_slot:849 Slot 8 gen 0x0 cksum 0x0 seq 0 last 0 changed 0 equal 13674 Here we have the mirrored situation in that only guests are seen heartbeating in slots 1, 3, 5 and 6. No trace of the hypervisors in slots 2 (16) and 8 (64) ... Uhm, well ... trying to wrap my head around this ... :-) Ok, so let's dig a little deeper. Watching the init of a mount operation with HEARTBEAT tracing enabled gives different results depending on whether the same file system is mounted on a hypervisor: (mount.ocfs2,28264,0):o2hb_init_region_params:1729 hr_start_block = 273, hr_blocks = 255 (mount.ocfs2,28264,0):o2hb_init_region_params:1731 hr_block_bytes = 4096, *hr_block_bits = 12* (mount.ocfs2,28264,0):o2hb_init_region_params:1732 hr_timeout_ms = 2000 (mount.ocfs2,28264,0):o2hb_init_region_params:1733 dead threshold = 31 (mount.ocfs2,28264,0):o2hb_map_slot_data:1764 Going to require 255 pages to cover 255 blocks at 1 blocks per page (o2hb-C27AC49D2B,28265,10):o2hb_thread:1221 hb thread running ... or on a guest: (mount.ocfs2,3505,1):o2hb_init_region_params:1729 hr_start_block = 2184, hr_blocks = 255 (mount.ocfs2,3505,1):o2hb_init_region_params:1731 hr_block_bytes = 512, *hr_block_bits = 9* (mount.ocfs2,3505,1):o2hb_init_region_params:1732 hr_timeout_ms = 2000 (mount.ocfs2,3505,1):o2hb_init_region_params:1733 dead threshold = 31 (mount.ocfs2,3505,1):o2hb_map_slot_data:1764 Going to require 32 pages to cover 255 blocks at 8 blocks per page (o2hb-C27AC49D2B,3506,0):o2hb_thread:1221 hb thread running So on the hypervisors we have -> hr_block_bytes = 4096, hr_block_bits = 12 and on the virtual machines -> hr_block_bytes = 512, hr_block_bits = 9 There we find BTW the factor of 8 in the different block size and a 3 bit length difference in hr_block_bits. Is this ok? From my limited understanding I would have expected that the nodes mounting the shared file system would share the heartbeat system file and inside of that share a common data structure (hearbeat region). But then the nodes should have a common understanding of the size of this structure, right? Here it looks as if the hypervisors are interacting with a "large" hr structure while the guests do the same but on a "small" heartbeat region. That would be an explanation that heartbeat threads from hypervisors and guests do not "see" each other as described above. Or are the heartbeat regions in-memory structures of different size which get translated into a common "disk" structure when being written to the hb system file? It would be great if you could give me a little guidance here. If it actually is a bug I'm willing to work on this further. If I'm heading in the wrong direction just give me a short note and probably a hint what's wrong with my setup or which ocfs2 version might fix this problem. Thanks a lot + Regards from Berlin ... Michael U. On 09/13/2017 09:54 AM, Michael Ulbrich wrote:> Hi all, > > we've a small (?) problem with a 2-node cluster on Debian 8: > > Linux h1b 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u2 (2017-06-26) > x86_64 GNU/Linux > > ocfs2-tools 1.6.4-3 > > Two ocfs2 filesystems (drbd0 600 GB w/ 8 slots and drbd1 6 TB w/ 6 > slots) are created on top of drbd w/ 4k block and cluster size, > 'max_features' enabled. > > cluster.conf assigns sequential node numbers 1 - 8. Nodes 1, 2 are the > hypervisors. Nodes 3, 4, 5 are VMs on node 1. Nodes 6, 7, 8 the > corresponding VMs on node 2. > > VMs all run Debian 8 as well: > > Linux srv2 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 > GNU/Linux > > When mounting drbd0 in order of increasing node numbers and concurrently > watching the 'hb' output from debugsfs.ocfs2 we get a clean slot map (?): > > hb > node: node seq generation checksum > 1: 1 0000000059b8d94a fa60f0d8423590d9 edec9643 > 2: 2 0000000059b8d94c aca059df4670f467 994e3458 > 3: 3 0000000059b8d949 f03dc9ba8f27582c d4473fc2 > 4: 4 0000000059b8d94b df5bbdb756e757f8 12a198eb > 5: 5 0000000059b8d94a 1af81d94a7cb681b 91fba906 > 6: 6 0000000059b8d94b 104538f30cdb35fa 8713e798 > 7: 7 0000000059b8d94b 195658c9fb8ca7f9 5e54edf6 > 8: 8 0000000059b8d949 dc6bfb46b9cf1ac3 de7a8757 > > Device drbd1 in contrast yields the following table after mounting on > nodes 1, 2: > > hb > node: node seq generation checksum > 8: 1 0000000059b8d9ba 73a63eb550a33095 f4e074d1 > 16: 2 0000000059b8d9b9 5c7504c05637983e 07d696ec > > Proceeding with the drbd1 mounts on nodes 3, 5, 6 leads us to: > > hb > node: node seq generation checksum > 3: 3 0000000059b8da3b 9443b4b209b16175 f2cc87ec > 5: 5 0000000059b8da3c 4b742f709377466f 3ac41cf3 > 6: 6 0000000059b8da3b d96e2de0a55514f6 335a4d90 > 8: 1 0000000059b8da3c 73a63eb550a33095 2312c1c4 > 16: 2 0000000059b8da3d 5c7504c05637983e 659571a1 > > The problem arises when trying to mount node 8 since its slot is already > occupied by node 1: > > kern.log node 1: > > (o2hb-0AEE381A14,50990,4):o2hb_check_own_slot:582 ERROR: Another node is > heartbeating on device (drbd1): expected(1:0x18acf7b0b3e5544c, > 0x59b8445c), ondisk(8:0xb91302db72a65364, 0x59b8445b) > > kern.log node 8: > > ocfs2: Mounting device (254,16) on (node 8, slot 7) with ordered data mode. > (o2hb-0AEE381A14,518,1):o2hb_check_own_slot:582 ERROR: Another node is > heartbeating on device (vdc): expected(8:0x18acf7b0b3e5544c, > 0x59b8445c), ondisk(1:0x18acf7b0b3e5544c, 0x59b8445c) > > This can be "fixed" by exchanging node numbers 1 and 8 in cluster.conf. > Then node 8 will be assigned slot 8, node 2 stays in slot 16, 3 to 7 as > expected. There is no node 16 configured so there's no conflict. But > since we experience some other so far not explainable instabilities with > this ocfs2 device / system during operation further down the road we > decided to take care of and try to fix this issue first. > > Somehow the failure reminds of bit shift or masking problems: > > 1 << 3 = 8 > 2 << 3 = 16 > > But then again - what do I know ... > > Tried so far: > > A. Create offending file system with 8 slots instead of 6 -> same issue. > B. Set features to 'default' (disables feature 'extended-slotmap') -> > same issue. > > We'd very much appreciate any comments on this. Has anything similar > ever been experienced before? Are we completely missing something > important here? > > If there's a fix already out for this any pointers (src files / commits) > to where to look would be greatly appreciated. > > Thanks in advance + Best regards ... Michael U.