Three bugfixes that can be pushed during 2.6.39. If ack-ed, I will cc the first patch to stable. Affects 2.6.37+. Sunil
Sunil Mushran
2011-Apr-05 22:21 UTC
[Ocfs2-devel] [PATCH 1/3] ocfs2/dlm: Use negotiated o2dlm protocol version
Patch fixes a bug in the o2dlm protocol negotiation in that it is using the builtin version rather than the negotiated version during the domain join. This causes join errors when a node having kernel >= 2.6.37 joins a cluster with nodes having kernels < 2.6.37. This only affects the o2cb cluster stack. Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com> Reported-by: Jacek Stepniewski <Jacek.Stepniewski at agora.pl> --- fs/ocfs2/dlm/dlmdomain.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c index 7540a49..3b179d6 100644 --- a/fs/ocfs2/dlm/dlmdomain.c +++ b/fs/ocfs2/dlm/dlmdomain.c @@ -1614,7 +1614,8 @@ static int dlm_try_to_join_domain(struct dlm_ctxt *dlm) spin_unlock(&dlm->spinlock); /* Support for global heartbeat and node info was added in 1.1 */ - if (dlm_protocol.pv_major > 1 || dlm_protocol.pv_minor > 0) { + if (dlm->dlm_locking_proto.pv_major > 1 || + dlm->dlm_locking_proto.pv_minor > 0) { status = dlm_send_nodeinfo(dlm, ctxt->yes_resp_map); if (status) { mlog_errno(status); -- 1.7.1
Sunil Mushran
2011-Apr-05 22:21 UTC
[Ocfs2-devel] [PATCH 2/3] ocfs2/cluster: Increase the live threshold
Double the live threshold for the first region in the global heartbeat mode. The default, 2, is bare minimum. Increasing it will affect all mounts in the local heartbeat mode. Instead we increase it only for the global heartbeat mode and that too only the first region. This is only to increase the margin of safety. Addresses internal Oracle bug#10635585. Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com> --- fs/ocfs2/cluster/heartbeat.c | 13 ++++++++++++- 1 files changed, 12 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 2461eb3..ec50121 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -1690,6 +1690,7 @@ static ssize_t o2hb_region_dev_write(struct o2hb_region *reg, struct file *filp = NULL; struct inode *inode = NULL; ssize_t ret = -EINVAL; + int live_threshold; if (reg->hr_bdev) goto out; @@ -1766,8 +1767,18 @@ static ssize_t o2hb_region_dev_write(struct o2hb_region *reg, * A node is considered live after it has beat LIVE_THRESHOLD * times. We're not steady until we've given them a chance * _after_ our first read. + * The default threshold is bare minimum so as to limit the delay + * during mounts. For global heartbeat, the threshold doubled for the + * first region. */ - atomic_set(®->hr_steady_iterations, O2HB_LIVE_THRESHOLD + 1); + live_threshold = O2HB_LIVE_THRESHOLD; + if (o2hb_global_heartbeat_active()) { + spin_lock(&o2hb_live_lock); + if (o2hb_pop_count(&o2hb_region_bitmap, O2NM_MAX_REGIONS) == 1) + live_threshold <<= 1; + spin_unlock(&o2hb_live_lock); + } + atomic_set(®->hr_steady_iterations, live_threshold + 1); hb_task = kthread_run(o2hb_thread, reg, "o2hb-%s", reg->hr_item.ci_name); -- 1.7.1
Sunil Mushran
2011-Apr-05 22:21 UTC
[Ocfs2-devel] [PATCH 3/3] ocfs2/cluster: Heartbeat mismatch message improved
If o2hb finds unexpected values in the heartbeat slot, it prints a message "ERROR: Device "dm-6": another node is heartbeating in our slot!" This message could be misleading. This patch adds two more messages to help users better diagnose the problem. Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com> --- fs/ocfs2/cluster/heartbeat.c | 48 +++++++++++++++++++++++++++-------------- 1 files changed, 31 insertions(+), 17 deletions(-) diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index ec50121..1dd2716 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -539,25 +539,41 @@ static int o2hb_verify_crc(struct o2hb_region *reg, /* We want to make sure that nobody is heartbeating on top of us -- * this will help detect an invalid configuration. */ -static int o2hb_check_last_timestamp(struct o2hb_region *reg) +static void o2hb_check_last_timestamp(struct o2hb_region *reg) { - int node_num, ret; struct o2hb_disk_slot *slot; struct o2hb_disk_heartbeat_block *hb_block; + char *errstr; - node_num = o2nm_this_node(); - - ret = 1; - slot = ®->hr_slots[node_num]; + slot = ®->hr_slots[o2nm_this_node()]; /* Don't check on our 1st timestamp */ - if (slot->ds_last_time) { - hb_block = slot->ds_raw_block; + if (!slot->ds_last_time) + return; - if (le64_to_cpu(hb_block->hb_seq) != slot->ds_last_time) - ret = 0; - } + hb_block = slot->ds_raw_block; + if (le64_to_cpu(hb_block->hb_seq) == slot->ds_last_time && + le64_to_cpu(hb_block->hb_generation) == slot->ds_last_generation && + hb_block->hb_node == slot->ds_node_num) + return; - return ret; +#define ERRSTR1 "Another node is heartbeating on device" +#define ERRSTR2 "Heartbeat generation mismatch on device" +#define ERRSTR3 "Heartbeat sequence mismatch on device" + + if (hb_block->hb_node != slot->ds_node_num) + errstr = ERRSTR1; + else if (le64_to_cpu(hb_block->hb_generation) !+ slot->ds_last_generation) + errstr = ERRSTR2; + else + errstr = ERRSTR3; + + mlog(ML_ERROR, "%s (%s): expected(%u:0x%llx, 0x%llx), " + "ondisk(%u:0x%llx, 0x%llx)\n", errstr, reg->hr_dev_name, + slot->ds_node_num, (unsigned long long)slot->ds_last_generation, + (unsigned long long)slot->ds_last_time, hb_block->hb_node, + (unsigned long long)le64_to_cpu(hb_block->hb_generation), + (unsigned long long)le64_to_cpu(hb_block->hb_seq)); } static inline void o2hb_prepare_block(struct o2hb_region *reg, @@ -983,9 +999,7 @@ static int o2hb_do_disk_heartbeat(struct o2hb_region *reg) /* With an up to date view of the slots, we can check that no * other node has been improperly configured to heartbeat in * our slot. */ - if (!o2hb_check_last_timestamp(reg)) - mlog(ML_ERROR, "Device \"%s\": another node is heartbeating " - "in our slot!\n", reg->hr_dev_name); + o2hb_check_last_timestamp(reg); /* fill in the proper info for our next heartbeat */ o2hb_prepare_block(reg, reg->hr_generation); @@ -999,8 +1013,8 @@ static int o2hb_do_disk_heartbeat(struct o2hb_region *reg) } i = -1; - while((i = find_next_bit(configured_nodes, O2NM_MAX_NODES, i + 1)) < O2NM_MAX_NODES) { - + while((i = find_next_bit(configured_nodes, + O2NM_MAX_NODES, i + 1)) < O2NM_MAX_NODES) { change |= o2hb_check_slot(reg, ®->hr_slots[i]); } -- 1.7.1
Mark Fasheh
2011-Apr-19 17:44 UTC
[Ocfs2-devel] [PATCH 1/3] ocfs2/dlm: Use negotiated o2dlm protocol version
On Tue, Apr 05, 2011 at 03:21:07PM -0700, Sunil Mushran wrote:> Patch fixes a bug in the o2dlm protocol negotiation in that it is using > the builtin version rather than the negotiated version during the domain > join. This causes join errors when a node having kernel >= 2.6.37 joins > a cluster with nodes having kernels < 2.6.37. > > This only affects the o2cb cluster stack. > > Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com> > Reported-by: Jacek Stepniewski <Jacek.Stepniewski at agora.pl>Signed-off-by: Mark Fasheh <mfasheh at suse.com> -- Mark Fasheh
Mark Fasheh
2011-Apr-21 21:50 UTC
[Ocfs2-devel] [PATCH 3/3] ocfs2/cluster: Heartbeat mismatch message improved
On Tue, Apr 05, 2011 at 03:21:09PM -0700, Sunil Mushran wrote:> If o2hb finds unexpected values in the heartbeat slot, it prints a message > "ERROR: Device "dm-6": another node is heartbeating in our slot!" > > This message could be misleading. This patch adds two more messages to > help users better diagnose the problem. > > Signed-off-by: Sunil Mushran <sunil.mushran at oracle.com>Signed-off-by: Mark Fasheh <mfasheh at suse.com> -- Mark Fasheh