This has been fixed for sometime now.
===========================================commit
14741472a05245ed5778aa0aec055e1f920b6ef8
Author: Srinivas Eeda <srinivas.eeda at oracle.com>
Date: Mon Mar 22 16:50:47 2010 -0700
ocfs2: Fix a race in o2dlm lockres mastery
In o2dlm, the master of a lock resource keeps a map of all interested
nodes. This prevents the master from purging the resource before an
interested node can create a lock.
A race between the mastery thread and the mastery handler allowed an
interested node to discover who the master is without informing the
master directly. This is easily fixed by holding the dlm spinlock a
little longer in the mastery handler.
Signed-off-by: Srinivas Eeda <srinivas.eeda at oracle.com>
Signed-off-by: Joel Becker <joel.becker at oracle.com>
commit a524812b7eaa7783d7811198921100f079034e61
Author: Wengang Wang <wen.gang.wang at oracle.com>
Date: Fri Jul 30 16:14:44 2010 +0800
ocfs2/dlm: avoid incorrect bit set in refmap on recovery master
In the following situation, there remains an incorrect bit in
refmap on the
recovery master. Finally the recovery master will fail at purging
the lockres
due to the incorrect bit in refmap.
1) node A has no interest on lockres A any longer, so it is purging it.
2) the owner of lockres A is node B, so node A is sending de-ref
message
to node B.
3) at this time, node B crashed. node C becomes the recovery
master. it recovers
lockres A(because the master is the dead node B).
4) node A migrated lockres A to node C with a refbit there.
5) node A failed to send de-ref message to node B because it
crashed. The failure
is ignored. no other action is done for lockres A any more.
For mormal, re-send the deref message to it to recovery master can
fix it. Well,
ignoring the failure of deref to the original master and not
recovering the lockres
to recovery master has the same effect. And the later is simpler.
Signed-off-by: Wengang Wang <wen.gang.wang at oracle.com>
Acked-by: Srinivas Eeda <srinivas.eeda at oracle.com>
Cc: stable at kernel.org
Signed-off-by: Joel Becker <joel.becker at oracle.com>
===========================================
On 09/29/2010 06:47 AM, Charlie Sharkey wrote:>
> I got the following crash on a Sles10 SP2 system, info below.
>
> Is this a known problem ? It looks similar to bug# 912
>
> http://oss.oracle.com/bugzilla/show_bug.cgi?id=912
>
> version info
>
> -----------------
>
> OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> crash info
>
> -------------
>
> KERNEL: ./vmlinux-2.6.16.60-0.42.10
>
> DUMPFILE: ../n2_vmcore_20100925
>
> CPUS: 8
>
> DATE: Sat Sep 25 12:48:00 2010
>
> UPTIME: 10 days, 04:08:44
>
> LOAD AVERAGE: 9.39, 9.11, 8.67
>
> TASKS: 484
>
> NODENAME: n2
>
> RELEASE: 2.6.16.60-0.42.10-smp
>
> VERSION: #1 SMP Tue Apr 27 05:11:27 UTC 2010
>
> MACHINE: x86_64 (2926 Mhz)
>
> MEMORY: 2.9 GB
>
> PANIC: ""
>
> PID: 6557
>
> COMMAND: "dlm_thread"
>
> TASK: ffff81012ac89860 [THREAD_INFO: ffff81010532e000]
>
> CPU: 4
>
> STATE: TASK_RUNNING (PANIC)
>
> crash> bt
>
> PID: 6557 TASK: ffff81012ac89860 CPU: 4 COMMAND:
"dlm_thread"
>
> #0 [ffff81010532fa50] machine_kexec at ffffffff8011c0b6
>
> #1 [ffff81010532fb20] crash_kexec at ffffffff80154022
>
> #2 [ffff81010532fbe0] __die at ffffffff802ec658
>
> #3 [ffff81010532fc20] die at ffffffff8010c7e6
>
> #4 [ffff81010532fc50] do_invalid_op at ffffffff8010cd97
>
> #5 [ffff81010532fd10] error_exit at ffffffff8010bced
>
> [exception RIP: dlm_drop_lockres_ref+480]
>
> RIP: ffffffff88511d2a RSP: ffff81010532fdc8 RFLAGS: 00010286
>
> RAX: ffff81006181cc08 RBX: 0000000000000000 RCX: 000000000001109c
>
> RDX: 000000000000001f RSI: 0000000000000296 RDI: ffffffff8035ba1c
>
> RBP: ffff81006181cbc0 R8: ffffffff8045a260 R9: 000000000000001f
>
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff810129b05c00
>
> R13: 000000000000001f R14: ffff81004ada2320 R15: 000000000000026d
>
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>
> #6 [ffff81010532fdc0] dlm_drop_lockres_ref at ffffffff88511d2a
>
> #7 [ffff81010532fe40] dlm_run_purge_list at ffffffff8852035c
>
> #8 [ffff81010532fe90] dlm_thread at ffffffff88520718
>
> #9 [ffff81010532ff10] kthread at ffffffff801480cd
>
> #10 [ffff81010532ff50] kernel_thread at ffffffff8010bea6
>
> crash>
>
> text extracted from the core file:
>
> -----------------------------------------
>
> <3>(6345,7):dlm_deref_lockres_handler:2302 ERROR:
> 27870DB34A7241CC8EBDD43647ABE1FB:M0000000000000078b4305e00000000: node
> 0 trying to drop ref but it is already dropped!
>
> <3>(6557,4):dlm_drop_lockres_ref:2234 ERROR: while dropping ref on
> 130ADCC7DE934141AF05DA025CCD14A4:O0000000000000079a3bfbc00000000
> (master=0) got -22.
>
> <1>Kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2236
>
> <4>Modules linked in: af_packet ocfs2 ocfs2_dlmfs ocfs2_dlm
> ocfs2_nodemanager configfs btipbsa4 ipmi_devintf ipmi_si
> ipmi_msghandler bonding ipv6 bticomp_aha363 dock smi button battery
> btismc ac st loop dm_round_robin dm_multipath dm_mod usbhid
> usb_storage ide_core i2c_i801 igb e1000 hw_random i2c_core uhci_hcd
> ehci_hcd usbcore ext3 jbd qla2xxx firmware_class qla2xxx_conf
> intermodule edd fan thermal processor sg megaraid_sas ata_piix libata
> sd_mod scsi_mod
>
> <4>Pid: 6557, comm: dlm_thread Tainted: P U 2.6.16.60-0.42.10-smp
#1
>
> <4>RIP: 0010:[<ffffffff88511d2a>]
> <ffffffff88511d2a>{:ocfs2_dlm:dlm_drop_lockres_ref+480}
>
> <4>Process dlm_thread (pid: 6557, threadinfo ffff81010532e000, task
> ffff81012ac89860)
>
> <4>Call Trace:
<ffffffff8852035c>{:ocfs2_dlm:dlm_run_purge_list+771}
>
> <4> <ffffffff88520718>{:ocfs2_dlm:dlm_thread+131}
> <ffffffff8014820e>{autoremove_wake_function+0}
>
> <4> <ffffffff88520695>{:ocfs2_dlm:dlm_thread+0}
> <ffffffff80147e05>{keventd_create_kthread+0}
>
> <1>RIP <ffffffff88511d2a>{:ocfs2_dlm:dlm_drop_lockres_ref+480}
RSP
> <ffff81010532fdc8>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100929/022ebbd0/attachment-0001.html