I have seen a problem on a two node system where some (but not all) of the ocfs2 volumes on node 2 became unmounted. Correcting the problem only required issuing the mount command, but I am curious if anyone has an explanation of what may have happened. One thing I should mention is that the hostname node 1 was changed about a week ago. At that time corrections were made to the cluster.conf and hosts files of both nodes, and both machines were rebooted. Any ideas ? System Info: ------------- SuSe Sles10 SP1 2.6.16.46-0.12-smp /proc/fs/ocfs2/version OCFS2 1.2.5-SLES-r2997 Tue Mar 27 16:33:19 EDT 2007 (build sles) rpm -qa | grep ocfs2 ocfs2-tools-1.2.3-0.7 ocfs2console-1.2.3-0.7 /etc/sysconfig/o2cb # # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running /etc/init.d/o2cb configure. # Please use that method to modify this file # # O2CB_ENABELED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=ocfs2 # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. #O2CB_HEARTBEAT_THRESHOLD=240 bti changed, 240 was default O2CB_HEARTBEAT_THRESHOLD=90 # O2CB_HEARTBEAT_MODE: Whether to use the native "kernel" or the "user" # driven heartbeat (for example, for integration with heartbeat 2.0.x) O2CB_HEARTBEAT_MODE="kernel" # O2CB_IDLE_TIMEOUT_MS: The time frame in which all cluster memberships # for a given ocfs2 filesystem must be configured on all nodes. O2CB_IDLE_TIMEOUT_MS=120000 # O2CB_RECONNECT_DELAY_MS: How long to wait for the other nodes to recognise us O2CB_RECONNECT_DELAY_MS=2000 # O2CB_KEEPALIVE_DELAY_MS: How often to remind our peers that we are alive O2CB_KEEPALIVE_DELAY_MS=5000 Messages from CN2: (this is the node that had the unmounted volumes) Sep 25 05:45:01 CN2 run-crons[22005]: time.cron returned 1 Sep 25 06:41:58 CN2 syslog-ng[3969]: STATS: dropped 0 Sep 25 06:45:01 CN2 run-crons[23956]: time.cron returned 1 Sep 25 07:41:58 CN2 syslog-ng[3969]: STATS: dropped 0 Sep 25 07:45:01 CN2 run-crons[25893]: time.cron returned 1 Sep 25 08:41:59 CN2 syslog-ng[3969]: STATS: dropped 0 Sep 25 08:45:01 CN2 run-crons[27833]: time.cron returned 1 Sep 25 09:42:00 CN2 syslog-ng[3969]: STATS: dropped 0 Sep 25 09:45:01 CN2 run-crons[29770]: time.cron returned 1 Sep 25 09:58:13 CN2 zmd: NetworkManagerModule (WARN): Failed to connect to NetworkManager Sep 25 09:58:19 CN2 zmd: Daemon (WARN): Not starting remote web server Sep 25 10:40:33 CN2 sshd[31688]: Accepted publickey for root from 192.168.100.3 port 48567 ssh2 Sep 25 10:42:00 CN2 syslog-ng[3969]: STATS: dropped 0 Sep 25 10:43:25 CN2 sshd[31798]: Accepted publickey for root from 192.168.100.3 port 48569 ssh2 Sep 25 10:43:54 CN2 kernel: ocfs2_dlm: Nodes in domain ("967C5C3174B341A399FAF031B4D544FE"): 0 1 Sep 25 10:43:54 CN2 kernel: (31849,1):ocfs2_find_slot:261 slot 1 is already allocated to this node! Sep 25 10:43:54 CN2 kernel: (31849,1):ocfs2_check_volume:1654 File system was not unmounted cleanly, recovering volume. Sep 25 10:43:54 CN2 kernel: kjournald starting. Commit interval 5 seconds Sep 25 10:43:54 CN2 kernel: ocfs2: Mounting device (253,5) on (node 1, slot 1) Sep 25 10:43:58 CN2 kernel: ocfs2_dlm: Nodes in domain ("655FEE13B3604FCE8E780BA2F525EB6A"): 0 1 Sep 25 10:43:58 CN2 kernel: (31860,1):ocfs2_find_slot:261 slot 1 is already allocated to this node! Sep 25 10:43:58 CN2 kernel: (31860,1):ocfs2_check_volume:1654 File system was not unmounted cleanly, recovering volume. Sep 25 10:43:59 CN2 kernel: kjournald starting. Commit interval 5 seconds Sep 25 10:43:59 CN2 kernel: ocfs2: Mounting device (253,9) on (node 1, slot 1) Sep 25 10:45:01 CN2 run-crons[31894]: time.cron returned 1 Messages from CN1: (this is the node whos hostname was changed. From: CN1 to: bustech-bu) Sep 25 05:45:10 bustech-bu su: (to nobody) root on none Sep 25 06:30:01 bustech-bu run-crons[31122]: time.cron returned 1 Sep 25 06:42:38 bustech-bu syslog-ng[3901]: STATS: dropped 0 Sep 25 07:30:01 bustech-bu run-crons[597]: time.cron returned 1 Sep 25 07:42:38 bustech-bu syslog-ng[3901]: STATS: dropped 0 Sep 25 07:55:00 bustech-bu kernel: (6823,3):ocfs2_broadcast_vote:725 ERROR: status = -92 Sep 25 07:55:00 bustech-bu kernel: (6823,3):ocfs2_do_request_vote:798 ERROR: status = -92 Sep 25 07:55:00 bustech-bu kernel: (6823,3):ocfs2_rename:1099 ERROR: status = -92 The above three messages continue over and over again. Sep 25 07:55:10 bustech-bu kernel: (6823,0):ocfs2_broadcast_vote:725 ERROR: status = -92 Sep 25 07:55:10 bustech-bu kernel: (6823,0):ocfs2_do_request_vote:798 ERROR: status = -92 Sep 25 07:55:10 bustech-bu kernel: (6823,0):ocfs2_rename:1099 ERROR: status = -92 Sep 25 08:30:01 bustech-bu run-crons[2562]: time.cron returned 1 Sep 25 08:42:38 bustech-bu syslog-ng[3901]: STATS: dropped 7647 Sep 25 09:30:01 bustech-bu run-crons[4528]: time.cron returned 1 Sep 25 09:42:39 bustech-bu syslog-ng[3901]: STATS: dropped 0 Sep 25 10:30:01 bustech-bu run-crons[6600]: time.cron returned 1 Sep 25 10:40:55 bustech-bu sshd[7010]: Accepted publickey for root from 192.168.100.3 port 42585 ssh2 Sep 25 10:42:39 bustech-bu syslog-ng[3901]: STATS: dropped 0 Sep 25 10:43:54 bustech-bu kernel: ocfs2_dlm: Node 1 joins domain 967C5C3174B341A399FAF031B4D544FE Sep 25 10:43:54 bustech-bu kernel: ocfs2_dlm: Nodes in domain ("967C5C3174B341A399FAF031B4D544FE"): 0 1 Sep 25 10:43:58 bustech-bu kernel: ocfs2_dlm: Node 1 joins domain 655FEE13B3604FCE8E780BA2F525EB6A Sep 25 10:43:58 bustech-bu kernel: ocfs2_dlm: Nodes in domain ("655FEE13B3604FCE8E780BA2F525EB6A"): 0 1
Automatic umount? The messages do not indicate a umount. How did you detect that the volumes were umounted? As in, did you see "mount -t ocfs2" or did you also do "cat /proc/mounts". When ocfs2 umounts, it prints the umount message in syslog. I don't see that message. Charlie Sharkey wrote:> > I have seen a problem on a two node system where some (but not all) of > the ocfs2 > volumes on node 2 became unmounted. Correcting the problem only required > issuing the > mount command, but I am curious if anyone has an explanation of what may > have > happened. One thing I should mention is that the hostname node 1 was > changed > about a week ago. At that time corrections were made to the cluster.conf > and hosts > files of both nodes, and both machines were rebooted. > > Any ideas ? > > System Info: > ------------- > > SuSe Sles10 SP1 2.6.16.46-0.12-smp > > /proc/fs/ocfs2/version > OCFS2 1.2.5-SLES-r2997 Tue Mar 27 16:33:19 EDT 2007 (build sles) > > rpm -qa | grep ocfs2 > ocfs2-tools-1.2.3-0.7 > ocfs2console-1.2.3-0.7 > > /etc/sysconfig/o2cb > # > # This is a configuration file for automatic startup of the O2CB > # driver. It is generated by running /etc/init.d/o2cb configure. > # Please use that method to modify this file > # > # O2CB_ENABELED: 'true' means to load the driver on boot. > O2CB_ENABLED=true > > # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. > O2CB_BOOTCLUSTER=ocfs2 > > # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. > > #O2CB_HEARTBEAT_THRESHOLD=240 bti changed, 240 was default > O2CB_HEARTBEAT_THRESHOLD=90 > > # O2CB_HEARTBEAT_MODE: Whether to use the native "kernel" or the "user" > # driven heartbeat (for example, for integration with heartbeat 2.0.x) > > O2CB_HEARTBEAT_MODE="kernel" > > # O2CB_IDLE_TIMEOUT_MS: The time frame in which all cluster memberships > # for a given ocfs2 filesystem must be configured on all nodes. > O2CB_IDLE_TIMEOUT_MS=120000 > > # O2CB_RECONNECT_DELAY_MS: How long to wait for the other nodes to > recognise us > O2CB_RECONNECT_DELAY_MS=2000 > > # O2CB_KEEPALIVE_DELAY_MS: How often to remind our peers that we are > alive > O2CB_KEEPALIVE_DELAY_MS=5000 > > > Messages from CN2: (this is the node that had the unmounted volumes) > > Sep 25 05:45:01 CN2 run-crons[22005]: time.cron returned 1 > Sep 25 06:41:58 CN2 syslog-ng[3969]: STATS: dropped 0 > Sep 25 06:45:01 CN2 run-crons[23956]: time.cron returned 1 > Sep 25 07:41:58 CN2 syslog-ng[3969]: STATS: dropped 0 > Sep 25 07:45:01 CN2 run-crons[25893]: time.cron returned 1 > Sep 25 08:41:59 CN2 syslog-ng[3969]: STATS: dropped 0 > Sep 25 08:45:01 CN2 run-crons[27833]: time.cron returned 1 > Sep 25 09:42:00 CN2 syslog-ng[3969]: STATS: dropped 0 > Sep 25 09:45:01 CN2 run-crons[29770]: time.cron returned 1 > Sep 25 09:58:13 CN2 zmd: NetworkManagerModule (WARN): Failed to connect > to NetworkManager > Sep 25 09:58:19 CN2 zmd: Daemon (WARN): Not starting remote web server > Sep 25 10:40:33 CN2 sshd[31688]: Accepted publickey for root from > 192.168.100.3 port 48567 ssh2 > Sep 25 10:42:00 CN2 syslog-ng[3969]: STATS: dropped 0 > Sep 25 10:43:25 CN2 sshd[31798]: Accepted publickey for root from > 192.168.100.3 port 48569 ssh2 > Sep 25 10:43:54 CN2 kernel: ocfs2_dlm: Nodes in domain > ("967C5C3174B341A399FAF031B4D544FE"): 0 1 > Sep 25 10:43:54 CN2 kernel: (31849,1):ocfs2_find_slot:261 slot 1 is > already allocated to this node! > Sep 25 10:43:54 CN2 kernel: (31849,1):ocfs2_check_volume:1654 File > system was not unmounted cleanly, recovering volume. > Sep 25 10:43:54 CN2 kernel: kjournald starting. Commit interval 5 > seconds > Sep 25 10:43:54 CN2 kernel: ocfs2: Mounting device (253,5) on (node 1, > slot 1) > Sep 25 10:43:58 CN2 kernel: ocfs2_dlm: Nodes in domain > ("655FEE13B3604FCE8E780BA2F525EB6A"): 0 1 > Sep 25 10:43:58 CN2 kernel: (31860,1):ocfs2_find_slot:261 slot 1 is > already allocated to this node! > Sep 25 10:43:58 CN2 kernel: (31860,1):ocfs2_check_volume:1654 File > system was not unmounted cleanly, recovering volume. > Sep 25 10:43:59 CN2 kernel: kjournald starting. Commit interval 5 > seconds > Sep 25 10:43:59 CN2 kernel: ocfs2: Mounting device (253,9) on (node 1, > slot 1) > Sep 25 10:45:01 CN2 run-crons[31894]: time.cron returned 1 > > > Messages from CN1: (this is the node whos hostname was changed. From: > CN1 to: bustech-bu) > > Sep 25 05:45:10 bustech-bu su: (to nobody) root on none > Sep 25 06:30:01 bustech-bu run-crons[31122]: time.cron returned 1 > Sep 25 06:42:38 bustech-bu syslog-ng[3901]: STATS: dropped 0 > Sep 25 07:30:01 bustech-bu run-crons[597]: time.cron returned 1 > Sep 25 07:42:38 bustech-bu syslog-ng[3901]: STATS: dropped 0 > Sep 25 07:55:00 bustech-bu kernel: (6823,3):ocfs2_broadcast_vote:725 > ERROR: status = -92 > Sep 25 07:55:00 bustech-bu kernel: (6823,3):ocfs2_do_request_vote:798 > ERROR: status = -92 > Sep 25 07:55:00 bustech-bu kernel: (6823,3):ocfs2_rename:1099 ERROR: > status = -92 > The above three messages continue over and over again. > Sep 25 07:55:10 bustech-bu kernel: (6823,0):ocfs2_broadcast_vote:725 > ERROR: status = -92 > Sep 25 07:55:10 bustech-bu kernel: (6823,0):ocfs2_do_request_vote:798 > ERROR: status = -92 > Sep 25 07:55:10 bustech-bu kernel: (6823,0):ocfs2_rename:1099 ERROR: > status = -92 > Sep 25 08:30:01 bustech-bu run-crons[2562]: time.cron returned 1 > Sep 25 08:42:38 bustech-bu syslog-ng[3901]: STATS: dropped 7647 > Sep 25 09:30:01 bustech-bu run-crons[4528]: time.cron returned 1 > Sep 25 09:42:39 bustech-bu syslog-ng[3901]: STATS: dropped 0 > Sep 25 10:30:01 bustech-bu run-crons[6600]: time.cron returned 1 > Sep 25 10:40:55 bustech-bu sshd[7010]: Accepted publickey for root from > 192.168.100.3 port 42585 ssh2 > Sep 25 10:42:39 bustech-bu syslog-ng[3901]: STATS: dropped 0 > Sep 25 10:43:54 bustech-bu kernel: ocfs2_dlm: Node 1 joins domain > 967C5C3174B341A399FAF031B4D544FE > Sep 25 10:43:54 bustech-bu kernel: ocfs2_dlm: Nodes in domain > ("967C5C3174B341A399FAF031B4D544FE"): 0 1 > Sep 25 10:43:58 bustech-bu kernel: ocfs2_dlm: Node 1 joins domain > 655FEE13B3604FCE8E780BA2F525EB6A > Sep 25 10:43:58 bustech-bu kernel: ocfs2_dlm: Nodes in domain > ("655FEE13B3604FCE8E780BA2F525EB6A"): 0 1 > > > > > > > > > > > > > > > > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
Hi, I'm having some dlm issues on a system. It looks like the scenario went something like: 19:45:29 --> 19:47:31 16 dlm locks are released using o2dlm_unlock(). ocfs2 logs an error into /var/log/messages, but returns ok to the application 19:45:15 a dlm lock (o2dlm_lock()) is put on P00000 -- ok 19:49:37 lock on P00000 is released -- ok 19:49:40 a lock is attempted P00000. and the lock fails. Returned error is "Trylock failed" Here is the data from /var/log/messages: Jan 31 19:45:29 N1 kernel: (25033,1):dlmfs_unlink:512 ERROR: unlink P50005, error -16 from destroy Jan 31 19:45:43 N1 kernel: (25038,1):dlmfs_unlink:512 ERROR: unlink P20010, error -16 from destroy Jan 31 19:45:44 N1 kernel: (25030,1):dlmfs_unlink:512 ERROR: unlink P20002, error -16 from destroy Jan 31 19:45:59 N1 kernel: (25034,3):dlmfs_unlink:512 ERROR: unlink P60006, error -16 from destroy Jan 31 19:46:07 N1 kernel: (25043,0):dlmfs_unlink:512 ERROR: unlink P70015, error -16 from destroy Jan 31 19:46:08 N1 kernel: (25035,0):dlmfs_unlink:512 ERROR: unlink P70007, error -16 from destroy Jan 31 19:46:10 N1 kernel: (25041,1):dlmfs_unlink:512 ERROR: unlink P50013, error -16 from destroy Jan 31 19:46:25 N1 kernel: (25040,1):dlmfs_unlink:512 ERROR: unlink P40012, error -16 from destroy Jan 31 19:46:30 N1 kernel: (25042,1):dlmfs_unlink:512 ERROR: unlink P60014, error -16 from destroy Jan 31 19:46:30 N1 kernel: (25032,1):dlmfs_unlink:512 ERROR: unlink P40004, error -16 from destroy Jan 31 19:47:07 N1 kernel: (25028,1):dlmfs_unlink:512 ERROR: unlink P00000, error -16 from destroy Jan 31 19:47:08 N1 kernel: (25036,1):dlmfs_unlink:512 ERROR: unlink P00008, error -16 from destroy Jan 31 19:47:09 N1 kernel: (25029,1):dlmfs_unlink:512 ERROR: unlink P10001, error -16 from destroy Jan 31 19:47:19 N1 kernel: (25037,0):dlmfs_unlink:512 ERROR: unlink P10009, error -16 from destroy Jan 31 19:47:30 N1 kernel: (25039,1):dlmfs_unlink:512 ERROR: unlink P30011, error -16 from destroy Jan 31 19:47:31 N1 kernel: (25031,1):dlmfs_unlink:512 ERROR: unlink P30003, error -16 from destroy Here is data from the application dlm log file 01/31/2008 19:42:50 C000: Dlm Lock fd/id 150/P00000, returning: ok 01/31/2008 19:42:50 C001: Dlm Lock fd/id 152/P10001, returning: ok 01/31/2008 19:42:50 C002: Dlm Lock fd/id 154/P20002, returning: ok 01/31/2008 19:42:51 C003: Dlm Lock fd/id 156/P30003, returning: ok 01/31/2008 19:42:51 C004: Dlm Lock fd/id 158/P40004, returning: ok 01/31/2008 19:42:52 C005: Dlm Lock fd/id 160/P50005, returning: ok 01/31/2008 19:42:52 C006: Dlm Lock fd/id 162/P60006, returning: ok 01/31/2008 19:42:52 C007: Dlm Lock fd/id 164/P70007, returning: ok 01/31/2008 19:42:53 C008: Dlm Lock fd/id 166/P00008, returning: ok 01/31/2008 19:42:53 C009: Dlm Lock fd/id 168/P10009, returning: ok 01/31/2008 19:42:53 C00A: Dlm Lock fd/id 170/P20010, returning: ok 01/31/2008 19:42:54 C00B: Dlm Lock fd/id 172/P30011, returning: ok 01/31/2008 19:42:54 C00C: Dlm Lock fd/id 174/P40012, returning: ok 01/31/2008 19:42:54 C00D: Dlm Lock fd/id 178/P50013, returning: ok 01/31/2008 19:42:55 C00E: Dlm Lock fd/id 180/P60014, returning: ok 01/31/2008 19:42:58 C00F: Dlm Lock fd/id 182/P70015, returning: ok 01/31/2008 19:45:29 C005: Dlm UnLock. fd/id 160/P50005, returning ok 01/31/2008 19:45:43 C00A: Dlm UnLock. fd/id 170/P20010, returning ok 01/31/2008 19:45:44 C002: Dlm UnLock. fd/id 154/P20002, returning ok 01/31/2008 19:45:59 C006: Dlm UnLock. fd/id 162/P60006, returning ok 01/31/2008 19:46:07 C00F: Dlm UnLock. fd/id 182/P70015, returning ok 01/31/2008 19:46:08 C007: Dlm UnLock. fd/id 164/P70007, returning ok 01/31/2008 19:46:10 C00D: Dlm UnLock. fd/id 178/P50013, returning ok 01/31/2008 19:46:25 C00C: Dlm UnLock. fd/id 174/P40012, returning ok 01/31/2008 19:46:30 C00E: Dlm UnLock. fd/id 180/P60014, returning ok 01/31/2008 19:46:30 C004: Dlm UnLock. fd/id 158/P40004, returning ok 01/31/2008 19:47:07 C000: Dlm UnLock. fd/id 150/P00000, returning ok 01/31/2008 19:47:08 C008: Dlm UnLock. fd/id 166/P00008, returning ok 01/31/2008 19:47:09 C001: Dlm UnLock. fd/id 152/P10001, returning ok 01/31/2008 19:47:19 C009: Dlm UnLock. fd/id 168/P10009, returning ok 01/31/2008 19:47:30 C00B: Dlm UnLock. fd/id 172/P30011, returning ok 01/31/2008 19:47:31 C003: Dlm UnLock. fd/id 156/P30003, returning ok 01/31/2008 19:49:15 C000: Dlm Lock fd/id 150/P00000, returning: ok 01/31/2008 19:49:37 C000: Dlm UnLock. fd/id 150/P00000, returning ok 01/31/2008 19:49:40 C000: Dlm Lock fd/id 150/P00000, returning: Trylock failed I also had a problem with this system the day before this. Here is The data from that: SYSTEM MAP: /boot/System.map-2.6.16.46-0.14.PTF.284042.0-smp DEBUG KERNEL: ../vmlinux.debug (2.6.16.46-0.14.PTF.284042.0-smp) DUMPFILE: vmcore CPUS: 4 DATE: Wed Jan 30 17:44:49 2008 UPTIME: 9 days, 00:55:51 LOAD AVERAGE: 1.10, 1.06, 1.01 TASKS: 341 NODENAME: N1 RELEASE: 2.6.16.46-0.14.PTF.284042.0-smp VERSION: #1 SMP Thu May 17 14:00:09 UTC 2007 MACHINE: i686 (2327 Mhz) MEMORY: 2 GB PANIC: "kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2780!" PID: 31585 COMMAND: "masx" TASK: dab912d0 [THREAD_INFO: d5840000] CPU: 0 STATE: TASK_RUNNING (PANIC) crash> bt PID: 31585 TASK: dab912d0 CPU: 0 COMMAND: "masx" #0 [d5841d78] crash_kexec at c013bb1a #1 [d5841dbc] die at c01055fe #2 [d5841dec] do_invalid_op at c0105ce2 #3 [d5841e9c] error_code (via invalid_op) at c0104e4d EAX: 00000051 EBX: ca668280 ECX: 00000000 EDX: 00000296 EBP: da0c1c00 DS: 007b ESI: da0c1c00 ES: 007b EDI: ca668280 CS: 0060 EIP: fb835e95 ERR: ffffffff EFLAGS: 00010296 #4 [d5841ed0] dlm_empty_lockres at fb835e95 #5 [d5841ee0] dlm_unregister_domain at fb827305 #6 [d5841f18] dlmfs_clear_inode at fb6c2eae #7 [d5841f24] clear_inode at c0175dfe #8 [d5841f30] generic_delete_inode at c0175eee #9 [d5841f3c] iput at c0175838 #10 [d5841f48] dput at c01744e0 #11 [d5841f54] do_rmdir at c016e63d #12 [d5841fb8] sysenter_entry at c0103bd4 EAX: 00000028 EBX: 08299988 ECX: 00000000 EDX: 08273be4 DS: 007b ESI: 00000000 ES: 007b EDI: 082ebf28 SS: 007b ESP: bf999f7c EBP: bf999fa8 CS: 0073 EIP: ffffe410 ERR: 00000028 EFLAGS: 00000246 This is a Suse Sles10 SP1 system, with a suse nfs patch. Ocfs2 tools version 1.2.3-0.7 Ocfs2 version 1.2.5-SLES-r2997 I was hoping you would have some ideas on this. Also, another question. I have been trying to run one of the debugging Scripts, for example, scanlocks. I keep getting the message 'Module debugfs not loaded'. I don't see any debugfs.ko on the system. Isn't it a part of The ocfs2 tools ? Thank you, Charlie