Eduardo Diaz - Gmail
2012-Feb-09 10:08 UTC
[Ocfs2-users] OCFS2 Error in the filesystem after of some weeks running ocfs2
Hi to all, I am running a very simple configuration of drbd primary primary.. I make all test some weeks ago and all runs very well, (shudown the nodes, etc etc etc).. I will repeat the probes yesterday and now :(... I don't know what happens, again!!! but every time that I stop one node (shutdown, not poweroff) the cluster is broken :-(... I shutdown the filesystem an make a fsck.ocfs2 and there is many errors y cluster file but there is no way to test that the ocfs2 are ok? I can stop in night but for me this are crazy, because every to months the filesystem are broken and if I stop one node the running node go down... I have all system in debian squezee with ocfs2 1.6.3 Any Ideas?? Feb 7 13:58:33 servidoradantra2 kernel: [1864496.744051] block drbd0: conn( Unconnected -> WFConnection ) Feb 7 13:59:24 servidoradantra2 kernel: [1864547.064015] o2net: connection to node servidoradantra1 (num 0) at 192.168.2.1:7777 has been idle for 60.0 seconds, shutting it down. Feb 7 13:59:24 servidoradantra2 kernel: [1864547.064025] (0,0):o2net_idle_timer:1495 here are some times that might help debug the situation: (tmr 1328619504.71832 now 1328619564.71605 dr 1328619504.71815 adv 1328619504.71839:1328619504.71840 func (18797194:507) 1328619488.80748:1328619488.80749) Feb 7 13:59:24 servidoradantra2 kernel: [1864547.064048] o2net: no longer connected to node servidoradantra1 (num 0) at 192.168.2.1:7777 Feb 7 13:59:31 servidoradantra2 kernel: [1864554.860190] (2950,0):o2dlm_eviction_cb:269 o2dlm has evicted node 0 from group F0E244E5687046DBAAF6A928CCDEEEF1 Feb 7 13:59:31 servidoradantra2 kernel: [1864554.874012] (28219,0):dlm_get_lock_resource:839 F0E244E5687046DBAAF6A928CCDEEEF1:M00000000000000000000120766ee68: at least one node (0) to recover before lock mastery can begin Feb 7 13:59:32 servidoradantra2 kernel: [1864555.876011] (28219,0):dlm_get_lock_resource:893 F0E244E5687046DBAAF6A928CCDEEEF1:M00000000000000000000120766ee68: at least one node (0) to recover before lock mastery can begin Feb 7 13:59:35 servidoradantra2 kernel: [1864558.309527] (3132,3):dlm_get_lock_resource:839 F0E244E5687046DBAAF6A928CCDEEEF1:$RECOVERY: at least one node (0) to recover before lock mastery can begin Feb 7 13:59:35 servidoradantra2 kernel: [1864558.309533] (3132,3):dlm_get_lock_resource:873 F0E244E5687046DBAAF6A928CCDEEEF1: recovery map is not empty, but must master $RECOVERY lock now Feb 7 13:59:35 servidoradantra2 kernel: [1864558.309549] (3132,3):dlm_do_recovery:523 (3132) Node 1 is the Recovery Master for the Dead Node 0 for Domain F0E244E5687046DBAAF6A928CCDEEEF1 Feb 7 13:59:43 servidoradantra2 kernel: [1864566.880235] (28219,0):ocfs2_replay_journal:1607 Recovering node 0 from slot 0 on device (147,0) Feb 7 13:59:47 servidoradantra2 kernel: [1864570.884880] ------------[ cut here ]------------ Feb 7 13:59:47 servidoradantra2 kernel: [1864570.884902] kernel BUG at /build/buildd-linux-2.6_2.6.32-39squeeze1-i386-F5tMlP/linux-2.6-2.6.32/debian/build/source_i386_none/fs/ocfs2/journal.c:1702! Feb 7 13:59:47 servidoradantra2 kernel: [1864570.884938] invalid opcode: 0000 [#1] SMP Feb 7 13:59:47 servidoradantra2 kernel: [1864570.884960] last sysfs file: /sys/devices/pci0000:00/0000:00:1f.2/host5/target5:0:0/5:0:0:0/model Feb 7 13:59:47 servidoradantra2 kernel: [1864570.884991] Modules linked in: ocfs2 jbd2 quota_tree crc32c drbd lru_cache cn pci_stub vboxpci vboxnetadp vboxnetflt vboxdrv cls_u32 sch_htb sch_ingress sch_sfq xt_time xt_connlimit xt_realm iptable_raw xt_TPROXY nf_tproxy_core xt_hashlimit xt_comment xt_owner xt_recent xt_iprange xt_policy xt_multiport ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP ipt_MASQUERADE ipt_LOG ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah ipt_addrtype xt_tcpmss xt_pkttype xt_physdev xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 iptable_mangle nfnetlink iptable_filter ip_tables x_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs xfs exportfs it87 hwmon_vid coretemp loop firewire_sbp2 snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep nouveau ttm drm_kms_helper snd_pcm drm snd_timer snd soundcore i2c_i801 i2c_ Feb 7 13:59:47 servidoradantra2 kernel: algo_bit parport_pc i2c_core snd_page_alloc parport psmouse evdev button pcspkr serio_raw processor ext3 jbd mbcache dm_mod sg usbhid hid sr_mod cdrom ata_generic sd_mod crc_t10dif uhci_hcd pata_jmicron firewire_ohci thermal ahci firewire_core floppy crc_itu_t libata r8169 mii ehci_hcd scsi_mod thermal_sys sky2 usbcore nls_base [last unloaded: scsi_wait_scan] Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886462] Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886477] Pid: 28219, comm: ocfs2rec Not tainted (2.6.32-5-686-bigmem #1) 965P-DS4 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886505] EIP: 0060:[<fd01d47a>] EFLAGS: 00010246 CPU: 0 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886532] EIP is at __ocfs2_recovery_thread+0x3af/0x146d [ocfs2] Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886550] EAX: 00000001 EBX: f5da6800 ECX: 00000001 EDX: 00000001 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886569] ESI: 00000001 EDI: f6ade038 EBP: 00000000 ESP: e0cb9ed4 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886587] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886605] Process ocfs2rec (pid: 28219, ti=e0cb8000 task=c91c0440 task.ti=e0cb8000) Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886633] Stack: Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886647] c91c0440 c91c0440 f5da689c 00000001 00000001 f5da6800 f6ade038 f6b21930 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886682] <0> 00000002 00010000 00000000 00010000 00000000 e6f91000 d2baa848 00000000 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886731] <0> 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886790] Call Trace: Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886814] [<fd01d0cb>] ? __ocfs2_recovery_thread+0x0/0x146d [ocfs2] Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886835] [<c104a420>] ? kthread+0x61/0x66 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886853] [<c104a3bf>] ? kthread+0x0/0x66 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886871] [<c1008d87>] ? kernel_thread_helper+0x7/0x10 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.886888] Code: 00 00 68 24 b7 05 fd 50 ff b2 2c 01 00 00 68 c9 47 06 fd e8 99 10 26 c4 83 c4 20 8b 5c 24 14 8b 44 24 0c 39 83 bc 00 00 00 75 04 <0f> 0b eb fe 8d 84 24 d0 00 00 00 c7 84 24 d0 00 00 00 00 00 00 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.887102] EIP: [<fd01d47a>] __ocfs2_recovery_thread+0x3af/0x146d [ocfs2] SS:ESP 0068:e0cb9ed4 Feb 7 13:59:47 servidoradantra2 kernel: [1864570.887413] ---[ end trace 22961f2e1f624b7d ]--- Feb 7 14:07:19 servidoradantra2 kernel: imklog 4.6.4, log source /proc/kmsg started.
Adi Kriegisch
2012-Feb-10 12:54 UTC
[Ocfs2-users] OCFS2 Error in the filesystem after of some weeks running ocfs2
Dear Eduardo,> I shutdown the filesystem an make a fsck.ocfs2 and there is many > errors y cluster file but there is no way to test that the ocfs2 are > ok? I can stop in night but for me this are crazy, because every to > months the filesystem are broken and if I stop one node the running > node go down... > > I have all system in debian squezee with ocfs2 1.6.3 > > Any Ideas??We had similar issues (also running debian squeeze 32bit). At the time we suspected having not enough LOWMEM available for the recovery to complete successfully. Switching to amd64 solved the issue for us. Luckily ocfs2 is able to run with mixed 32bit and 64bit clients so we could migrate our servers one by one without too much interrupting production. -- Adi