Hi, I have recently installed CentOS4 (i386 version) on server. After I run mdadm --create /dev/md0 --level=raid5 --raid-devices=6 --spare-devices=1 $DEVLIST ($DEVLIST is /dev/sda1 ... 7 same SATA disks on three different sata controllers, two on-board, one pci-card) it starts to build raid but at the end (> 3 hours) it produces kernel bug. Could it be hardware error, or it is really kernel bug and I should try newer kernel (I prefer using stock centos kernel !) ? Relevant part of log: Mar 14 20:55:48 server kernel: md: syncing RAID array md0 Mar 14 20:55:48 server kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. Mar 14 20:55:48 server kernel: md: using maximum available idle IO bandwith (but not more than 200000 KB/sec) for reconstruction. Mar 14 20:55:48 server kernel: md: using 128k window, over a total of 390711296 blocks. Mar 14 20:55:48 server kernel: md: md0: sync done. (it repeats all the time of reconstruction) Mar 14 20:55:48 server kernel: eip: c011cf67 Mar 14 20:55:48 server kernel: ------------[ cut here ]------------ Mar 14 20:55:48 server kernel: kernel BUG at include/asm/spinlock.h:146! Mar 14 20:55:48 server kernel: invalid operand: 0000 [#1] Mar 14 20:55:48 server kernel: SMP Mar 14 20:55:48 server kernel: Modules linked in: vfat fat raid5 xor parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mod button battery ac uhci_hcd ehci_hcd sata_sil e1000 floppy ext3 jbd ata_piix aacraid sata_promise libata sd_mod scsi_mod Mar 14 20:55:48 server kernel: CPU: 1 Mar 14 20:55:48 server kernel: EIP: 0060:[<c02c4d04>] Not tainted VLI Mar 14 20:55:48 server kernel: EFLAGS: 00010046 (2.6.9-5.0.3.ELsmp) Mar 14 20:55:48 server kernel: EIP is at _spin_lock_irqsave+0x20/0x45 Mar 14 20:55:48 server kernel: eax: c011cf67 ebx: 00000246 ecx: c02d7fe4 edx: c02d7fe4 Mar 14 20:55:48 server kernel: esi: f44f4f54 edi: f4ae4000 ebp: f4ae4f9c esp: f4ae4f84 Mar 14 20:55:48 server kernel: ds: 007b es: 007b ss: 0068 Mar 14 20:55:48 server kernel: Process md0_resync (pid: 12079, threadinfo=f4ae4000 task=f620f0b0) Mar 14 20:55:48 server kernel: Stack: f44f4f50 f44f4f54 c011cf67 f4dc1280 00000000 f4ae4000 00000000 c02650fd Mar 14 20:55:48 server kernel: 00000000 f620f0b0 c011e8d2 f4ae4fd0 f4ae4fd0 f44f4ee4 c02c5fca f396a630 Mar 14 20:55:48 server kernel: 00000000 f620f0b0 c011e8d2 f4ae4fd0 f4ae4fd0 00000000 00000000 0000007b Mar 14 20:55:48 server kernel: Call Trace: Mar 14 20:55:48 server kernel: [<c011cf67>] complete+0x12/0x3d Mar 14 20:55:48 server kernel: [<c02650fd>] md_thread+0x15f/0x168 Mar 14 20:55:48 server kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d Mar 14 20:55:48 server kernel: [<c02c5fca>] ret_from_fork+0x6/0x14 Mar 14 20:55:48 server kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d Mar 14 20:55:48 server kernel: [<c0264f9e>] md_thread+0x0/0x168 Mar 14 20:55:48 server kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb Mar 14 20:55:48 server kernel: Code: 81 00 00 00 00 01 c3 f0 ff 00 c3 56 89 c6 53 9c 5b fa 81 78 04 ad 4e ad de 74 18 ff 74 24 08 68 e4 7f 2d c0 e8 4c be e5 ff 59 58 <0f> 0b 92 00 f1 70 2d c0 f0 fe 0e 79 13 f7 c3 00 02 00 00 74 01 System is up-to-date (via yum) CentOS4: [root@server tmp]# free total used free shared buffers cached Mem: 1034676 286364 748312 0 30464 182996 -/+ buffers/cache: 72904 961772 Swap: 2096440 0 2096440 [root@server tmp]# uname -r 2.6.9-5.0.3.ELsmp [root@server tmp]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.40GHz stepping : 9 cpu MHz : 2394.536 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 4718.59 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 2.40GHz stepping : 9 cpu MHz : 2394.536 cache size : 512 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr bogomips : 4784.12 -- Sincerely Ivo Panacek
On Tue, 2005-03-15 at 11:54 +0100, Ivo Panacek wrote:> Hi, > > I have recently installed CentOS4 (i386 version) on server. > > After I run > > mdadm --create /dev/md0 --level=raid5 --raid-devices=6 > --spare-devices=1 $DEVLIST > > ($DEVLIST is /dev/sda1 ... 7 same SATA disks on three different > sata controllers, two on-board, one pci-card) > > it starts to build raid but at the end (> 3 hours) it produces kernel bug. > > Could it be hardware error, or it is really kernel bug and I should > try newer kernel (I prefer using stock centos kernel !) ? >is it a real dual processor machine or HyperThreading. I am assuming that the drivers work for the drives, that you see them all via fdisk -l, and that they are the correct size, etc. Maybe try booting to the non-smp (regular) kernel for building the RAID. should your device list entry be: /dev/sda[1-7] Just a couple thoughts -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.caosity.org/pipermail/centos/attachments/20050315/074662c1/attachment.bin
Johnny Hughes wrote:>On Tue, 2005-03-15 at 11:54 +0100, Ivo Panacek wrote: > > >>Hi, >> >>I have recently installed CentOS4 (i386 version) on server. >> >>After I run >> >>mdadm --create /dev/md0 --level=raid5 --raid-devices=6 >>--spare-devices=1 $DEVLIST >> >>($DEVLIST is /dev/sda1 ... 7 same SATA disks on three different >>sata controllers, two on-board, one pci-card) >> >>it starts to build raid but at the end (> 3 hours) it produces kernel bug. >> >> >>>>is it a real dual processor machine or HyperThreading. >> >>I am assuming that the drivers work for the drives, that you see them >>all via fdisk -l, and that they are the correct size, etc. >> >>Maybe try booting to the non-smp (regular) kernel for building the RAID. >> >>should your device list entry be: >> >>/dev/sda[1-7] >> >>Just a couple thoughts >> >>I would also recommend building the RAID without the SMP kernel and see if the bug goes away. I''m not a kernel developer, but I follow the list and the spinlock issue seems to be a recurring theme with multiprocessor machines. Also, as Johnny mentions, the behaviour may be quite different between a "genuine" SMP machine and a virtual SMP setup with a hyperthreading P4. You might also want to report the bug to the kernel gods and see if they have a more detailed plan for either fixing it or avoiding it in the future. Cheers, C
I''ve CentOS4 with two software raid5s running on a Proliant ML370G3 (dual Xeon 3Ghz) and have had no problem with them (using 6 SCSI drives). So I''d expect the problem isn''t with raid5, I''d guess it''s either hardware or SATA driver related... On Tue, 15 Mar 2005, Ivo Panacek wrote:> Hi, > > I have recently installed CentOS4 (i386 version) on server. > > After I run > > mdadm --create /dev/md0 --level=raid5 --raid-devices=6 > --spare-devices=1 $DEVLIST > > ($DEVLIST is /dev/sda1 ... 7 same SATA disks on three different > sata controllers, two on-board, one pci-card) > > it starts to build raid but at the end (> 3 hours) it produces kernel bug. > > Could it be hardware error, or it is really kernel bug and I should > try newer kernel (I prefer using stock centos kernel !) ? > > Relevant part of log: > > Mar 14 20:55:48 server kernel: md: syncing RAID array md0 > Mar 14 20:55:48 server kernel: md: minimum _guaranteed_ reconstruction > speed: 1000 KB/sec/disc. > Mar 14 20:55:48 server kernel: md: using maximum available idle IO > bandwith (but not more than 200000 KB/sec) for reconstruction. > Mar 14 20:55:48 server kernel: md: using 128k window, over a total of > 390711296 blocks. > Mar 14 20:55:48 server kernel: md: md0: sync done. > > (it repeats all the time of reconstruction) > > Mar 14 20:55:48 server kernel: eip: c011cf67 > Mar 14 20:55:48 server kernel: ------------[ cut here ]------------ > Mar 14 20:55:48 server kernel: kernel BUG at include/asm/spinlock.h:146! > Mar 14 20:55:48 server kernel: invalid operand: 0000 [#1] > Mar 14 20:55:48 server kernel: SMP > Mar 14 20:55:48 server kernel: Modules linked in: vfat fat raid5 xor > parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc dm_mod button > battery ac uhci_hcd ehci_hcd sata_sil e1000 floppy ext3 jbd ata_piix > aacraid sata_promise libata sd_mod scsi_mod > Mar 14 20:55:48 server kernel: CPU: 1 > Mar 14 20:55:48 server kernel: EIP: 0060:[<c02c4d04>] Not tainted VLI > Mar 14 20:55:48 server kernel: EFLAGS: 00010046 (2.6.9-5.0.3.ELsmp) > Mar 14 20:55:48 server kernel: EIP is at _spin_lock_irqsave+0x20/0x45 > Mar 14 20:55:48 server kernel: eax: c011cf67 ebx: 00000246 ecx: > c02d7fe4 edx: c02d7fe4 > Mar 14 20:55:48 server kernel: esi: f44f4f54 edi: f4ae4000 ebp: > f4ae4f9c esp: f4ae4f84 > Mar 14 20:55:48 server kernel: ds: 007b es: 007b ss: 0068 > Mar 14 20:55:48 server kernel: Process md0_resync (pid: 12079, > threadinfo=f4ae4000 task=f620f0b0) > Mar 14 20:55:48 server kernel: Stack: f44f4f50 f44f4f54 c011cf67 > f4dc1280 00000000 f4ae4000 00000000 c02650fd > Mar 14 20:55:48 server kernel: 00000000 f620f0b0 c011e8d2 > f4ae4fd0 f4ae4fd0 f44f4ee4 c02c5fca f396a630 > Mar 14 20:55:48 server kernel: 00000000 f620f0b0 c011e8d2 > f4ae4fd0 f4ae4fd0 00000000 00000000 0000007b > Mar 14 20:55:48 server kernel: Call Trace: > Mar 14 20:55:48 server kernel: [<c011cf67>] complete+0x12/0x3d > Mar 14 20:55:48 server kernel: [<c02650fd>] md_thread+0x15f/0x168 > Mar 14 20:55:48 server kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d > Mar 14 20:55:48 server kernel: [<c02c5fca>] ret_from_fork+0x6/0x14 > Mar 14 20:55:48 server kernel: [<c011e8d2>] autoremove_wake_function+0x0/0x2d > Mar 14 20:55:48 server kernel: [<c0264f9e>] md_thread+0x0/0x168 > Mar 14 20:55:48 server kernel: [<c01041f1>] kernel_thread_helper+0x5/0xb > Mar 14 20:55:48 server kernel: Code: 81 00 00 00 00 01 c3 f0 ff 00 c3 > 56 89 c6 53 9c 5b fa 81 78 04 ad 4e ad de 74 18 ff 74 24 08 68 e4 7f > 2d c0 e8 4c be e5 ff 59 58 <0f> 0b 92 00 f1 70 2d c0 f0 fe 0e 79 13 f7 > c3 00 02 00 00 74 01 > > System is up-to-date (via yum) CentOS4: > > [root@server tmp]# free > total used free shared buffers cached > Mem: 1034676 286364 748312 0 30464 182996 > -/+ buffers/cache: 72904 961772 > Swap: 2096440 0 2096440 > > [root@server tmp]# uname -r > 2.6.9-5.0.3.ELsmp > > [root@server tmp]# cat /proc/cpuinfo > processor : 0 > vendor_id : GenuineIntel > cpu family : 15 > model : 2 > model name : Intel(R) Pentium(R) 4 CPU 2.40GHz > stepping : 9 > cpu MHz : 2394.536 > cache size : 512 KB > physical id : 0 > siblings : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 2 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid > xtpr > bogomips : 4718.59 > > processor : 1 > vendor_id : GenuineIntel > cpu family : 15 > model : 2 > model name : Intel(R) Pentium(R) 4 CPU 2.40GHz > stepping : 9 > cpu MHz : 2394.536 > cache size : 512 KB > physical id : 0 > siblings : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 2 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca > cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid > xtpr > bogomips : 4784.12 > >
Thank all for tips, I will try (it is slow process :). It is NOT truly smp machine but hypethreading. All disks are visible, fdisk -l /dev/sdX shows what they should show. Now I will start next test, I will report here after all tests. -- Sincerely Ivo Panacek
> is it a real dual processor machine or HyperThreading. > > I am assuming that the drivers work for the drives, that you see them > all via fdisk -l, and that they are the correct size, etc. > > Maybe try booting to the non-smp (regular) kernel for building the RAID.maybe, but at least on my dual xeon (2 cpus, 4 hyperthreads) it works... Cheers, MaZe.
Result: it works now - one of the disks was faulty. I run raid creation once more - with no modifications at all. Instead of kernel bug there were messages that disk (sdd) is faulty (98% => after 3 hours). I removed it and everything works now (no spare disk at the moment). So I think that in kernel is some hazard in case of faulty hardware. But it can be complicated to reproduce, since one can not know, what exactly went wrong. -- Sincerely Ivo Panacek