hi, after we switch our servers from centos-3 to centos-4 (aka. rhel-4) one of our server always crash once a week without any oops. this happneds with both the normal kernel-2.6.9-11.EL and kernel-2.6.9-11.106.unsupported. after we change the motherboard, the raid contorller and the cables too we still got it. finally we start netdump and last but not least yesterday we got a crash log and a core file. it seems there is a bug in the raid5 code of the kernel. this is our backup server with 8 x 200GB hdd in a raid5 (for the data) plus 2 x 40GB hdd in raid1 (for the system) with 3ware 8xxx raid contorller, running. i attached the netdump log of the last crash. how can i fix it? yours. -- Levente "Si vis pacem para bellum!" -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: log URL: <http://lists.centos.org/pipermail/centos/attachments/20050707/11321ba7/attachment-0002.ksh>
On 7/7/05, Farkas Levente <lfarkas at bppiac.hu> wrote:> hi, > after we switch our servers from centos-3 to centos-4 (aka. rhel-4) one > of our server always crash once a week without any oops. this happneds > with both the normal kernel-2.6.9-11.EL and > kernel-2.6.9-11.106.unsupported. after we change the motherboard, the > raid contorller and the cables too we still got it. finally we start > netdump and last but not least yesterday we got a crash log and a core > file. it seems there is a bug in the raid5 code of the kernel. > this is our backup server with 8 x 200GB hdd in a raid5 (for the data) > plus 2 x 40GB hdd in raid1 (for the system) with 3ware 8xxx raid > contorller, running. i attached the netdump log of the last crash. > how can i fix it? > yours. >Hi, I have seen similar (but not quite the same) in the raid code on RHEL 3 kernels. They typically have occured due to a race condition between something updating the linked lists of raid devices and something trying to read them. For RHEL 3, my co-workes and I found where one particular race condition was fixed in 2.6 kernel and back ported to RHEL 3 kernel. Ultimately this patch was placed in one of the updates for the RHEL 3 kernel. Anyway, it is likely your problem is yet another race condition. What I would suggest doing is get a box configured with true RHEL 4 and reproduce. Once reproduced file a bugzilla report with redhat. We have had very good success with this approach with a number of kernel bugs we found in the Centos 3/RHEL 3 kernels. Fixes have not always come quickly, but they generally do come. Good Luck...james> --
although this mail create a long thread, but anybody has any good solution to the original question? Farkas Levente wrote:> hi, > after we switch our servers from centos-3 to centos-4 (aka. rhel-4) one > of our server always crash once a week without any oops. this happneds > with both the normal kernel-2.6.9-11.EL and > kernel-2.6.9-11.106.unsupported. after we change the motherboard, the > raid contorller and the cables too we still got it. finally we start > netdump and last but not least yesterday we got a crash log and a core > file. it seems there is a bug in the raid5 code of the kernel. > this is our backup server with 8 x 200GB hdd in a raid5 (for the data) > plus 2 x 40GB hdd in raid1 (for the system) with 3ware 8xxx raid > contorller, running. i attached the netdump log of the last crash. > how can i fix it? > yours.-- Levente "Si vis pacem para bellum!"