Tim Burgess
2009-Jun-23 05:23 UTC
[Lustre-discuss] Hung software raid in 2.6.18-92.1.26 + lustre 1.6.7.2
Hi All, Was wondering if anyone might be able to shed any light on some more problems we''ve been seeing since our 1.6.7.2 upgrade over the weekend... We''ve upgraded all the OSSes and the MDS to the SDLC 2.6.18-92.1.26.el5_lustre.1.6.7.2smp, and now it appears that something is causing the software raid layer on the OSSes to freeze completely. Even: [root at oss006 md]# dd if=/dev/md2 of=/dev/null bs=1024k count=1 hangs forever. /dev/md2 is the OST volume (one OST per OSS), but we see the same effect on /dev/md0 (and presumably /dev/md1). This of course causes all the lustre io threads to go into D state one by one and never return: [root at oss006 ~]# ps -elf | grep D F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 1 D root 250 27 0 75 0 - 0 get_ac Jun21 ? 00:00:02 [pdflush] 1 D root 251 27 0 70 -5 - 0 get_ac Jun21 ? 00:00:01 [kswapd0] 1 D root 3232 1 0 75 0 - 0 log_wa Jun21 ? 00:00:00 [obd_zombid] 1 D root 3288 27 0 70 -5 - 0 sync_b Jun21 ? 00:00:37 [kjournald] 1 D root 3303 1 0 75 0 - 0 get_ac Jun21 ? 00:00:00 [ldlm_cn_00] 1 D root 3305 1 0 75 0 - 0 - Jun21 ? 00:00:00 [ldlm_cn_01] 1 D root 3306 1 0 75 0 - 0 - Jun21 ? 00:00:00 [ldlm_cn_02] 1 D root 3307 1 0 75 0 - 0 get_ac Jun21 ? 00:00:00 [ldlm_cn_03] 1 D root 3308 1 0 75 0 - 0 - Jun21 ? 00:00:00 [ldlm_cn_04] 1 D root 3309 1 0 75 0 - 0 - Jun21 ? 00:00:00 [ldlm_cn_05] 1 D root 3310 1 0 75 0 - 0 - Jun21 ? 00:00:00 [ldlm_cn_06] .... 1 D root 3455 1 0 75 0 - 0 - Jun21 ? 00:00:00 [ll_evictor] 1 D root 5996 1 0 75 0 - 0 get_ac 12:16 ? 00:00:00 [ldlm_cn_08] 1 D root 5997 1 0 75 0 - 0 - 12:18 ? 00:00:00 [ldlm_cn_09] 1 D root 6020 1 0 75 0 - 0 get_ac 12:41 ? 00:00:00 [ldlm_cn_10] 4 D root 6107 1 0 77 0 - 16819 get_ac 12:53 ? 00:00:00 dd if /dev/md2 of /dev/null bs 4096k count 10 skip 10000 4 D root 6138 1 0 78 0 - 16818 get_ac 12:55 ? 00:00:01 dd if /dev/md0 of /dev/null bs 4096k count 10 skip 10000 If it''s relevant - we haven''t _yet_ seen this on our newer OSSes, which are 7+1 RAID5s. We are only seeing it on the older 5+1s for now. Any help would be greatly appreciated! Thanks again, Tim