I've got an i386 server running centos 4.2 with 3 3ware controllers in it -- an 8006-2 for the system disks and 2 7500-8s. On the 7500s, I'm running an all software RAID50. This morning I came in to find the system hung. Turns out a disk went overnight on one of the 7500s, and rather than a graceful failover I got this: Jan 6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command failed: status = 0xc7,flags = 0x40, unit #3. Jan 6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR: Drive error: Port #3. Jan 6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error encountered on port 3 on controller ID:2. Check cables and drives for media errors. (0xa) Jan 6 01:04:10 $SERVER kernel: Debug: sleeping function called from invalid context at include/asm/uaccess.h:556 Jan 6 01:04:10 $SERVER kernel: in_atomic():0[expected: 0], irqs_disabled():1 Jan 6 01:04:10 $SERVER kernel: [<c011fbe9>] __might_sleep+0x7d/0x88 Jan 6 01:04:10 $SERVER kernel: [<f885f056>] tw_ioctl+0x478/0xb07 [3w_xxxx] Jan 6 01:04:10 $SERVER kernel: [<c011fec9>] autoremove_wake_function+0x0/0x2d Jan 6 01:04:10 $SERVER kernel: [<f883f905>] scsi_done+0x0/0x16 [scsi_mod] Jan 6 01:04:10 $SERVER kernel: [<f8860529>] tw_scsi_queue+0x163/0x1f1 [3w_xxxx] Jan 6 01:04:10 $SERVER kernel: [<f883f748>] scsi_dispatch_cmd+0x1e9/0x24f [scsi_mod] Jan 6 01:04:10 $SERVER kernel: [<f884417e>] scsi_request_fn+0x297/0x30d [scsi_mod] Jan 6 01:04:10 $SERVER kernel: [<c0221f68>] __generic_unplug_device+0x2b/0x2d Jan 6 01:04:10 $SERVER kernel: [<c0221f7f>] generic_unplug_device+0x15/0x21 Jan 6 01:04:10 $SERVER kernel: [<c022288d>] blk_execute_rq+0x88/0xb0 Jan 6 01:04:10 $SERVER kernel: [<c022096e>] elv_set_request+0xa/0x17 Jan 6 01:04:10 $SERVER kernel: [<c022251f>] get_request+0x1de/0x1e8 Jan 6 01:04:10 $SERVER kernel: [<c01a97c3>] task_has_capability+0x4a/0x52 Jan 6 01:04:10 $SERVER kernel: [<c0225cd5>] sg_scsi_ioctl+0x2bf/0x3c1 Jan 6 01:04:10 $SERVER kernel: [<c02261aa>] scsi_cmd_ioctl+0x3d3/0x475 Jan 6 01:04:10 $SERVER kernel: [<c014d41b>] handle_mm_fault+0xbd/0x175 Jan 6 01:04:10 $SERVER kernel: [<c011ad67>] do_page_fault+0x1ae/0x5c6 Jan 6 01:04:10 $SERVER kernel: [<c014e4a6>] vma_adjust+0x286/0x2d6 Jan 6 01:04:10 $SERVER kernel: [<f88228ea>] sd_ioctl+0xb3/0xd4 [sd_mod] Jan 6 01:04:10 $SERVER kernel: [<c02246e8>] blkdev_ioctl+0x328/0x334 The 3ware boards are running the latest firmware and I'm using the stock driver in the kernel. I *did* have to upgrade mdadm to 1.12.0, as the stock version doesn't support stacked arrays. 3dmd is running. Any ideas as to what I can do to prevent this in the future? Having the system hang every time a disk dies is, well, less than optimal. Thanks. -- Joshua Baker-LePain Department of Biomedical Engineering Duke University
Joshua Baker-LePain <jlb17 at duke.edu> wrote:> I'm running an all software RAID50 ... > This morning I came in to find the system hung. > Turns out a disk went overnight on one of the 7500s, > and rather than a graceful failover I got this: > Jan 6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command > failed: status = 0xc7,flags = 0x40, unit #3. > Jan 6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR: > Drive error: Port #3. > Jan 6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error > encountered on port 3 on controller ID:2. Check cables and > drives for media errors. (0xa)Yes, the drive failed. Had you used the 3Ware's intelligent hardware RAID, it would have hidden the drive disconnect from the system. You'd see a log entry on the failure, and that the array was in a "downgraded" state. Instead, you're using software RAID, and it's up to the kernel to not panic on itself because a disk is no longer available. The problem isn't the 3Ware controller, it's the software RAID logic in the kernel.> Any ideas as to what I can do to prevent this in the > future?Use the 3Ware card as it is intended, a hardware RAID card.> Having the system hang every time a disk dies is, well,less> than optimal.No joke. It wasn't until even kernel 2.6 that hotplug support was offered, and it still does _not_ work as advertised. It's stuff like this that makes me want to strangle most advocates of using 3Ware cards with software RAID. There are countless issues like this -- far more than the alleged "hardware lock-in" negative of using hardware RAID. -- Bryan J. Smith Professional, Technical Annoyance b.j.smith at ieee.org http://thebs413.blogspot.com ---------------------------------------------------- *** Speed doesn't kill, difference in speed does ***
Joshua Baker-LePain wrote:> I've got an i386 server running centos 4.2 with 3 3ware controllers in > it -- an 8006-2 for the system disks and 2 7500-8s. On the 7500s, I'm > running an all software RAID50. This morning I came in to find the > system hung. Turns out a disk went overnight on one of the 7500s, and > rather than a graceful failover I got this: > > Jan 6 01:03:58 $SERVER kernel: 3w-xxxx: scsi2: Command failed: status = > 0xc7,flags = 0x40, unit #3. > Jan 6 01:04:02 $SERVER kernel: 3w-xxxx: scsi2: AEN: ERROR: Drive error: > Port #3. > Jan 6 01:04:10 $SERVER 3w-xxxx[2781]: ERROR: Drive error encountered on > port 3 on controller ID:2. Check cables and drives for media errors. (0xa) > Jan 6 01:04:10 $SERVER kernel: Debug: sleeping function called from > invalid context at include/asm/uaccess.h:556 > Jan 6 01:04:10 $SERVER kernel: in_atomic():0[expected: 0], > irqs_disabled():1 > Jan 6 01:04:10 $SERVER kernel: [<c011fbe9>] __might_sleep+0x7d/0x88 > Jan 6 01:04:10 $SERVER kernel: [<f885f056>] tw_ioctl+0x478/0xb07 > [3w_xxxx] > Jan 6 01:04:10 $SERVER kernel: [<c011fec9>] > autoremove_wake_function+0x0/0x2d > Jan 6 01:04:10 $SERVER kernel: [<f883f905>] scsi_done+0x0/0x16 [scsi_mod] > Jan 6 01:04:10 $SERVER kernel: [<f8860529>] tw_scsi_queue+0x163/0x1f1 > [3w_xxxx] > Jan 6 01:04:10 $SERVER kernel: [<f883f748>] > scsi_dispatch_cmd+0x1e9/0x24f [scsi_mod] > Jan 6 01:04:10 $SERVER kernel: [<f884417e>] > scsi_request_fn+0x297/0x30d [scsi_mod] > Jan 6 01:04:10 $SERVER kernel: [<c0221f68>] > __generic_unplug_device+0x2b/0x2d > Jan 6 01:04:10 $SERVER kernel: [<c0221f7f>] > generic_unplug_device+0x15/0x21 > Jan 6 01:04:10 $SERVER kernel: [<c022288d>] blk_execute_rq+0x88/0xb0 > Jan 6 01:04:10 $SERVER kernel: [<c022096e>] elv_set_request+0xa/0x17 > Jan 6 01:04:10 $SERVER kernel: [<c022251f>] get_request+0x1de/0x1e8 > Jan 6 01:04:10 $SERVER kernel: [<c01a97c3>] task_has_capability+0x4a/0x52 > Jan 6 01:04:10 $SERVER kernel: [<c0225cd5>] sg_scsi_ioctl+0x2bf/0x3c1 > Jan 6 01:04:10 $SERVER kernel: [<c02261aa>] scsi_cmd_ioctl+0x3d3/0x475 > Jan 6 01:04:10 $SERVER kernel: [<c014d41b>] handle_mm_fault+0xbd/0x175 > Jan 6 01:04:10 $SERVER kernel: [<c011ad67>] do_page_fault+0x1ae/0x5c6 > Jan 6 01:04:10 $SERVER kernel: [<c014e4a6>] vma_adjust+0x286/0x2d6 > Jan 6 01:04:10 $SERVER kernel: [<f88228ea>] sd_ioctl+0xb3/0xd4 [sd_mod] > Jan 6 01:04:10 $SERVER kernel: [<c02246e8>] blkdev_ioctl+0x328/0x334 > > The 3ware boards are running the latest firmware and I'm using the stock > driver in the kernel. I *did* have to upgrade mdadm to 1.12.0, as the > stock version doesn't support stacked arrays. 3dmd is running. > > Any ideas as to what I can do to prevent this in the future? Having the > system hang every time a disk dies is, well, less than optimal.I have a similar problem with using the hardware raid(mirror). Every time 3dmd started a scheduled verify at midnight... anywhere from 0 to 26 minutes later the kernel would crash. This happened every night at 12. I finally disabled the verify task in 3dmd and the crashes stopped. I now just use smartd to do extended tests which do not show any problems with the disks. The crash dump and log indicates that port0 is bad though. I have the crash dumps and it is reproducible if I enable verify again... Anyone know of a way to get to the bottom of the crash and find a fix? I keep getting the feeling of "See... you should have bought RHEL to get support!". Too expensive for my use of this system though. 3w-xxxx: scsi0: Command failed: status = 0xc4, flags = 0x3b, unit #0. 3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Initialization complete: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify complete: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. 3w-xxxx: scsi0: AEN: INFO: Verify started: Unit #0. 3w-xxxx: scsi0: AEN: ERROR: Verify failed: Port #0. 3w-xxxx: scsi0: AEN: INFO: Initialization started: Unit #0. Debug: sleeping function called from invalid context at include/asm/uaccess.h:556 in_atomic():0[expected: 0], irqs_disabled():1 [<c011df50>] __might_sleep+0x7d/0x89 [<d0840550>] tw_ioctl+0x466/0xd80 [3w_xxxx] [<d0858dfc>] scsi_done+0x0/0x16 [scsi_mod] [<d08422a0>] tw_scsi_queue+0x222/0x312 [3w_xxxx] [<d0858bee>] scsi_dispatch_cmd+0x2f6/0x3ad [scsi_mod] [<d085eab2>] scsi_request_fn+0x385/0x5b4 [scsi_mod] [<c024ee5a>] __generic_unplug_device+0x2b/0x2d [<c024eed7>] generic_unplug_device+0x7b/0xe0 [<c024fbd6>] blk_execute_rq+0xbb/0xe3 [<c011e57b>] autoremove_wake_function+0x0/0x2d [<c02573c5>] cfq_set_request+0x33/0x6b [<c0257392>] cfq_set_request+0x0/0x6b [<c024db99>] elv_set_request+0xa/0x17 [<c024f77b>] get_request+0x395/0x39f [<c0253181>] sg_scsi_ioctl+0x2bf/0x3c1 [<c0253656>] scsi_cmd_ioctl+0x3d3/0x475 [<c0157a48>] do_no_page+0x55/0x3bf [<c018103f>] dput+0x33/0x423 [<c0157f92>] handle_mm_fault+0xd5/0x1fd [<c011a8ed>] do_page_fault+0x1ac/0x4dc [<d081a8f4>] sd_ioctl+0xb6/0xd7 [sd_mod] [<c0251d86>] blkdev_ioctl+0x32b/0x337 [<c0171c2e>] block_ioctl+0x11/0x13 [<c017c0f1>] sys_ioctl+0x297/0x336 [<c030f8cb>] syscall_call+0x7/0xb [<c030007b>] xfrm_policy_gc_kill+0x39/0x68 Debug: sleeping function called from invalid context at include/asm/uaccess.h:531 in_atomic():0[expected: 0], irqs_disabled():1 [<c011df50>] __might_sleep+0x7d/0x89 [<d0840b85>] tw_ioctl+0xa9b/0xd80 [3w_xxxx] [<d0858dfc>] scsi_done+0x0/0x16 [scsi_mod] [<d08422a0>] tw_scsi_queue+0x222/0x312 [3w_xxxx] [<d0858bee>] scsi_dispatch_cmd+0x2f6/0x3ad [scsi_mod] [<d085eab2>] scsi_request_fn+0x385/0x5b4 [scsi_mod] [<c024ee5a>] __generic_unplug_device+0x2b/0x2d [<c024eed7>] generic_unplug_device+0x7b/0xe0 [<c024fbd6>] blk_execute_rq+0xbb/0xe3 [<c011e57b>] autoremove_wake_function+0x0/0x2d [<c02573c5>] cfq_set_request+0x33/0x6b [<c0257392>] cfq_set_request+0x0/0x6b [<c024db99>] elv_set_request+0xa/0x17 [<c024f77b>] get_request+0x395/0x39f [<c0253181>] sg_scsi_ioctl+0x2bf/0x3c1 [<c0253656>] scsi_cmd_ioctl+0x3d3/0x475 [<c0157a48>] do_no_page+0x55/0x3bf [<c018103f>] dput+0x33/0x423 [<c0157f92>] handle_mm_fault+0xd5/0x1fd [<c011a8ed>] do_page_fault+0x1ac/0x4dc [<d081a8f4>] sd_ioctl+0xb6/0xd7 [sd_mod] [<c0251d86>] blkdev_ioctl+0x32b/0x337 [<c0171c2e>] block_ioctl+0x11/0x13 [<c017c0f1>] sys_ioctl+0x297/0x336 [<c030f8cb>] syscall_call+0x7/0xb [<c030007b>] xfrm_policy_gc_kill+0x39/0x68 Kernel panic - not syncing: drivers/scsi/scsi_lib.c:1245: spin_lock(drivers/scsi/hosts.c:cf950034) already locked by drivers/scsi/3w-xxxx.c/1950