Hans Petter Selasky
2011-Jan-25 07:49 UTC
Fwd: Re: System lockups caused by USB external HDD
---------- Forwarded Message ---------- Subject: Re: System lockups caused by USB external HDD Date: Tuesday 25 January 2011, 01:48:03 From: CDP <dr.clau@gmail.com> To: freebsd-usb@freebsd.org CC: Hans Petter Selasky <hselasky@c2i.net>, mav@freebsd.org On 01/24/11 13:27, Hans Petter Selasky wrote:> On Monday 24 January 2011 12:08:47 CDP wrote: >> On 01/24/11 11:34, Hans Petter Selasky wrote: >>> On Monday 24 January 2011 10:00:53 CDP wrote: >>>> On 01/24/11 01:56, Daniel O'Connor wrote: >>>>> On 24/01/2011, at 9:10, CDP wrote: >>>>>> g_vfs_done():da0s2[WRITE(offset=xxxxxxxxxxxx, length=16384)]error = 5 >>>>>> [several more lines similar to the above] >>>>>> panic: softdep_move_dependencies: need merge code >>>>>> cpuid = 0 >>>>>> KDB: stack backtrace: >>>>>> #0 0x... at kdb_backtrace+0x5e >>>>>> #1 0x... at panic+0x182 >>>>> >>>>> It looks like the disk is dying, or the FS is corrupt (the former might >>>>> cause the later). >>>>> >>>>> Can you run smartctl on the disk? Unfortunately a lot of enclosures >>>>> reject SMART commands so you might not be able to :( >>>> >>>> I have attached the output of smartctl -d sat -a /dev/da0. I didn't yet >>>> run a SMART long test for the simple reason that the disk is going into >>>> sleep mode and interrupts it. Haven't bothered to keep it alive for a >>>> long test but I might just do that. >>>> >>>> Although, I doubt it's a disk failure, since I do backups on it without >>>> problems by using FreeBSD 7.3, on the same space where FreeBSD 8.x >>>> fails. And I am talking about over 150GB of data in one run, while >>>> 8.2-RC2 crashes after 5-10GB. I have experienced disk failure in the >>>> past, on SATA, and a few read/write errors never caused a system lockup. >>>> >>>> My feeling is that enough traffic on USB causes the problem, and that >>>> this problem is only present in the new USB stack. >>>> Unfortunately downgrading to 7.x is not an option because there are >>>> things that won't work on this notebook. >>> >>> If you run a simple test like this: >>> >>> dd if=/dev/da0 of=/dev/null bs=65536 >>> dd if=/dev/da0 of=/dev/null bs=16384 >>> >>> Do you then see any errors? >>> >>> Do you have a spare USB memory stick which you could run similar write >>> tests on? >> >> Both reads fail with I/O error, while writes to an unused partition seem >> to be fine (I interrupted the writes after a while): >> >> % dd if=/dev/da0 of=/dev/null bs=65536 >> dd: /dev/da0: Input/output error >> 191732+0 records in >> 191732+0 records out >> 12565348352 bytes transferred in 429.999272 secs (29221790 bytes/sec) >> >> % dd if=/dev/da0 of=/dev/null bs=16384 >> dd: /dev/da0: Input/output error >> 126427+0 records in >> 126427+0 records out >> 2071379968 bytes transferred in 169.431766 secs (12225452 bytes/sec) >> >> # dd if=/dev/random of=/dev/da0s3 bs=65536 >> ^C329378+0 records in >> 329377+0 records out >> 21586051072 bytes transferred in 1003.020293 secs (21521051 bytes/sec) >> >> # dd if=/dev/random of=/dev/da0s3 bs=16384 >> ^C679571+0 records in >> 679571+0 records out >> 11134091264 bytes transferred in 690.135793 secs (16133189 bytes/sec) >> >> This is what I get in /var/log/messages when the I/O error occurs: >> (da0:umass-sim0:0:0:0): AutoSense failed >> >> However, I experience no lockup. Maybe this situation is not handled >> correctly at another level ? > > I haven't looked into the code of CAM or GEOM that much so I won't say too > much about that. I believe the USB/umass is not to blame. What you could dois> to add a conditional error printout in "umass_t_bbb_status_callback()" in > /sys/dev/usb/storage/umass.c when the error happens. If that error is not a > USB transport error, then we are most likely seeing a SCSI issue in layers > above umass. Or if you have access to USB analyser use that. There is nowalso> the option to trace USB from the kernel itself, but the feature is in its > early development.The panics I was able to catch/inspect (latest from add_to_worklist() / ffs_softdep.c) indicated they were thrown by ffs/softupdates code, therefore I tried disabling softupdates. The system doesn't panic anymore. The operations on the USB HDD still stop, but after several tens of seconds the system logs the 'autosense failed' error, a bunch of write errors, and the copy operation resumes. md5 shows the copied files are identical to the source files. In 7.x I don't recall having any kind of errors, neither temporary locks in disk operations, so I'm guessing the 'autosense failed' situation is handled differently in 8.x, compared to 7.x. Claudiu. -----------------------------------------