Hi.
On 01/30/12 22:46, Andrew Boyer wrote:> I have a system that appears to have a flaky SATA controller (one of the
Intel ESB2 variants) and it seems to be exposing a weakness in the ATA driver
(not using ATA_CAM). If a command with ATA_R_DIRECT set times out, the channel
gets reinitialized, but from the soft interrupt context. It panics when it
tries to sleep in ata_queue_request().
>
> Timeouts work if ATA_R_DIRECT isn't set because in that case it uses a
taskqueue to complete the request.
>
> Here is the backtrace:
>> #0 kdb_enter (why=0xffffffff80962cfa "panic",
msg=0xa<Address 0xa out of bounds>) at ../../../kern/subr_kdb.c:349
>> #1 0xffffffff805d6d0b in panic (fmt=Variable "fmt" is not
available.
>> ) at ../../../kern/kern_shutdown.c:689
>> #2 0xffffffff8061bc53 in sleepq_add (wchan=0xffffff00052c3e58,
lock=0xffffff00052c3e38, wmesg=0xffffffff808fa213 "ATA request done",
>> flags=1, queue=0) at ../../../kern/subr_sleepqueue.c:320
>> #3 0xffffffff80590c95 in _cv_timedwait (cvp=0xffffff00052c3e58,
lock=0xffffff00052c3e38, timo=40000) at ../../../kern/kern_condvar.c:313
>> #4 0xffffffff805d61af in _sema_timedwait (sema=0xffffff00052c3e38,
timo=40000, file=0xffffffff808fa1f6 "../../../dev/ata/ata-queue.c",
>> line=118) at ../../../kern/kern_sema.c:123
>> #5 0xffffffff8028559f in ata_queue_request
(request=0xffffff00052c3dc0) at ../../../dev/ata/ata-queue.c:117
>> #6 0xffffffff80286628 in ata_controlcmd (dev=0xffffff0002e83d00,
command=239 '?', feature=Variable "feature" is not available.
>> ) at ../../../dev/ata/ata-queue.c:153
>> #7 0xffffffff8027ffd3 in ata_setmode (dev=0xffffff0002e83d00) at
../../../dev/ata/ata-all.c:637
>> #8 0xffffffff802a0af9 in ad_init (dev=0xffffff0002e83d00) at
../../../dev/ata/ata-disk.c:405
>> #9 0xffffffff802a0c29 in ad_reinit (dev=0xffffff0002e83d00) at
../../../dev/ata/ata-disk.c:221
>> #10 0xffffffff80280cad in ata_reinit (dev=0xffffff0002902800) at
ata_if.h:79
>> #11 0xffffffff802856c4 in ata_completed (context=Variable
"context" is not available.
>> ) at ../../../dev/ata/ata-queue.c:313
>> #12 0xffffffff80285ffb in ata_finish (request=0xffffff00054ec8c0) at
../../../dev/ata/ata-queue.c:265
>> #13 0xffffffff805ed419 in softclock (arg=Variable "arg" is
not available.
>> ) at ../../../kern/kern_timeout.c:430
>
> This is very repeatable. I'm not sure what's the best fix - always
use a taskqueue on timeouts? Don't reinit if direct commands fail?
This is one of the most messy points of the old ata(4). Problem is that
reinit implemented to work synchronously. It means that if some command
caused timeout and started reinit, that reinit runs from the taskqueue,
blocking it. As result, we can't use taskqueue for completion there and
can't do reinit on one of reinit commands timeout. That is handled using
ATA_STALL_QUEUE flag. I remember I've intentionally blocked new device
detection on reinit to avoid problems with taskqueue there.
What's about ATA_R_DIRECT, sorry, I don't remember why it is used there
or why it is needed at all. It was done before me. The only place where
I see it set except ataraid is ata_getparam(), that should be called
only on initial bus probe.
--
Alexander Motin