Mike Christie
2023-Jul-11 20:21 UTC
[PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"
What was the issue you are seeing? Was it something like you get the UA. We retry then on one of the retries the sense is not setup correctly, so the scsi error handler runs? That fails and the device goes offline? If you turn on scsi debugging you would see: [ 335.445922] sd 0:0:0:0: [sda] tag#15 Add. Sense: Reported luns data has changed [ 335.445922] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445925] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445929] sd 0:0:0:0: [sda] tag#17 Done: FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 335.445932] sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 00 db 4f c0 00 00 20 00 [ 335.445934] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445936] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445938] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445940] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445942] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445945] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.451447] scsi host0: scsi_eh_0: waking up 0/2/2 [ 335.451453] scsi host0: Total of 2 commands on 1 devices require eh work [ 335.451457] sd 0:0:0:0: [sda] tag#16 scsi_eh_0: requesting sense I don't know the qemu scsi code well, but I scanned the code for my co-worker and my guess was commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 had a race in it. How is locking done? when it is a bus level UA but there are multiple devices on the bus? Is it possible, devA is clearing the sense on devB. For example, thread1 for devA is doing scsi_clear_unit_attention but then thread2 for devB has seen that bus->unit_attention so it set req ops to reqops_unit_attention. But when we run reqops_unit_attention.send_command scsi_unit_attention does not see req->bus->unit_attention set anymore so we get a CC with no sense. If the linux kernel scsi layer sees a CC with no sense then we fire the SCSI error handler like seen above in the logs. On 7/11/23 12:06 PM, Stefano Garzarella wrote:> CCing `./scripts/get_maintainer.pl -f drivers/scsi/virtio_scsi.c`, > since I found a few things in the virtio-scsi driver... > > FYI we have seen that Linux has problems with a QEMU patch for the > virtio-scsi device (details at the bottom of this email in the revert > commit message and BZ). > > > This is what I found when I looked at the Linux code: > > In scsi_report_sense() in linux/drivers/scsi/scsi_error.c linux calls > scsi_report_lun_change() that set `sdev_target->expecting_lun_change > 1` when we receive a UNIT ATTENTION with REPORT LUNS CHANGED > (sshdr->asc == 0x3f && sshdr->ascq == 0x0e). > > When `sdev_target->expecting_lun_change = 1` is set and we call > scsi_check_sense(), for example to check the next UNIT ATTENTION, it > will return NEEDS_RETRY, that I think will cause the issues we are > seeing. > > `sdev_target->expecting_lun_change` is reset only in > scsi_decide_disposition() when `REPORT_LUNS` command returns with > SAM_STAT_GOOD. > That command is issued in scsi_report_lun_scan() called by > __scsi_scan_target(), called for example by scsi_scan_target(), > scsi_scan_host(), etc. > > So, checking QEMU, we send VIRTIO_SCSI_EVT_RESET_RESCAN during hotplug > and VIRTIO_SCSI_EVT_RESET_REMOVED during hotunplug. In both cases now we > send also the UNIT ATTENTION. > > In the virtio-scsi driver, when we receive VIRTIO_SCSI_EVT_RESET_RESCAN > (hotplug) we call scsi_scan_target() or scsi_add_device(). Both of them > will call __scsi_scan_target() at some points, sending `REPORT_LUNS` > command to the device. This does not happen for > VIRTIO_SCSI_EVT_RESET_REMOVED (hotunplug). Indeed if I remove the > UNIT ATTENTION from the hotunplug in QEMU, everything works well. > > So, I tried to add a scan also for VIRTIO_SCSI_EVT_RESET_REMOVED: > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index bd5633667d01..c57658a63097 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -291,6 +291,7 @@ static void virtscsi_handle_transport_reset(struct virtio_scsi *vscsi, > ??????????????? } > ??????????????? break; > ??????? case VIRTIO_SCSI_EVT_RESET_REMOVED: > +?????????????? scsi_scan_host(shost); > ??????????????? sdev = scsi_device_lookup(shost, 0, target, lun); > ??????????????? if (sdev) { > ??????????????????????? scsi_remove_device(sdev); > > This somehow helps, now linux only breaks if the plug/unplug frequency > is really high. If I put a 5 second sleep between plug/unplug events, it > doesn't break (at least for the duration of my test which has been > running for about 30 minutes, before it used to break after about a > minute). > > Another thing I noticed is that in QEMU maybe we should set the UNIT > ATTENTION first and then send the event on the virtqueue, because the > scan should happen after the unit attention, but I don't know if in any > case the unit attention is processed before the virtqueue. > > I mean something like this: > > diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c > index 45b95ea070..13db40f4f3 100644 > --- a/hw/scsi/virtio-scsi.c > +++ b/hw/scsi/virtio-scsi.c > @@ -1079,8 +1079,8 @@ static void virtio_scsi_hotplug(HotplugHandler *hotplug_dev, DeviceState *dev, > ???????? }; > > ???????? virtio_scsi_acquire(s); > -??????? virtio_scsi_push_event(s, &info); > ???????? scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED)); > +??????? virtio_scsi_push_event(s, &info); > ???????? virtio_scsi_release(s); > ???? } > ?} > @@ -1111,8 +1111,8 @@ static void virtio_scsi_hotunplug(HotplugHandler *hotplug_dev, DeviceState *dev, > > ???? if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) { > ???????? virtio_scsi_acquire(s); > -??????? virtio_scsi_push_event(s, &info); > ???????? scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED)); > +??????? virtio_scsi_push_event(s, &info); > ???????? virtio_scsi_release(s); > ???? } > ?} > > At this point I think the problem is on the handling of the > VIRTIO_SCSI_EVT_RESET_REMOVED event in the virtio-scsi driver, where > somehow we have to redo the bus scan, but scsi_scan_host() doesn't seem > to be enough when the event rate is very high. > > I don't know if along with this fix, we also need to limit the rate in > QEMU somehow. > > Sorry for the length of this email, but I'm not familiar with SCSI and > wanted some suggestions on how to proceed. > > Paolo, Stefan, Linux SCSI maintainers, any suggestion? > > > Thanks, > Stefano > > On Wed, Jul 05, 2023 at 09:15:23AM +0200, Stefano Garzarella wrote: >> This reverts commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2. >> >> That commit causes several problems in Linux as described in the BZ. >> In particular, after a while, other devices on the bus are no longer >> usable even if those devices are not affected by the hotunplug. >> This may be a problem in Linux, but we have not been able to identify >> it so far. So better to revert this patch until we find a solution. >> >> Also, Oracle, which initially proposed this patch for a problem with >> Solaris, seems to have already reversed it downstream: >> ?? https://linux.oracle.com/errata/ELSA-2023-12065.html >> >> Suggested-by: Thomas Huth <thuth at redhat.com> >> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2176702 >> Cc: qemu-stable at nongnu.org >> Cc: Mark Kanda <mark.kanda at oracle.com> >> Signed-off-by: Stefano Garzarella <sgarzare at redhat.com> >> --- >> include/hw/scsi/scsi.h |? 1 - >> hw/scsi/scsi-bus.c???? | 18 ------------------ >> hw/scsi/virtio-scsi.c? |? 2 -- >> 3 files changed, 21 deletions(-) >> >> diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h >> index e2bb1a2fbf..7c8adf10b1 100644 >> --- a/include/hw/scsi/scsi.h >> +++ b/include/hw/scsi/scsi.h >> @@ -198,7 +198,6 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, BlockBackend *blk, >> ????????????????????????????????????? BlockdevOnError rerror, >> ????????????????????????????????????? BlockdevOnError werror, >> ????????????????????????????????????? const char *serial, Error **errp); >> -void scsi_bus_set_ua(SCSIBus *bus, SCSISense sense); >> void scsi_bus_legacy_handle_cmdline(SCSIBus *bus); >> >> SCSIRequest *scsi_req_alloc(const SCSIReqOps *reqops, SCSIDevice *d, >> diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c >> index f80f4cb4fc..42a915f8b7 100644 >> --- a/hw/scsi/scsi-bus.c >> +++ b/hw/scsi/scsi-bus.c >> @@ -1617,24 +1617,6 @@ static int scsi_ua_precedence(SCSISense sense) >> ??? return (sense.asc << 8) | sense.ascq; >> } >> >> -void scsi_bus_set_ua(SCSIBus *bus, SCSISense sense) >> -{ >> -??? int prec1, prec2; >> -??? if (sense.key != UNIT_ATTENTION) { >> -??????? return; >> -??? } >> - >> -??? /* >> -???? * Override a pre-existing unit attention condition, except for a more >> -???? * important reset condition. >> -???? */ >> -??? prec1 = scsi_ua_precedence(bus->unit_attention); >> -??? prec2 = scsi_ua_precedence(sense); >> -??? if (prec2 < prec1) { >> -??????? bus->unit_attention = sense; >> -??? } >> -} >> - >> void scsi_device_set_ua(SCSIDevice *sdev, SCSISense sense) >> { >> ??? int prec1, prec2; >> diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c >> index 45b95ea070..1f56607100 100644 >> --- a/hw/scsi/virtio-scsi.c >> +++ b/hw/scsi/virtio-scsi.c >> @@ -1080,7 +1080,6 @@ static void virtio_scsi_hotplug(HotplugHandler *hotplug_dev, DeviceState *dev, >> >> ??????? virtio_scsi_acquire(s); >> ??????? virtio_scsi_push_event(s, &info); >> -??????? scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED)); >> ??????? virtio_scsi_release(s); >> ??? } >> } >> @@ -1112,7 +1111,6 @@ static void virtio_scsi_hotunplug(HotplugHandler *hotplug_dev, DeviceState *dev, >> ??? if (virtio_vdev_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG)) { >> ??????? virtio_scsi_acquire(s); >> ??????? virtio_scsi_push_event(s, &info); >> -??????? scsi_bus_set_ua(&s->bus, SENSE_CODE(REPORTED_LUNS_CHANGED)); >> ??????? virtio_scsi_release(s); >> ??? } >> } >> --? >> 2.41.0 >> > > _______________________________________________ > Virtualization mailing list > Virtualization at lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Paolo Bonzini
2023-Jul-12 08:06 UTC
[PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"
On 7/11/23 22:21, Mike Christie wrote:> What was the issue you are seeing? > > Was it something like you get the UA. We retry then on one of the > retries the sense is not setup correctly, so the scsi error handler > runs? That fails and the device goes offline? > > If you turn on scsi debugging you would see: > > > [ 335.445922] sd 0:0:0:0: [sda] tag#15 Add. Sense: Reported luns data has changed > [ 335.445922] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445925] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445929] sd 0:0:0:0: [sda] tag#17 Done: FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > [ 335.445932] sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 00 db 4f c0 00 00 20 00 > [ 335.445934] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445936] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445938] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445940] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445942] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.445945] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > [ 335.451447] scsi host0: scsi_eh_0: waking up 0/2/2 > [ 335.451453] scsi host0: Total of 2 commands on 1 devices require eh work > [ 335.451457] sd 0:0:0:0: [sda] tag#16 scsi_eh_0: requesting senseDoes this log come from internal discussions within Oracle?> I don't know the qemu scsi code well, but I scanned the code for my co-worker > and my guess was commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 had a race in it. > > How is locking done? when it is a bus level UA but there are multiple devices > on the bus?No locking should be necessary, the code is single threaded. However, what can happen is that two consecutive calls to virtio_scsi_handle_cmd_req_prepare use the unit attention ReqOps, and then the second virtio_scsi_handle_cmd_req_submit finds no unit attention (see the loop in virtio_scsi_handle_cmd_vq). That can definitely explain the log above. Paolo