Michael Kluge
2010-Oct-22 08:48 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi list, DID_BUS_BUSY means that the controller is unable to handle the SCSI command and is basically asking the host to send it again later. I had I think just one concurrent region and 32 threads running. What would be the appropriate action in this case? Reducing the queue depth on the HBA? We have Qlogic here, there is an option for the kernel module for this. Regards, Michael -- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101022/22dedb2d/attachment.bin
Michael Kluge
2010-Oct-22 09:02 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Reducing the queue depth from the default of 32 to 8 did not help. It looks like this problem always shows up when I am writing to more than one region. 2 regions and 2 threads are enough to see the problem. The last tests that succeeds is 1 one region and 16 threads. 1/32 is not being tested. Michael Am Freitag, den 22.10.2010, 10:48 +0200 schrieb Michael Kluge:> Hi list, > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > command and is basically asking the host to send it again later. I had I > think just one concurrent region and 32 threads running. What would be > the appropriate action in this case? Reducing the queue depth on the > HBA? We have Qlogic here, there is an option for the kernel module for > this. > > > Regards, Michael > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101022/53a48b54/attachment.bin
Bernd Schubert
2010-Oct-22 11:34 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
On Friday, October 22, 2010, Michael Kluge wrote:> Hi list, > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > command and is basically asking the host to send it again later. I had I > think just one concurrent region and 32 threads running. What would be > the appropriate action in this case? Reducing the queue depth on the > HBA? We have Qlogic here, there is an option for the kernel module for > this.I think you run into a known issue with the Q-Logic driver an the SFA10K. You will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal numbers of commands is likely to be 16 (with 4 OSS connected). Hope it helps, Bernd -- Bernd Schubert DataDirect Networks
Michael Kluge
2010-Oct-22 12:33 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, I have found a RHEL-only release for this version. It does not compile on a 2.6.27 kernel :( I actually don''t want to go back to 2.6.18 just to get a new driver. Michael Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert:> On Friday, October 22, 2010, Michael Kluge wrote: > > Hi list, > > > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > > command and is basically asking the host to send it again later. I had I > > think just one concurrent region and 32 threads running. What would be > > the appropriate action in this case? Reducing the queue depth on the > > HBA? We have Qlogic here, there is an option for the kernel module for > > this. > > I think you run into a known issue with the Q-Logic driver an the SFA10K. You > will need at least qla2xxx version 8.03.01.06.05.06-k. And the optimal numbers > of commands is likely to be 16 (with 4 OSS connected). > > > Hope it helps, > Bernd >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101022/6d86d9f0/attachment.bin
Bernd Schubert
2010-Oct-22 13:03 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hello Michael, I''m sorry to hear that. Unfortunately, I really do not have the time to port this version to your kernel version. I remember that you use Debian. But I guess you are still using a SLES kernel then? You could ask Suse about it, although I guess they only do care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 kernel should work (and besides its name, it is internally more or less a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more recent udev, which requires at least 2.6.27. You could also ask our support department, if they have any news for 2.6.27. I''m in Lustre engineering and as we only support RHEL5 right now, I so far did not care about other kernel versions too much. If all doesn''t help, you will need to set the queue depth to 1, but that will also impose a big performance hit :( Cheers, Bernd On Friday, October 22, 2010, Michael Kluge wrote:> Hi Bernd, > > I have found a RHEL-only release for this version. It does not compile > on a 2.6.27 kernel :( I actually don''t want to go back to 2.6.18 just to > get a new driver. > > > Michael > > Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert: > > On Friday, October 22, 2010, Michael Kluge wrote: > > > Hi list, > > > > > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > > > command and is basically asking the host to send it again later. I had > > > I think just one concurrent region and 32 threads running. What would > > > be the appropriate action in this case? Reducing the queue depth on > > > the HBA? We have Qlogic here, there is an option for the kernel module > > > for this. > > > > I think you run into a known issue with the Q-Logic driver an the SFA10K. > > You will need at least qla2xxx version 8.03.01.06.05.06-k. And the > > optimal numbers of commands is likely to be 16 (with 4 OSS connected). > > > > > > Hope it helps, > > Bernd-- Bernd Schubert DataDirect Networks
Michael Kluge
2010-Oct-22 13:53 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd,> I''m sorry to hear that. Unfortunately, I really do not have the time to port > this version to your kernel version.No worries. I don''t expect this :)> I remember that you use Debian. But I guess you are still using a SLES kernel > then? You could ask Suse about it, although I guess they only do care about > SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 kernel should > work (and besides its name, it is internally more or less a 2.6.29 to 2.6.32 > kernel). Later Debian and Ubuntu releases have a more recent udev, which > requires at least 2.6.27.OK, if the 2.6.18 works like a charm, I''ll give the 2.6.18-194 it a try. Michael> > You could also ask our support department, if they have any news for 2.6.27. > I''m in Lustre engineering and as we only support RHEL5 right now, I so far did > not care about other kernel versions too much. > > If all doesn''t help, you will need to set the queue depth to 1, but that will > also impose a big performance hit :( > > > Cheers, > Bernd > > > On Friday, October 22, 2010, Michael Kluge wrote: > > Hi Bernd, > > > > I have found a RHEL-only release for this version. It does not compile > > on a 2.6.27 kernel :( I actually don''t want to go back to 2.6.18 just to > > get a new driver. > > > > > > Michael > > > > Am Freitag, den 22.10.2010, 13:34 +0200 schrieb Bernd Schubert: > > > On Friday, October 22, 2010, Michael Kluge wrote: > > > > Hi list, > > > > > > > > DID_BUS_BUSY means that the controller is unable to handle the SCSI > > > > command and is basically asking the host to send it again later. I had > > > > I think just one concurrent region and 32 threads running. What would > > > > be the appropriate action in this case? Reducing the queue depth on > > > > the HBA? We have Qlogic here, there is an option for the kernel module > > > > for this. > > > > > > I think you run into a known issue with the Q-Logic driver an the SFA10K. > > > You will need at least qla2xxx version 8.03.01.06.05.06-k. And the > > > optimal numbers of commands is likely to be 16 (with 4 OSS connected). > > > > > > > > > Hope it helps, > > > Bernd > >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101022/c0a04020/attachment.bin
Michael Kluge
2010-Oct-23 15:51 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, do you have a rpm with OFED 1.4 kernel modules for your kernel? I took a 2.6.18-164 from the Lustre kernels and OFED won''t built against it. The OFED backports report lot and lots of symbols as "redefined". Michael Am 22.10.2010 23:30, schrieb Bernd Schubert:> Hello Michael, > > On Friday, October 22, 2010, you wrote: >> Hi Bernd, >> >>> I''m sorry to hear that. Unfortunately, I really do not have the time to >>> port this version to your kernel version. >> >> No worries. I don''t expect this :) >> >>> I remember that you use Debian. But I guess you are still using a SLES >>> kernel then? You could ask Suse about it, although I guess they only do >>> care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 >>> kernel should work (and besides its name, it is internally more or less >>> a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more >>> recent udev, which requires at least 2.6.27. >> >> OK, if the 2.6.18 works like a charm, I''ll give the 2.6.18-194 it a try. > > Just don''t forget that -194 requires 1.8.4 (I think you had been at 1.8.3 > previously). We also have this driver added as Lustre kernel patch in our -ddn > releases. 1.8.4 is in testing, but I have not uploaded it yet. 1.8.3-ddn also > includes the driver together with with recent security backports. > > http://eu.ddn.com:8080/lustre/lustre/1.8.3/ > > > Cheers, > Bernd >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih
Michael Kluge
2010-Oct-23 16:13 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hi Bernd, I get the same message with you kernel RPMS: In file included from include/linux/list.h:6, from include/linux/mutex.h:13, from /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:36: /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6/include/linux/stddef.h:9: error: redeclaration of enumerator ''false'' include/linux/stddef.h:16: error: previous definition of ''false'' was here /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6/include/linux/stddef.h:11: error: redeclaration of enumerator ''true'' include/linux/stddef.h:18: error: previous definition of ''true'' was here Could it be that this ''2.6.18 being almost an 2.6.28/29'' confuses the OFED backports and the 2.6.18 backport does not work anymore? Is that solvable? I found nothing in the OFED bugzilla. Michael Am 23.10.2010 17:51, schrieb Michael Kluge:> Hi Bernd, > > do you have a rpm with OFED 1.4 kernel modules for your kernel? I took a > 2.6.18-164 from the Lustre kernels and OFED won''t built against it. The > OFED backports report lot and lots of symbols as "redefined". > > > Michael > > Am 22.10.2010 23:30, schrieb Bernd Schubert: >> Hello Michael, >> >> On Friday, October 22, 2010, you wrote: >>> Hi Bernd, >>> >>>> I''m sorry to hear that. Unfortunately, I really do not have the time to >>>> port this version to your kernel version. >>> >>> No worries. I don''t expect this :) >>> >>>> I remember that you use Debian. But I guess you are still using a SLES >>>> kernel then? You could ask Suse about it, although I guess they only do >>>> care about SP1 with 2.6.32-sles now. If you use Debian Lenny, the RHEL5 >>>> kernel should work (and besides its name, it is internally more or less >>>> a 2.6.29 to 2.6.32 kernel). Later Debian and Ubuntu releases have a more >>>> recent udev, which requires at least 2.6.27. >>> >>> OK, if the 2.6.18 works like a charm, I''ll give the 2.6.18-194 it a try. >> >> Just don''t forget that -194 requires 1.8.4 (I think you had been at 1.8.3 >> previously). We also have this driver added as Lustre kernel patch in our -ddn >> releases. 1.8.4 is in testing, but I have not uploaded it yet. 1.8.3-ddn also >> includes the driver together with with recent security backports. >> >> http://eu.ddn.com:8080/lustre/lustre/1.8.3/ >> >> >> Cheers, >> Bernd >> > >-- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room WIL A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih
Bernd Schubert
2010-Oct-23 18:38 UTC
[Lustre-discuss] sgpdd-survey provokes DID_BUS_BUSY on an SFA10K
Hello Michael, On Saturday, October 23, 2010, Michael Kluge wrote:> Hi Bernd, > > I get the same message with you kernel RPMS: > > In file included from include/linux/list.h:6, > from include/linux/mutex.h:13, > from > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/drivers/infiniband/core/addr.c:36 > : > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC > 6/include/linux/stddef.h:9: error: redeclaration of enumerator ''false'' > include/linux/stddef.h:16: error: previous definition of ''false'' was here > /var/tmp/OFED_topdir/BUILD/ofa_kernel-1.4/kernel_addons/backport/2.6.18_FC6 > /include/linux/stddef.h:11: error: redeclaration of enumerator ''true'' > include/linux/stddef.h:18: error: previous definition of ''true'' was here > > Could it be that this ''2.6.18 being almost an 2.6.28/29'' confuses the > OFED backports and the 2.6.18 backport does not work anymore? Is that > solvable? I found nothing in the OFED bugzilla.somewhere there is a support matrix, which OFED version supports which RHEL version, but I also would need to search for it. Anyway, ofed-1.4 is already included in 2.6.18-164. So no need for any additional compilations. 2.6.18-194 (RHEL5.5) also still mostly has OFED-1.4, but with an important mellanox driver backport (you will still additionally need a beta version version to get reliably QDR with recent chips). So if you have mellanox QDR HCAs and your connection is flaky in between SDR and QDR, just compile OFED-1.5, if works fine with Lustre (fortunately recently no interfaces changes anymore). But still make sure you compile Lustre against that stack... I also just updated our download page a bit and also uploaded sources for kernel, lustre, tar and e2fsprogs. Cheers, Bernd