Eli Cohen
2014-Aug-31 07:53 UTC
[PATCH] Avoid Lustre failure on temporary failure on create QP
Lustre code tries to create a QP with max_send_wr which depends on a module parameter. The device capabilities do provide the maximum number of send work requests that the device supports but the actual number of work requests that can be supported in a specific case depends on other characteristics of the work queue, the transport type, etc. This is in compliance with the IB spec: 11.2.1.2 QUERY HCA Description: Returns the attributes for the specified HCA. The maximum values defined in this section are guaranteed not-to-exceed values. It is possible for an implementation to allocate some HCA resources from the same space. In that case, the maximum values returned are not guaranteed for all of those resources simultaneously. This patch tries to decrease the number of requested work requests to a level that can be supported by the HCA. This prevents unnecessary failures. Signed-off-by: Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> --- lnet/klnds/o2iblnd/o2iblnd.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/lnet/klnds/o2iblnd/o2iblnd.c b/lnet/klnds/o2iblnd/o2iblnd.c index 4061db00cba2..ef1c6e07cb45 100644 --- a/lnet/klnds/o2iblnd/o2iblnd.c +++ b/lnet/klnds/o2iblnd/o2iblnd.c @@ -736,6 +736,7 @@ kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid, int cpt; int rc; int i; + int orig_wr; LASSERT(net != NULL); LASSERT(!in_interrupt()); @@ -862,13 +863,23 @@ kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid, conn->ibc_sched = sched; - rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr); - if (rc != 0) { - CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n", - rc, init_qp_attr->cap.max_send_wr, - init_qp_attr->cap.max_recv_wr); - goto failed_2; - } + orig_wr = init_qp_attr->cap.max_send_wr; + do { + rc = rdma_create_qp(cmid, conn->ibc_hdev->ibh_pd, init_qp_attr); + if (!rc || init_qp_attr->cap.max_send_wr < 16) + break; + + init_qp_attr->cap.max_send_wr /= 2; + } while (rc); + if (rc != 0) { + CERROR("Can't create QP: %d, send_wr: %d, recv_wr: %d\n", + rc, init_qp_attr->cap.max_send_wr, + init_qp_attr->cap.max_recv_wr); + goto failed_2; + } + if (orig_wr != init_qp_attr->cap.max_send_wr) + pr_info("original send wr %d, created with %d\n", + orig_wr, init_qp_attr->cap.max_send_wr); LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr)); -- 2.1.0