Hi Guys, I just deployed a new lustre filesystem and was unable to mount the filesystem on a client for the first time. I was able to reach everything using the: lctl ping IP at o2ib3 All networks were up but the client hanged while doing a mount. So, I decided to reboot my client and was then able to mount the filesystem without any issues. Here is something that I saw on the MDS server. Any ideas what might be the problem? Thanks in advance for your input. -- ..snip.. Jan 14 17:06:36 resmds01 kernel: Lustre: 7790:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1324894173265938 sent from reshpcfs-OST0001-osc to NID 10.0.250.47 at o2ib3 0s ago has failed due to network error (limit 15s). Jan 14 17:06:36 resmds01 kernel: Lustre: 7722:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7712:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7719:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7714:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7712:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7722:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7714:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7709:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7719:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7714:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7709:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7719:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7714:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7709:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7717:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7714:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7717:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7714:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 7712:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3failed: 5 Jan 14 17:06:51 resmds01 kernel: Lustre: 5616:0:(o2iblnd_cb.c:1953:kiblnd_peer_connect_failed()) Deleting messages for 10.0.250.47 at o2ib3: connection failed Jan 14 17:07:36 resmds01 kernel: LustreError: 11-0: an error occurred while communicating with 10.0.250.46 at o2ib3. The ost_connect operation failed with -19 Jan 14 17:13:55 resmds01 kernel: LustreError: 11-0: an error occurred while communicating with 0 at lo. The mds_connect operation failed with -16 Jan 14 17:14:20 resmds01 kernel: LustreError: 11-0: an error occurred while communicating with 0 at lo. The mds_connect operation failed with -16 ..snip.. -- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100115/326a4de1/attachment-0001.html
On Fri, 2010-01-15 at 10:09 -0800, Jagga Soorma wrote:> Hi Guys,Hi Jagga,> Jan 14 17:06:36 resmds01 kernel: Lustre: 7790:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1324894173265938 sent from reshpcfs-OST0001-osc to NID 10.0.250.47 at o2ib3 0s ago has failed due to network error (limit 15s). > Jan 14 17:06:36 resmds01 kernel: Lustre: 7722:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3 failed: 5 > Jan 14 17:06:51 resmds01 kernel: Lustre: 7712:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3 failed: 5 > Jan 14 17:06:51 resmds01 kernel: Lustre: 7719:0:(o2iblnd_cb.c:459:kiblnd_rx_complete()) Rx from 10.0.250.47 at o2ib3 failed: 5These are networking failures. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100115/1af5f398/attachment.bin