Unfortunately, we''ve had lot''s of reports of IB instability.
It does appear
to happen
quite a bit, and generally is not a Lustre problem at all.
- Check all mechanical connections, cables, etc. - replace if need be - many
issues have been cable-related.
- Check firmware versions of all IB cards, find the best version for yours.
- Make sure your IB cards are in the proper (best performing) slots in your
backplane.
- If you have an IB switch with monitoring/error reporting you may be able
to get more data.
cliffw
On Thu, Mar 17, 2011 at 10:54 AM, Kevin Hildebrand <kevin at umd.edu>
wrote:
>
> We''ve been seeing occasional hangs on our MDS and I''d
like to see if
> anyone else is seeing this or can provide suggestions on where to look.
> This might not even be a Lustre problem at all.
>
> We''re running Lustre 1.8.4 with OFED 1.5.2, and kernel version
> 2.6.18-194.3.1.el5_lustre.1.8.4.
>
> The problem is that at some point it appears that something in the IB
> stack is going out to lunch- pings to the IPoIB interface time out, and
> anything that touches IB (perfquery, etc) goes into a hard hang and cannot
> be killed.
>
> The only solution to the problem once it occurs is to power-cycle the
> machine, as shutdown/reboot hang as well.
>
> >From what I can see, the first abnormal entries in the system logs on
> the MDS are messages showing that connections to the OSSes are timing out.
>
> Any insight would be appreciated.
>
> Thanks,
>
> Kevin
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
--
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110317/ba1c71fc/attachment.html