Hi everyone, I have a problem lately with our Lustre 1.8 deployment. It crashes periodically in a way that the nodes can mount the storage and I can''t access the Lustre server machine neither. So I have to manually restart the machine every time to make everything normal again. I tried to see the logs, memory usage and locks count to see whether these issues may have the cause of the problem. But, I don''t think they account for this issue. An interesting symptom I see every time this problem happens is the Infiniband switch network usage lights which blink very fast. I think a huge traffic on the Infiniband network to the lustre server may cause the server crash. Does this relevance seems logical? Anyway, I hope some of you may have experience this problem before and could help me understand what is happening and how to avoid crashing the server again! Thanks, _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sorry, I have to correct this: "the nodes CANNOT mount the storage and I can''t access the Lustre server machine neither". On Wednesday ۱۷ July ۱۳۹۲ at ۱۱:۲۱, Arya Mazaheri wrote:> Hi everyone, > I have a problem lately with our Lustre 1.8 deployment. It crashes periodically in a way that the nodes can mount the storage and I can''t access the Lustre server machine neither. So I have to manually restart the machine every time to make everything normal again. I tried to see the logs, memory usage and locks count to see whether these issues may have the cause of the problem. But, I don''t think they account for this issue. > An interesting symptom I see every time this problem happens is the Infiniband switch network usage lights which blink very fast. I think a huge traffic on the Infiniband network to the lustre server may cause the server crash. Does this relevance seems logical? > > Anyway, I hope some of you may have experience this problem before and could help me understand what is happening and how to avoid crashing the server again! > > Thanks,_______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
<Abraham.Alawi-uXK74I8fpZ0@public.gmane.org>
2013-Oct-10 00:25 UTC
Re: Lustre crashes periodically
Did you run lfsck against it? No kernel crash dumps? Maybe it’s not Lustre related problem? If you have no Active/Passive MDS setup, Lustre file system will be unusable if the MDS server crashes for whatever reason. Abraham Alawi Linux/UNIX Systems and Storage Specialist | STACC Project | Information Management & Technology (IMT) | CSIRO From: lustre-discuss-bounces@lists.lustre.org [mailto:lustre-discuss-bounces@lists.lustre.org] On Behalf Of Arya Mazaheri Sent: Wednesday, 9 October 2013 6:52 PM To: lustre-discuss@lists.lustre.org Subject: [Lustre-discuss] Lustre crashes periodically Hi everyone, I have a problem lately with our Lustre 1.8 deployment. It crashes periodically in a way that the nodes can mount the storage and I can't access the Lustre server machine neither. So I have to manually restart the machine every time to make everything normal again. I tried to see the logs, memory usage and locks count to see whether these issues may have the cause of the problem. But, I don't think they account for this issue. An interesting symptom I see every time this problem happens is the Infiniband switch network usage lights which blink very fast. I think a huge traffic on the Infiniband network to the lustre server may cause the server crash. Does this relevance seems logical? Anyway, I hope some of you may have experience this problem before and could help me understand what is happening and how to avoid crashing the server again! Thanks, _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
On 2013/10/09 6:25 PM, "Abraham.Alawi-uXK74I8fpZ0@public.gmane.org" <Abraham.Alawi-uXK74I8fpZ0@public.gmane.org> wrote:>Did you run >lfsck against it?To be honest, I can''t think of any reason why a crashing server would be fixed by running lfsck. In some rare cases it might be that running e2fsck could fix a crash (if there is incorrect error handling in the ldiskfs or Lustre code).> >No kernel crash dumps?The first thing to do is to connect a serial console, or enable netconsole/netdump. It is possible to cross-cable two serial ports between failover server pairs and use mgetty or maybe conman to capture the kernel oops messages. Without that, it is almost impossible to figure out what the problem is. Cheers, Andreas> >Maybe it¹s not Lustre related problem? If you have no Active/Passive MDS >setup, Lustre file system will be unusable if the MDS server crashes for >whatever reason.> >Abraham Alawi >Linux/UNIX Systems and Storage Specialist | > STACC Project | > Information Management & Technology (IMT) | >CSIRO > >From: > lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >[mailto:lustre-discuss-bounces-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org] >On Behalf Of Arya Mazaheri >Sent: Wednesday, 9 October 2013 6:52 PM >To: lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >Subject: [Lustre-discuss] Lustre crashes periodically > > >Hi everyone, > >I have a problem lately with our Lustre 1.8 deployment. It crashes >periodically in a way that the nodes can mount the storage and I can''t >access the Lustre server machine neither. So I have to manually restart >the machine every time to > make everything normal again. I tried to see the logs, memory usage and >locks count to see whether these issues may have the cause of the >problem. But, I don''t think they account for this issue. > >An interesting symptom I see every time this problem happens is the >Infiniband switch network usage lights which blink very fast. I think a >huge traffic on the Infiniband network to the lustre server may cause the >server crash. Does this > relevance seems logical? > > > >Anyway, I hope some of you may have experience this problem before and >could help me understand what is happening and how to avoid crashing the >server again! > > > >Thanks, > > >Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division