thr3ads.net - Lustre discuss - [Lustre-discuss] Hung Lustre filesystem until a remount [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Jeremy Mann

2009-Jan-22 20:05 UTC

[Lustre-discuss] Hung Lustre filesystem until a remount

We have been running Lustre for a few years now and today was the first
time I came upon something I haven''t seen before. The lustre partition
was
mounted and I could access files within it, however the minute I started
opening the large files, it became unstable and hung. The system load shot
up to 33 (on the headnode client) and Lustre was using approximately 6 GB
of memory.  I stopped all of our services that write into the Lustre
partition and unmounted /lustre. Tailing the logs during this process, I
saw:

LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108
from cancel RPC: canceling anyway
LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped
308135 previous similar messages
LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped
308135 previous similar messages
LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108
from cancel RPC: canceling anyway
LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped
710099 previous similar messages
LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped
710099 previous similar messages

Over and over again. A few minutes later, Lustre unmounted and freed up
the 6GB of memory it was using. I didn''t see anything wrong with our
OSTs
and remounted the Lustre partition on the headnode and now everything is
back to normal. I''m wondering what could have caused this in the first
place?

Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp


-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Andreas Dilger

2009-Jan-23 20:17 UTC

head link

[Lustre-discuss] Hung Lustre filesystem until a remount

On Jan 22, 2009  14:05 -0600, Jeremy Mann wrote:> We have been running Lustre for a few years now and today was the first
> time I came upon something I haven''t seen before. The lustre
partition was
> mounted and I could access files within it, however the minute I started
> opening the large files, it became unstable and hung. The system load shot
> up to 33 (on the headnode client) and Lustre was using approximately 6 GB
> of memory.  I stopped all of our services that write into the Lustre
> partition and unmounted /lustre. Tailing the logs during this process, I
> saw:
> 
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108
> from cancel RPC: canceling anyway
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped
> 308135 previous similar messages
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
> ldlm_cli_cancel_list: -108
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped
> 308135 previous similar messages
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108
> from cancel RPC: canceling anyway
> LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped
> 710099 previous similar messages
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list())
> ldlm_cli_cancel_list: -108
> LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped
> 710099 previous similar messages
With so many skipped messages, it appears this node is in a tight loop for
some reason.  Is this client mounted on the same node as the MDS perhaps?
That isn''t an excuse for hitting such a problem, but might explain why
it was in such a tight loop that it was DOS-ing your filesystem.
> Over and over again. A few minutes later, Lustre unmounted and freed up
> the 6GB of memory it was using. I didn''t see anything wrong with
our OSTs
> and remounted the Lustre partition on the headnode and now everything is
> back to normal. I''m wondering what could have caused this in the
first
> place?
> 
> Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel
2.6.18-53.1.14.el5_lustre.1.6.5.1smp
If it is 1.6.5.1 it might be the statahead bug.  Please check archives for
many discussions for workarouds.  There was also a recent patch (not in any
release yet) to fix the lock dynamic LRU sizing code to use less CPU, which
may have contributed to this problem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Jeremy Mann

2009-Jan-23 20:51 UTC

head link

[Lustre-discuss] Hung Lustre filesystem until a remount

Andreas Dilger wrote:
> With so many skipped messages, it appears this node is in a tight loop for
> some reason.  Is this client mounted on the same node as the MDS perhaps?
> That isn''t an excuse for hitting such a problem, but might explain
why
> it was in such a tight loop that it was DOS-ing your filesystem.
We separated the MGS/MDT into a separate node quite awhile ago. This is
just a client connecting to our OSTs.
> If it is 1.6.5.1 it might be the statahead bug.  Please check archives for
> many discussions for workarouds.  There was also a recent patch (not in
> any
> release yet) to fix the lock dynamic LRU sizing code to use less CPU,
> which
> may have contributed to this problem.
Thank you Andreas, I will do that.

-- 
Jeremy Mann
jeremy at biochem.uthscsa.edu

University of Texas Health Science Center
Bioinformatics Core Facility
http://www.bioinformatics.uthscsa.edu
Phone: (210) 567-2672

Lustre discuss - Jan 2009 - Hung Lustre filesystem until a remount

[Lustre-discuss] Hung Lustre filesystem until a remount

[Lustre-discuss] Hung Lustre filesystem until a remount

[Lustre-discuss] Hung Lustre filesystem until a remount