We have been running Lustre for a few years now and today was the first time I came upon something I haven''t seen before. The lustre partition was mounted and I could access files within it, however the minute I started opening the large files, it became unstable and hung. The system load shot up to 33 (on the headnode client) and Lustre was using approximately 6 GB of memory. I stopped all of our services that write into the Lustre partition and unmounted /lustre. Tailing the logs during this process, I saw: LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 308135 previous similar messages LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 308135 previous similar messages LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 710099 previous similar messages LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 710099 previous similar messages Over and over again. A few minutes later, Lustre unmounted and freed up the 6GB of memory it was using. I didn''t see anything wrong with our OSTs and remounted the Lustre partition on the headnode and now everything is back to normal. I''m wondering what could have caused this in the first place? Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smp -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672
Andreas Dilger
2009-Jan-23 20:17 UTC
[Lustre-discuss] Hung Lustre filesystem until a remount
On Jan 22, 2009 14:05 -0600, Jeremy Mann wrote:> We have been running Lustre for a few years now and today was the first > time I came upon something I haven''t seen before. The lustre partition was > mounted and I could access files within it, however the minute I started > opening the large files, it became unstable and hung. The system load shot > up to 33 (on the headnode client) and Lustre was using approximately 6 GB > of memory. I stopped all of our services that write into the Lustre > partition and unmounted /lustre. Tailing the logs during this process, I > saw: > > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 > from cancel RPC: canceling anyway > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped > 308135 previous similar messages > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -108 > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped > 308135 previous similar messages > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -108 > from cancel RPC: canceling anyway > LustreError: 8620:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped > 710099 previous similar messages > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -108 > LustreError: 8620:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped > 710099 previous similar messagesWith so many skipped messages, it appears this node is in a tight loop for some reason. Is this client mounted on the same node as the MDS perhaps? That isn''t an excuse for hitting such a problem, but might explain why it was in such a tight loop that it was DOS-ing your filesystem.> Over and over again. A few minutes later, Lustre unmounted and freed up > the 6GB of memory it was using. I didn''t see anything wrong with our OSTs > and remounted the Lustre partition on the headnode and now everything is > back to normal. I''m wondering what could have caused this in the first > place? > > Rocks 5 (RHEL5), Lustre 1.6.5.1, Kernel 2.6.18-53.1.14.el5_lustre.1.6.5.1smpIf it is 1.6.5.1 it might be the statahead bug. Please check archives for many discussions for workarouds. There was also a recent patch (not in any release yet) to fix the lock dynamic LRU sizing code to use less CPU, which may have contributed to this problem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> With so many skipped messages, it appears this node is in a tight loop for > some reason. Is this client mounted on the same node as the MDS perhaps? > That isn''t an excuse for hitting such a problem, but might explain why > it was in such a tight loop that it was DOS-ing your filesystem.We separated the MGS/MDT into a separate node quite awhile ago. This is just a client connecting to our OSTs.> If it is 1.6.5.1 it might be the statahead bug. Please check archives for > many discussions for workarouds. There was also a recent patch (not in > any > release yet) to fix the lock dynamic LRU sizing code to use less CPU, > which > may have contributed to this problem.Thank you Andreas, I will do that. -- Jeremy Mann jeremy at biochem.uthscsa.edu University of Texas Health Science Center Bioinformatics Core Facility http://www.bioinformatics.uthscsa.edu Phone: (210) 567-2672