Christopher J.Walker
2009-Dec-17 09:57 UTC
[Lustre-discuss] ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
I have a stream of errors in the logs of my Lustre clients. 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 888 previous similar messages Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 We have Lustre 1.8.1.1 servers and both Lustre 1.6.7 and 1.8.1.1 clients (with a patch applied so we can use the 2.6.18-164.6.1.el5 kernel). Can anyone shed any light on what is causing them, and what I should do to fix them? Chris PS more examples below: Dec 16 10:53:47 se03 kernel: Lustre: Client lustre_0-client has started Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 888 previous similar messages Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) Skipped 888 previous similar messages Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 98 previous similar messages Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Dec 16 15:42:33 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) Skipped 98 previous similar messages Dec 16 15:42:34 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Dec 16 15:42:34 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 147 previous similar messages Dec 16 15:42:34 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Dec 16 15:42:34 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) Skipped 147 previous similar messages Dec 16 15:42:35 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Dec 16 15:42:35 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 159 previous similar messages Dec 16 15:42:35 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Dec 16 15:42:35 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) Skipped 159 previous similar messages Dec 16 15:42:37 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Dec 16 15:42:37 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 288 previous similar messages Dec 16 15:42:37 se03 kernel: LustreError: 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
Ewan Mac Mahon
2009-Dec-18 13:37 UTC
[Lustre-discuss] ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
On Thu, Dec 17, 2009 at 09:57:12AM +0000, Christopher J.Walker wrote:> I have a stream of errors in the logs of my Lustre clients. > > 21544:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Skipped 888 previous > similar messages > Dec 16 15:42:33 se03 kernel: LustreError: > 21544:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) > ldlm_cli_cancel_list: -108 > > We have Lustre 1.8.1.1 servers and both Lustre 1.6.7 and 1.8.1.1 clients > (with a patch applied so we can use the 2.6.18-164.6.1.el5 > kernel). >I''m seeing this too on a very vanilla lustre 1.8.1.1 setup; it''s an essentially x86_64 RHEL5-esqe system (actually Scientific Linux), the servers are running the Sun built 2.6.18-128.7.1.el5_lustre.1.8.1.1 kernel and the troublesome client is running the stock 2.6.18-128.7.1.el5 kernel with 1.8.1.1-2.6.18_128.7.1.el5_lustre.1.8.1.1 versions of the lustre-client and lustre-client-modules packages. The client automounts the lustre filesystem, and it seems to fall over when doing a du scan of the filesysem, so there''s a lot of stat-ing going on, but not much else. Ewan -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091218/eff5d955/attachment.bin
Miguel Angel Gila Arrondo
2009-Dec-22 16:44 UTC
[Lustre-discuss] ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Hi all, We''re seeing the same error message in our Lustre clients when umounting a lustre filesystem, but only in a very specific situation: When trying to empty an OST with the simple data migration script (http://manual.lustre.org/manual/LustreManual18_HTML/LustreOperatingTips.html#50532487_pgfId-1292867), sometimes inconsistences appear. Doing ls -lh in the client that just ran the script shows 0 file size for some files, while doing it in another client shows the correct file size for the same files. This gets solved by umounting and mounting again the Lustre filesystem, although the mentioned error messages appear in /var/log/messages of the client: Dec 22 17:18:49 proof kernelLustreError: 10917:0: (ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Dec 22 17:18:49 proof kernelLustreError: 10917:0: (ldlm_request.c:1030:ldlm_cli_cancel_req()) Skipped 169 previous similar messages Dec 22 17:18:49 proof kernelLustreError: 10917:0: (ldlm_request.c:1533:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Dec 22 17:18:49 proof kernelLustreError: 10917:0: (ldlm_request.c:1533:ldlm_cli_cancel_list()) Skipped 169 previous similar messages Dec 22 17:18:49 proof kernelLustre: client ffff81042fccf400 umount complete Dec 22 17:19:02 proof kernelLustre: Client userdata-client has started Is anybody else seeing these messages in this situation? Does anyboyd know for a workaround?? Cheers, Miguel -- =============================================================================Miguel Angel Gila Arrondo e-mail: miguel.gila at uam.es Dpto Fisica Teorica. C-XI. Laboratorio de Altas Energias Universidad Autonoma de Madrid. Phone: 34 91 497 3976 Cantoblanco, 28049 Madrid, Spain. Fax: 34 91 497 4086 ==============================================================================
Christopher J. Walker
2009-Dec-23 11:22 UTC
[Lustre-discuss] ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Miguel Angel Gila Arrondo wrote:> Hi all, > > We''re seeing the same error message in our Lustre clients when umounting a > lustre filesystem, but only in a very specific situation: > > When trying to empty an OST with the simple data migration script > (http://manual.lustre.org/manual/LustreManual18_HTML/LustreOperatingTips.html#50532487_pgfId-1292867), > sometimes inconsistences appear. Doing ls -lh in the client that just ran the > script shows 0 file size for some files, while doing it in another client shows > the correct file size for the same files. > > This gets solved by umounting and mounting again the Lustre filesystem, > although the mentioned error messages appear in /var/log/messages of the > client: > > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > (ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: > canceling anyway > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > (ldlm_request.c:1030:ldlm_cli_cancel_req()) Skipped 169 previous similar > messages > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > (ldlm_request.c:1533:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > (ldlm_request.c:1533:ldlm_cli_cancel_list()) Skipped 169 previous similar > messages > Dec 22 17:18:49 proof kernelLustre: client ffff81042fccf400 umount complete > Dec 22 17:19:02 proof kernelLustre: Client userdata-client has started > > Is anybody else seeing these messages in this situation? Does anyboyd know for > a workaround??Like Ewan, our Lustre filesystem is automounted. Whilst I haven''t done a detailed study, it does look as though these messages occur immediately before unmounting the filesystem. Is automounting a bad idea? Chris
Heiko Schröter
2009-Dec-23 11:57 UTC
[Lustre-discuss] ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Am Mittwoch 23 Dezember 2009 12:22:17 schrieb Christopher J. Walker:> > > > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > > (ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: > > canceling anyway > > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > > (ldlm_request.c:1030:ldlm_cli_cancel_req()) Skipped 169 previous similar > > messages > > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > > (ldlm_request.c:1533:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 > > Dec 22 17:18:49 proof kernelLustreError: 10917:0: > > (ldlm_request.c:1533:ldlm_cli_cancel_list()) Skipped 169 previous similar > > messages > > Dec 22 17:18:49 proof kernelLustre: client ffff81042fccf400 umount complete > > Dec 22 17:19:02 proof kernelLustre: Client userdata-client has started > > > > Is anybody else seeing these messages in this situation? Does anyboyd know for > > a workaround?? > > Like Ewan, our Lustre filesystem is automounted. Whilst I haven''t done a > detailed study, it does look as though these messages occur immediately > before unmounting the filesystem.Yes. These messages do occur before ''auto''-un-mounting. So nothing to worry about. The above is the mount process. Unmounting should look like this: Jun 17 04:00:16 cluster1 LustreError: 6460:0:(ldlm_request.c:1043:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Jun 17 04:00:16 cluster1 LustreError: 6460:0:(ldlm_request.c:1632:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Jun 17 04:00:16 cluster1 Lustre: client ffff8100c44d1000 umount complete If you don''t see the last line ''umount complete'' automount + lustre will hang and there should be no further access to the lustre system. Happend to us in our scenario.> > Is automounting a bad idea?It depends. We had some bad experiences with lustre-1.6.6 and automount. See the mail archive about it. Subject: ''Stalled autofs + lustre'' Our problem should be resolved with upgrading to 1.8.x. We will test again in Jan/Feb 2010 when the upgrade is sheduled. Regards Heiko
Christopher J.Walker
2010-Jan-04 17:16 UTC
[Lustre-discuss] ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
Heiko Schr?ter wrote:> Am Mittwoch 23 Dezember 2009 12:22:17 schrieb Christopher J. Walker: >>> Dec 22 17:18:49 proof kernelLustreError: 10917:0: >>> (ldlm_request.c:1030:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: >>> canceling anyway >>> Dec 22 17:18:49 proof kernelLustreError: 10917:0: >>> (ldlm_request.c:1030:ldlm_cli_cancel_req()) Skipped 169 previous similar >>> messages >>> Dec 22 17:18:49 proof kernelLustreError: 10917:0: >>> (ldlm_request.c:1533:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 >>> Dec 22 17:18:49 proof kernelLustreError: 10917:0: >>> (ldlm_request.c:1533:ldlm_cli_cancel_list()) Skipped 169 previous similar >>> messages >>> Dec 22 17:18:49 proof kernelLustre: client ffff81042fccf400 umount complete >>> Dec 22 17:19:02 proof kernelLustre: Client userdata-client has started >>> >>> Is anybody else seeing these messages in this situation? Does anyboyd know for >>> a workaround?? >> Like Ewan, our Lustre filesystem is automounted. Whilst I haven''t done a >> detailed study, it does look as though these messages occur immediately >> before unmounting the filesystem. > > Yes. These messages do occur before ''auto''-un-mounting. So nothing to worry about. > The above is the mount process. > > Unmounting should look like this: > Jun 17 04:00:16 cluster1 LustreError: 6460:0:(ldlm_request.c:1043:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway > Jun 17 04:00:16 cluster1 LustreError: 6460:0:(ldlm_request.c:1632:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 > Jun 17 04:00:16 cluster1 Lustre: client ffff8100c44d1000 umount complete > > If you don''t see the last line ''umount complete'' automount + lustre will hang and there should be no further access to the lustre system. > Happend to us in our scenario.Something similar has been happening to us with lustre 1.8 - which is partly what prompted the question. When I look at the machine, the lustre_0 filesystem doesn''t seem to be there - and looking doesn''t prompt any lustre errors. The lustre_1 filesystem automounts fine. I think that forcing the filesystem to stay mounted helps - but I need to do some more investigating.> >> Is automounting a bad idea? > > It depends. We had some bad experiences with lustre-1.6.6 and automount. See the mail archive about it. Subject: ''Stalled autofs + lustre'' > Our problem should be resolved with upgrading to 1.8.x. > We will test again in Jan/Feb 2010 when the upgrade is sheduled. >Do let me know. Thanks, Chris