I have a lustre client that was randomly evicted early this morning. The errors from the dmesg are below. It''s running infiniband. There were no infiniband errors that I could tell and all the mds/mgs and oss''s said was "haven''t heard from client xyz in 2277 seconds. Evicting". The client has halfway come back and now shows this - aaron at cola10:~ $ lfs df -h UUID bytes Used Available Use% Mounted on data-MDT0000_UUID 87.5G 6.4G 81.1G 7% /data[MDT:0] data-OST0000_UUID 5.4T 4.9T 439.6G 92% /data[OST:0] data-OST0001_UUID : inactive device data-OST0002_UUID : inactive device data-OST0003_UUID : inactive device data-OST0004_UUID : inactive device data-OST0005_UUID : inactive device data-OST0006_UUID : inactive device data-OST0007_UUID : inactive device data-OST0008_UUID : inactive device data-OST0009_UUID : inactive device filesystem summary: 5.4T 4.9T 439.6G 92% /data so it''s reconnected to one of 10 osts. I tried to to an lctl --device {device} reconnect and it said "Error: Operation in progress". I have no idea what went wrong and I''m confident a reboot would fix it but I''d like to avoid it if possible. Thanks in advance. LustreError: 11-0: an error occurred while communicating with 192.168.64.70 at o2ib. The mds_statfs operation failed with -107 Lustre: data-MDT0000-mdc-ffff81013037b800: Connection to service data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by data-MDT0000; in progress operations using this service will fail. LustreError: 22345:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -5 LustreError: 22396:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff810136334400 x81717113/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22396:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22454:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101136d2000 x81717114/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22454:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22463:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff810024ee4c00 x81717115/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22463:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22734:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101316c8200 x81717138/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22734:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22736:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101136d2c00 x81717139/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22736:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22912:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8101136d2c00 x81717140/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22912:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81012cebb000 x81717143/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped 2 previous similar messages LustreError: 23781:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81012bd02000 x81717144/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23781:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 23796:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81006c776000 x81717156/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23827:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81013cbae400 x81717157/t0 o41->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped 1 previous similar message LustreError: 22346:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100a5f3d400 x81717169/t0 o35->data-MDT0000_UUID at 192.168.64.70@o2ib:12 lens 296/896 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22346:0:(file.c:97:ll_close_inode_openhandle()) inode 21601226 mdc close failed: rc = -108 Lustre: data-MDT0000-mdc-ffff81013037b800: Connection restored to service data-MDT0000 using nid 192.168.64.70 at o2ib. LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_statfs operation failed with -107 Lustre: data-OST0001-osc-ffff81013037b800: Connection to service data-OST0001 via nid 192.168.64.71 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_statfs operation failed with -107 LustreError: 167-0: This client was evicted by data-OST0001; in progress operations using this service will fail. LustreError: 167-0: This client was evicted by data-OST0002; in progress operations using this service will fail. LustreError: 24093:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc = -5 Lustre: data-OST0000-osc-ffff81013037b800: Connection restored to service data-OST0000 using nid 192.168.64.71 at o2ib. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080430/a3f25ff5/attachment.html
Some more information that might be helpful. There is a particular code that one of our users runs. Personally after the trouble this code has caused us we''d like to hand him a calculator and disable his accounts but sadly that''s not an option. Since the time of the hang, there is what seems to be one process associated with lustre that is running as the userid of the problem user- "ll_sa_15530". A trace of this process in its current state shows this - Apr 30 11:29:30 cola10 kernel: ll_sa_15530 S 0000000000000000 0 15531 1 17700 18228 (L-TLB) Apr 30 11:29:30 cola10 kernel: ffff810116c31c10 0000000000000046 ffff81013e7747a0 ffffffff80087d0e Apr 30 11:29:30 cola10 kernel: 0000000000000007 ffff81003a76b040 ffff81012f11f0c0 000fcb5175eba398 Apr 30 11:29:30 cola10 kernel: 0000000000001407 ffff81003a76b228 0000000000000001 0000000000000068 Apr 30 11:29:30 cola10 kernel: Call Trace: Apr 30 11:29:30 cola10 kernel: [<ffffffff80087d0e>] enqueue_task +0x41/0x56 Apr 30 11:29:30 cola10 kernel: [<ffffffff8862b7e4>] :ptlrpc:ldlm_prep_enqueue_req+0x1b4/0x2e0 Apr 30 11:29:30 cola10 kernel: [<ffffffff886e528c>] :mdc:mdc_req_avail +0x6c/0xf0 Apr 30 11:29:30 cola10 kernel: [<ffffffff886e6275>] :mdc:mdc_enter_request+0x145/0x1e0 Apr 30 11:29:30 cola10 kernel: [<ffffffff800884ed>] default_wake_function+0x0/0xe Apr 30 11:29:30 cola10 kernel: [<ffffffff886e6410>] :mdc:mdc_intent_lookup_pack+0xd0/0xf0 Apr 30 11:29:30 cola10 kernel: [<ffffffff886e6644>] :mdc:mdc_intent_getattr_async+0x214/0x420 Apr 30 11:29:30 cola10 kernel: [<ffffffff887ae63d>] :lustre:ll_i2gids +0x5d/0x150 Apr 30 11:29:30 cola10 kernel: [<ffffffff887b94c5>] :lustre:ll_statahead_thread+0xf75/0x1810 Apr 30 11:29:30 cola10 kernel: [<ffffffff800884ed>] default_wake_function+0x0/0xe Apr 30 11:29:30 cola10 kernel: [<ffffffff8005bfb1>] child_rip+0xa/0x11 Apr 30 11:29:30 cola10 kernel: [<ffffffff887b8550>] :lustre:ll_statahead_thread+0x0/0x1810 Apr 30 11:29:30 cola10 kernel: [<ffffffff8005bfa7>] child_rip+0x0/0x11 Is this a problem with the lustre readahead code? If so would this fix it? "echo 0 > /proc/fs/lustre/llite/*/statahead_count " Thank you so much for all your help. -Aaron On Apr 30, 2008, at 11:16 AM, Aaron S. Knister wrote:> I have a lustre client that was randomly evicted early this morning. > The errors from the dmesg are below. It''s running infiniband. There > were no infiniband errors that I could tell and all the mds/mgs and > oss''s said was "haven''t heard from client xyz in 2277 seconds. > Evicting". The client has halfway come back and now shows this - > > > aaron at cola10:~ $ lfs df -h > UUID bytes Used Available Use% Mounted on > data-MDT0000_UUID 87.5G 6.4G 81.1G 7% /data[MDT:0] > data-OST0000_UUID 5.4T 4.9T 439.6G 92% /data[OST:0] > data-OST0001_UUID : inactive device > data-OST0002_UUID : inactive device > data-OST0003_UUID : inactive device > data-OST0004_UUID : inactive device > data-OST0005_UUID : inactive device > data-OST0006_UUID : inactive device > data-OST0007_UUID : inactive device > data-OST0008_UUID : inactive device > data-OST0009_UUID : inactive device > > filesystem summary: 5.4T 4.9T 439.6G 92% /data > > so it''s reconnected to one of 10 osts. I tried to to an lctl -- > device {device} reconnect and it said "Error: Operation in > progress". I have no idea what went wrong and I''m confident a reboot > would fix it but I''d like to avoid it if possible. > > > Thanks in advance. > > LustreError: 11-0: an error occurred while communicating with > 192.168.64.70 at o2ib. The mds_statfs operation failed with -107 > Lustre: data-MDT0000-mdc-ffff81013037b800: Connection to service > data-MDT0000 via nid 192.168.64.70 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 167-0: This client was evicted by data-MDT0000; in > progress operations using this service will fail. > LustreError: 22345:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -5 > LustreError: 22396:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff810136334400 x81717113/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22396:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22454:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff8101136d2000 x81717114/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22454:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22463:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff810024ee4c00 x81717115/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22463:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22734:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff8101316c8200 x81717138/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22734:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22736:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff8101136d2c00 x81717139/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22736:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22912:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff8101136d2c00 x81717140/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22912:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff81012cebb000 x81717143/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) > Skipped 2 previous similar messages > LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped > 2 previous similar messages > LustreError: 23781:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff81012bd02000 x81717144/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 23781:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 23796:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff81006c776000 x81717156/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 23827:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff81013cbae400 x81717157/t0 o41->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) > mdc_statfs fails: rc = -108 > LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped > 1 previous similar message > LustreError: 22346:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at ffff8100a5f3d400 x81717169/t0 o35->data-MDT0000_UUID at 192.168.64.70 > @o2ib:12 lens 296/896 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 22346:0:(file.c:97:ll_close_inode_openhandle()) inode > 21601226 mdc close failed: rc = -108 > Lustre: data-MDT0000-mdc-ffff81013037b800: Connection restored to > service data-MDT0000 using nid 192.168.64.70 at o2ib. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_statfs operation failed with -107 > Lustre: data-OST0001-osc-ffff81013037b800: Connection to service > data-OST0001 via nid 192.168.64.71 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_statfs operation failed with -107 > LustreError: 167-0: This client was evicted by data-OST0001; in > progress operations using this service will fail. > LustreError: 167-0: This client was evicted by data-OST0002; in > progress operations using this service will fail. > LustreError: 24093:0:(llite_lib.c:1520:ll_statfs_internal()) > obd_statfs fails: rc = -5 > Lustre: data-OST0000-osc-ffff81013037b800: Connection restored to > service data-OST0000 using nid 192.168.64.71 at o2ib. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080430/399fb03f/attachment-0001.html
On Apr 30, 2008 11:40 -0400, Aaron Knister wrote:> Some more information that might be helpful. There is a particular code > that one of our users runs. Personally after the trouble this code has > caused us we''d like to hand him a calculator and disable his accounts but > sadly that''s not an option. Since the time of the hang, there is what seems > to be one process associated with lustre that is running as the userid of > the problem user- "ll_sa_15530". A trace of this process in its current > state shows this - > > Is this a problem with the lustre readahead code? If so would this fix it? > "echo 0 > /proc/fs/lustre/llite/*/statahead_count "Yes, this appears to be a statahead problem. There were fixes added to 1.6.5 that should resolve the problems seen with statahead. In the meantime I''d recommend disabling it as you suggest above. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote:>On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >> Some more information that might be helpful. There is a particular code >> that one of our users runs. Personally after the trouble this code has >> caused us we''d like to hand him a calculator and disable his accounts but >> sadly that''s not an option. Since the time of the hang, there is what seems >> to be one process associated with lustre that is running as the userid of >> the problem user- "ll_sa_15530". A trace of this process in its current >> state shows this - >> >> Is this a problem with the lustre readahead code? If so would this fix it? >> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " > >Yes, this appears to be a statahead problem. There were fixes added to >1.6.5 that should resolve the problems seen with statahead. In the meantime >I''d recommend disabling it as you suggest above.we''re seeing the same problem. I think the workaround should be: echo 0 > /proc/fs/lustre/llite/*/statahead_max ?? /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- cheers, robin ps. sorry I''ve been too busy this week to look at the llite_lloop stuff.
Robin Humble ??:> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: > >> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >> >>> Some more information that might be helpful. There is a particular code >>> that one of our users runs. Personally after the trouble this code has >>> caused us we''d like to hand him a calculator and disable his accounts but >>> sadly that''s not an option. Since the time of the hang, there is what seems >>> to be one process associated with lustre that is running as the userid of >>> the problem user- "ll_sa_15530". A trace of this process in its current >>> state shows this - >>> >>> Is this a problem with the lustre readahead code? If so would this fix it? >>> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >>> >> Yes, this appears to be a statahead problem. There were fixes added to >> 1.6.5 that should resolve the problems seen with statahead. In the meantime >> I''d recommend disabling it as you suggest above. >> > > we''re seeing the same problem. > > I think the workaround should be: > echo 0 > /proc/fs/lustre/llite/*/statahead_max > ?? > > /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >Sure. "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. "/proc/fs/lustre/llite/*/statahead_max " is the switch for enable/disable directory statahead. Regrads! -- Fan Yong> cheers, > robin > > ps. sorry I''ve been too busy this week to look at the llite_lloop stuff. > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Ah! That would make a lot of sense. echoing 0 to statahead_count doesn''t really do anything other than hang my session. Thanks! -Aaron On May 15, 2008, at 4:36 AM, Yong Fan wrote:> Robin Humble ??: >> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: >> >>> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >>> >>>> Some more information that might be helpful. There is a >>>> particular code >>>> that one of our users runs. Personally after the trouble this >>>> code has >>>> caused us we''d like to hand him a calculator and disable his >>>> accounts but >>>> sadly that''s not an option. Since the time of the hang, there is >>>> what seems >>>> to be one process associated with lustre that is running as the >>>> userid of >>>> the problem user- "ll_sa_15530". A trace of this process in its >>>> current >>>> state shows this - >>>> >>>> Is this a problem with the lustre readahead code? If so would >>>> this fix it? >>>> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >>>> >>> Yes, this appears to be a statahead problem. There were fixes >>> added to >>> 1.6.5 that should resolve the problems seen with statahead. In >>> the meantime >>> I''d recommend disabling it as you suggest above. >>> >> >> we''re seeing the same problem. >> >> I think the workaround should be: >> echo 0 > /proc/fs/lustre/llite/*/statahead_max >> ?? >> >> /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >> > Sure. > "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. > "/proc/fs/lustre/llite/*/statahead_max " is the switch for > enable/disable directory statahead. > > Regrads! > -- > Fan Yong >> cheers, >> robin >> >> ps. sorry I''ve been too busy this week to look at the llite_lloop >> stuff. >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Systems Administrator Center for Research on Environment and Water (301) 595-7000 aaron at iges.org
On Thu, May 15, 2008 at 08:23:20AM -0400, Aaron Knister wrote:> Ah! That would make a lot of sense. echoing 0 to statahead_count doesn''t > really do anything other than hang my session. Thanks!I think the hang echo''ing into /proc is another bug, but yeah, deal with the big ones first :-) cheers, robin> -Aaron > > On May 15, 2008, at 4:36 AM, Yong Fan wrote: > >> Robin Humble ??????: >>> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: >>> >>>> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >>>> >>>>> Some more information that might be helpful. There is a particular code >>>>> that one of our users runs. Personally after the trouble this code has >>>>> caused us we''d like to hand him a calculator and disable his accounts >>>>> but >>>>> sadly that''s not an option. Since the time of the hang, there is what >>>>> seems >>>>> to be one process associated with lustre that is running as the userid >>>>> of >>>>> the problem user- "ll_sa_15530". A trace of this process in its current >>>>> state shows this - >>>>> >>>>> Is this a problem with the lustre readahead code? If so would this fix >>>>> it? >>>>> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >>>>> >>>> Yes, this appears to be a statahead problem. There were fixes added to >>>> 1.6.5 that should resolve the problems seen with statahead. In the >>>> meantime >>>> I''d recommend disabling it as you suggest above. >>>> >>> >>> we''re seeing the same problem. >>> >>> I think the workaround should be: >>> echo 0 > /proc/fs/lustre/llite/*/statahead_max >>> ?? >>> >>> /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >>> >> Sure. >> "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. >> "/proc/fs/lustre/llite/*/statahead_max " is the switch for >> enable/disable directory statahead. >> >> Regrads! >> -- >> Fan Yong >>> cheers, >>> robin >>> >>> ps. sorry I''ve been too busy this week to look at the llite_lloop stuff. >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Systems Administrator > Center for Research on Environment and Water > > (301) 595-7000 > aaron at iges.org > > > > >
Robin Humble ??:> On Thu, May 15, 2008 at 08:23:20AM -0400, Aaron Knister wrote: > >> Ah! That would make a lot of sense. echoing 0 to statahead_count doesn''t >> really do anything other than hang my session. Thanks! >> > > I think the hang echo''ing into /proc is another bug, but yeah, deal > with the big ones first :-) > >ah! little issue for that, will fix it soon. Regards! -- Fan Yong> cheers, > robin > > >> -Aaron >> >> On May 15, 2008, at 4:36 AM, Yong Fan wrote: >> >> >>> Robin Humble ??????: >>> >>>> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: >>>> >>>> >>>>> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >>>>> >>>>> >>>>>> Some more information that might be helpful. There is a particular code >>>>>> that one of our users runs. Personally after the trouble this code has >>>>>> caused us we''d like to hand him a calculator and disable his accounts >>>>>> but >>>>>> sadly that''s not an option. Since the time of the hang, there is what >>>>>> seems >>>>>> to be one process associated with lustre that is running as the userid >>>>>> of >>>>>> the problem user- "ll_sa_15530". A trace of this process in its current >>>>>> state shows this - >>>>>> >>>>>> Is this a problem with the lustre readahead code? If so would this fix >>>>>> it? >>>>>> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >>>>>> >>>>>> >>>>> Yes, this appears to be a statahead problem. There were fixes added to >>>>> 1.6.5 that should resolve the problems seen with statahead. In the >>>>> meantime >>>>> I''d recommend disabling it as you suggest above. >>>>> >>>>> >>>> we''re seeing the same problem. >>>> >>>> I think the workaround should be: >>>> echo 0 > /proc/fs/lustre/llite/*/statahead_max >>>> ?? >>>> >>>> /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >>>> >>>> >>> Sure. >>> "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. >>> "/proc/fs/lustre/llite/*/statahead_max " is the switch for >>> enable/disable directory statahead. >>> >>> Regrads! >>> -- >>> Fan Yong >>> >>>> cheers, >>>> robin >>>> >>>> ps. sorry I''ve been too busy this week to look at the llite_lloop stuff. >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> Aaron Knister >> Systems Administrator >> Center for Research on Environment and Water >> >> (301) 595-7000 >> aaron at iges.org >> >> >> >> >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >