On a file system thats been up for only 57 days, I have: 505 lustre-log. dumps. THe problem at hand is a user has many jobs where his jobs are now hung trying to create a directory from his pbs script. On the clients i see: LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_connect operation failed with -16 LustreError: Skipped 2 previous similar messages On every client his jobs are on. In the most recent /tmp/lustre-log. on the MDS/MGS I see this message: @@@ processing error (-16) req at 000001001af9a600 x12808293/t0 o38- >32633f05-02c6-50a5-b496-047150f1fe81 at NET_0x200000aa4003e_UUID:-1 lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 ldlm_lib.c target_handle_reconnect nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting ldlm_lib.c target_handle_connect nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c- dac0-516b8ce786fc at 10.164.0.111@tcp to 0x00000100069a7000; still busy with 2 active RPCs ldlm_lib.c target_send_reply_msg @@@ processing error (-16) req at 0000010019159a00 x11199816/t0 o38- >34b4fbea-200b-1f7c-dac0-516b8ce786fc at NET_0x200000aa4006f_UUID:-1 lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 What I see messages about active rpc''s in other logs. What would this mean? Is something suck someplace ? Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985
Hi! I have a few questions for you- 1. How many nodes was his job running on? 2. What version of lustre and linux kernel are you running on your servers/clients? 3. What ethernet module are you using on the servers/clients? I honestly am not sure what the RPC errors mean but I''ve had similar issues caused by ethernet-level errors. -Aaron On Mar 7, 2008, at 6:45 PM, Brock Palen wrote:> On a file system thats been up for only 57 days, I have: > > 505 lustre-log. dumps. > > THe problem at hand is a user has many jobs where his jobs are now > hung trying to create a directory from his pbs script. On the > clients i see: > > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_connect operation failed with -16 > LustreError: Skipped 2 previous similar messages > > On every client his jobs are on. > > In the most recent /tmp/lustre-log. on the MDS/MGS I see this > message: > > @@@ processing error (-16) req at 000001001af9a600 x12808293/t0 o38- >> 32633f05-02c6-50a5-b496-047150f1fe81 at NET_0x200000aa4003e_UUID:-1 > lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 > ldlm_lib.c > target_handle_reconnect > nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting > ldlm_lib.c > target_handle_connect > nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c- > dac0-516b8ce786fc at 10.164.0.111@tcp to 0x00000100069a7000; still busy > with 2 active RPCs > ldlm_lib.c > target_send_reply_msg > @@@ processing error (-16) req at 0000010019159a00 x11199816/t0 o38- >> 34b4fbea-200b-1f7c-dac0-516b8ce786fc at NET_0x200000aa4006f_UUID:-1 > lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 > > > What I see messages about active rpc''s in other logs. What would > this mean? Is something suck someplace ? > > > > Brock Palen > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org
On Fri, 2008-03-07 at 18:45 -0500, Brock Palen wrote:> On a file system thats been up for only 57 days, I have: >> target_handle_reconnect > nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting > ldlm_lib.c > target_handle_connect > nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c- > dac0-516b8ce786fc at 10.164.0.111@tcp to 0x00000100069a7000; still busy > with 2 active RPCs > ldlm_lib.c > target_send_reply_msg > @@@ processing error (-16) req at 0000010019159a00 x11199816/t0 o38- > >34b4fbea-200b-1f7c-dac0-516b8ce786fc at NET_0x200000aa4006f_UUID:-1 > lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 > > > What I see messages about active rpc''s in other logs. What would > this mean? Is something suck someplace ? >-16 = EBUSY. This say client reconnected to server which already work on different request from this client. After old rpc from this client will be finished - client will be reconnected. -- Alex Lyashkov <Alexey.lyashkov at sun.com> Lustre Group, Sun Microsystems
On Mar 9, 2008, at 10:01 PM, Aaron Knister wrote:> Hi! I have a few questions for you- > > 1. How many nodes was his job running on?around 64 serial jobs accessing the same directory (not the same files).> 2. What version of lustre and linux kernel are you running on your > servers/clients?Lustre servers: 2.6.9-55.0.9.EL_lustre.1.6.4.1smp Clients: 2.6.9-67.0.1.ELsmp> 3. What ethernet module are you using on the servers/clients?Most use the tg3, some use e1000.> > I honestly am not sure what the RPC errors mean but I''ve had > similar issues caused by ethernet-level errors.Over the weekend the MDS/MGS went into a unhealthy state forced a reboot+fsck and when it came back up the directory was accessible again and jobs started working again.> > -Aaron > > On Mar 7, 2008, at 6:45 PM, Brock Palen wrote: > >> On a file system thats been up for only 57 days, I have: >> >> 505 lustre-log. dumps. >> >> THe problem at hand is a user has many jobs where his jobs are now >> hung trying to create a directory from his pbs script. On the >> clients i see: >> >> LustreError: 11-0: an error occurred while communicating with >> 141.212.30.184 at tcp. The mds_connect operation failed with -16 >> LustreError: Skipped 2 previous similar messages >> >> On every client his jobs are on. >> >> In the most recent /tmp/lustre-log. on the MDS/MGS I see this >> message: >> >> @@@ processing error (-16) req at 000001001af9a600 x12808293/t0 o38- >>> 32633f05-02c6-50a5-b496-047150f1fe81 at NET_0x200000aa4003e_UUID:-1 >> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 >> ldlm_lib.c >> target_handle_reconnect >> nobackup-MDT0000: 34b4fbea-200b-1f7c-dac0-516b8ce786fc reconnecting >> ldlm_lib.c >> target_handle_connect >> nobackup-MDT0000: refuse reconnection from 34b4fbea-200b-1f7c- >> dac0-516b8ce786fc at 10.164.0.111@tcp to 0x00000100069a7000; still busy >> with 2 active RPCs >> ldlm_lib.c >> target_send_reply_msg >> @@@ processing error (-16) req at 0000010019159a00 x11199816/t0 o38- >>> 34b4fbea-200b-1f7c-dac0-516b8ce786fc at NET_0x200000aa4006f_UUID:-1 >> lens 304/200 ref 0 fl Interpret:/0/0 rc -16/0 >> >> >> What I see messages about active rpc''s in other logs. What would >> this mean? Is something suck someplace ? >> >> >> >> Brock Palen >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > >