on our cluster that has been running lustre for about 1 month. I have 1 MDT/MGS and 1 OSS with 2 OST''s. Our cluster uses all Gige and has about 608 nodes 1854 cores. We have allot of jobs that die, and/or go into high IO wait, strace shows processes stuck in fstat(). The big problem is (i think) I would like some feedback on it that of these 608 nodes 209 of them have in dmesg the string "This client was evicted by" Is this normal for clients to be dropped like this? Is there some tuning that needs to be done to the server to carry this many nodes out of the box? We are using default lustre install with Gige. Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985
Hi Brock, On Monday 04 February 2008 07:11:11 am Brock Palen wrote:> on our cluster that has been running lustre for about 1 month. I have > 1 MDT/MGS and 1 OSS with 2 OST''s. > > Our cluster uses all Gige and has about 608 nodes 1854 cores.This seems to be a lot of clients for only one OSS (and thus for only one GigE link to the OSS).> We have allot of jobs that die, and/or go into high IO wait, strace > shows processes stuck in fstat(). > > The big problem is (i think) I would like some feedback on it that of > these 608 nodes 209 of them have in dmesg the string > > "This client was evicted by" > > Is this normal for clients to be dropped like this?I''m not an expert here, but evictions typically occur when a client hasn''t been seen for a certain period by the OSS/MDS. This is often related to network problems. Considering your number of clients, if they all do I/O operations on the filesystem concurrently, maybe your Ethernet switches are the bottleneck and have to drop packets. Is your GigE network working fine outside of Lustre? To eliminate networking issues from the equation, you can try to lctl ping your MDS and OSS from a freshly evicted node, and see what you get. (lctl ping <your-oss-nid>)> Is there some > tuning that needs to be done to the server to carry this many nodes > out of the box? We are using default lustre install with Gige.Do your MDS or OSS show any particularly high load or memory usage? Do you see any Lustre-related error messages in their logs? CHeers, -- Kilian
> Hi Brock, > > On Monday 04 February 2008 07:11:11 am Brock Palen wrote: >> on our cluster that has been running lustre for about 1 month. I have >> 1 MDT/MGS and 1 OSS with 2 OST''s. >> >> Our cluster uses all Gige and has about 608 nodes 1854 cores. > > This seems to be a lot of clients for only one OSS (and thus for only > one GigE link to the OSS).Its more for evaluation, the ''real'' file system is a NFS file system provided by a OnStor bobcat. So anything is a improvement. The cluster IS to big, but there isn''t a person at the university who is willing to pay for anything other than more cluster nodes. Enough with politics.> >> We have allot of jobs that die, and/or go into high IO wait, strace >> shows processes stuck in fstat(). >> >> The big problem is (i think) I would like some feedback on it that of >> these 608 nodes 209 of them have in dmesg the string >> >> "This client was evicted by" >> >> Is this normal for clients to be dropped like this? > > I''m not an expert here, but evictions typically occur when a client > hasn''t been seen for a certain period by the OSS/MDS. This is often > related to network problems. Considering your number of clients, if > they all do I/O operations on the filesystem concurrently, maybe your > Ethernet switches are the bottleneck and have to drop packets. Is your > GigE network working fine outside of Lustre? > > To eliminate networking issues from the equation, you can try to lctl > ping your MDS and OSS from a freshly evicted node, and see what you > get. (lctl ping <your-oss-nid>)I just had another node get evicted while running code causing the code to lock up. This time it was the MDS that evicted it. Pinging work though: [root at nyx350 ~]# lctl ping 141.212.30.184 at tcp 12345-0 at lo 12345-141.212.30.184 at tcp Recovery is slow, this clinet has been evicted for about 10 minutes. I have attached the output of lctl dk from the client and some syslog messages from the MDS.> >> Is there some >> tuning that needs to be done to the server to carry this many nodes >> out of the box? We are using default lustre install with Gige. > > Do your MDS or OSS show any particularly high load or memory usage? Do > you see any Lustre-related error messages in their logs?Nope both servers have 2GB ram, and load is almost 0. No swapping. Thanks -------------- next part -------------- A non-text attachment was scrubbed... Name: client.err Type: application/octet-stream Size: 27024 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080204/3d7714b2/attachment-0004.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: mds.log Type: application/octet-stream Size: 997 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080204/3d7714b2/attachment-0005.obj -------------- next part --------------> > CHeers, > -- > Kilian > >
On Monday 04 February 2008 10:17:37 am Brock Palen wrote:> The > cluster IS to big, but there isn''t a person at the university who is > willing to pay for anything other than more cluster nodes. Enough > with politics.That''s the first time I hear a cluster is too big, people usually complain about the contrary. :) But the second part sounds very very familiar, though... Anyway.> I just had another node get evicted while running code causing the > code to lock up. This time it was the MDS that evicted it. Pinging > work though: > > [root at nyx350 ~]# lctl ping 141.212.30.184 at tcp > 12345-0 at lo > 12345-141.212.30.184 at tcpOk.> I have attached the output of lctl dk from the client and some > syslog messages from the MDS.(recover.c:188:ptlrpc_request_handle_notconn()) import nobackup-MDT0000-mdc-000001012bd27c00 of nobackup-MDT0000_UUID at 141.212.30.184@tcp abruptly disconnected: reconnecting (import.c:133:ptlrpc_set_import_discon()) nobackup-MDT0000-mdc-000001012bd27c00: Connection to service nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; I will let Lustre people comment on this, but this sure looks like a network problem to me. Is there any information you can get out of the switches (logs, dropped packets, retries, stats, anything)?> Nope both servers have 2GB ram, and load is almost 0. No swapping.Do you see dropped packets or errors in your ifconfig output, on the servers and/or clients? Cheers, -- Kilian
On Feb 4, 2008, at 1:43 PM, Kilian CAVALOTTI wrote:> On Monday 04 February 2008 10:17:37 am Brock Palen wrote: >> The >> cluster IS to big, but there isn''t a person at the university who is >> willing to pay for anything other than more cluster nodes. Enough >> with politics. > > That''s the first time I hear a cluster is too big, people usually > complain about the contrary. :) > But the second part sounds very very familiar, though... Anyway. > >> I just had another node get evicted while running code causing the >> code to lock up. This time it was the MDS that evicted it. Pinging >> work though: >> >> [root at nyx350 ~]# lctl ping 141.212.30.184 at tcp >> 12345-0 at lo >> 12345-141.212.30.184 at tcp > > Ok. > >> I have attached the output of lctl dk from the client and some >> syslog messages from the MDS. > > (recover.c:188:ptlrpc_request_handle_notconn()) import > nobackup-MDT0000-mdc-000001012bd27c00 of > nobackup-MDT0000_UUID at 141.212.30.184@tcp abruptly disconnected: > reconnecting > (import.c:133:ptlrpc_set_import_discon()) > nobackup-MDT0000-mdc-000001012bd27c00: Connection to service > nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; > > I will let Lustre people comment on this, but this sure looks like a > network problem to me. > > Is there any information you can get out of the switches (logs, > dropped > packets, retries, stats, anything)?The client, shows 107 dropped packets. The servers have none. I think your right, the client is the same clint that caused problems in the week sooner with losing connections to the OSS is now losing the connection to the MDT. I have asked networking to look at the counters between the force10 and the cisco. Lustre doesn''t care about frames at 6000 MTU right?> >> Nope both servers have 2GB ram, and load is almost 0. No swapping. > > Do you see dropped packets or errors in your ifconfig output, on the > servers and/or clients? > > Cheers, > -- > Kilian > >
Which version of lustre do you use? Server and clients same version and same os? which one? Harald On Monday 04 February 2008 04:11 pm, Brock Palen wrote:> on our cluster that has been running lustre for about 1 month. I have > 1 MDT/MGS and 1 OSS with 2 OST''s. > > Our cluster uses all Gige and has about 608 nodes 1854 cores. > > We have allot of jobs that die, and/or go into high IO wait, strace > shows processes stuck in fstat(). > > The big problem is (i think) I would like some feedback on it that of > these 608 nodes 209 of them have in dmesg the string > > "This client was evicted by" > > Is this normal for clients to be dropped like this? Is there some > tuning that needs to be done to the server to carry this many nodes > out of the box? We are using default lustre install with Gige. > > > Brock Palen > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Harald van Pee Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn
On Feb 04, 2008 13:17 -0500, Brock Palen wrote:>> On Monday 04 February 2008 07:11:11 am Brock Palen wrote: >>> on our cluster that has been running lustre for about 1 month. I have >>> 1 MDT/MGS and 1 OSS with 2 OST''s. >>> >>> Our cluster uses all Gige and has about 608 nodes 1854 cores. >> >> This seems to be a lot of clients for only one OSS (and thus for only >> one GigE link to the OSS). > > Its more for evaluation, the ''real'' file system is a NFS file system > provided by a OnStor bobcat. So anything is a improvement. The cluster IS > to big, but there isn''t a person at the university who is willing to pay > for anything other than more cluster nodes. Enough with politics.I''d suggest increasing the lustre timeout, to avoid eviction if the system is overloaded: Temporarily (on the MDS, OSS, and all client nodes): [root at mds]# sysctl -w lustre.timeout=300 If this helps you can set it permanently on the MGS (MDS) node: mgs> lctl conf_param testfs-MDT0000.sys.timeout=300 replacing "testfs" with the actual name of your filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Just some clarifying questions:> I''d suggest increasing the lustre timeout, to avoid eviction if the > system > is overloaded:Is this more fore clients being overloaded? (say user is swapping some)> > Temporarily (on the MDS, OSS, and all client nodes): > [root at mds]# sysctl -w lustre.timeout=300 > > If this helps you can set it permanently on the MGS (MDS) node: > mgs> lctl conf_param testfs-MDT0000.sys.timeout=300Changing options like this does it take affect only for new mounted clients? Or is it forced to all currently mounted clients? Should this only be done on a ''down'' filesystem? Or can conf_param values be changed while live? Thanks> replacing "testfs" with the actual name of your filesystem. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >
> Which version of lustre do you use? > Server and clients same version and same os? which one?lustre-1.6.4.1 The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org: 2.6.9-55.0.9.EL_lustre.1.6.4.1smp The clients run patchless RHEL4 2.6.9-67.0.1.ELsmp One set of clients are on a 10.x network while the servers and other half of clients are on a 141. network, because we are using the tcp network type, we have not setup any lnet routes. I don''t think should cause a problem but I include the information for clarity. We do route 10.x on campus.> > Harald > > On Monday 04 February 2008 04:11 pm, Brock Palen wrote: >> on our cluster that has been running lustre for about 1 month. I have >> 1 MDT/MGS and 1 OSS with 2 OST''s. >> >> Our cluster uses all Gige and has about 608 nodes 1854 cores. >> >> We have allot of jobs that die, and/or go into high IO wait, strace >> shows processes stuck in fstat(). >> >> The big problem is (i think) I would like some feedback on it that of >> these 608 nodes 209 of them have in dmesg the string >> >> "This client was evicted by" >> >> Is this normal for clients to be dropped like this? Is there some >> tuning that needs to be done to the server to carry this many nodes >> out of the box? We are using default lustre install with Gige. >> >> >> Brock Palen >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > -- > Harald van Pee > > Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet Bonn > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Craig Tierney
2008-Feb-04 21:52 UTC
[Lustre-discuss] Question about building Lustre, correct version of GCC
I am trying to lustre-1.6.4.2 with my system and I am reading through the documentation to figure out how to it. I am reading version 1.6_man_v1.9 of the Operations manual. On page 31, regarding compiler choice, it says: Compiler Choice The compiler must be greater than GCC version 3.3.4. Currently, GCC v4.0 is not supported. GCC v3.3.4 has been used to successfully compile all of the pre-packaged releases made available by CFS, and it is the only officially-supported compiler. Your mileage may vary with other compilers, or even with other versions of GCC. NOTE: GCC v3.3.4 was used to build 2.6 series kernels. So, which is it? Is 3.3.4 the right compiler, or does it have to be "greater than" 3.3.4? Has anyone built Lustre using Centos 5.X? I am trying to get Lustre working with 5.1, and have downgraded the kernel for simplicity. Using a vanilla 2.6.18 kernel, I have been able to build lustre and mount some basic filesystems, but I have not tested it thoroughly enough to say it works. Craig -- Craig Tierney (craig.tierney at noaa.gov)
Andreas Dilger
2008-Feb-04 23:19 UTC
[Lustre-discuss] Question about building Lustre, correct version of GCC
On Feb 04, 2008 14:52 -0700, Craig Tierney wrote:> I am trying to lustre-1.6.4.2 with my system and I am reading through > the documentation to figure out how to it. I am reading version 1.6_man_v1.9 > of the Operations manual. > > On page 31, regarding compiler choice, it says: > > Compiler Choice > The compiler must be greater than GCC version 3.3.4. Currently, > GCC v4.0 is not supported. GCC v3.3.4 has been used to successfully > compile all of the pre-packaged releases made available by CFS, and it > is the only officially-supported compiler. Your mileage may vary with > other compilers, or even with other versions of GCC. > > NOTE: > GCC v3.3.4 was used to build 2.6 series kernels. > > > So, which is it? Is 3.3.4 the right compiler, or does it have to be > "greater than" 3.3.4?The right answer today is "Lustre is built with the kernel shipped with the distro". The documentation needs to be updated.> Has anyone built Lustre using Centos 5.X? I am trying to get Lustre working > with 5.1, and have downgraded the kernel for simplicity. Using a vanilla > 2.6.18 kernel, I have been able to build lustre and mount some basic filesystems, > but I have not tested it thoroughly enough to say it works.Why not use the RHEL5 2.6.18 kernel? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
The timeouts fixed the random evictions. The problem we were trying to solve in the first place still is in place though. In talking with the user of the code the problem is related to a similar problem in another code. One code is from NOAA, the Other is S3D from Sandia (I think). Both these codes write one file per process. (NetCDF for one, tecplot for the other). When the code has finished with a iteration they copy/tar/cpio the files to another location. This is where the job will hand *some* times. Most the time it works, but with enough iterations of this method a job will hang at some point. The job does not die. Just hangs. The NOAA code does the mv+cpio in its pbs script. The S3D code uses system() to run tar. In the end they have the same behavior. has anyone seen similar behavior? Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 4, 2008, at 2:47 PM, Brock Palen wrote:>> Which version of lustre do you use? >> Server and clients same version and same os? which one? > > lustre-1.6.4.1 > > The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org: > 2.6.9-55.0.9.EL_lustre.1.6.4.1smp > > The clients run patchless RHEL4 > 2.6.9-67.0.1.ELsmp > > One set of clients are on a 10.x network while the servers and other > half of clients are on a 141. network, because we are using the tcp > network type, we have not setup any lnet routes. I don''t think > should cause a problem but I include the information for clarity. We > do route 10.x on campus. > >> >> Harald >> >> On Monday 04 February 2008 04:11 pm, Brock Palen wrote: >>> on our cluster that has been running lustre for about 1 month. I >>> have >>> 1 MDT/MGS and 1 OSS with 2 OST''s. >>> >>> Our cluster uses all Gige and has about 608 nodes 1854 cores. >>> >>> We have allot of jobs that die, and/or go into high IO wait, strace >>> shows processes stuck in fstat(). >>> >>> The big problem is (i think) I would like some feedback on it >>> that of >>> these 608 nodes 209 of them have in dmesg the string >>> >>> "This client was evicted by" >>> >>> Is this normal for clients to be dropped like this? Is there some >>> tuning that needs to be done to the server to carry this many nodes >>> out of the box? We are using default lustre install with Gige. >>> >>> >>> Brock Palen >>> Center for Advanced Computing >>> brockp at umich.edu >>> (734)936-1985 >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> -- >> Harald van Pee >> >> Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet >> Bonn >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Brock Palen wrote:> The timeouts fixed the random evictions. The problem we were trying > to solve in the first place still is in place though. In talking > with the user of the code the problem is related to a similar problem > in another code. > > One code is from NOAA, the Other is S3D from Sandia (I think). > > Both these codes write one file per process. (NetCDF for one, > tecplot for the other). > When the code has finished with a iteration they copy/tar/cpio the > files to another location. This is where the job will hand *some* > times. Most the time it works, but with enough iterations of this > method a job will hang at some point. The job does not die. Just > hangs. > > The NOAA code does the mv+cpio in its pbs script. The S3D code uses > system() to run tar. In the end they have the same behavior. > > has anyone seen similar behavior? >If client get eviction from the server, it might be triggered by 1) server did not get client pinger msg in a long time. 2) client is too busy to handle the server lock cancel req. 3) client cancel the lock, but the network just dropped the cancel reply to server. 4) server is too busy to handle the lock cancel reply from the client or be blocked somewhere. It seems there are a lot of metadata operations in your job. I guess your eviction might be caused by the latter 2 reasons. If you could provide the process stack trace on MDS when the job died, it might help us to figure out what is going on there? WangDi> Brock Palen > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > On Feb 4, 2008, at 2:47 PM, Brock Palen wrote: > > >>> Which version of lustre do you use? >>> Server and clients same version and same os? which one? >>> >> lustre-1.6.4.1 >> >> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org: >> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp >> >> The clients run patchless RHEL4 >> 2.6.9-67.0.1.ELsmp >> >> One set of clients are on a 10.x network while the servers and other >> half of clients are on a 141. network, because we are using the tcp >> network type, we have not setup any lnet routes. I don''t think >> should cause a problem but I include the information for clarity. We >> do route 10.x on campus. >> >> >>> Harald >>> >>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote: >>> >>>> on our cluster that has been running lustre for about 1 month. I >>>> have >>>> 1 MDT/MGS and 1 OSS with 2 OST''s. >>>> >>>> Our cluster uses all Gige and has about 608 nodes 1854 cores. >>>> >>>> We have allot of jobs that die, and/or go into high IO wait, strace >>>> shows processes stuck in fstat(). >>>> >>>> The big problem is (i think) I would like some feedback on it >>>> that of >>>> these 608 nodes 209 of them have in dmesg the string >>>> >>>> "This client was evicted by" >>>> >>>> Is this normal for clients to be dropped like this? Is there some >>>> tuning that needs to be done to the server to carry this many nodes >>>> out of the box? We are using default lustre install with Gige. >>>> >>>> >>>> Brock Palen >>>> Center for Advanced Computing >>>> brockp at umich.edu >>>> (734)936-1985 >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> -- >>> Harald van Pee >>> >>> Helmholtz-Institut fuer Strahlen- und Kernphysik der Universitaet >>> Bonn >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
>> If client get eviction from the server, it might be triggered by > > 1) server did not get client pinger msg in a long time. > 2) client is too busy to handle the server lock cancel req.Clients show a load of 4.2 (4 cores total, 1 process per core).> 3) client cancel the lock, but the network just dropped the cancel > reply to server.I see a very small amount (6339) of dropped packets on the interfaces of the OSS. Links between the switches show no errors.> 4) server is too busy to handle the lock cancel reply from the > client or be blocked somewhere.I started paying attention to the OSS more once you said this, some times i see the cpu use of socknal_sd00 get to 100%. Now is this process used to keep all the odb_ping''s going? Both the OSS and the MDS/MGS are SMP systems and run single interfaces. If I dual homed the servers would that create another socknal process for lnet?> > It seems there are a lot of metadata operations in your job. I > guess your eviction > might be caused by the latter 2 reasons. If you could provide the > process stack trace on MDS > when the job died, it might help us to figure out what is going on > there? > > WangDi >> Brock Palen >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> On Feb 4, 2008, at 2:47 PM, Brock Palen wrote: >> >> >>>> Which version of lustre do you use? >>>> Server and clients same version and same os? which one? >>>> >>> lustre-1.6.4.1 >>> >>> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org: >>> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp >>> >>> The clients run patchless RHEL4 >>> 2.6.9-67.0.1.ELsmp >>> >>> One set of clients are on a 10.x network while the servers and other >>> half of clients are on a 141. network, because we are using the tcp >>> network type, we have not setup any lnet routes. I don''t think >>> should cause a problem but I include the information for >>> clarity. We >>> do route 10.x on campus. >>> >>> >>>> Harald >>>> >>>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote: >>>> >>>>> on our cluster that has been running lustre for about 1 month. >>>>> I have >>>>> 1 MDT/MGS and 1 OSS with 2 OST''s. >>>>> >>>>> Our cluster uses all Gige and has about 608 nodes 1854 cores. >>>>> >>>>> We have allot of jobs that die, and/or go into high IO wait, >>>>> strace >>>>> shows processes stuck in fstat(). >>>>> >>>>> The big problem is (i think) I would like some feedback on it >>>>> that of >>>>> these 608 nodes 209 of them have in dmesg the string >>>>> >>>>> "This client was evicted by" >>>>> >>>>> Is this normal for clients to be dropped like this? Is there some >>>>> tuning that needs to be done to the server to carry this many >>>>> nodes >>>>> out of the box? We are using default lustre install with Gige. >>>>> >>>>> >>>>> Brock Palen >>>>> Center for Advanced Computing >>>>> brockp at umich.edu >>>>> (734)936-1985 >>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>> -- >>>> Harald van Pee >>>> >>>> Helmholtz-Institut fuer Strahlen- und Kernphysik der >>>> Universitaet Bonn >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > >
On Tue, Feb 05, 2008 at 11:01:47AM -0500, Brock Palen wrote:> The timeouts fixed the random evictions. The problem we were trying > to solve in the first place still is in place though. In talking > with the user of the code the problem is related to a similar problem > in another code. > > One code is from NOAA, the Other is S3D from Sandia (I think). > > Both these codes write one file per process. (NetCDF for one, > tecplot for the other). > When the code has finished with a iteration they copy/tar/cpio the > files to another location. This is where the job will hand *some* > times. Most the time it works, but with enough iterations of this > method a job will hang at some point. The job does not die. Just > hangs. > > The NOAA code does the mv+cpio in its pbs script. The S3D code uses > system() to run tar. In the end they have the same behavior. > > has anyone seen similar behavior?we have seen evictions several times and I noticed that it''s worth to investigate them. You can get evictions by bad applications, e.g. if lots of nodes write few bytes to a shared file. One time the reason was a tecplot routine and the user reported that it includes bad code (in preutil.c). Regards, Roland -- -------------------------------------------------------------------------- Roland Laifer Rechenzentrum, Universitaet Karlsruhe (TH), D-76128 Karlsruhe, Germany Email: Roland.Laifer at rz.uni-karlsruhe.de, Phone: +49 721 608 4861, Fax: +49 721 32550, Web: www.rz.uni-karlsruhe.de/personen/roland.laifer --------------------------------------------------------------------------
I was able to catch a client and server in the act: client dmesg: eth0: no IPv6 routers present Lustre: nobackup-MDT0000-mdc-000001012bd39800: Connection to service nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by nobackup-MDT0000; in progress operations using this service will fail. LustreError: 2757:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 00000100cfce6800 x3216741/t0 o101->nobackup- MDT0000_UUID at 141.212.30.184@tcp:12 lens 448/768 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) inode 11237379 mdc close failed: rc = -108 LustreError: 2822:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001002966d000 x3216837/t0 o35->nobackup- MDT0000_UUID at 141.212.30.184@tcp:12 lens 296/448 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 2822:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 95 previous similar messages LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) inode 11270746 mdc close failed: rc = -108 LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 30 previous similar messages LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) Skipped 62 previous similar messages LustreError: 2757:0:(dir.c:258:ll_get_dir_page()) lock enqueue: rc: -108 LustreError: 2757:0:(dir.c:412:ll_readdir()) error reading dir 11239903/324715747 page 2: rc -108 Lustre: nobackup-MDT0000-mdc-000001012bd39800: Connection restored to service nobackup-MDT0000 using nid 141.212.30.184 at tcp. LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116 LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116 LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116 LustreError: 2834:0:(file.c:97:ll_close_inode_openhandle()) inode 11270686 mdc close failed: rc = -116 LustreError: 2834:0:(file.c:97:ll_close_inode_openhandle()) Skipped 40 previous similar messages LustreError: 11-0: an error occurred while communicating with 141.212.30.184 at tcp. The mds_close operation failed with -116 LustreError: 2728:0:(file.c:97:ll_close_inode_openhandle()) inode 11240591 mdc close failed: rc = -116 MDT dmesg: LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at 000001002b 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback timer expired: evicting cl ient 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid 10.164.0.141 at tcp ns: mds-nobackup -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: 1/0,0 mode: CR/CR res: 11240142/324715850 bi ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 expref: 372 pid 26925 Lustre: 3170:0:(mds_reint.c:127:mds_finish_transno()) commit transaction for disconnected client 2faf3c9e -26fb-64b7-ca6c-7c5b09374e67: rc 0 LustreError: 27505:0:(mds_open.c:1474:mds_close()) @@@ no handle for file close ino 11239903: cookie 0xbc 269e05c51912d8 req at 000001001e69c600 x3216892/t0 o35- >2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa 4008d_UUID:-1 lens 296/448 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 27505:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-116) req at 000001001 e69c600 x3216892/t0 o35->2faf3c9e-26fb-64b7- ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID:-1 lens 296/448 re f 0 fl Interpret:/0/0 rc -116/0 I hope this helps, but I see it happen more often with the OSS evicting client. On Feb 6, 2008, at 10:59 AM, Brock Palen wrote:>>> If client get eviction from the server, it might be triggered by >> >> 1) server did not get client pinger msg in a long time. >> 2) client is too busy to handle the server lock cancel req. > > Clients show a load of 4.2 (4 cores total, 1 process per core). > >> 3) client cancel the lock, but the network just dropped the cancel >> reply to server. > I see a very small amount (6339) of dropped packets on the interfaces > of the OSS. Links between the switches show no errors. > > >> 4) server is too busy to handle the lock cancel reply from the >> client or be blocked somewhere. > > I started paying attention to the OSS more once you said this, some > times i see the cpu use of socknal_sd00 get to 100%. Now is this > process used to keep all the odb_ping''s going? > > Both the OSS and the MDS/MGS are SMP systems and run single > interfaces. If I dual homed the servers would that create another > socknal process for lnet? > > >> >> It seems there are a lot of metadata operations in your job. I >> guess your eviction >> might be caused by the latter 2 reasons. If you could provide the >> process stack trace on MDS >> when the job died, it might help us to figure out what is going on >> there? >> >> WangDi >>> Brock Palen >>> Center for Advanced Computing >>> brockp at umich.edu >>> (734)936-1985 >>> >>> >>> On Feb 4, 2008, at 2:47 PM, Brock Palen wrote: >>> >>> >>>>> Which version of lustre do you use? >>>>> Server and clients same version and same os? which one? >>>>> >>>> lustre-1.6.4.1 >>>> >>>> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org: >>>> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp >>>> >>>> The clients run patchless RHEL4 >>>> 2.6.9-67.0.1.ELsmp >>>> >>>> One set of clients are on a 10.x network while the servers and >>>> other >>>> half of clients are on a 141. network, because we are using the >>>> tcp >>>> network type, we have not setup any lnet routes. I don''t think >>>> should cause a problem but I include the information for >>>> clarity. We >>>> do route 10.x on campus. >>>> >>>> >>>>> Harald >>>>> >>>>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote: >>>>> >>>>>> on our cluster that has been running lustre for about 1 month. >>>>>> I have >>>>>> 1 MDT/MGS and 1 OSS with 2 OST''s. >>>>>> >>>>>> Our cluster uses all Gige and has about 608 nodes 1854 cores. >>>>>> >>>>>> We have allot of jobs that die, and/or go into high IO wait, >>>>>> strace >>>>>> shows processes stuck in fstat(). >>>>>> >>>>>> The big problem is (i think) I would like some feedback on it >>>>>> that of >>>>>> these 608 nodes 209 of them have in dmesg the string >>>>>> >>>>>> "This client was evicted by" >>>>>> >>>>>> Is this normal for clients to be dropped like this? Is there >>>>>> some >>>>>> tuning that needs to be done to the server to carry this many >>>>>> nodes >>>>>> out of the box? We are using default lustre install with Gige. >>>>>> >>>>>> >>>>>> Brock Palen >>>>>> Center for Advanced Computing >>>>>> brockp at umich.edu >>>>>> (734)936-1985 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>> -- >>>>> Harald van Pee >>>>> >>>>> Helmholtz-Institut fuer Strahlen- und Kernphysik der >>>>> Universitaet Bonn >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Hello, Brock Palen wrote:> I was able to catch a client and server in the act: > > client dmesg: > > eth0: no IPv6 routers present > Lustre: nobackup-MDT0000-mdc-000001012bd39800: Connection to service > nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 167-0: This client was evicted by nobackup-MDT0000; in > progress operations using this service will fail. > LustreError: 2757:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at 00000100cfce6800 x3216741/t0 o101->nobackup- > MDT0000_UUID at 141.212.30.184@tcp:12 lens 448/768 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) > ldlm_cli_enqueue: -108 > LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) inode > 11237379 mdc close failed: rc = -108 > LustreError: 2822:0:(client.c:519:ptlrpc_import_delay_req()) @@@ > IMP_INVALID req at 000001002966d000 x3216837/t0 o35->nobackup- > MDT0000_UUID at 141.212.30.184@tcp:12 lens 296/448 ref 1 fl Rpc:/0/0 rc 0/0 > LustreError: 2822:0:(client.c:519:ptlrpc_import_delay_req()) Skipped > 95 previous similar messages > LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) inode > 11270746 mdc close failed: rc = -108 > LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) > ldlm_cli_enqueue: -108 > LustreError: 2757:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 30 > previous similar messages > LustreError: 2822:0:(file.c:97:ll_close_inode_openhandle()) Skipped > 62 previous similar messages > LustreError: 2757:0:(dir.c:258:ll_get_dir_page()) lock enqueue: rc: -108 > LustreError: 2757:0:(dir.c:412:ll_readdir()) error reading dir > 11239903/324715747 page 2: rc -108 > Lustre: nobackup-MDT0000-mdc-000001012bd39800: Connection restored to > service nobackup-MDT0000 using nid 141.212.30.184 at tcp. > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_close operation failed with -116 > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_close operation failed with -116 > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_close operation failed with -116 > LustreError: 2834:0:(file.c:97:ll_close_inode_openhandle()) inode > 11270686 mdc close failed: rc = -116 > LustreError: 2834:0:(file.c:97:ll_close_inode_openhandle()) Skipped > 40 previous similar messages > LustreError: 11-0: an error occurred while communicating with > 141.212.30.184 at tcp. The mds_close operation failed with -116 > LustreError: 2728:0:(file.c:97:ll_close_inode_openhandle()) inode > 11240591 mdc close failed: rc = -116 > > MDT dmesg: > > LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ > processing error (-107) req at 000001002b > 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/0/0 > rc -107/0 > LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock > callback timer expired: evicting cl > ient 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID > nid 10.164.0.141 at tcp ns: mds-nobackup > -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: 1/0,0 > mode: CR/CR res: 11240142/324715850 bi > ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 expref: > 372 pid 26925 >The client was evicted because of this lock can not be released on client on time. Could you provide the stack strace of client at that time? I assume increase obd_timeout could fix your problem. Then maybe you should wait 1.6.5 released, including a new feature adaptive_timeout, which will adjust the timeout value according to the network congestion and server load. And it should help your problem.> Lustre: 3170:0:(mds_reint.c:127:mds_finish_transno()) commit > transaction for disconnected client 2faf3c9e > -26fb-64b7-ca6c-7c5b09374e67: rc 0 > LustreError: 27505:0:(mds_open.c:1474:mds_close()) @@@ no handle for > file close ino 11239903: cookie 0xbc > 269e05c51912d8 req at 000001001e69c600 x3216892/t0 o35- > >2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa > 4008d_UUID:-1 lens 296/448 ref 0 fl Interpret:/0/0 rc 0/0 > LustreError: 27505:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ > processing error (-116) req at 000001001 > e69c600 x3216892/t0 o35->2faf3c9e-26fb-64b7- > ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID:-1 lens 296/448 re > f 0 fl Interpret:/0/0 rc -116/0 > > I hope this helps, but I see it happen more often with the OSS > evicting client. > On Feb 6, 2008, at 10:59 AM, Brock Palen wrote: > > >>>> If client get eviction from the server, it might be triggered by >>>> >>> 1) server did not get client pinger msg in a long time. >>> 2) client is too busy to handle the server lock cancel req. >>> >> Clients show a load of 4.2 (4 cores total, 1 process per core). >> >> >>> 3) client cancel the lock, but the network just dropped the cancel >>> reply to server. >>> >> I see a very small amount (6339) of dropped packets on the interfaces >> of the OSS. Links between the switches show no errors. >> >> >> >>> 4) server is too busy to handle the lock cancel reply from the >>> client or be blocked somewhere. >>> >> I started paying attention to the OSS more once you said this, some >> times i see the cpu use of socknal_sd00 get to 100%. Now is this >> process used to keep all the odb_ping''s going? >> >> Both the OSS and the MDS/MGS are SMP systems and run single >> interfaces. If I dual homed the servers would that create another >> socknal process for lnet? >> >> >> >>> It seems there are a lot of metadata operations in your job. I >>> guess your eviction >>> might be caused by the latter 2 reasons. If you could provide the >>> process stack trace on MDS >>> when the job died, it might help us to figure out what is going on >>> there? >>> >>> WangDi >>> >>>> Brock Palen >>>> Center for Advanced Computing >>>> brockp at umich.edu >>>> (734)936-1985 >>>> >>>> >>>> On Feb 4, 2008, at 2:47 PM, Brock Palen wrote: >>>> >>>> >>>> >>>>>> Which version of lustre do you use? >>>>>> Server and clients same version and same os? which one? >>>>>> >>>>>> >>>>> lustre-1.6.4.1 >>>>> >>>>> The servers (oss and mds/mgs) use the RHEL4 rpm from lustre.org: >>>>> 2.6.9-55.0.9.EL_lustre.1.6.4.1smp >>>>> >>>>> The clients run patchless RHEL4 >>>>> 2.6.9-67.0.1.ELsmp >>>>> >>>>> One set of clients are on a 10.x network while the servers and >>>>> other >>>>> half of clients are on a 141. network, because we are using the >>>>> tcp >>>>> network type, we have not setup any lnet routes. I don''t think >>>>> should cause a problem but I include the information for >>>>> clarity. We >>>>> do route 10.x on campus. >>>>> >>>>> >>>>> >>>>>> Harald >>>>>> >>>>>> On Monday 04 February 2008 04:11 pm, Brock Palen wrote: >>>>>> >>>>>> >>>>>>> on our cluster that has been running lustre for about 1 month. >>>>>>> I have >>>>>>> 1 MDT/MGS and 1 OSS with 2 OST''s. >>>>>>> >>>>>>> Our cluster uses all Gige and has about 608 nodes 1854 cores. >>>>>>> >>>>>>> We have allot of jobs that die, and/or go into high IO wait, >>>>>>> strace >>>>>>> shows processes stuck in fstat(). >>>>>>> >>>>>>> The big problem is (i think) I would like some feedback on it >>>>>>> that of >>>>>>> these 608 nodes 209 of them have in dmesg the string >>>>>>> >>>>>>> "This client was evicted by" >>>>>>> >>>>>>> Is this normal for clients to be dropped like this? Is there >>>>>>> some >>>>>>> tuning that needs to be done to the server to carry this many >>>>>>> nodes >>>>>>> out of the box? We are using default lustre install with Gige. >>>>>>> >>>>>>> >>>>>>> Brock Palen >>>>>>> Center for Advanced Computing >>>>>>> brockp at umich.edu >>>>>>> (734)936-1985 >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Lustre-discuss mailing list >>>>>>> Lustre-discuss at lists.lustre.org >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>>> >>>>>>> >>>>>> -- >>>>>> Harald van Pee >>>>>> >>>>>> Helmholtz-Institut fuer Strahlen- und Kernphysik der >>>>>> Universitaet Bonn >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote:>> MDT dmesg: >> >> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ >> processing error (-107) req at 000001002b >> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl Interpret:/ >> 0/0 rc -107/0 >> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### >> lock callback timer expired: evicting cl >> ient 2faf3c9e-26fb-64b7- >> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid 10.164.0.141 at tcp >> ns: mds-nobackup >> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >> expref: 372 pid 26925 >> > The client was evicted because of this lock can not be released on > client > on time. Could you provide the stack strace of client at that time? > > I assume increase obd_timeout could fix your problem. Then maybe > you should wait 1.6.5 released, including a new feature > adaptive_timeout, > which will adjust the timeout value according to the network > congestion > and server load. And it should help your problem.Waiting for the next version of lustre might be the best thing. I had upped the timeout a few days back but the next day i had errors on the MDS box. I have switched it back: lctl conf_param nobackup-MDT0000.sys.timeout=300 I would love to give you that trace but I don''t know how to get it. Is there a debug option to turn on in the clients?
Brock Palen wrote:> > > Brock Palen > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>> MDT dmesg: >>> >>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ >>> processing error (-107) req at 000001002b >>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>> Interpret:/0/0 rc -107/0 >>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### >>> lock callback timer expired: evicting cl >>> ient 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID >>> nid 10.164.0.141 at tcp ns: mds-nobackup >>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: 1/0,0 >>> mode: CR/CR res: 11240142/324715850 bi >>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>> expref: 372 pid 26925 >>> >> The client was evicted because of this lock can not be released on >> client >> on time. Could you provide the stack strace of client at that time? >> >> I assume increase obd_timeout could fix your problem. Then maybe >> you should wait 1.6.5 released, including a new feature >> adaptive_timeout, >> which will adjust the timeout value according to the network congestion >> and server load. And it should help your problem. > > Waiting for the next version of lustre might be the best thing. I had > upped the timeout a few days back but the next day i had errors on the > MDS box. I have switched it back: > > lctl conf_param nobackup-MDT0000.sys.timeout=300 > > I would love to give you that trace but I don''t know how to get it. > Is there a debug option to turn on in the clients?You can get that by echo t > /proc/sysrq-trigger on client.
>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>> MDT dmesg: >>>> >>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>> @@@ processing error (-107) req at 000001002b >>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>> Interpret:/0/0 rc -107/0 >>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### >>>> lock callback timer expired: evicting cl >>>> ient 2faf3c9e-26fb-64b7- >>>> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid >>>> 10.164.0.141 at tcp ns: mds-nobackup >>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>> expref: 372 pid 26925 >>>> >>> The client was evicted because of this lock can not be released >>> on client >>> on time. Could you provide the stack strace of client at that time? >>> >>> I assume increase obd_timeout could fix your problem. Then maybe >>> you should wait 1.6.5 released, including a new feature >>> adaptive_timeout, >>> which will adjust the timeout value according to the network >>> congestion >>> and server load. And it should help your problem. >> >> Waiting for the next version of lustre might be the best thing. I >> had upped the timeout a few days back but the next day i had >> errors on the MDS box. I have switched it back: >> >> lctl conf_param nobackup-MDT0000.sys.timeout=300 >> >> I would love to give you that trace but I don''t know how to get >> it. Is there a debug option to turn on in the clients? > You can get that by echo t > /proc/sysrq-trigger on client. >Cool command, output of the client is attached. The four processes m45_amp214_om, is the application that hung when working off of luster. you can see its stuck in IO state. -------------- next part -------------- A non-text attachment was scrubbed... Name: trace Type: application/octet-stream Size: 117493 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080208/99f28028/attachment-0002.obj -------------- next part --------------> > > > >
Hello, m45_amp214_om D 0000000000000000 0 2587 1 31389 2586 (NOTLB) 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 00000100080f1a40 0000000000000246 00000101f6b435a8 0000000380136025 00000102270a1030 00000000000000d0 Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} <ffffffff8030e45f>{__down+147} <ffffffff80134659>{default_wake_function+0} <ffffffff8030ff7d>{__down_failed+53} <ffffffffa04292e1>{:lustre:.text.lock.file+5} <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} <ffffffffa0268730>{:obdclass:class_handle2object+224} <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} <ffffffffa039617d>{:mdc:mdc_intent_lock+685} <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} <ffffffff80192006>{__d_lookup+287} <ffffffffa0419724>{:lustre:ll_file_open+2100} <ffffffffa0428a18>{:lustre:ll_inode_permission+184} <ffffffff80179bdb>{sys_access+349} <ffffffff8017a1ee>{__dentry_open+201} <ffffffff8017a3a9>{filp_open+95} <ffffffff80179bdb>{sys_access+349} <ffffffff801f00b5>{strncpy_from_user+74} <ffffffff8017a598>{sys_open+57} <ffffffff8011026a>{system_call+126} It seems blocking_ast process was blocked here. Could you dump the lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send to me? Thanks WangDi Brock Palen wrote:>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>> MDT dmesg: >>>>> >>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ >>>>> processing error (-107) req at 000001002b >>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>> Interpret:/0/0 rc -107/0 >>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### >>>>> lock callback timer expired: evicting cl >>>>> ient >>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid >>>>> 10.164.0.141 at tcp ns: mds-nobackup >>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>> expref: 372 pid 26925 >>>>> >>>> The client was evicted because of this lock can not be released on >>>> client >>>> on time. Could you provide the stack strace of client at that time? >>>> >>>> I assume increase obd_timeout could fix your problem. Then maybe >>>> you should wait 1.6.5 released, including a new feature >>>> adaptive_timeout, >>>> which will adjust the timeout value according to the network >>>> congestion >>>> and server load. And it should help your problem. >>> >>> Waiting for the next version of lustre might be the best thing. I >>> had upped the timeout a few days back but the next day i had errors >>> on the MDS box. I have switched it back: >>> >>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>> >>> I would love to give you that trace but I don''t know how to get it. >>> Is there a debug option to turn on in the clients? >> You can get that by echo t > /proc/sysrq-trigger on client. >> > Cool command, output of the client is attached. The four processes > m45_amp214_om, is the application that hung when working off of > luster. you can see its stuck in IO state. > >> >> >> >> >> > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sure, Attached, note though, we rebuilt our lustre source for another box that uses the largesmp kernel. but it used the same options and compiler. -------------- next part -------------- A non-text attachment was scrubbed... Name: objdump Type: application/octet-stream Size: 354530 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080208/abfba097/attachment-0002.obj -------------- next part -------------- Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:> Hello, > > m45_amp214_om D 0000000000000000 0 2587 1 31389 > 2586 (NOTLB) > 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 > 00000100080f1a40 0000000000000246 00000101f6b435a8 > 0000000380136025 > 00000102270a1030 00000000000000d0 > Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} <ffffffff8030e45f> > {__down+147} > <ffffffff80134659>{default_wake_function+0} <ffffffff8030ff7d> > {__down_failed+53} > <ffffffffa04292e1>{:lustre:.text.lock.file+5} > <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} > <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} > <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} > <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} > <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} > <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} > <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} > <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} > <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} > <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} > <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} <ffffffffa02c1035> > {:ptlrpc:lock_res_and_lock+53} > <ffffffffa0268730>{:obdclass:class_handle2object+224} > <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} > <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} > <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} > <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} > <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} > <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} > <ffffffffa039617d>{:mdc:mdc_intent_lock+685} > <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} > <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} > <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} > <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} > <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} > <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} > <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} > <ffffffff80192006>{__d_lookup+287} > <ffffffffa0419724>{:lustre:ll_file_open+2100} > <ffffffffa0428a18>{:lustre:ll_inode_permission+184} > <ffffffff80179bdb>{sys_access+349} <ffffffff8017a1ee> > {__dentry_open+201} > <ffffffff8017a3a9>{filp_open+95} <ffffffff80179bdb>{sys_access > +349} > <ffffffff801f00b5>{strncpy_from_user+74} <ffffffff8017a598> > {sys_open+57} > <ffffffff8011026a>{system_call+126} > > It seems blocking_ast process was blocked here. Could you dump the > lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send > to me? > > Thanks > WangDi > > Brock Palen wrote: >>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>> MDT dmesg: >>>>>> >>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>> @@@ processing error (-107) req at 000001002b >>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>> Interpret:/0/0 rc -107/0 >>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>> ### lock callback timer expired: evicting cl >>>>>> ient 2faf3c9e-26fb-64b7- >>>>>> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid >>>>>> 10.164.0.141 at tcp ns: mds-nobackup >>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>> expref: 372 pid 26925 >>>>>> >>>>> The client was evicted because of this lock can not be released >>>>> on client >>>>> on time. Could you provide the stack strace of client at that >>>>> time? >>>>> >>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>> you should wait 1.6.5 released, including a new feature >>>>> adaptive_timeout, >>>>> which will adjust the timeout value according to the network >>>>> congestion >>>>> and server load. And it should help your problem. >>>> >>>> Waiting for the next version of lustre might be the best thing. >>>> I had upped the timeout a few days back but the next day i had >>>> errors on the MDS box. I have switched it back: >>>> >>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>> >>>> I would love to give you that trace but I don''t know how to get >>>> it. Is there a debug option to turn on in the clients? >>> You can get that by echo t > /proc/sysrq-trigger on client. >>> >> Cool command, output of the client is attached. The four >> processes m45_amp214_om, is the application that hung when >> working off of luster. you can see its stuck in IO state. >> >>> >>> >>> >>> >>> >> --------------------------------------------------------------------- >> --- >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >
Hi, Aha, this is bug has been fixed in 14360. https://bugzilla.lustre.org/show_bug.cgi?id=14360 The patch there should fix your problem, which should be released in 1.6.5 Thanks Brock Palen wrote:> Sure, Attached, note though, we rebuilt our lustre source for another > box that uses the largesmp kernel. but it used the same options and > compiler. > > > Brock Palen > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote: > >> Hello, >> >> m45_amp214_om D 0000000000000000 0 2587 1 31389 >> 2586 (NOTLB) >> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 >> 00000100080f1a40 0000000000000246 00000101f6b435a8 >> 0000000380136025 >> 00000102270a1030 00000000000000d0 >> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} >> <ffffffff8030e45f>{__down+147} >> <ffffffff80134659>{default_wake_function+0} >> <ffffffff8030ff7d>{__down_failed+53} >> <ffffffffa04292e1>{:lustre:.text.lock.file+5} >> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} >> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} >> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} >> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} >> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} >> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} >> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} >> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} >> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} >> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} >> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} >> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >> <ffffffffa0268730>{:obdclass:class_handle2object+224} >> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} >> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} >> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} >> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} >> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} >> <ffffffffa039617d>{:mdc:mdc_intent_lock+685} >> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} >> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} >> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >> <ffffffff80192006>{__d_lookup+287} >> <ffffffffa0419724>{:lustre:ll_file_open+2100} >> <ffffffffa0428a18>{:lustre:ll_inode_permission+184} >> <ffffffff80179bdb>{sys_access+349} >> <ffffffff8017a1ee>{__dentry_open+201} >> <ffffffff8017a3a9>{filp_open+95} >> <ffffffff80179bdb>{sys_access+349} >> <ffffffff801f00b5>{strncpy_from_user+74} >> <ffffffff8017a598>{sys_open+57} >> <ffffffff8011026a>{system_call+126} >> >> It seems blocking_ast process was blocked here. Could you dump the >> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send to me? >> >> Thanks >> WangDi >> >> Brock Palen wrote: >>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>>> MDT dmesg: >>>>>>> >>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>>> @@@ processing error (-107) req at 000001002b >>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>>> Interpret:/0/0 rc -107/0 >>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### >>>>>>> lock callback timer expired: evicting cl >>>>>>> ient >>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID >>>>>>> nid 10.164.0.141 at tcp ns: mds-nobackup >>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>>> expref: 372 pid 26925 >>>>>>> >>>>>> The client was evicted because of this lock can not be released >>>>>> on client >>>>>> on time. Could you provide the stack strace of client at that time? >>>>>> >>>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>>> you should wait 1.6.5 released, including a new feature >>>>>> adaptive_timeout, >>>>>> which will adjust the timeout value according to the network >>>>>> congestion >>>>>> and server load. And it should help your problem. >>>>> >>>>> Waiting for the next version of lustre might be the best thing. I >>>>> had upped the timeout a few days back but the next day i had >>>>> errors on the MDS box. I have switched it back: >>>>> >>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>>> >>>>> I would love to give you that trace but I don''t know how to get >>>>> it. Is there a debug option to turn on in the clients? >>>> You can get that by echo t > /proc/sysrq-trigger on client. >>>> >>> Cool command, output of the client is attached. The four processes >>> m45_amp214_om, is the application that hung when working off of >>> luster. you can see its stuck in IO state. >>> >>>> >>>> >>>> >>>> >>>> >>> ------------------------------------------------------------------------ >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >> > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under load, the clients hand about every 10 minutes which is really bad for a production machine. The only way to fix the hang is to reboot the server. My users are getting extremely impatient :-/ I see this on the clients- LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl Rpc:/0/0 rc 0/-22 Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 I''ve increased the timeout to 300seconds and it has helped marginally. -Aaron On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:> Hi, > Aha, this is bug has been fixed in 14360. > > https://bugzilla.lustre.org/show_bug.cgi?id=14360 > > The patch there should fix your problem, which should be released in > 1.6.5 > > Thanks > > Brock Palen wrote: >> Sure, Attached, note though, we rebuilt our lustre source for >> another >> box that uses the largesmp kernel. but it used the same options and >> compiler. >> >> >> Brock Palen >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote: >> >>> Hello, >>> >>> m45_amp214_om D 0000000000000000 0 2587 1 31389 >>> 2586 (NOTLB) >>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 >>> 00000100080f1a40 0000000000000246 00000101f6b435a8 >>> 0000000380136025 >>> 00000102270a1030 00000000000000d0 >>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} >>> <ffffffff8030e45f>{__down+147} >>> <ffffffff80134659>{default_wake_function+0} >>> <ffffffff8030ff7d>{__down_failed+53} >>> <ffffffffa04292e1>{:lustre:.text.lock.file+5} >>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} >>> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} >>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} >>> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} >>> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} >>> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} >>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} >>> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} >>> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} >>> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} >>> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} >>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>> <ffffffffa0268730>{:obdclass:class_handle2object+224} >>> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} >>> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} >>> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} >>> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} >>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} >>> <ffffffffa039617d>{:mdc:mdc_intent_lock+685} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} >>> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffff80192006>{__d_lookup+287} >>> <ffffffffa0419724>{:lustre:ll_file_open+2100} >>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184} >>> <ffffffff80179bdb>{sys_access+349} >>> <ffffffff8017a1ee>{__dentry_open+201} >>> <ffffffff8017a3a9>{filp_open+95} >>> <ffffffff80179bdb>{sys_access+349} >>> <ffffffff801f00b5>{strncpy_from_user+74} >>> <ffffffff8017a598>{sys_open+57} >>> <ffffffff8011026a>{system_call+126} >>> >>> It seems blocking_ast process was blocked here. Could you dump the >>> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send >>> to me? >>> >>> Thanks >>> WangDi >>> >>> Brock Palen wrote: >>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>>>> MDT dmesg: >>>>>>>> >>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>>>> @@@ processing error (-107) req at 000001002b >>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>>>> Interpret:/0/0 rc -107/0 >>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>>>> ### >>>>>>>> lock callback timer expired: evicting cl >>>>>>>> ient >>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID >>>>>>>> nid 10.164.0.141 at tcp ns: mds-nobackup >>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>>>> expref: 372 pid 26925 >>>>>>>> >>>>>>> The client was evicted because of this lock can not be released >>>>>>> on client >>>>>>> on time. Could you provide the stack strace of client at that >>>>>>> time? >>>>>>> >>>>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>>>> you should wait 1.6.5 released, including a new feature >>>>>>> adaptive_timeout, >>>>>>> which will adjust the timeout value according to the network >>>>>>> congestion >>>>>>> and server load. And it should help your problem. >>>>>> >>>>>> Waiting for the next version of lustre might be the best >>>>>> thing. I >>>>>> had upped the timeout a few days back but the next day i had >>>>>> errors on the MDS box. I have switched it back: >>>>>> >>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>>>> >>>>>> I would love to give you that trace but I don''t know how to get >>>>>> it. Is there a debug option to turn on in the clients? >>>>> You can get that by echo t > /proc/sysrq-trigger on client. >>>>> >>>> Cool command, output of the client is attached. The four >>>> processes >>>> m45_amp214_om, is the application that hung when working off of >>>> luster. you can see its stuck in IO state. >>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org
Aaron Knister wrote:> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 > x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 > ref 1 fl Rpc:/0/0 rc 0/-22It means OST could not response the request(unlink, o6) in 300 seconds, so client disconnect the import to OST and try to reconnect. Does this disconnection always happened when do unlink ? Could you please post process trace and console msg of OST at that time? Thanks WangDi> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service > data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally. > > -Aaron >> > > > >
Aaron Knister wrote:> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ > t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl > Rpc:/0/0 rc 0/-22 > Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- > OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations > using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally.Hi Aaron; We set the timeout a big number (1000secs) on our 400 node cluster (mostly o2ib, some tcp clients). Until we did this, we had loads of evictions. In our case, it solved the problem. Cheers, Craig
>> I''ve increased the timeout to 300seconds and it has helped >> marginally. > > Hi Aaron; > > We set the timeout a big number (1000secs) on our 400 node cluster > (mostly o2ib, some tcp clients). Until we did this, we had loads > of evictions. In our case, it solved the problem.This feels excessive. But at this point I guess Ill try it.> > Cheers, > Craig > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
So far it''s helped. If this doesn''t fix it I''m going to apply the patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit I''ll let you know how it goes. If you''d like a copy of the patched version let me know. Are you running RHEL/SLES? what version of the OS and lustre? -Aaron On Feb 11, 2008, at 4:17 PM, Brock Palen wrote:>>> I''ve increased the timeout to 300seconds and it has helped >>> marginally. >> >> Hi Aaron; >> >> We set the timeout a big number (1000secs) on our 400 node cluster >> (mostly o2ib, some tcp clients). Until we did this, we had loads >> of evictions. In our case, it solved the problem. > > This feels excessive. But at this point I guess Ill try it. > >> >> Cheers, >> Craig >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org
RHEL4 x86_64 lustre-1.6.4.1 Ill wait and see if this helps. We wont be running any patched kernels outside of the OSS/MGS/MDS ever. Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 11, 2008, at 4:48 PM, Aaron Knister wrote:> So far it''s helped. If this doesn''t fix it I''m going to apply the > patch mentioned here - https://bugzilla.lustre.org/attachment.cgi? > id=14006&action=edit I''ll let you know how it goes. If you''d like a > copy of the patched version let me know. Are you running RHEL/SLES? > what version of the OS and lustre? > > -Aaron > > On Feb 11, 2008, at 4:17 PM, Brock Palen wrote: > >>>> I''ve increased the timeout to 300seconds and it has helped >>>> marginally. >>> >>> Hi Aaron; >>> >>> We set the timeout a big number (1000secs) on our 400 node cluster >>> (mostly o2ib, some tcp clients). Until we did this, we had loads >>> of evictions. In our case, it solved the problem. >> >> This feels excessive. But at this point I guess Ill try it. >> >>> >>> Cheers, >>> Craig >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > >
Hi, Aaron FYI, the patch in 14360 will unlikely help your problem, since the problem here seems OST load is too high or stuck somewhere. So we need more information. Actually, we have met some similar problems when do unlink before. If increase obd_timeout could help you, that is good. But if you could provide stack trace and console msg of OST at that time, if it is not much trouble to get these information, that will help us to figure out what happened there? Thanks WangDi Aaron Knister wrote:> So far it''s helped. If this doesn''t fix it I''m going to apply the > patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit > I''ll let you know how it goes. If you''d like a copy of the patched > version let me know. Are you running RHEL/SLES? what version of the OS > and lustre? > > -Aaron > > On Feb 11, 2008, at 4:17 PM, Brock Palen wrote: > > >>>> I''ve increased the timeout to 300seconds and it has helped >>>> marginally. >>>> >>> Hi Aaron; >>> >>> We set the timeout a big number (1000secs) on our 400 node cluster >>> (mostly o2ib, some tcp clients). Until we did this, we had loads >>> of evictions. In our case, it solved the problem. >>> >> This feels excessive. But at this point I guess Ill try it. >> >> >>> Cheers, >>> Craig >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Aaron, We are running 1.6.3 with some patches that we applied by hand after rummaging through the Lustre bugzilla database. We run CentOS 5.0 on the servers and 4.5 on the clients with an updated kernel. # uname -a Linux submit.ufhpc 2.6.18-8.1.14.el5Lustre #1 SMP Fri Oct 12 15:51:56 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux We also run OFED 1.2 which we built with Lustre by configuring IB out of the CentOS kernel entirely and then installling OFED. We then build the Lustre modules against the resulting kernel and IB modules. We are pretty stable right now and are very pleased with Lustre. It took a little work to get there with the base 1.6.3 release which we needed for o2ib nids but it has worked out for us so far. BTW, we agree with the post of a few days ago. We think the lustre team has done a fantastic job for the open source community. Thanks, Charlie Taylor UF HPC Center On Feb 11, 2008, at 4:48 PM, Aaron Knister wrote:> So far it''s helped. If this doesn''t fix it I''m going to apply the > patch mentioned here - https://bugzilla.lustre.org/attachment.cgi? > id=14006&action=edit > I''ll let you know how it goes. If you''d like a copy of the patched > version let me know. Are you running RHEL/SLES? what version of the OS > and lustre? > > -Aaron > > On Feb 11, 2008, at 4:17 PM, Brock Palen wrote: > >>>> I''ve increased the timeout to 300seconds and it has helped >>>> marginally. >>> >>> Hi Aaron; >>> >>> We set the timeout a big number (1000secs) on our 400 node cluster >>> (mostly o2ib, some tcp clients). Until we did this, we had loads >>> of evictions. In our case, it solved the problem. >> >> This feels excessive. But at this point I guess Ill try it. >> >>> >>> Cheers, >>> Craig >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
I found that something is getting overloaded some place. If i just go start and stop a job over and over quickly the client will lose contact with one of the servers, ether OST or MDT. Would more ram in the servers help? I dont see a high load or IO wait, but both servers are older (dual 1.4Ghz amd) with only 2 gb of memory. Brock Palen Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote:> Hello, > > m45_amp214_om D 0000000000000000 0 2587 1 31389 > 2586 (NOTLB) > 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 > 00000100080f1a40 0000000000000246 00000101f6b435a8 > 0000000380136025 > 00000102270a1030 00000000000000d0 > Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} <ffffffff8030e45f> > {__down+147} > <ffffffff80134659>{default_wake_function+0} <ffffffff8030ff7d> > {__down_failed+53} > <ffffffffa04292e1>{:lustre:.text.lock.file+5} > <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} > <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} > <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} > <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} > <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} > <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} > <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} > <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} > <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} > <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} > <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} <ffffffffa02c1035> > {:ptlrpc:lock_res_and_lock+53} > <ffffffffa0268730>{:obdclass:class_handle2object+224} > <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} > <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} > <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} > <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} > <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} > <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} > <ffffffffa039617d>{:mdc:mdc_intent_lock+685} > <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} > <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} > <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} > <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} > <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} > <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} > <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} > <ffffffff80192006>{__d_lookup+287} > <ffffffffa0419724>{:lustre:ll_file_open+2100} > <ffffffffa0428a18>{:lustre:ll_inode_permission+184} > <ffffffff80179bdb>{sys_access+349} <ffffffff8017a1ee> > {__dentry_open+201} > <ffffffff8017a3a9>{filp_open+95} <ffffffff80179bdb>{sys_access > +349} > <ffffffff801f00b5>{strncpy_from_user+74} <ffffffff8017a598> > {sys_open+57} > <ffffffff8011026a>{system_call+126} > > It seems blocking_ast process was blocked here. Could you dump the > lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send > to me? > > Thanks > WangDi > > Brock Palen wrote: >>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>> MDT dmesg: >>>>>> >>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>> @@@ processing error (-107) req at 000001002b >>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>> Interpret:/0/0 rc -107/0 >>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>> ### lock callback timer expired: evicting cl >>>>>> ient 2faf3c9e-26fb-64b7- >>>>>> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid >>>>>> 10.164.0.141 at tcp ns: mds-nobackup >>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>> expref: 372 pid 26925 >>>>>> >>>>> The client was evicted because of this lock can not be released >>>>> on client >>>>> on time. Could you provide the stack strace of client at that >>>>> time? >>>>> >>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>> you should wait 1.6.5 released, including a new feature >>>>> adaptive_timeout, >>>>> which will adjust the timeout value according to the network >>>>> congestion >>>>> and server load. And it should help your problem. >>>> >>>> Waiting for the next version of lustre might be the best thing. >>>> I had upped the timeout a few days back but the next day i had >>>> errors on the MDS box. I have switched it back: >>>> >>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>> >>>> I would love to give you that trace but I don''t know how to get >>>> it. Is there a debug option to turn on in the clients? >>> You can get that by echo t > /proc/sysrq-trigger on client. >>> >> Cool command, output of the client is attached. The four >> processes m45_amp214_om, is the application that hung when >> working off of luster. you can see its stuck in IO state. >> >>> >>> >>> >>> >>> >> --------------------------------------------------------------------- >> --- >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >
Brock Palen wrote:> I found that something is getting overloaded some place. If i just > go start and stop a job over and over quickly the client will lose > contact with one of the servers, ether OST or MDT. > >Server might be stuck somewhere. It should depend on what does the job do if you start and stop it over and over? Will it create, then unlink a lot of file if you start and stop the job? Whether the memory will help your problem depend on what triggers this server stuck. Could you find some console error msg when they are stuck? Usually memory is more helpful on MDS, if you have a large number of clients and big directory in your system. Memory on OST is also helpful, but not directly for read and write. I think you can find the reason of this easily on the list, and there are many discussion about the hardware requirement for lustre before. Thanks WangDi> Would more ram in the servers help? I dont see a high load or IO > wait, but both servers are older (dual 1.4Ghz amd) with only 2 gb of > memory. > > Brock Palen > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote: > > >> Hello, >> >> m45_amp214_om D 0000000000000000 0 2587 1 31389 >> 2586 (NOTLB) >> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 >> 00000100080f1a40 0000000000000246 00000101f6b435a8 >> 0000000380136025 >> 00000102270a1030 00000000000000d0 >> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} <ffffffff8030e45f> >> {__down+147} >> <ffffffff80134659>{default_wake_function+0} <ffffffff8030ff7d> >> {__down_failed+53} >> <ffffffffa04292e1>{:lustre:.text.lock.file+5} >> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} >> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} >> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} >> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} >> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} >> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} >> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} >> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} >> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} >> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} >> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} <ffffffffa02c1035> >> {:ptlrpc:lock_res_and_lock+53} >> <ffffffffa0268730>{:obdclass:class_handle2object+224} >> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} >> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} >> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} >> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} >> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} >> <ffffffffa039617d>{:mdc:mdc_intent_lock+685} >> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} >> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} >> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >> <ffffffff80192006>{__d_lookup+287} >> <ffffffffa0419724>{:lustre:ll_file_open+2100} >> <ffffffffa0428a18>{:lustre:ll_inode_permission+184} >> <ffffffff80179bdb>{sys_access+349} <ffffffff8017a1ee> >> {__dentry_open+201} >> <ffffffff8017a3a9>{filp_open+95} <ffffffff80179bdb>{sys_access >> +349} >> <ffffffff801f00b5>{strncpy_from_user+74} <ffffffff8017a598> >> {sys_open+57} >> <ffffffff8011026a>{system_call+126} >> >> It seems blocking_ast process was blocked here. Could you dump the >> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send >> to me? >> >> Thanks >> WangDi >> >> Brock Palen wrote: >> >>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>> >>>>>>> MDT dmesg: >>>>>>> >>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>>> @@@ processing error (-107) req at 000001002b >>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>>> Interpret:/0/0 rc -107/0 >>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>>> ### lock callback timer expired: evicting cl >>>>>>> ient 2faf3c9e-26fb-64b7- >>>>>>> ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID nid >>>>>>> 10.164.0.141 at tcp ns: mds-nobackup >>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>>> expref: 372 pid 26925 >>>>>>> >>>>>>> >>>>>> The client was evicted because of this lock can not be released >>>>>> on client >>>>>> on time. Could you provide the stack strace of client at that >>>>>> time? >>>>>> >>>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>>> you should wait 1.6.5 released, including a new feature >>>>>> adaptive_timeout, >>>>>> which will adjust the timeout value according to the network >>>>>> congestion >>>>>> and server load. And it should help your problem. >>>>>> >>>>> Waiting for the next version of lustre might be the best thing. >>>>> I had upped the timeout a few days back but the next day i had >>>>> errors on the MDS box. I have switched it back: >>>>> >>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>>> >>>>> I would love to give you that trace but I don''t know how to get >>>>> it. Is there a debug option to turn on in the clients? >>>>> >>>> You can get that by echo t > /proc/sysrq-trigger on client. >>>> >>>> >>> Cool command, output of the client is attached. The four >>> processes m45_amp214_om, is the application that hung when >>> working off of luster. you can see its stuck in IO state. >>> >>> >>>> >>>> >>>> >>>> >>> --------------------------------------------------------------------- >>> --- >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >