ashok bharat bayana
2008-Feb-12 05:48 UTC
[Lustre-discuss] Contents of Lustre-discuss digest...
Hi, i just want to know whether there are any alternative file systems for HP SFS. I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway. Thanks and Regards, Ashok Bharat -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org Sent: Tue 2/12/2008 11:05 AM To: lustre-discuss at lists.lustre.org Subject: Lustre-discuss Digest, Vol 25, Issue 19 Send Lustre-discuss mailing list submissions to lustre-discuss at lists.lustre.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.lustre.org/mailman/listinfo/lustre-discuss or, via email, send a message with subject or body ''help'' to lustre-discuss-request at lists.lustre.org You can reach the person managing the list at lustre-discuss-owner at lists.lustre.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Lustre-discuss digest..." Today''s Topics: 1. Re: multihomed clients ignoring lnet options (Cliff White) 2. Re: multihomed clients ignoring lnet options (Joe Little) 3. Re: multihomed clients ignoring lnet options (Steden Klaus) 4. Re: Lustre-discuss Digest, Vol 25, Issue 17 (ashok bharat bayana) ---------------------------------------------------------------------- Message: 1 Date: Mon, 11 Feb 2008 20:00:10 -0800 From: Cliff White <Cliff.White at Sun.COM> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options To: Aaron Knister <aaron at iges.org> Cc: lustre-discuss at lists.lustre.org Message-ID: <47B119CA.4050105 at sun.com> Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Aaron Knister wrote:> I believe that''s correct. The nids of the various server components > are stored on the filesystem itself.Yes, and you can always see them with tunefs.lustre --print <device> cliffw> > On Feb 10, 2008, at 12:58 AM, Joe Little wrote: > >> never mind.. The problem was resolved by recreating again the MGS and >> the OST''s using the same parameters on the server. I was able to >> change the parameters and still have the servers working, but my guess >> is that those options are permanently etched into the filesystem. >> >> >> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote: >>> I have all of my servers and clients using eth1 for the tcp lustre >>> lnet. >>> >>> All have modprobe.conf entries of: >>> >>> options lnet networks="tcp0(eth1)" >>> >>> and all report with "lctl list_nids" that they are using the IP >>> address associated with that interface (a net 192.168.200.x address) >>> >>> However, when my client connects, it ignores the above and goes with >>> eth0 for routing, even though the mds/mgs is on that network range: >>> >>> client dmesg: >>> >>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre >>> stack 8192 >>> Lustre: Added LNI 192.168.200.100 at tcp [8/256] >>> Lustre: Accept secure, port 988 >>> Lustre: OBD class driver, info at clusterfs.com >>> Lustre Version: 1.6.4.2 >>> Build Version: >>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- >>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp >>> Lustre: Lustre Client File System; info at clusterfs.com >>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error >>> -104 reading HELLO from 192.168.2.201 >>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host >>> 192.168.2.201 on port 988 was reset: is it running a compatible >>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs? >>> >>> server dmesg: >>> LustreError: 120-3: Refusing connection from 192.168.2.192 for >>> 192.168.2.201 at tcp: No matching NI >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss------------------------------ Message: 2 Date: Mon, 11 Feb 2008 20:51:20 -0800 From: "Joe Little" <jmlittle at gmail.com> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options To: "Cliff White" <Cliff.White at sun.com> Cc: lustre-discuss at lists.lustre.org Message-ID: <e3849caa0802112051q7e24e6acv5af03a16f2bca2c3 at mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1 On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:> Aaron Knister wrote: > > I believe that''s correct. The nids of the various server components > > are stored on the filesystem itself. > > Yes, and you can always see them with > tunefs.lustre --print <device> > > cliffwanyone to change them after the fact?> > > > > > On Feb 10, 2008, at 12:58 AM, Joe Little wrote: > > > >> never mind.. The problem was resolved by recreating again the MGS and > >> the OST''s using the same parameters on the server. I was able to > >> change the parameters and still have the servers working, but my guess > >> is that those options are permanently etched into the filesystem. > >> > >> > >> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote: > >>> I have all of my servers and clients using eth1 for the tcp lustre > >>> lnet. > >>> > >>> All have modprobe.conf entries of: > >>> > >>> options lnet networks="tcp0(eth1)" > >>> > >>> and all report with "lctl list_nids" that they are using the IP > >>> address associated with that interface (a net 192.168.200.x address) > >>> > >>> However, when my client connects, it ignores the above and goes with > >>> eth0 for routing, even though the mds/mgs is on that network range: > >>> > >>> client dmesg: > >>> > >>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre > >>> stack 8192 > >>> Lustre: Added LNI 192.168.200.100 at tcp [8/256] > >>> Lustre: Accept secure, port 988 > >>> Lustre: OBD class driver, info at clusterfs.com > >>> Lustre Version: 1.6.4.2 > >>> Build Version: > >>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- > >>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp > >>> Lustre: Lustre Client File System; info at clusterfs.com > >>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error > >>> -104 reading HELLO from 192.168.2.201 > >>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host > >>> 192.168.2.201 on port 988 was reset: is it running a compatible > >>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs? > >>> > >>> server dmesg: > >>> LustreError: 120-3: Refusing connection from 192.168.2.192 for > >>> 192.168.2.201 at tcp: No matching NI > >>> > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > Aaron Knister > > Associate Systems Analyst > > Center for Ocean-Land-Atmosphere Studies > > > > (301) 595-7000 > > aaron at iges.org > > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >------------------------------ Message: 3 Date: Mon, 11 Feb 2008 20:53:41 -0800 From: "Steden Klaus" <Klaus.Steden at thomson.net> Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options To: <jmlittle at gmail.com>, <Cliff.White at sun.com> Cc: lustre-discuss at lists.lustre.org Message-ID: <23480D326186CF49819F5EF363276C9003AB2AB3 at BRBKSMAIL04.am.thmulti.com> Content-Type: text/plain; charset="utf-8" If you have root, you can change them using tunefs.lustre after the file system has been shut down. I''ve done this a number of times to test various lnet configs. Klaus ----- Original Message ----- From: lustre-discuss-bounces at lists.lustre.org <lustre-discuss-bounces at lists.lustre.org> To: Cliff White <Cliff.White at sun.com> Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org> Sent: Mon Feb 11 20:51:20 2008 Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote:> Aaron Knister wrote: > > I believe that''s correct. The nids of the various server components > > are stored on the filesystem itself. > > Yes, and you can always see them with > tunefs.lustre --print <device> > > cliffwanyone to change them after the fact?> > > > > > On Feb 10, 2008, at 12:58 AM, Joe Little wrote: > > > >> never mind.. The problem was resolved by recreating again the MGS and > >> the OST''s using the same parameters on the server. I was able to > >> change the parameters and still have the servers working, but my guess > >> is that those options are permanently etched into the filesystem. > >> > >> > >> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote: > >>> I have all of my servers and clients using eth1 for the tcp lustre > >>> lnet. > >>> > >>> All have modprobe.conf entries of: > >>> > >>> options lnet networks="tcp0(eth1)" > >>> > >>> and all report with "lctl list_nids" that they are using the IP > >>> address associated with that interface (a net 192.168.200.x address) > >>> > >>> However, when my client connects, it ignores the above and goes with > >>> eth0 for routing, even though the mds/mgs is on that network range: > >>> > >>> client dmesg: > >>> > >>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre > >>> stack 8192 > >>> Lustre: Added LNI 192.168.200.100 at tcp [8/256] > >>> Lustre: Accept secure, port 988 > >>> Lustre: OBD class driver, info at clusterfs.com > >>> Lustre Version: 1.6.4.2 > >>> Build Version: > >>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- > >>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp > >>> Lustre: Lustre Client File System; info at clusterfs.com > >>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error > >>> -104 reading HELLO from 192.168.2.201 > >>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host > >>> 192.168.2.201 on port 988 was reset: is it running a compatible > >>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs? > >>> > >>> server dmesg: > >>> LustreError: 120-3: Refusing connection from 192.168.2.192 for > >>> 192.168.2.201 at tcp: No matching NI > >>> > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > Aaron Knister > > Associate Systems Analyst > > Center for Ocean-Land-Atmosphere Studies > > > > (301) 595-7000 > > aaron at iges.org > > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >_______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ------------------------------ Message: 4 Date: Tue, 12 Feb 2008 11:15:18 +0530 From: "ashok bharat bayana" <ashok.bharat.bayana at iiitb.ac.in> Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 25, Issue 17 To: <lustre-discuss at lists.lustre.org> Message-ID: <8626C1B7EB748940BCDD7596134632BE850213 at jal.iiitb.ac.in> Content-Type: text/plain; charset="iso-8859-1" Hi, i just want to know whether there are any alternative file systems for HP SFS. I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway. Thanks and Regards, Ashok Bharat -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org Sent: Tue 2/12/2008 3:18 AM To: lustre-discuss at lists.lustre.org Subject: Lustre-discuss Digest, Vol 25, Issue 17 Send Lustre-discuss mailing list submissions to lustre-discuss at lists.lustre.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.lustre.org/mailman/listinfo/lustre-discuss or, via email, send a message with subject or body ''help'' to lustre-discuss-request at lists.lustre.org You can reach the person managing the list at lustre-discuss-owner at lists.lustre.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Lustre-discuss digest..." Today''s Topics: 1. Re: Benchmarking Lustre (Marty Barnaby) 2. Re: Luster clients getting evicted (Aaron Knister) 3. Re: Luster clients getting evicted (Tom.Wang) 4. Re: Luster clients getting evicted (Craig Prescott) 5. Re: rc -43: Identifier removed (Andreas Dilger) 6. Re: Luster clients getting evicted (Brock Palen) 7. Re: Luster clients getting evicted (Aaron Knister) ---------------------------------------------------------------------- Message: 1 Date: Mon, 11 Feb 2008 11:25:48 -0700 From: "Marty Barnaby" <mlbarna at sandia.gov> Subject: Re: [Lustre-discuss] Benchmarking Lustre To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org> Message-ID: <47B0932C.2090200 at sandia.gov> Content-Type: text/plain; charset=iso-8859-1; format=flowed Do you have any special interests, like: writing from a true MPI job; collective vs. independent; one-file-per-processor vs. a single, share file; or writing via MPI-IO vs. posix? Marty Barnaby mayur bhosle wrote:> hi everyone, > > I am a student at Georgia Tech university, and > as a part of a project i need to benchmark lustre file system. I did a > lot of searching regarding > the possible benchmark, but i need some advice on which benchmarks > would be more suitable... if any one can post a sugesstion that would > be really helpful....................... > > thanks in advance ............ > > Mayur------------------------------ Message: 2 Date: Mon, 11 Feb 2008 14:16:20 -0500 From: Aaron Knister <aaron at iges.org> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Tom.Wang <Tom.Wang at Sun.COM> Cc: lustre-discuss at lists.lustre.org Message-ID: <79343CD8-77EA-4686-A2AE-BEE6FAC59914 at iges.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under load, the clients hand about every 10 minutes which is really bad for a production machine. The only way to fix the hang is to reboot the server. My users are getting extremely impatient :-/ I see this on the clients- LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl Rpc:/0/0 rc 0/-22 Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 I''ve increased the timeout to 300seconds and it has helped marginally. -Aaron On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:> Hi, > Aha, this is bug has been fixed in 14360. > > https://bugzilla.lustre.org/show_bug.cgi?id=14360 > > The patch there should fix your problem, which should be released in > 1.6.5 > > Thanks > > Brock Palen wrote: >> Sure, Attached, note though, we rebuilt our lustre source for >> another >> box that uses the largesmp kernel. but it used the same options and >> compiler. >> >> >> Brock Palen >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote: >> >>> Hello, >>> >>> m45_amp214_om D 0000000000000000 0 2587 1 31389 >>> 2586 (NOTLB) >>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 >>> 00000100080f1a40 0000000000000246 00000101f6b435a8 >>> 0000000380136025 >>> 00000102270a1030 00000000000000d0 >>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} >>> <ffffffff8030e45f>{__down+147} >>> <ffffffff80134659>{default_wake_function+0} >>> <ffffffff8030ff7d>{__down_failed+53} >>> <ffffffffa04292e1>{:lustre:.text.lock.file+5} >>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} >>> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} >>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} >>> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} >>> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} >>> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} >>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} >>> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} >>> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} >>> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} >>> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} >>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>> <ffffffffa0268730>{:obdclass:class_handle2object+224} >>> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} >>> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} >>> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} >>> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} >>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} >>> <ffffffffa039617d>{:mdc:mdc_intent_lock+685} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} >>> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffff80192006>{__d_lookup+287} >>> <ffffffffa0419724>{:lustre:ll_file_open+2100} >>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184} >>> <ffffffff80179bdb>{sys_access+349} >>> <ffffffff8017a1ee>{__dentry_open+201} >>> <ffffffff8017a3a9>{filp_open+95} >>> <ffffffff80179bdb>{sys_access+349} >>> <ffffffff801f00b5>{strncpy_from_user+74} >>> <ffffffff8017a598>{sys_open+57} >>> <ffffffff8011026a>{system_call+126} >>> >>> It seems blocking_ast process was blocked here. Could you dump the >>> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send >>> to me? >>> >>> Thanks >>> WangDi >>> >>> Brock Palen wrote: >>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>>>> MDT dmesg: >>>>>>>> >>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>>>> @@@ processing error (-107) req at 000001002b >>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>>>> Interpret:/0/0 rc -107/0 >>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>>>> ### >>>>>>>> lock callback timer expired: evicting cl >>>>>>>> ient >>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID >>>>>>>> nid 10.164.0.141 at tcp ns: mds-nobackup >>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>>>> expref: 372 pid 26925 >>>>>>>> >>>>>>> The client was evicted because of this lock can not be released >>>>>>> on client >>>>>>> on time. Could you provide the stack strace of client at that >>>>>>> time? >>>>>>> >>>>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>>>> you should wait 1.6.5 released, including a new feature >>>>>>> adaptive_timeout, >>>>>>> which will adjust the timeout value according to the network >>>>>>> congestion >>>>>>> and server load. And it should help your problem. >>>>>> >>>>>> Waiting for the next version of lustre might be the best >>>>>> thing. I >>>>>> had upped the timeout a few days back but the next day i had >>>>>> errors on the MDS box. I have switched it back: >>>>>> >>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>>>> >>>>>> I would love to give you that trace but I don''t know how to get >>>>>> it. Is there a debug option to turn on in the clients? >>>>> You can get that by echo t > /proc/sysrq-trigger on client. >>>>> >>>> Cool command, output of the client is attached. The four >>>> processes >>>> m45_amp214_om, is the application that hung when working off of >>>> luster. you can see its stuck in IO state. >>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org ------------------------------ Message: 3 Date: Mon, 11 Feb 2008 15:04:05 -0500 From: "Tom.Wang" <Tom.Wang at Sun.COM> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Aaron Knister <aaron at iges.org> Cc: lustre-discuss at lists.lustre.org Message-ID: <47B0AA35.7070303 at sun.com> Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Aaron Knister wrote:> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 > x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 > ref 1 fl Rpc:/0/0 rc 0/-22It means OST could not response the request(unlink, o6) in 300 seconds, so client disconnect the import to OST and try to reconnect. Does this disconnection always happened when do unlink ? Could you please post process trace and console msg of OST at that time? Thanks WangDi> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service > data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally. > > -Aaron >> > > > >------------------------------ Message: 4 Date: Mon, 11 Feb 2008 15:19:21 -0500 From: Craig Prescott <prescott at hpc.ufl.edu> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Aaron Knister <aaron at iges.org> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org Message-ID: <47B0ADC9.8020501 at hpc.ufl.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Aaron Knister wrote:> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ > t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl > Rpc:/0/0 rc 0/-22 > Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- > OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations > using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally.Hi Aaron; We set the timeout a big number (1000secs) on our 400 node cluster (mostly o2ib, some tcp clients). Until we did this, we had loads of evictions. In our case, it solved the problem. Cheers, Craig ------------------------------ Message: 5 Date: Mon, 11 Feb 2008 14:11:45 -0700 From: Andreas Dilger <adilger at sun.com> Subject: Re: [Lustre-discuss] rc -43: Identifier removed To: Per Lundqvist <perl at nsc.liu.se> Cc: Lustre Discuss <lustre-discuss at lists.lustre.org> Message-ID: <20080211211145.GJ3029 at webber.adilger.int> Content-Type: text/plain; charset=us-ascii On Feb 11, 2008 17:04 +0100, Per Lundqvist wrote:> I got this error today when testing a newly set up 1.6 filesystem: > > n50 1% cd /mnt/test > n50 2% ls > ls: reading directory .: Identifier removed > > n50 3% ls -alrt > total 8 > ?--------- ? ? ? ? ? dir1 > ?--------- ? ? ? ? ? dir2 > drwxr-xr-x 4 root root 4096 Feb 8 15:46 ../ > drwxr-xr-x 4 root root 4096 Feb 11 15:11 ./ > > n50 4% stat . > File: `.'' > Size: 4096 Blocks: 8 IO Block: 4096 directory > Device: b438c888h/-1271347064d Inode: 27616681 Links: 2 > Access: (0755/drwxr-xr-x) Uid: ( 1120/ faxen) Gid: ( 500/ nsc) > Access: 2008-02-11 16:11:48.336621154 +0100 > Modify: 2008-02-11 15:11:27.000000000 +0100 > Change: 2008-02-11 15:11:31.352841294 +0100 > > this seems to be happen almost all the time when I am running as a > specific user on this system. Note that the stat call always works... I > haven''t yet been able to reproduce this problem when running as my own > user.EIDRM (Identifier removed) means that your MDS has a user database (/etc/passwd and /etc/group) that is missing the particular user ID. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ------------------------------ Message: 6 Date: Mon, 11 Feb 2008 16:17:37 -0500 From: Brock Palen <brockp at umich.edu> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Craig Prescott <prescott at hpc.ufl.edu> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org Message-ID: <38A6B1A2-E20A-40BC-80C2-CEBB971BDC09 at umich.edu> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed>> I''ve increased the timeout to 300seconds and it has helped >> marginally. > > Hi Aaron; > > We set the timeout a big number (1000secs) on our 400 node cluster > (mostly o2ib, some tcp clients). Until we did this, we had loads > of evictions. In our case, it solved the problem.This feels excessive. But at this point I guess Ill try it.> > Cheers, > Craig > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >------------------------------ Message: 7 Date: Mon, 11 Feb 2008 16:48:05 -0500 From: Aaron Knister <aaron at iges.org> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Brock Palen <brockp at umich.edu> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org Message-ID: <7A1D46E5-CC69-4C37-9CC7-B229FCA43BA1 at iges.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes So far it''s helped. If this doesn''t fix it I''m going to apply the patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit I''ll let you know how it goes. If you''d like a copy of the patched version let me know. Are you running RHEL/SLES? what version of the OS and lustre? -Aaron On Feb 11, 2008, at 4:17 PM, Brock Palen wrote:>>> I''ve increased the timeout to 300seconds and it has helped >>> marginally. >> >> Hi Aaron; >> >> We set the timeout a big number (1000secs) on our 400 node cluster >> (mostly o2ib, some tcp clients). Until we did this, we had loads >> of evictions. In our case, it solved the problem. > > This feels excessive. But at this point I guess Ill try it. > >> >> Cheers, >> Craig >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org ------------------------------ _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss End of Lustre-discuss Digest, Vol 25, Issue 17 ********************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 11404 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/610bb025/attachment.bin ------------------------------ _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss End of Lustre-discuss Digest, Vol 25, Issue 19 ********************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 16245 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/2fa9e8b7/attachment-0002.bin
ashok bharat bayana wrote:> Hi, > i just want to know whether there are any alternative file systems for HP SFS. > I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway.Polyserve is now owned by HP, so I would ask there. cliffw> > Thanks and Regards, > Ashok Bharat > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org > Sent: Tue 2/12/2008 11:05 AM > To: lustre-discuss at lists.lustre.org > Subject: Lustre-discuss Digest, Vol 25, Issue 19 > > Send Lustre-discuss mailing list submissions to > lustre-discuss at lists.lustre.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.lustre.org/mailman/listinfo/lustre-discuss > or, via email, send a message with subject or body ''help'' to > lustre-discuss-request at lists.lustre.org > > You can reach the person managing the list at > lustre-discuss-owner at lists.lustre.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Lustre-discuss digest..." > > > Today''s Topics: > > 1. Re: multihomed clients ignoring lnet options (Cliff White) > 2. Re: multihomed clients ignoring lnet options (Joe Little) > 3. Re: multihomed clients ignoring lnet options (Steden Klaus) > 4. Re: Lustre-discuss Digest, Vol 25, Issue 17 (ashok bharat bayana) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 11 Feb 2008 20:00:10 -0800 > From: Cliff White <Cliff.White at Sun.COM> > Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options > To: Aaron Knister <aaron at iges.org> > Cc: lustre-discuss at lists.lustre.org > Message-ID: <47B119CA.4050105 at sun.com> > Content-Type: text/plain; format=flowed; charset=ISO-8859-1 > > Aaron Knister wrote: >> I believe that''s correct. The nids of the various server components >> are stored on the filesystem itself. > > Yes, and you can always see them with > tunefs.lustre --print <device> > > cliffw > >> On Feb 10, 2008, at 12:58 AM, Joe Little wrote: >> >>> never mind.. The problem was resolved by recreating again the MGS and >>> the OST''s using the same parameters on the server. I was able to >>> change the parameters and still have the servers working, but my guess >>> is that those options are permanently etched into the filesystem. >>> >>> >>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote: >>>> I have all of my servers and clients using eth1 for the tcp lustre >>>> lnet. >>>> >>>> All have modprobe.conf entries of: >>>> >>>> options lnet networks="tcp0(eth1)" >>>> >>>> and all report with "lctl list_nids" that they are using the IP >>>> address associated with that interface (a net 192.168.200.x address) >>>> >>>> However, when my client connects, it ignores the above and goes with >>>> eth0 for routing, even though the mds/mgs is on that network range: >>>> >>>> client dmesg: >>>> >>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre >>>> stack 8192 >>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256] >>>> Lustre: Accept secure, port 988 >>>> Lustre: OBD class driver, info at clusterfs.com >>>> Lustre Version: 1.6.4.2 >>>> Build Version: >>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- >>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp >>>> Lustre: Lustre Client File System; info at clusterfs.com >>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error >>>> -104 reading HELLO from 192.168.2.201 >>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host >>>> 192.168.2.201 on port 988 was reset: is it running a compatible >>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs? >>>> >>>> server dmesg: >>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for >>>> 192.168.2.201 at tcp: No matching NI >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> Aaron Knister >> Associate Systems Analyst >> Center for Ocean-Land-Atmosphere Studies >> >> (301) 595-7000 >> aaron at iges.org >> >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > ------------------------------ > > Message: 2 > Date: Mon, 11 Feb 2008 20:51:20 -0800 > From: "Joe Little" <jmlittle at gmail.com> > Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options > To: "Cliff White" <Cliff.White at sun.com> > Cc: lustre-discuss at lists.lustre.org > Message-ID: > <e3849caa0802112051q7e24e6acv5af03a16f2bca2c3 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote: >> Aaron Knister wrote: >>> I believe that''s correct. The nids of the various server components >>> are stored on the filesystem itself. >> Yes, and you can always see them with >> tunefs.lustre --print <device> >> >> cliffw > > anyone to change them after the fact? >> >>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote: >>> >>>> never mind.. The problem was resolved by recreating again the MGS and >>>> the OST''s using the same parameters on the server. I was able to >>>> change the parameters and still have the servers working, but my guess >>>> is that those options are permanently etched into the filesystem. >>>> >>>> >>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote: >>>>> I have all of my servers and clients using eth1 for the tcp lustre >>>>> lnet. >>>>> >>>>> All have modprobe.conf entries of: >>>>> >>>>> options lnet networks="tcp0(eth1)" >>>>> >>>>> and all report with "lctl list_nids" that they are using the IP >>>>> address associated with that interface (a net 192.168.200.x address) >>>>> >>>>> However, when my client connects, it ignores the above and goes with >>>>> eth0 for routing, even though the mds/mgs is on that network range: >>>>> >>>>> client dmesg: >>>>> >>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre >>>>> stack 8192 >>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256] >>>>> Lustre: Accept secure, port 988 >>>>> Lustre: OBD class driver, info at clusterfs.com >>>>> Lustre Version: 1.6.4.2 >>>>> Build Version: >>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- >>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp >>>>> Lustre: Lustre Client File System; info at clusterfs.com >>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error >>>>> -104 reading HELLO from 192.168.2.201 >>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host >>>>> 192.168.2.201 on port 988 was reset: is it running a compatible >>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs? >>>>> >>>>> server dmesg: >>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for >>>>> 192.168.2.201 at tcp: No matching NI >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> Aaron Knister >>> Associate Systems Analyst >>> Center for Ocean-Land-Atmosphere Studies >>> >>> (301) 595-7000 >>> aaron at iges.org >>> >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > ------------------------------ > > Message: 3 > Date: Mon, 11 Feb 2008 20:53:41 -0800 > From: "Steden Klaus" <Klaus.Steden at thomson.net> > Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options > To: <jmlittle at gmail.com>, <Cliff.White at sun.com> > Cc: lustre-discuss at lists.lustre.org > Message-ID: > <23480D326186CF49819F5EF363276C9003AB2AB3 at BRBKSMAIL04.am.thmulti.com> > Content-Type: text/plain; charset="utf-8" > > > If you have root, you can change them using tunefs.lustre after the file system has been shut down. > > I''ve done this a number of times to test various lnet configs. > > Klaus > > > ----- Original Message ----- > From: lustre-discuss-bounces at lists.lustre.org <lustre-discuss-bounces at lists.lustre.org> > To: Cliff White <Cliff.White at sun.com> > Cc: lustre-discuss at lists.lustre.org <lustre-discuss at lists.lustre.org> > Sent: Mon Feb 11 20:51:20 2008 > Subject: Re: [Lustre-discuss] multihomed clients ignoring lnet options > > On Feb 11, 2008 8:00 PM, Cliff White <Cliff.White at sun.com> wrote: >> Aaron Knister wrote: >>> I believe that''s correct. The nids of the various server components >>> are stored on the filesystem itself. >> Yes, and you can always see them with >> tunefs.lustre --print <device> >> >> cliffw > > anyone to change them after the fact? >> >>> On Feb 10, 2008, at 12:58 AM, Joe Little wrote: >>> >>>> never mind.. The problem was resolved by recreating again the MGS and >>>> the OST''s using the same parameters on the server. I was able to >>>> change the parameters and still have the servers working, but my guess >>>> is that those options are permanently etched into the filesystem. >>>> >>>> >>>> On Feb 9, 2008 8:16 PM, Joe Little <jmlittle at gmail.com> wrote: >>>>> I have all of my servers and clients using eth1 for the tcp lustre >>>>> lnet. >>>>> >>>>> All have modprobe.conf entries of: >>>>> >>>>> options lnet networks="tcp0(eth1)" >>>>> >>>>> and all report with "lctl list_nids" that they are using the IP >>>>> address associated with that interface (a net 192.168.200.x address) >>>>> >>>>> However, when my client connects, it ignores the above and goes with >>>>> eth0 for routing, even though the mds/mgs is on that network range: >>>>> >>>>> client dmesg: >>>>> >>>>> Lustre: 4756:0:(module.c:382:init_libcfs_module()) maximum lustre >>>>> stack 8192 >>>>> Lustre: Added LNI 192.168.200.100 at tcp [8/256] >>>>> Lustre: Accept secure, port 988 >>>>> Lustre: OBD class driver, info at clusterfs.com >>>>> Lustre Version: 1.6.4.2 >>>>> Build Version: >>>>> 1.6.4.2-19691231190000-PRISTINE-.cache.build.BUILD.lustre- >>>>> kernel-2.6.9.lustre.linux-2.6.9-55.0.9.EL_lustre.1.6.4.2smp >>>>> Lustre: Lustre Client File System; info at clusterfs.com >>>>> LustreError: 4799:0:(socklnd_cb.c:2167:ksocknal_recv_hello()) Error >>>>> -104 reading HELLO from 192.168.2.201 >>>>> LustreError: 11b-b: Connection to 192.168.2.201 at tcp at host >>>>> 192.168.2.201 on port 988 was reset: is it running a compatible >>>>> version of Lustre and is 192.168.2.201 at tcp one of its NIDs? >>>>> >>>>> server dmesg: >>>>> LustreError: 120-3: Refusing connection from 192.168.2.192 for >>>>> 192.168.2.201 at tcp: No matching NI >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> Aaron Knister >>> Associate Systems Analyst >>> Center for Ocean-Land-Atmosphere Studies >>> >>> (301) 595-7000 >>> aaron at iges.org >>> >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > ------------------------------ > > Message: 4 > Date: Tue, 12 Feb 2008 11:15:18 +0530 > From: "ashok bharat bayana" <ashok.bharat.bayana at iiitb.ac.in> > Subject: Re: [Lustre-discuss] Lustre-discuss Digest, Vol 25, Issue 17 > To: <lustre-discuss at lists.lustre.org> > Message-ID: <8626C1B7EB748940BCDD7596134632BE850213 at jal.iiitb.ac.in> > Content-Type: text/plain; charset="iso-8859-1" > > > Hi, > i just want to know whether there are any alternative file systems for HP SFS. > I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway. > > Thanks and Regards, > Ashok Bharat > > -----Original Message----- > From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org > Sent: Tue 2/12/2008 3:18 AM > To: lustre-discuss at lists.lustre.org > Subject: Lustre-discuss Digest, Vol 25, Issue 17 > > Send Lustre-discuss mailing list submissions to > lustre-discuss at lists.lustre.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.lustre.org/mailman/listinfo/lustre-discuss > or, via email, send a message with subject or body ''help'' to > lustre-discuss-request at lists.lustre.org > > You can reach the person managing the list at > lustre-discuss-owner at lists.lustre.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Lustre-discuss digest..." > > > Today''s Topics: > > 1. Re: Benchmarking Lustre (Marty Barnaby) > 2. Re: Luster clients getting evicted (Aaron Knister) > 3. Re: Luster clients getting evicted (Tom.Wang) > 4. Re: Luster clients getting evicted (Craig Prescott) > 5. Re: rc -43: Identifier removed (Andreas Dilger) > 6. Re: Luster clients getting evicted (Brock Palen) > 7. Re: Luster clients getting evicted (Aaron Knister) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 11 Feb 2008 11:25:48 -0700 > From: "Marty Barnaby" <mlbarna at sandia.gov> > Subject: Re: [Lustre-discuss] Benchmarking Lustre > To: "lustre-discuss at lists.lustre.org" > <lustre-discuss at lists.lustre.org> > Message-ID: <47B0932C.2090200 at sandia.gov> > Content-Type: text/plain; charset=iso-8859-1; format=flowed > > Do you have any special interests, like: writing from a true MPI job; > collective vs. independent; one-file-per-processor vs. a single, share > file; or writing via MPI-IO vs. posix? > > > Marty Barnaby > > > mayur bhosle wrote: >> hi everyone, >> >> I am a student at Georgia Tech university, and >> as a part of a project i need to benchmark lustre file system. I did a >> lot of searching regarding >> the possible benchmark, but i need some advice on which benchmarks >> would be more suitable... if any one can post a sugesstion that would >> be really helpful....................... >> >> thanks in advance ............ >> >> Mayur > > > > > ------------------------------ > > Message: 2 > Date: Mon, 11 Feb 2008 14:16:20 -0500 > From: Aaron Knister <aaron at iges.org> > Subject: Re: [Lustre-discuss] Luster clients getting evicted > To: Tom.Wang <Tom.Wang at Sun.COM> > Cc: lustre-discuss at lists.lustre.org > Message-ID: <79343CD8-77EA-4686-A2AE-BEE6FAC59914 at iges.org> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ > t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl > Rpc:/0/0 rc 0/-22 > Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- > OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations > using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally. > > -Aaron > > On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote: > >> Hi, >> Aha, this is bug has been fixed in 14360. >> >> https://bugzilla.lustre.org/show_bug.cgi?id=14360 >> >> The patch there should fix your problem, which should be released in >> 1.6.5 >> >> Thanks >> >> Brock Palen wrote: >>> Sure, Attached, note though, we rebuilt our lustre source for >>> another >>> box that uses the largesmp kernel. but it used the same options and >>> compiler. >>> >>> >>> Brock Palen >>> Center for Advanced Computing >>> brockp at umich.edu >>> (734)936-1985 >>> >>> >>> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote: >>> >>>> Hello, >>>> >>>> m45_amp214_om D 0000000000000000 0 2587 1 31389 >>>> 2586 (NOTLB) >>>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 >>>> 00000100080f1a40 0000000000000246 00000101f6b435a8 >>>> 0000000380136025 >>>> 00000102270a1030 00000000000000d0 >>>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} >>>> <ffffffff8030e45f>{__down+147} >>>> <ffffffff80134659>{default_wake_function+0} >>>> <ffffffff8030ff7d>{__down_failed+53} >>>> <ffffffffa04292e1>{:lustre:.text.lock.file+5} >>>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} >>>> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} >>>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} >>>> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} >>>> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} >>>> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} >>>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} >>>> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} >>>> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} >>>> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} >>>> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} >>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>>> <ffffffffa0268730>{:obdclass:class_handle2object+224} >>>> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} >>>> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} >>>> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} >>>> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} >>>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} >>>> <ffffffffa039617d>{:mdc:mdc_intent_lock+685} >>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} >>>> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} >>>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>>> <ffffffff80192006>{__d_lookup+287} >>>> <ffffffffa0419724>{:lustre:ll_file_open+2100} >>>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184} >>>> <ffffffff80179bdb>{sys_access+349} >>>> <ffffffff8017a1ee>{__dentry_open+201} >>>> <ffffffff8017a3a9>{filp_open+95} >>>> <ffffffff80179bdb>{sys_access+349} >>>> <ffffffff801f00b5>{strncpy_from_user+74} >>>> <ffffffff8017a598>{sys_open+57} >>>> <ffffffff8011026a>{system_call+126} >>>> >>>> It seems blocking_ast process was blocked here. Could you dump the >>>> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send >>>> to me? >>>> >>>> Thanks >>>> WangDi >>>> >>>> Brock Palen wrote: >>>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>>>>> MDT dmesg: >>>>>>>>> >>>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>>>>> @@@ processing error (-107) req at 000001002b >>>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>>>>> Interpret:/0/0 rc -107/0 >>>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>>>>> ### >>>>>>>>> lock callback timer expired: evicting cl >>>>>>>>> ient >>>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID >>>>>>>>> nid 10.164.0.141 at tcp ns: mds-nobackup >>>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>>>>> expref: 372 pid 26925 >>>>>>>>> >>>>>>>> The client was evicted because of this lock can not be released >>>>>>>> on client >>>>>>>> on time. Could you provide the stack strace of client at that >>>>>>>> time? >>>>>>>> >>>>>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>>>>> you should wait 1.6.5 released, including a new feature >>>>>>>> adaptive_timeout, >>>>>>>> which will adjust the timeout value according to the network >>>>>>>> congestion >>>>>>>> and server load. And it should help your problem. >>>>>>> Waiting for the next version of lustre might be the best >>>>>>> thing. I >>>>>>> had upped the timeout a few days back but the next day i had >>>>>>> errors on the MDS box. I have switched it back: >>>>>>> >>>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>>>>> >>>>>>> I would love to give you that trace but I don''t know how to get >>>>>>> it. Is there a debug option to turn on in the clients? >>>>>> You can get that by echo t > /proc/sysrq-trigger on client. >>>>>> >>>>> Cool command, output of the client is attached. The four >>>>> processes >>>>> m45_amp214_om, is the application that hung when working off of >>>>> luster. you can see its stuck in IO state. >>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > > > ------------------------------ > > Message: 3 > Date: Mon, 11 Feb 2008 15:04:05 -0500 > From: "Tom.Wang" <Tom.Wang at Sun.COM> > Subject: Re: [Lustre-discuss] Luster clients getting evicted > To: Aaron Knister <aaron at iges.org> > Cc: lustre-discuss at lists.lustre.org > Message-ID: <47B0AA35.7070303 at sun.com> > Content-Type: text/plain; format=flowed; charset=ISO-8859-1 > > Aaron Knister wrote: >> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under >> load, the clients hand about every 10 minutes which is really bad for >> a production machine. The only way to fix the hang is to reboot the >> server. My users are getting extremely impatient :-/ >> >> I see this on the clients- >> >> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ >> timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 >> x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 >> ref 1 fl Rpc:/0/0 rc 0/-22 > It means OST could not response the request(unlink, o6) in 300 seconds, > so client disconnect the import to OST and try to reconnect. > Does this disconnection always happened when do unlink ? Could you > please post process trace and console msg of OST at that time? > > Thanks > WangDi >> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service >> data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress >> operations using this service will wait for recovery to complete. >> LustreError: 11-0: an error occurred while communicating with >> 192.168.64.71 at o2ib. The ost_connect operation failed with -16 >> LustreError: 11-0: an error occurred while communicating with >> 192.168.64.71 at o2ib. The ost_connect operation failed with -16 >> >> I''ve increased the timeout to 300seconds and it has helped marginally. >> >> -Aaron >> > >> >> >> >> > > > > ------------------------------ > > Message: 4 > Date: Mon, 11 Feb 2008 15:19:21 -0500 > From: Craig Prescott <prescott at hpc.ufl.edu> > Subject: Re: [Lustre-discuss] Luster clients getting evicted > To: Aaron Knister <aaron at iges.org> > Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org > Message-ID: <47B0ADC9.8020501 at hpc.ufl.edu> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Aaron Knister wrote: >> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under >> load, the clients hand about every 10 minutes which is really bad for >> a production machine. The only way to fix the hang is to reboot the >> server. My users are getting extremely impatient :-/ >> >> I see this on the clients- >> >> LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ >> timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ >> t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl >> Rpc:/0/0 rc 0/-22 >> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- >> OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations >> using this service will wait for recovery to complete. >> LustreError: 11-0: an error occurred while communicating with >> 192.168.64.71 at o2ib. The ost_connect operation failed with -16 >> LustreError: 11-0: an error occurred while communicating with >> 192.168.64.71 at o2ib. The ost_connect operation failed with -16 >> >> I''ve increased the timeout to 300seconds and it has helped marginally. > > Hi Aaron; > > We set the timeout a big number (1000secs) on our 400 node cluster > (mostly o2ib, some tcp clients). Until we did this, we had loads > of evictions. In our case, it solved the problem. > > Cheers, > Craig > > > ------------------------------ > > Message: 5 > Date: Mon, 11 Feb 2008 14:11:45 -0700 > From: Andreas Dilger <adilger at sun.com> > Subject: Re: [Lustre-discuss] rc -43: Identifier removed > To: Per Lundqvist <perl at nsc.liu.se> > Cc: Lustre Discuss <lustre-discuss at lists.lustre.org> > Message-ID: <20080211211145.GJ3029 at webber.adilger.int> > Content-Type: text/plain; charset=us-ascii > > On Feb 11, 2008 17:04 +0100, Per Lundqvist wrote: >> I got this error today when testing a newly set up 1.6 filesystem: >> >> n50 1% cd /mnt/test >> n50 2% ls >> ls: reading directory .: Identifier removed >> >> n50 3% ls -alrt >> total 8 >> ?--------- ? ? ? ? ? dir1 >> ?--------- ? ? ? ? ? dir2 >> drwxr-xr-x 4 root root 4096 Feb 8 15:46 ../ >> drwxr-xr-x 4 root root 4096 Feb 11 15:11 ./ >> >> n50 4% stat . >> File: `.'' >> Size: 4096 Blocks: 8 IO Block: 4096 directory >> Device: b438c888h/-1271347064d Inode: 27616681 Links: 2 >> Access: (0755/drwxr-xr-x) Uid: ( 1120/ faxen) Gid: ( 500/ nsc) >> Access: 2008-02-11 16:11:48.336621154 +0100 >> Modify: 2008-02-11 15:11:27.000000000 +0100 >> Change: 2008-02-11 15:11:31.352841294 +0100 >> >> this seems to be happen almost all the time when I am running as a >> specific user on this system. Note that the stat call always works... I >> haven''t yet been able to reproduce this problem when running as my own >> user. > > EIDRM (Identifier removed) means that your MDS has a user database > (/etc/passwd and /etc/group) that is missing the particular user ID. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > > > ------------------------------ > > Message: 6 > Date: Mon, 11 Feb 2008 16:17:37 -0500 > From: Brock Palen <brockp at umich.edu> > Subject: Re: [Lustre-discuss] Luster clients getting evicted > To: Craig Prescott <prescott at hpc.ufl.edu> > Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org > Message-ID: <38A6B1A2-E20A-40BC-80C2-CEBB971BDC09 at umich.edu> > Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed > >>> I''ve increased the timeout to 300seconds and it has helped >>> marginally. >> Hi Aaron; >> >> We set the timeout a big number (1000secs) on our 400 node cluster >> (mostly o2ib, some tcp clients). Until we did this, we had loads >> of evictions. In our case, it solved the problem. > > This feels excessive. But at this point I guess Ill try it. > >> Cheers, >> Craig >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > > > ------------------------------ > > Message: 7 > Date: Mon, 11 Feb 2008 16:48:05 -0500 > From: Aaron Knister <aaron at iges.org> > Subject: Re: [Lustre-discuss] Luster clients getting evicted > To: Brock Palen <brockp at umich.edu> > Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org > Message-ID: <7A1D46E5-CC69-4C37-9CC7-B229FCA43BA1 at iges.org> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > So far it''s helped. If this doesn''t fix it I''m going to apply the > patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit > I''ll let you know how it goes. If you''d like a copy of the patched > version let me know. Are you running RHEL/SLES? what version of the OS > and lustre? > > -Aaron > > On Feb 11, 2008, at 4:17 PM, Brock Palen wrote: > >>>> I''ve increased the timeout to 300seconds and it has helped >>>> marginally. >>> Hi Aaron; >>> >>> We set the timeout a big number (1000secs) on our 400 node cluster >>> (mostly o2ib, some tcp clients). Until we did this, we had loads >>> of evictions. In our case, it solved the problem. >> This feels excessive. But at this point I guess Ill try it. >> >>> Cheers, >>> Craig >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org > > > > > > > ------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > End of Lustre-discuss Digest, Vol 25, Issue 17 > ********************************************** > > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: not available > Type: application/ms-tnef > Size: 11404 bytes > Desc: not available > Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/610bb025/attachment.bin > > ------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > End of Lustre-discuss Digest, Vol 25, Issue 19 > ********************************************** > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss