ashok bharat bayana
2008-Feb-12 05:45 UTC
[Lustre-discuss] Lustre-discuss Digest, Vol 25, Issue 17
Hi, i just want to know whether there are any alternative file systems for HP SFS. I heard that there is Cluster Gateway from Polyserve. Can anybody plz help me in finding more abt this Cluster Gateway. Thanks and Regards, Ashok Bharat -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org on behalf of lustre-discuss-request at lists.lustre.org Sent: Tue 2/12/2008 3:18 AM To: lustre-discuss at lists.lustre.org Subject: Lustre-discuss Digest, Vol 25, Issue 17 Send Lustre-discuss mailing list submissions to lustre-discuss at lists.lustre.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.lustre.org/mailman/listinfo/lustre-discuss or, via email, send a message with subject or body ''help'' to lustre-discuss-request at lists.lustre.org You can reach the person managing the list at lustre-discuss-owner at lists.lustre.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Lustre-discuss digest..." Today''s Topics: 1. Re: Benchmarking Lustre (Marty Barnaby) 2. Re: Luster clients getting evicted (Aaron Knister) 3. Re: Luster clients getting evicted (Tom.Wang) 4. Re: Luster clients getting evicted (Craig Prescott) 5. Re: rc -43: Identifier removed (Andreas Dilger) 6. Re: Luster clients getting evicted (Brock Palen) 7. Re: Luster clients getting evicted (Aaron Knister) ---------------------------------------------------------------------- Message: 1 Date: Mon, 11 Feb 2008 11:25:48 -0700 From: "Marty Barnaby" <mlbarna at sandia.gov> Subject: Re: [Lustre-discuss] Benchmarking Lustre To: "lustre-discuss at lists.lustre.org" <lustre-discuss at lists.lustre.org> Message-ID: <47B0932C.2090200 at sandia.gov> Content-Type: text/plain; charset=iso-8859-1; format=flowed Do you have any special interests, like: writing from a true MPI job; collective vs. independent; one-file-per-processor vs. a single, share file; or writing via MPI-IO vs. posix? Marty Barnaby mayur bhosle wrote:> hi everyone, > > I am a student at Georgia Tech university, and > as a part of a project i need to benchmark lustre file system. I did a > lot of searching regarding > the possible benchmark, but i need some advice on which benchmarks > would be more suitable... if any one can post a sugesstion that would > be really helpful....................... > > thanks in advance ............ > > Mayur------------------------------ Message: 2 Date: Mon, 11 Feb 2008 14:16:20 -0500 From: Aaron Knister <aaron at iges.org> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Tom.Wang <Tom.Wang at Sun.COM> Cc: lustre-discuss at lists.lustre.org Message-ID: <79343CD8-77EA-4686-A2AE-BEE6FAC59914 at iges.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under load, the clients hand about every 10 minutes which is really bad for a production machine. The only way to fix the hang is to reboot the server. My users are getting extremely impatient :-/ I see this on the clients- LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl Rpc:/0/0 rc 0/-22 Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 LustreError: 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 I''ve increased the timeout to 300seconds and it has helped marginally. -Aaron On Feb 9, 2008, at 12:06 AM, Tom.Wang wrote:> Hi, > Aha, this is bug has been fixed in 14360. > > https://bugzilla.lustre.org/show_bug.cgi?id=14360 > > The patch there should fix your problem, which should be released in > 1.6.5 > > Thanks > > Brock Palen wrote: >> Sure, Attached, note though, we rebuilt our lustre source for >> another >> box that uses the largesmp kernel. but it used the same options and >> compiler. >> >> >> Brock Palen >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> On Feb 8, 2008, at 2:47 PM, Tom.Wang wrote: >> >>> Hello, >>> >>> m45_amp214_om D 0000000000000000 0 2587 1 31389 >>> 2586 (NOTLB) >>> 00000101f6b435f8 0000000000000006 000001022c7fc030 0000000000000001 >>> 00000100080f1a40 0000000000000246 00000101f6b435a8 >>> 0000000380136025 >>> 00000102270a1030 00000000000000d0 >>> Call Trace:<ffffffffa0216e79>{:lnet:LNetPut+1689} >>> <ffffffff8030e45f>{__down+147} >>> <ffffffff80134659>{default_wake_function+0} >>> <ffffffff8030ff7d>{__down_failed+53} >>> <ffffffffa04292e1>{:lustre:.text.lock.file+5} >>> <ffffffffa044b12e>{:lustre:ll_mdc_blocking_ast+798} >>> <ffffffffa02c8eb8>{:ptlrpc:ldlm_resource_get+456} >>> <ffffffffa02c3bbb>{:ptlrpc:ldlm_cancel_callback+107} >>> <ffffffffa02da615>{:ptlrpc:ldlm_cli_cancel_local+213} >>> <ffffffffa02c3c48>{:ptlrpc:ldlm_lock_addref_internal_nolock+56} >>> <ffffffffa02c3dbc>{:ptlrpc:search_queue+284} >>> <ffffffffa02dbc03>{:ptlrpc:ldlm_cancel_list+99} >>> <ffffffffa02dc113>{:ptlrpc:ldlm_cancel_lru_local+915} >>> <ffffffffa02ca293>{:ptlrpc:ldlm_resource_putref+435} >>> <ffffffffa02dc2c9>{:ptlrpc:ldlm_prep_enqueue_req+313} >>> <ffffffffa0394e6f>{:mdc:mdc_enqueue+1023} >>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>> <ffffffffa0268730>{:obdclass:class_handle2object+224} >>> <ffffffffa02c5fea>{:ptlrpc:__ldlm_handle2lock+794} >>> <ffffffffa02c106f>{:ptlrpc:unlock_res_and_lock+31} >>> <ffffffffa02c5c03>{:ptlrpc:ldlm_lock_decref_internal+595} >>> <ffffffffa02c156c>{:ptlrpc:ldlm_lock_add_to_lru+140} >>> <ffffffffa02c1035>{:ptlrpc:lock_res_and_lock+53} >>> <ffffffffa02c6f0a>{:ptlrpc:ldlm_lock_decref+154} >>> <ffffffffa039617d>{:mdc:mdc_intent_lock+685} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffffa02d85f0>{:ptlrpc:ldlm_completion_ast+0} >>> <ffffffffa044b64b>{:lustre:ll_prepare_mdc_op_data+139} >>> <ffffffffa0418a32>{:lustre:ll_intent_file_open+450} >>> <ffffffffa044ae10>{:lustre:ll_mdc_blocking_ast+0} >>> <ffffffff80192006>{__d_lookup+287} >>> <ffffffffa0419724>{:lustre:ll_file_open+2100} >>> <ffffffffa0428a18>{:lustre:ll_inode_permission+184} >>> <ffffffff80179bdb>{sys_access+349} >>> <ffffffff8017a1ee>{__dentry_open+201} >>> <ffffffff8017a3a9>{filp_open+95} >>> <ffffffff80179bdb>{sys_access+349} >>> <ffffffff801f00b5>{strncpy_from_user+74} >>> <ffffffff8017a598>{sys_open+57} >>> <ffffffff8011026a>{system_call+126} >>> >>> It seems blocking_ast process was blocked here. Could you dump the >>> lustre/llite/namei.o by objdump -S lustre/llite/namei.o and send >>> to me? >>> >>> Thanks >>> WangDi >>> >>> Brock Palen wrote: >>>>>> On Feb 7, 2008, at 11:09 PM, Tom.Wang wrote: >>>>>>>> MDT dmesg: >>>>>>>> >>>>>>>> LustreError: 9042:0:(ldlm_lib.c:1442:target_send_reply_msg()) >>>>>>>> @@@ processing error (-107) req at 000001002b >>>>>>>> 52b000 x445020/t0 o400-><?>@<?>:-1 lens 128/0 ref 0 fl >>>>>>>> Interpret:/0/0 rc -107/0 >>>>>>>> LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) >>>>>>>> ### >>>>>>>> lock callback timer expired: evicting cl >>>>>>>> ient >>>>>>>> 2faf3c9e-26fb-64b7-ca6c-7c5b09374e67 at NET_0x200000aa4008d_UUID >>>>>>>> nid 10.164.0.141 at tcp ns: mds-nobackup >>>>>>>> -MDT0000_UUID lock: 00000100476df240/0xbc269e05c512de3a lrc: >>>>>>>> 1/0,0 mode: CR/CR res: 11240142/324715850 bi >>>>>>>> ts 0x5 rrc: 2 type: IBT flags: 20 remote: 0x4e54bc800174cd08 >>>>>>>> expref: 372 pid 26925 >>>>>>>> >>>>>>> The client was evicted because of this lock can not be released >>>>>>> on client >>>>>>> on time. Could you provide the stack strace of client at that >>>>>>> time? >>>>>>> >>>>>>> I assume increase obd_timeout could fix your problem. Then maybe >>>>>>> you should wait 1.6.5 released, including a new feature >>>>>>> adaptive_timeout, >>>>>>> which will adjust the timeout value according to the network >>>>>>> congestion >>>>>>> and server load. And it should help your problem. >>>>>> >>>>>> Waiting for the next version of lustre might be the best >>>>>> thing. I >>>>>> had upped the timeout a few days back but the next day i had >>>>>> errors on the MDS box. I have switched it back: >>>>>> >>>>>> lctl conf_param nobackup-MDT0000.sys.timeout=300 >>>>>> >>>>>> I would love to give you that trace but I don''t know how to get >>>>>> it. Is there a debug option to turn on in the clients? >>>>> You can get that by echo t > /proc/sysrq-trigger on client. >>>>> >>>> Cool command, output of the client is attached. The four >>>> processes >>>> m45_amp214_om, is the application that hung when working off of >>>> luster. you can see its stuck in IO state. >>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >>> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org ------------------------------ Message: 3 Date: Mon, 11 Feb 2008 15:04:05 -0500 From: "Tom.Wang" <Tom.Wang at Sun.COM> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Aaron Knister <aaron at iges.org> Cc: lustre-discuss at lists.lustre.org Message-ID: <47B0AA35.7070303 at sun.com> Content-Type: text/plain; format=flowed; charset=ISO-8859-1 Aaron Knister wrote:> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 > x1796079/t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 > ref 1 fl Rpc:/0/0 rc 0/-22It means OST could not response the request(unlink, o6) in 300 seconds, so client disconnect the import to OST and try to reconnect. Does this disconnection always happened when do unlink ? Could you please post process trace and console msg of OST at that time? Thanks WangDi> Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service > data-OST0000 via nid 192.168.64.71 at o2ib was lost; in progress > operations using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally. > > -Aaron >> > > > >------------------------------ Message: 4 Date: Mon, 11 Feb 2008 15:19:21 -0500 From: Craig Prescott <prescott at hpc.ufl.edu> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Aaron Knister <aaron at iges.org> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org Message-ID: <47B0ADC9.8020501 at hpc.ufl.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Aaron Knister wrote:> I''m having a similar issue with lustre 1.6.4.2 and infiniband. Under > load, the clients hand about every 10 minutes which is really bad for > a production machine. The only way to fix the hang is to reboot the > server. My users are getting extremely impatient :-/ > > I see this on the clients- > > LustreError: 2814:0:(client.c:975:ptlrpc_expire_one_request()) @@@ > timeout (sent at 1202756629, 301s ago) req at ffff8100af233600 x1796079/ > t0 o6->data-OST0000_UUID at 192.168.64.71@o2ib:28 lens 336/336 ref 1 fl > Rpc:/0/0 rc 0/-22 > Lustre: data-OST0000-osc-ffff810139ce4800: Connection to service data- > OST0000 via nid 192.168.64.71 at o2ib was lost; in progress operations > using this service will wait for recovery to complete. > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > LustreError: 11-0: an error occurred while communicating with > 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > > I''ve increased the timeout to 300seconds and it has helped marginally.Hi Aaron; We set the timeout a big number (1000secs) on our 400 node cluster (mostly o2ib, some tcp clients). Until we did this, we had loads of evictions. In our case, it solved the problem. Cheers, Craig ------------------------------ Message: 5 Date: Mon, 11 Feb 2008 14:11:45 -0700 From: Andreas Dilger <adilger at sun.com> Subject: Re: [Lustre-discuss] rc -43: Identifier removed To: Per Lundqvist <perl at nsc.liu.se> Cc: Lustre Discuss <lustre-discuss at lists.lustre.org> Message-ID: <20080211211145.GJ3029 at webber.adilger.int> Content-Type: text/plain; charset=us-ascii On Feb 11, 2008 17:04 +0100, Per Lundqvist wrote:> I got this error today when testing a newly set up 1.6 filesystem: > > n50 1% cd /mnt/test > n50 2% ls > ls: reading directory .: Identifier removed > > n50 3% ls -alrt > total 8 > ?--------- ? ? ? ? ? dir1 > ?--------- ? ? ? ? ? dir2 > drwxr-xr-x 4 root root 4096 Feb 8 15:46 ../ > drwxr-xr-x 4 root root 4096 Feb 11 15:11 ./ > > n50 4% stat . > File: `.'' > Size: 4096 Blocks: 8 IO Block: 4096 directory > Device: b438c888h/-1271347064d Inode: 27616681 Links: 2 > Access: (0755/drwxr-xr-x) Uid: ( 1120/ faxen) Gid: ( 500/ nsc) > Access: 2008-02-11 16:11:48.336621154 +0100 > Modify: 2008-02-11 15:11:27.000000000 +0100 > Change: 2008-02-11 15:11:31.352841294 +0100 > > this seems to be happen almost all the time when I am running as a > specific user on this system. Note that the stat call always works... I > haven''t yet been able to reproduce this problem when running as my own > user.EIDRM (Identifier removed) means that your MDS has a user database (/etc/passwd and /etc/group) that is missing the particular user ID. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ------------------------------ Message: 6 Date: Mon, 11 Feb 2008 16:17:37 -0500 From: Brock Palen <brockp at umich.edu> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Craig Prescott <prescott at hpc.ufl.edu> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org Message-ID: <38A6B1A2-E20A-40BC-80C2-CEBB971BDC09 at umich.edu> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed>> I''ve increased the timeout to 300seconds and it has helped >> marginally. > > Hi Aaron; > > We set the timeout a big number (1000secs) on our 400 node cluster > (mostly o2ib, some tcp clients). Until we did this, we had loads > of evictions. In our case, it solved the problem.This feels excessive. But at this point I guess Ill try it.> > Cheers, > Craig > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >------------------------------ Message: 7 Date: Mon, 11 Feb 2008 16:48:05 -0500 From: Aaron Knister <aaron at iges.org> Subject: Re: [Lustre-discuss] Luster clients getting evicted To: Brock Palen <brockp at umich.edu> Cc: "Tom.Wang" <Tom.Wang at Sun.COM>, lustre-discuss at lists.lustre.org Message-ID: <7A1D46E5-CC69-4C37-9CC7-B229FCA43BA1 at iges.org> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes So far it''s helped. If this doesn''t fix it I''m going to apply the patch mentioned here - https://bugzilla.lustre.org/attachment.cgi?id=14006&action=edit I''ll let you know how it goes. If you''d like a copy of the patched version let me know. Are you running RHEL/SLES? what version of the OS and lustre? -Aaron On Feb 11, 2008, at 4:17 PM, Brock Palen wrote:>>> I''ve increased the timeout to 300seconds and it has helped >>> marginally. >> >> Hi Aaron; >> >> We set the timeout a big number (1000secs) on our 400 node cluster >> (mostly o2ib, some tcp clients). Until we did this, we had loads >> of evictions. In our case, it solved the problem. > > This feels excessive. But at this point I guess Ill try it. > >> >> Cheers, >> Craig >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> >Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org ------------------------------ _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss End of Lustre-discuss Digest, Vol 25, Issue 17 ********************************************** -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/ms-tnef Size: 11404 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080212/610bb025/attachment-0002.bin