thr3ads.net - Lustre discuss - [Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86

If this information is useful, please help other people find it:
Share via:

anil kumar

2008-Dec-02 05:10 UTC

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

Hi All,

We have noticed few issues related to patchless client being evicted & it
does not happen while we use Lustre kernel.  As an alternative we exported
Lustre filesystem using NFS but have intermittent issues with NFS with
"Input/output error" for few folders & it recovers on it own.

We are using RHEL 4 x86_64 with lustre 1.6.6;
I have exported as ; "/lin_results *(rw,no_root_squash,async)"   Note:
if I
don''t use no_root_squash i get consistent IO error while executing
"ls"

Please let us know if any one know the workaround or fix for this.

Thanks,
Anil
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081202/a91b5940/attachment.html

Alex Lyashkov

2008-Dec-02 15:12 UTC

head link

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

On Tue, 2008-12-02 at 10:40 +0530, anil kumar wrote:> Hi All, 
>  
> We have noticed few issues related to patchless client being evicted &
> it does not happen while we use Lustre kernel. what kernel version user for patchless client ? RHEL4 or something
else ?
>  As an alternative we exported Lustre filesystem using NFS but have
> intermittent issues with NFS with "Input/output error" for few
folders
> & it recovers on it own.can you post more info about it? console logs, cut
from /var/log/messages from affected client, lustre log?
>  
> We are using RHEL 4 x86_64 with lustre 1.6.6; 
> I have exported as ; "/lin_results *(rw,no_root_squash,async)"  
Note:
> if I don''t use no_root_squash i get consistent IO error while
> executing "ls"
>  
> Please let us know if any one know the workaround or fix for this.
>  
> Thanks,
> Anil
>  
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

anil kumar

2008-Dec-04 07:48 UTC

head link

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

Alex,

We are working on checking the lustre scalability so that we can uptake it
in our production infrastructure. Below are the details of our setup, tests
conducted and the issues faced till now,
*Setup details :
--------------------
*
Hardware Used - HP DL360
MDT/MGS - 1
OST - 13 (13 HP DL360 servers used, 1 OSS = 1 OST, 700gb x 13 )

*Issue1
---------
Test Environment:
*
Operating System - Redhat EL4 Update 7 ,x86_64
Lustre Version - 1.6.5.1
Lustre Kernel - kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64
Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel(
patchless )

*Test Conducted:* Performed heavy read/write ops from 190 lustre clients.
Each client tries to read & write 14000 files parallely.

*Errors noticed :* Multiple cliens evicted while writting hugh number of
files.Lustre mount is not accessible in the evicted clients. We need to
umount and mount to make the lustre accessible in the affected clients.

*server side errors noticed
*-----------------------------------------
Nov 26 01:03:48 kernel: LustreError: 29774:0:(handler.c:1515:mds_handle())
operation 41 on unconnected MDS from 12345-[CLIENT IP HERE]@tcp
Nov 26 01:07:46 kernel: Lustre: farmres-MDT0000: haven''t heard from
client
2379a0f4-f298-9c78-fce6-3d8db74f912b (at [CLIENT IP HERE]@tcp) in 227
seconds. I think it''s dead, and I am evicting it.
Nov 26 01:43:58 kernel: Lustre: MGS: haven''t heard from client
0c239c47-e1f7-47de-0b43-19d5819081e1 (at [CLIENT IP HERE]@tcp) in 227
seconds. I think it''s dead, and I am evicting it.
Nov 26 01:54:37 kernel: LustreError: 29766:0:(handler.c:1515:mds_handle())
operation 101 on unconnected MDS from 12345-[CLIENT IP HERE]@tcp
Nov 26 02:09:49 kernel: LustreError:
29760:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-107) req at 000001080ba29400 x260230/t0 o101-><?>@<?>:0/0 lens
440/0 e 0 to 0
dl 1227665489 ref 1 fl Interpret:/0/0 rc -107/0
Nov 27 01:06:07 kernel: LustreError:
30478:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 101 on
unconnected MGS
Nov 27 02:21:39 kernel: Lustre:
18420:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000:
180cf598-1e43-3ea4-6cf6-0ee40e5a2d5e reconnecting
Nov 27 02:22:16 kernel: Lustre: Request x2282604 sent from farmres-MDT0000
to NID [CLIENT IP HERE]@tcp 6s ago has timed out (limit 6s).
Nov 27 02:22:16 kernel: LustreError: 138-a: farmres-MDT0000: A client on nid
[CLIENT IP HERE]@tcp was evicted due to a lock blocking callback to [CLIENT
IP HERE]@tcp timed out: rc -107
Nov 27 08:58:46 kernel: LustreError:
29755:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
Nov 27 08:59:11 kernel: LustreError:
18473:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout
exceeded for key 0
Nov 27 13:23:25 kernel: Lustre:
29752:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000:
3d5efff1-1652-6669-94de-c93ee73a4bc7 reconnecting
Nov 27 02:17:16 kernel: nfs_statfs: statfs error = 116
------------------------

*client errors*
------------------------

cp: cannot open
`/master/jdk16/sample/jnlp/webpad/src/version1/ClipboardHandler.java''
for
reading: Input/output error
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/CopyAction.java'': Cannot
send
after transport endpoint shutdown
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/CutAction.java'': Cannot
send
after transport endpoint shutdown
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/ExitAction.java'': Cannot
send
after transport endpoint shutdown
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/FileHandler.java'':
Cannot
send after transport endpoint shutdown
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/HelpAction.java'': Cannot
send
after transport endpoint shutdown
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/HelpHandler.java'':
Cannot
send after transport endpoint shutdown
cp: cannot stat
`/master/jdk16/sample/jnlp/webpad/src/version1/JLFAbstractAction.java'':
Cannot send after transport endpoint shutdown
-------------------------

Lustre supports Xen kernel 2.6.9-78.0.0.0.1.ELxenU as patchless ?


*Issue2 - Tested using Lustre 1.6.6 to see client evicting issue
----------------------------------------------------------------------------------------------

Test Environment:*

Operating System - Redhat EL4 Update 7 ,x86_64
Lustre Version - 1.6.6
Lustre Kernel - kernel-lustre-smp-2.6.9-67.0.22.EL_lustre.1.6.6.x86_64
Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel(
patchless )


*Test Conducted:* Performed heavy read/write ops from 190 lustre clients.
Each client tries to read & write 14000 files parallely.

Errors noticed : Same eviction issue noticed in 1.6.6 kernel too.But evited
clients got reconnected and lustre file system was accessible without need
to umount & mount.The write operation going on in evicted clients was
terminated.

*Errors - same as Issue1*


*Issue3 - Tried accessing lustre via NFS since the eviction issue was
noticed only in patchless clients.
--------------------------------------------------------------------------------------------------------------------------------------------------------------
*
*Test Environment:
*
Operating System - Redhat EL4 Update 7 ,x86_64
Lustre Version - 1.6.6
Lustre Kernel - kernel-lustre-smp-2.6.9-67.0.22.EL_lustre.1.6.6.x86_64
Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel(nfs
clients)

All 13 OST''s acts as lustre clients and nfs server exporting Lustrefs
via
NFS

*NFS options* - *(rw,no_root_squash,async) - settled for this nfs option
finally because of the following problem faced in othere nfs options

*(rw) -- Seen IO errors in clients( nfs terminate -61 ) , this issue was
fixed by adding "no_root_squash"
*(rw,no_root_squash,sync) -- When using sync , mdt was overloaded and write
takes long time.


*Test Conducted:* Performed heavy read/write ops from 190 nfs clients. Each
client tries to read & write 14000 files parallely.

*Errors noticed:*

Lustre was able to withstand the read & write operations but we have seen IO
errors while deleting large number of files. But the file system is
accessible on OST(nfs server), and no logs seen related to this issue in
nfsserver and client.
We understand from some threads the this IO issue is fixed in EL5 kernel so
we moved to EL5 for all mdt & ost

*Issue4)*

*Test Environment:
*
Operating System - Redhat EL5 Update 2 ,x86_64
Lustre Version - 1.6.5.1
Lustre Kernel - 2.6.18-53.1.14.el5_lustre.1.6.5.1smp
Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel(nfs
clients)
All OST''s acts as lustre clients and nfs server exporting Lustrefs via
NFS

*Test Conducted:* Performed heavy read/write ops from 190 nfs clients. Each
client tries to read & write 14000 files parallely.

*Errors
*
We still see IO errors on the clients while deleting large number of files
intermittently.But the file system is accessible on OST(nfs server), and no
logs seen related to this issue in nfsserver and client.
Also noticed that mdt is more loaded than EL4 kernel with same hardware(load
average consistenly above 20 while writting large number of small files)

Additionally Iscsi modules not working in EL5 lustre kernel. (We could not
use ISCSI Volumes for MDT Failover)
( iscsistart: Missing or Invalid version from
/sys/module/scsi_transport_iscsi/version. Make sure a up to date
scsi_transport_iscsi module is loaded and a up todate version of iscsid is
running. Exiting..)


Thanks,
Anil



On Tue, Dec 2, 2008 at 8:42 PM, Alex Lyashkov <Alexey.Lyashkov at
sun.com>wrote:
> On Tue, 2008-12-02 at 10:40 +0530, anil kumar wrote:
> > Hi All,
> >
> > We have noticed few issues related to patchless client being evicted
&
> > it does not happen while we use Lustre kernel.
>
> what kernel version user for patchless client ? RHEL4 or something
> else ?
>
> >  As an alternative we exported Lustre filesystem using NFS but have
> > intermittent issues with NFS with "Input/output error" for
few folders
> > & it recovers on it own.
>
> can you post more info about it? console logs, cut
> from /var/log/messages from affected client, lustre log?
>
> >
> > We are using RHEL 4 x86_64 with lustre 1.6.6;
> > I have exported as ; "/lin_results
*(rw,no_root_squash,async)"   Note:
> > if I don''t use no_root_squash i get consistent IO error while
> > executing "ls"
> >
> > Please let us know if any one know the workaround or fix for this.
> >
> > Thanks,
> > Anil
> >
>
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081204/bcea715b/attachment.html

Alex Lyashkov

2008-Dec-04 13:53 UTC

head link

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

On Thu, 2008-12-04 at 13:18 +0530, anil kumar wrote:> Alex,
>  
> We are working on checking the lustre scalability so that we can
> uptake it in our production infrastructure. Below are the details of
> our setup, tests conducted and the issues faced till now, 
> Setup details :
> --------------------
> 
> Hardware Used - HP DL360 
> MDT/MGS - 1 
> OST - 13 (13 HP DL360 servers used, 1 OSS = 1 OST, 700gb x 13 )
> 
> Issue1
> ---------
> Test Environment: 
> 
> Operating System - Redhat EL4 Update 7 ,x86_64
> Lustre Version - 1.6.5.1 
> Lustre Kernel -
> kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64I think this for server?
> Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU
> kernel( patchless ) 2.6.9 kernel for patchless is dangerous - some problems can be fixed due
kernel internal limitation. i suggest apply vfs_intent and dcache
patches.


> 
> Test Conducted: Performed heavy read/write ops from 190 lustre
> clients. Each client tries to read & write 14000 files parallely. 
> 
> Errors noticed : Multiple cliens evicted while writting hugh number of
> files.Lustre mount is not accessible in the evicted clients. We need
> to umount and mount to make the lustre accessible in the affected
> clients. 
> 
> server side errors noticed 
> -----------------------------------------
> Nov 26 01:03:48 kernel: LustreError:
> 29774:0:(handler.c:1515:mds_handle()) operation 41 on unconnected MDS
> from 12345-[CLIENT IP HERE]@tcp
> Nov 26 01:07:46 kernel: Lustre: farmres-MDT0000: haven''t heard
from
> client 2379a0f4-f298-9c78-fce6-3d8db74f912b (at [CLIENT IP HERE]@tcp)
> in 227 seconds. I think it''s dead, and I am evicting it.
> Nov 26 01:43:58 kernel: Lustre: MGS: haven''t heard from client
> 0c239c47-e1f7-47de-0b43-19d5819081e1 (at [CLIENT IP HERE]@tcp) in 227
> seconds. I think it''s dead, and I am evicting it.both - mds and mgs is evict client - is network link is OK ?

> Nov 26 01:54:37 kernel: LustreError:
> 29766:0:(handler.c:1515:mds_handle()) operation 101 on unconnected MDS
> from 12345-[CLIENT IP HERE]@tcp
> Nov 26 02:09:49 kernel: LustreError:
> 29760:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
> (-107) req at 000001080ba29400 x260230/t0 o101-><?>@<?>:0/0
lens 440/0 e
> 0 to 0 dl 1227665489 ref 1 fl Interpret:/0/0 rc -107/0
> Nov 27 01:06:07 kernel: LustreError:
> 30478:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 101 on
> unconnected MGS
> Nov 27 02:21:39 kernel: Lustre:
> 18420:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000:
> 180cf598-1e43-3ea4-6cf6-0ee40e5a2d5e reconnecting
> Nov 27 02:22:16 kernel: Lustre: Request x2282604 sent from
> farmres-MDT0000 to NID [CLIENT IP HERE]@tcp 6s ago has timed out
> (limit 6s).
> Nov 27 02:22:16 kernel: LustreError: 138-a: farmres-MDT0000: A client
> on nid [CLIENT IP HERE]@tcp was evicted due to a lock blocking
> callback to [CLIENT IP HERE]@tcp timed out: rc -107
> Nov 27 08:58:46 kernel: LustreError:
> 29755:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout
> exceeded for key 0
> Nov 27 08:59:11 kernel: LustreError:
> 18473:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout
> exceeded for key 0hm... as i know this bug on FS configuration. can you reset
mdt.group_upcall to ''NONE'' ?

> Nov 27 13:23:25 kernel: Lustre:
> 29752:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000:
> 3d5efff1-1652-6669-94de-c93ee73a4bc7 reconnecting
> Nov 27 02:17:16 kernel: nfs_statfs: statfs error = 116
> ------------------------
> 
> client errors 
> ------------------------
> 
> cp: cannot stat
>
`/master/jdk16/sample/jnlp/webpad/src/version1/JLFAbstractAction.java'':
Cannot send after transport endpoint shutdown
> -------------------------
> 
> Lustre supports Xen kernel 2.6.9-78.0.0.0.1.ELxenU as patchless ? with some limitation. i suggest use 2.6.15 and up for patchless client.
for 2.6.16 i know about one limitation - FMODE_EXEC patch is absent.

what is in clients /var/log/messages at same time ?>

Peter Kjellstrom

2008-Dec-08 19:39 UTC

head link

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

On Thursday 04 December 2008, Alex Lyashkov wrote:> On Thu, 2008-12-04 at 13:18 +0530, anil kumar wrote:
> > Alex,
> >
> > We are working on checking the lustre scalability so that we can
> > uptake it in our production infrastructure. Below are the details of
> > our setup, tests conducted and the issues faced till now,
> > Setup details :
> > --------------------
> >
> > Hardware Used - HP DL360
> > MDT/MGS - 1
> > OST - 13 (13 HP DL360 servers used, 1 OSS = 1 OST, 700gb x 13 )
> >
> > Issue1
> > ---------
> > Test Environment:
> >
> > Operating System - Redhat EL4 Update 7 ,x86_64
> > Lustre Version - 1.6.5.1
> > Lustre Kernel -
> > kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64
>
> I think this for server?
>
> > Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU
> > kernel( patchless )
>
> 2.6.9 kernel for patchless is dangerous - some problems can be fixed due
> kernel internal limitation. i suggest apply vfs_intent and dcache
> patches.
Hmm.. Brian stated that patchless was ok now for modern EL4 kernels like 
2.6.9-78 (thread name "[Lustre-discuss] Is patchless ok for EL4 now?"
early
November...). Are you saying that this is not true and that CFS/SUN still 
thinks patchless client will have problems on EL4?

/Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/833d8860/attachment.bin

Lustre discuss - Dec 2008 - NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64

[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64