anil kumar
2008-Dec-02 05:10 UTC
[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64
Hi All, We have noticed few issues related to patchless client being evicted & it does not happen while we use Lustre kernel. As an alternative we exported Lustre filesystem using NFS but have intermittent issues with NFS with "Input/output error" for few folders & it recovers on it own. We are using RHEL 4 x86_64 with lustre 1.6.6; I have exported as ; "/lin_results *(rw,no_root_squash,async)" Note: if I don''t use no_root_squash i get consistent IO error while executing "ls" Please let us know if any one know the workaround or fix for this. Thanks, Anil -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081202/a91b5940/attachment.html
Alex Lyashkov
2008-Dec-02 15:12 UTC
[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64
On Tue, 2008-12-02 at 10:40 +0530, anil kumar wrote:> Hi All, > > We have noticed few issues related to patchless client being evicted & > it does not happen while we use Lustre kernel.what kernel version user for patchless client ? RHEL4 or something else ?> As an alternative we exported Lustre filesystem using NFS but have > intermittent issues with NFS with "Input/output error" for few folders > & it recovers on it own.can you post more info about it? console logs, cut from /var/log/messages from affected client, lustre log?> > We are using RHEL 4 x86_64 with lustre 1.6.6; > I have exported as ; "/lin_results *(rw,no_root_squash,async)" Note: > if I don''t use no_root_squash i get consistent IO error while > executing "ls" > > Please let us know if any one know the workaround or fix for this. > > Thanks, > Anil > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
anil kumar
2008-Dec-04 07:48 UTC
[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64
Alex, We are working on checking the lustre scalability so that we can uptake it in our production infrastructure. Below are the details of our setup, tests conducted and the issues faced till now, *Setup details : -------------------- * Hardware Used - HP DL360 MDT/MGS - 1 OST - 13 (13 HP DL360 servers used, 1 OSS = 1 OST, 700gb x 13 ) *Issue1 --------- Test Environment: * Operating System - Redhat EL4 Update 7 ,x86_64 Lustre Version - 1.6.5.1 Lustre Kernel - kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64 Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel( patchless ) *Test Conducted:* Performed heavy read/write ops from 190 lustre clients. Each client tries to read & write 14000 files parallely. *Errors noticed :* Multiple cliens evicted while writting hugh number of files.Lustre mount is not accessible in the evicted clients. We need to umount and mount to make the lustre accessible in the affected clients. *server side errors noticed *----------------------------------------- Nov 26 01:03:48 kernel: LustreError: 29774:0:(handler.c:1515:mds_handle()) operation 41 on unconnected MDS from 12345-[CLIENT IP HERE]@tcp Nov 26 01:07:46 kernel: Lustre: farmres-MDT0000: haven''t heard from client 2379a0f4-f298-9c78-fce6-3d8db74f912b (at [CLIENT IP HERE]@tcp) in 227 seconds. I think it''s dead, and I am evicting it. Nov 26 01:43:58 kernel: Lustre: MGS: haven''t heard from client 0c239c47-e1f7-47de-0b43-19d5819081e1 (at [CLIENT IP HERE]@tcp) in 227 seconds. I think it''s dead, and I am evicting it. Nov 26 01:54:37 kernel: LustreError: 29766:0:(handler.c:1515:mds_handle()) operation 101 on unconnected MDS from 12345-[CLIENT IP HERE]@tcp Nov 26 02:09:49 kernel: LustreError: 29760:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-107) req at 000001080ba29400 x260230/t0 o101-><?>@<?>:0/0 lens 440/0 e 0 to 0 dl 1227665489 ref 1 fl Interpret:/0/0 rc -107/0 Nov 27 01:06:07 kernel: LustreError: 30478:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS Nov 27 02:21:39 kernel: Lustre: 18420:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000: 180cf598-1e43-3ea4-6cf6-0ee40e5a2d5e reconnecting Nov 27 02:22:16 kernel: Lustre: Request x2282604 sent from farmres-MDT0000 to NID [CLIENT IP HERE]@tcp 6s ago has timed out (limit 6s). Nov 27 02:22:16 kernel: LustreError: 138-a: farmres-MDT0000: A client on nid [CLIENT IP HERE]@tcp was evicted due to a lock blocking callback to [CLIENT IP HERE]@tcp timed out: rc -107 Nov 27 08:58:46 kernel: LustreError: 29755:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout exceeded for key 0 Nov 27 08:59:11 kernel: LustreError: 18473:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout exceeded for key 0 Nov 27 13:23:25 kernel: Lustre: 29752:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000: 3d5efff1-1652-6669-94de-c93ee73a4bc7 reconnecting Nov 27 02:17:16 kernel: nfs_statfs: statfs error = 116 ------------------------ *client errors* ------------------------ cp: cannot open `/master/jdk16/sample/jnlp/webpad/src/version1/ClipboardHandler.java'' for reading: Input/output error cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/CopyAction.java'': Cannot send after transport endpoint shutdown cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/CutAction.java'': Cannot send after transport endpoint shutdown cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/ExitAction.java'': Cannot send after transport endpoint shutdown cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/FileHandler.java'': Cannot send after transport endpoint shutdown cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/HelpAction.java'': Cannot send after transport endpoint shutdown cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/HelpHandler.java'': Cannot send after transport endpoint shutdown cp: cannot stat `/master/jdk16/sample/jnlp/webpad/src/version1/JLFAbstractAction.java'': Cannot send after transport endpoint shutdown ------------------------- Lustre supports Xen kernel 2.6.9-78.0.0.0.1.ELxenU as patchless ? *Issue2 - Tested using Lustre 1.6.6 to see client evicting issue ---------------------------------------------------------------------------------------------- Test Environment:* Operating System - Redhat EL4 Update 7 ,x86_64 Lustre Version - 1.6.6 Lustre Kernel - kernel-lustre-smp-2.6.9-67.0.22.EL_lustre.1.6.6.x86_64 Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel( patchless ) *Test Conducted:* Performed heavy read/write ops from 190 lustre clients. Each client tries to read & write 14000 files parallely. Errors noticed : Same eviction issue noticed in 1.6.6 kernel too.But evited clients got reconnected and lustre file system was accessible without need to umount & mount.The write operation going on in evicted clients was terminated. *Errors - same as Issue1* *Issue3 - Tried accessing lustre via NFS since the eviction issue was noticed only in patchless clients. -------------------------------------------------------------------------------------------------------------------------------------------------------------- * *Test Environment: * Operating System - Redhat EL4 Update 7 ,x86_64 Lustre Version - 1.6.6 Lustre Kernel - kernel-lustre-smp-2.6.9-67.0.22.EL_lustre.1.6.6.x86_64 Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel(nfs clients) All 13 OST''s acts as lustre clients and nfs server exporting Lustrefs via NFS *NFS options* - *(rw,no_root_squash,async) - settled for this nfs option finally because of the following problem faced in othere nfs options *(rw) -- Seen IO errors in clients( nfs terminate -61 ) , this issue was fixed by adding "no_root_squash" *(rw,no_root_squash,sync) -- When using sync , mdt was overloaded and write takes long time. *Test Conducted:* Performed heavy read/write ops from 190 nfs clients. Each client tries to read & write 14000 files parallely. *Errors noticed:* Lustre was able to withstand the read & write operations but we have seen IO errors while deleting large number of files. But the file system is accessible on OST(nfs server), and no logs seen related to this issue in nfsserver and client. We understand from some threads the this IO issue is fixed in EL5 kernel so we moved to EL5 for all mdt & ost *Issue4)* *Test Environment: * Operating System - Redhat EL5 Update 2 ,x86_64 Lustre Version - 1.6.5.1 Lustre Kernel - 2.6.18-53.1.14.el5_lustre.1.6.5.1smp Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU kernel(nfs clients) All OST''s acts as lustre clients and nfs server exporting Lustrefs via NFS *Test Conducted:* Performed heavy read/write ops from 190 nfs clients. Each client tries to read & write 14000 files parallely. *Errors * We still see IO errors on the clients while deleting large number of files intermittently.But the file system is accessible on OST(nfs server), and no logs seen related to this issue in nfsserver and client. Also noticed that mdt is more loaded than EL4 kernel with same hardware(load average consistenly above 20 while writting large number of small files) Additionally Iscsi modules not working in EL5 lustre kernel. (We could not use ISCSI Volumes for MDT Failover) ( iscsistart: Missing or Invalid version from /sys/module/scsi_transport_iscsi/version. Make sure a up to date scsi_transport_iscsi module is loaded and a up todate version of iscsid is running. Exiting..) Thanks, Anil On Tue, Dec 2, 2008 at 8:42 PM, Alex Lyashkov <Alexey.Lyashkov at sun.com>wrote:> On Tue, 2008-12-02 at 10:40 +0530, anil kumar wrote: > > Hi All, > > > > We have noticed few issues related to patchless client being evicted & > > it does not happen while we use Lustre kernel. > > what kernel version user for patchless client ? RHEL4 or something > else ? > > > As an alternative we exported Lustre filesystem using NFS but have > > intermittent issues with NFS with "Input/output error" for few folders > > & it recovers on it own. > > can you post more info about it? console logs, cut > from /var/log/messages from affected client, lustre log? > > > > > We are using RHEL 4 x86_64 with lustre 1.6.6; > > I have exported as ; "/lin_results *(rw,no_root_squash,async)" Note: > > if I don''t use no_root_squash i get consistent IO error while > > executing "ls" > > > > Please let us know if any one know the workaround or fix for this. > > > > Thanks, > > Anil > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081204/bcea715b/attachment.html
Alex Lyashkov
2008-Dec-04 13:53 UTC
[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64
On Thu, 2008-12-04 at 13:18 +0530, anil kumar wrote:> Alex, > > We are working on checking the lustre scalability so that we can > uptake it in our production infrastructure. Below are the details of > our setup, tests conducted and the issues faced till now, > Setup details : > -------------------- > > Hardware Used - HP DL360 > MDT/MGS - 1 > OST - 13 (13 HP DL360 servers used, 1 OSS = 1 OST, 700gb x 13 ) > > Issue1 > --------- > Test Environment: > > Operating System - Redhat EL4 Update 7 ,x86_64 > Lustre Version - 1.6.5.1 > Lustre Kernel - > kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64I think this for server?> Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU > kernel( patchless )2.6.9 kernel for patchless is dangerous - some problems can be fixed due kernel internal limitation. i suggest apply vfs_intent and dcache patches.> > Test Conducted: Performed heavy read/write ops from 190 lustre > clients. Each client tries to read & write 14000 files parallely. > > Errors noticed : Multiple cliens evicted while writting hugh number of > files.Lustre mount is not accessible in the evicted clients. We need > to umount and mount to make the lustre accessible in the affected > clients. > > server side errors noticed > ----------------------------------------- > Nov 26 01:03:48 kernel: LustreError: > 29774:0:(handler.c:1515:mds_handle()) operation 41 on unconnected MDS > from 12345-[CLIENT IP HERE]@tcp> Nov 26 01:07:46 kernel: Lustre: farmres-MDT0000: haven''t heard from > client 2379a0f4-f298-9c78-fce6-3d8db74f912b (at [CLIENT IP HERE]@tcp) > in 227 seconds. I think it''s dead, and I am evicting it. > Nov 26 01:43:58 kernel: Lustre: MGS: haven''t heard from client > 0c239c47-e1f7-47de-0b43-19d5819081e1 (at [CLIENT IP HERE]@tcp) in 227 > seconds. I think it''s dead, and I am evicting it.both - mds and mgs is evict client - is network link is OK ?> Nov 26 01:54:37 kernel: LustreError: > 29766:0:(handler.c:1515:mds_handle()) operation 101 on unconnected MDS > from 12345-[CLIENT IP HERE]@tcp > Nov 26 02:09:49 kernel: LustreError: > 29760:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error > (-107) req at 000001080ba29400 x260230/t0 o101-><?>@<?>:0/0 lens 440/0 e > 0 to 0 dl 1227665489 ref 1 fl Interpret:/0/0 rc -107/0 > Nov 27 01:06:07 kernel: LustreError: > 30478:0:(mgs_handler.c:538:mgs_handle()) lustre_mgs: operation 101 on > unconnected MGS > Nov 27 02:21:39 kernel: Lustre: > 18420:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000: > 180cf598-1e43-3ea4-6cf6-0ee40e5a2d5e reconnecting > Nov 27 02:22:16 kernel: Lustre: Request x2282604 sent from > farmres-MDT0000 to NID [CLIENT IP HERE]@tcp 6s ago has timed out > (limit 6s).> Nov 27 02:22:16 kernel: LustreError: 138-a: farmres-MDT0000: A client > on nid [CLIENT IP HERE]@tcp was evicted due to a lock blocking > callback to [CLIENT IP HERE]@tcp timed out: rc -107> Nov 27 08:58:46 kernel: LustreError: > 29755:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout > exceeded for key 0 > Nov 27 08:59:11 kernel: LustreError: > 18473:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout > exceeded for key 0hm... as i know this bug on FS configuration. can you reset mdt.group_upcall to ''NONE'' ?> Nov 27 13:23:25 kernel: Lustre: > 29752:0:(ldlm_lib.c:525:target_handle_reconnect()) farmres-MDT0000: > 3d5efff1-1652-6669-94de-c93ee73a4bc7 reconnecting > Nov 27 02:17:16 kernel: nfs_statfs: statfs error = 116 > ------------------------ > > client errors > ------------------------ > > cp: cannot stat > `/master/jdk16/sample/jnlp/webpad/src/version1/JLFAbstractAction.java'': Cannot send after transport endpoint shutdown > ------------------------- > > Lustre supports Xen kernel 2.6.9-78.0.0.0.1.ELxenU as patchless ?with some limitation. i suggest use 2.6.15 and up for patchless client. for 2.6.16 i know about one limitation - FMODE_EXEC patch is absent. what is in clients /var/log/messages at same time ?>
Peter Kjellstrom
2008-Dec-08 19:39 UTC
[Lustre-discuss] NFS Stale Handling with Lustre on RHEL 4 U7 x86_64
On Thursday 04 December 2008, Alex Lyashkov wrote:> On Thu, 2008-12-04 at 13:18 +0530, anil kumar wrote: > > Alex, > > > > We are working on checking the lustre scalability so that we can > > uptake it in our production infrastructure. Below are the details of > > our setup, tests conducted and the issues faced till now, > > Setup details : > > -------------------- > > > > Hardware Used - HP DL360 > > MDT/MGS - 1 > > OST - 13 (13 HP DL360 servers used, 1 OSS = 1 OST, 700gb x 13 ) > > > > Issue1 > > --------- > > Test Environment: > > > > Operating System - Redhat EL4 Update 7 ,x86_64 > > Lustre Version - 1.6.5.1 > > Lustre Kernel - > > kernel-lustre-smp-2.6.9-67.0.7.EL_lustre.1.6.5.1.x86_64 > > I think this for server? > > > Lustre Client - Xen Virtual Machines with 2.6.9-78.0.0.0.1.ELxenU > > kernel( patchless ) > > 2.6.9 kernel for patchless is dangerous - some problems can be fixed due > kernel internal limitation. i suggest apply vfs_intent and dcache > patches.Hmm.. Brian stated that patchless was ok now for modern EL4 kernels like 2.6.9-78 (thread name "[Lustre-discuss] Is patchless ok for EL4 now?" early November...). Are you saying that this is not true and that CFS/SUN still thinks patchless client will have problems on EL4? /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/833d8860/attachment.bin