Hello, I am experiencing some weird behavior on my lustre clients. I have worked with Novell support and they keeping pointing to lustre as the culprit for these issues. I am getting intermittent I/O errors when running df/ls on any nfs mounts without anything being logged in syslog. After putting nfs and rpc in debug mode by running: rpcdebug -m nfs -s all rpcdebug -m rpc -s all I now see the following errors in my logs: ..snip.. Aug 8 02:32:56 reshpc115 kernel: RPC: 2440 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:32:56 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:32:59 reshpc115 kernel: RPC: 2441 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:32:59 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:47:59 reshpc115 kernel: RPC: 2447 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:47:59 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:57:59 reshpc115 kernel: RPC: 2451 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:57:59 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:58:00 reshpc115 kernel: RPC: 2452 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:58:00 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:58:13 reshpc115 kernel: RPC: 2453 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:58:13 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:58:26 reshpc115 kernel: RPC: 2454 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:58:26 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:58:30 reshpc115 kernel: RPC: 2455 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:58:30 reshpc115 kernel: nfs_statfs: statfs error = 5 Aug 8 02:58:32 reshpc115 kernel: RPC: 2456 xprt_connect_status: error 99 connecting to server nas-rwc-is2 Aug 8 02:58:32 reshpc115 kernel: nfs_statfs: statfs error = 5 ..snip.. I am using all supported packages/kernels for lustre and on servers without the lustre clients installed I have no issues with nfs. Does the interval between these errors mean anything? Any help would be greatly appreciated. Thanks, -J -- reshpc115:~ # uname -a Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux reshpc115:~ # rpm -qa | grep -i lustre lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default reshpc115:~ # rpm -qa | grep -i kernel-ib kernel-ib-1.4.2-2.6.27.29_0.1_default -- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100808/e63c13fd/attachment.html
One other piece of information. It seems like I have found a workaround by adding a cronjob that runs every 2mins and runs a df command. Is there some caching issue that might be caused by lustre? Thanks, -J On Sun, Aug 8, 2010 at 3:15 AM, Jagga Soorma <jagga13 at gmail.com> wrote:> Hello, > > I am experiencing some weird behavior on my lustre clients. I have worked > with Novell support and they keeping pointing to lustre as the culprit for > these issues. I am getting intermittent I/O errors when running df/ls on > any nfs mounts without anything being logged in syslog. After putting nfs > and rpc in debug mode by running: > > rpcdebug -m nfs -s all > rpcdebug -m rpc -s all > > I now see the following errors in my logs: > > ..snip.. > Aug 8 02:32:56 reshpc115 kernel: RPC: 2440 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:32:56 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:32:59 reshpc115 kernel: RPC: 2441 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:32:59 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:47:59 reshpc115 kernel: RPC: 2447 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:47:59 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:57:59 reshpc115 kernel: RPC: 2451 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:57:59 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:58:00 reshpc115 kernel: RPC: 2452 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:58:00 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:58:13 reshpc115 kernel: RPC: 2453 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:58:13 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:58:26 reshpc115 kernel: RPC: 2454 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:58:26 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:58:30 reshpc115 kernel: RPC: 2455 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:58:30 reshpc115 kernel: nfs_statfs: statfs error = 5 > Aug 8 02:58:32 reshpc115 kernel: RPC: 2456 xprt_connect_status: error 99 > connecting to server nas-rwc-is2 > Aug 8 02:58:32 reshpc115 kernel: nfs_statfs: statfs error = 5 > ..snip.. > > I am using all supported packages/kernels for lustre and on servers without > the lustre clients installed I have no issues with nfs. Does the interval > between these errors mean anything? > > Any help would be greatly appreciated. > > Thanks, > -J > > -- > reshpc115:~ # uname -a > Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 > x86_64 x86_64 x86_64 GNU/Linux > reshpc115:~ # rpm -qa | grep -i lustre > lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default > lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default > reshpc115:~ # rpm -qa | grep -i kernel-ib > kernel-ib-1.4.2-2.6.27.29_0.1_default > -- > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100808/95ec1727/attachment.html
On 2010-08-08, at 16:44, Jagga Soorma wrote:> One other piece of information. It seems like I have found a workaround by adding a cronjob that runs every 2mins and runs a df command. Is there some caching issue that might be caused by lustre?Are the IO errors on NFS filesystems that have nothing to do with Lustre, or is this from NFS re-exporting of a Lustre filesystem?>> I am experiencing some weird behavior on my lustre clients. I have worked with Novell support and they keeping pointing to lustre as the culprit for these issues. I am getting intermittent I/O errors when running df/ls on any nfs mounts without anything being logged in syslog. After putting nfs and rpc in debug mode by running: > > I am using all supported packages/kernels for lustre and on servers without the lustre clients installed I have no issues with nfs. Does the interval between these errors mean anything? > > Any help would be greatly appreciated. > > reshpc115:~ # uname -a > Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux > reshpc115:~ # rpm -qa | grep -i lustre > lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default > lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default > reshpc115:~ # rpm -qa | grep -i kernel-ib > kernel-ib-1.4.2-2.6.27.29_0.1_defaultCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Andreas, Yes, these I/O errors are for any NFS filesystems mounted on all lustre clients. Even though this nfs mount has nothing to do with lustre there seems to be something specific on the lustre clients with the kernel-ib and lustre client modules installed that seems to be causing this problem. I believe lustre caches data locally and then flushes it out on a regular basis, but don''t know enough to rule lustre out. It looks like this issue is happening every 8-10mins. Is there something that lustre is doing on the system that might be flushing some type of a cache or might be causing this problem? If I do a df every 5mins or so then I never see this problem. I have just run out of things to try and wanted to check the lustre route as a last resort in hopes of getting more information that might help me find a permanent solution for this issue. Any assistance/comments would be appreciated. Thanks, -J On Sun, Aug 8, 2010 at 6:53 PM, Andreas Dilger <andreas.dilger at oracle.com>wrote:> On 2010-08-08, at 16:44, Jagga Soorma wrote: > > One other piece of information. It seems like I have found a workaround > by adding a cronjob that runs every 2mins and runs a df command. Is there > some caching issue that might be caused by lustre? > > Are the IO errors on NFS filesystems that have nothing to do with Lustre, > or is this from NFS re-exporting of a Lustre filesystem? > > >> I am experiencing some weird behavior on my lustre clients. I have > worked with Novell support and they keeping pointing to lustre as the > culprit for these issues. I am getting intermittent I/O errors when running > df/ls on any nfs mounts without anything being logged in syslog. After > putting nfs and rpc in debug mode by running: > > > > I am using all supported packages/kernels for lustre and on servers > without the lustre clients installed I have no issues with nfs. Does the > interval between these errors mean anything? > > > > Any help would be greatly appreciated. > > > > reshpc115:~ # uname -a > > Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 > x86_64 x86_64 x86_64 GNU/Linux > > reshpc115:~ # rpm -qa | grep -i lustre > > lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default > > lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default > > reshpc115:~ # rpm -qa | grep -i kernel-ib > > kernel-ib-1.4.2-2.6.27.29_0.1_default > > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100808/bef9a2a8/attachment.html