thr3ads.net - Lustre discuss - [Lustre-discuss] Weird behavior on lustre clients [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Jagga Soorma

2010-Aug-08 10:15 UTC

[Lustre-discuss] Weird behavior on lustre clients

Hello,

I am experiencing some weird behavior on my lustre clients.  I have worked
with Novell support and they keeping pointing to lustre as the culprit for
these issues.  I am getting intermittent I/O errors when running df/ls on
any nfs mounts without anything being logged in syslog.  After putting nfs
and rpc in debug mode by running:

rpcdebug -m nfs -s all
rpcdebug -m rpc -s all

I now see the following errors in my logs:

..snip..
Aug  8 02:32:56 reshpc115 kernel: RPC:  2440 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:32:56 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:32:59 reshpc115 kernel: RPC:  2441 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:32:59 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:47:59 reshpc115 kernel: RPC:  2447 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:47:59 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:57:59 reshpc115 kernel: RPC:  2451 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:57:59 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:58:00 reshpc115 kernel: RPC:  2452 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:58:00 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:58:13 reshpc115 kernel: RPC:  2453 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:58:13 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:58:26 reshpc115 kernel: RPC:  2454 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:58:26 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:58:30 reshpc115 kernel: RPC:  2455 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:58:30 reshpc115 kernel: nfs_statfs: statfs error = 5
Aug  8 02:58:32 reshpc115 kernel: RPC:  2456 xprt_connect_status: error 99
connecting to server nas-rwc-is2
Aug  8 02:58:32 reshpc115 kernel: nfs_statfs: statfs error = 5
..snip..

I am using all supported packages/kernels for lustre and on servers without
the lustre clients installed I have no issues with nfs.  Does the interval
between these errors mean anything?

Any help would be greatly appreciated.

Thanks,
-J

--
reshpc115:~ # uname -a
Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
x86_64 x86_64 x86_64 GNU/Linux
reshpc115:~ # rpm -qa | grep -i lustre
lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
reshpc115:~ # rpm -qa | grep -i kernel-ib
kernel-ib-1.4.2-2.6.27.29_0.1_default
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100808/e63c13fd/attachment.html

Jagga Soorma

2010-Aug-08 20:44 UTC

head link

[Lustre-discuss] Weird behavior on lustre clients

One other piece of information.  It seems like I have found a workaround by
adding a cronjob that runs every 2mins and runs a df command.  Is there some
caching issue that might be caused by lustre?

Thanks,
-J

On Sun, Aug 8, 2010 at 3:15 AM, Jagga Soorma <jagga13 at gmail.com> wrote:
> Hello,
>
> I am experiencing some weird behavior on my lustre clients.  I have worked
> with Novell support and they keeping pointing to lustre as the culprit for
> these issues.  I am getting intermittent I/O errors when running df/ls on
> any nfs mounts without anything being logged in syslog.  After putting nfs
> and rpc in debug mode by running:
>
> rpcdebug -m nfs -s all
> rpcdebug -m rpc -s all
>
> I now see the following errors in my logs:
>
> ..snip..
> Aug  8 02:32:56 reshpc115 kernel: RPC:  2440 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:32:56 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:32:59 reshpc115 kernel: RPC:  2441 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:32:59 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:47:59 reshpc115 kernel: RPC:  2447 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:47:59 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:57:59 reshpc115 kernel: RPC:  2451 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:57:59 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:58:00 reshpc115 kernel: RPC:  2452 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:58:00 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:58:13 reshpc115 kernel: RPC:  2453 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:58:13 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:58:26 reshpc115 kernel: RPC:  2454 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:58:26 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:58:30 reshpc115 kernel: RPC:  2455 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:58:30 reshpc115 kernel: nfs_statfs: statfs error = 5
> Aug  8 02:58:32 reshpc115 kernel: RPC:  2456 xprt_connect_status: error 99
> connecting to server nas-rwc-is2
> Aug  8 02:58:32 reshpc115 kernel: nfs_statfs: statfs error = 5
> ..snip..
>
> I am using all supported packages/kernels for lustre and on servers without
> the lustre clients installed I have no issues with nfs.  Does the interval
> between these errors mean anything?
>
> Any help would be greatly appreciated.
>
> Thanks,
> -J
>
> --
> reshpc115:~ # uname -a
> Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
> x86_64 x86_64 x86_64 GNU/Linux
> reshpc115:~ # rpm -qa | grep -i lustre
> lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> reshpc115:~ # rpm -qa | grep -i kernel-ib
> kernel-ib-1.4.2-2.6.27.29_0.1_default
> --
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100808/95ec1727/attachment.html

Andreas Dilger

2010-Aug-09 01:53 UTC

head link

[Lustre-discuss] Weird behavior on lustre clients

On 2010-08-08, at 16:44, Jagga Soorma wrote:> One other piece of information.  It seems like I have found a workaround by
adding a cronjob that runs every 2mins and runs a df command.  Is there some
caching issue that might be caused by lustre?
Are the IO errors on NFS filesystems that have nothing to do with Lustre, or is
this from NFS re-exporting of a Lustre filesystem?
>> I am experiencing some weird behavior on my lustre clients.  I have
worked with Novell support and they keeping pointing to lustre as the culprit
for these issues.  I am getting intermittent I/O errors when running df/ls on
any nfs mounts without anything being logged in syslog.  After putting nfs and
rpc in debug mode by running:
> 
> I am using all supported packages/kernels for lustre and on servers without
the lustre clients installed I have no issues with nfs.  Does the interval
between these errors mean anything?
> 
> Any help would be greatly appreciated.
> 
> reshpc115:~ # uname -a
> Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
x86_64 x86_64 x86_64 GNU/Linux
> reshpc115:~ # rpm -qa | grep -i lustre
> lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> reshpc115:~ # rpm -qa | grep -i kernel-ib
> kernel-ib-1.4.2-2.6.27.29_0.1_default

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Jagga Soorma

2010-Aug-09 03:04 UTC

head link

[Lustre-discuss] Weird behavior on lustre clients

Andreas,

Yes, these I/O errors are for any NFS filesystems mounted on all lustre
clients.  Even though this nfs mount has nothing to do with lustre there
seems to be something specific on the lustre clients with the kernel-ib and
lustre client modules installed that seems to be causing this problem.

I believe lustre caches data locally and then flushes it out on a regular
basis, but don''t know enough to rule lustre out.  It looks like this
issue
is happening every 8-10mins.  Is there something that lustre is doing on the
system that might be flushing some type of a cache or might be causing this
problem?  If I do a df every 5mins or so then I never see this problem.

I have just run out of things to try and wanted to check the lustre route as
a last resort in hopes of getting more information that might help me find a
permanent solution for this issue.

Any assistance/comments would be appreciated.

Thanks,
-J

On Sun, Aug 8, 2010 at 6:53 PM, Andreas Dilger <andreas.dilger at
oracle.com>wrote:
> On 2010-08-08, at 16:44, Jagga Soorma wrote:
> > One other piece of information.  It seems like I have found a
workaround
> by adding a cronjob that runs every 2mins and runs a df command.  Is there
> some caching issue that might be caused by lustre?
>
> Are the IO errors on NFS filesystems that have nothing to do with Lustre,
> or is this from NFS re-exporting of a Lustre filesystem?
>
> >> I am experiencing some weird behavior on my lustre clients.  I
have
> worked with Novell support and they keeping pointing to lustre as the
> culprit for these issues.  I am getting intermittent I/O errors when
running
> df/ls on any nfs mounts without anything being logged in syslog.  After
> putting nfs and rpc in debug mode by running:
> >
> > I am using all supported packages/kernels for lustre and on servers
> without the lustre clients installed I have no issues with nfs.  Does the
> interval between these errors mean anything?
> >
> > Any help would be greatly appreciated.
> >
> > reshpc115:~ # uname -a
> > Linux reshpc115 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
> x86_64 x86_64 x86_64 GNU/Linux
> > reshpc115:~ # rpm -qa | grep -i lustre
> > lustre-client-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> > lustre-client-modules-1.8.1.1-2.6.27.29_0.1_lustre.1.8.1.1_default
> > reshpc115:~ # rpm -qa | grep -i kernel-ib
> > kernel-ib-1.4.2-2.6.27.29_0.1_default
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100808/bef9a2a8/attachment.html

Lustre discuss - Aug 2010 - Weird behavior on lustre clients

[Lustre-discuss] Weird behavior on lustre clients

[Lustre-discuss] Weird behavior on lustre clients

[Lustre-discuss] Weird behavior on lustre clients

[Lustre-discuss] Weird behavior on lustre clients