)=20
bug/warning/error has not necessarily been associated with data loss, but we=20
are finding that our gluster fs is interrupting our cluster jobs with the=20
''Stale NFS handle'' Warnings like this (on the client):
[2013-01-03 12:30:59.149230] W [client3_1-fops.c:2630:client3_1_lookup_cbk] 0-
gl-client-0: remote operation failed: Stale NFS file handle. Path:=20
/bio/krthornt/build_div/yak/line06_CY08A/prinses (3b0aa7b2-bf7f-4b27-
b515-32e94b1206e3)
(and 7 more, differing by the timestamp of <<1s).
The dir mentioned existed before the job was asked to read from it and
shortly=20
after the SGE failed, I checked that the glusterfs (/bio) was still mounted=20
and that dir was still r/w.
We are getting these errors infrequently, but fairly regularly (a couple
times=20
a week, usually during a big array job that heavily reads from a particular=20
dir) and I haven''t seen any resolutions of the fault besides the
vocabulary=20
being corrected. I know it''s not nec an NFS problem, but I
haven''t seen a fix=20
from the gluster folks.
Our glusterfs on this system is set up like this (over QDR/tcpoib)
$ gluster volume info gl
=20
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
auth.allow: 10.2.*.*,10.1.*.*
performance.io-thread-count: 64
performance.quick-read: on
performance.io-cache: on
nfs.disable: on
performance.cache-size: 268435456
performance.flush-behind: on
performance.write-behind-window-size: 1024MB
and otherwise appears to be happy. =20
We were having a low-level problem with the RAID servers, where this
LSI/3ware=20
error was temporally close (~2m) to the gluster error:
LSI 3DM2 alert -- host: biostor4.oit.uci.edu
Jan 03, 2013 03:32:09PM - Controller 6
ERROR - Drive timeout detected: encl=3D1, slot=3D3
This error seemed to be related to construction around our data center and=20
dust related with it. We have had 10s of these LSI/3ware errors with no=20
related gluster errors or apparent problems with the RAIDs. No drives were=20
ejected from the RAIDs and the errors did not repeat. 3ware explains:
<http://cholla.mmto.org/computers/3ware/3dm2/en/3DM_2_OLH-8-6.html>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
009h Drive timeout detected
The 3ware RAID controller has a sophisticated recovery mechanism to handle=20
various types of failures of a disk drive. One such possible failure of a
disk=20
drive is a failure of a command that is pending from the 3ware RAID
controller=20
to complete within a reasonable amount of time. If the 3ware RAID controller=20
detects this condition, it notifies the user, prior to entering the recovery=20
phase, by displaying this AEN.
Possible causes of APORT time-outs include a bad or intermittent disk drive,=20
power cable or interface cable.
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
We''ve checked into this and it doesn''t seem to be related, but
I thought I''d=20
bring it up.
hjm
On Thursday, August 23, 2012 09:54:13 PM Joe Julian
wrote:> *Bug 832694* <https://bugzilla.redhat.com/show_bug.cgi?id=3D832694>
> -ESTALE error text should be reworded
>=20
> On 08/23/2012 09:50 PM, Kaushal M wrote:
> > The "Stale NFS file handle" message is the default string
given by
> > strerror() for errno ESTALE.
> > Gluster uses ESTALE as errno to indicate that the file being referred
> > to no longer exists, ie. the reference is stale.
> >=20
> > - Kaushal
> >=20
> > On Fri, Aug 24, 2012 at 7:03 AM, Jules Wang <lancelotds at 163.com
> >=20
> > <mailto:lancelotds at 163.com>> wrote:
> > Hi, Jon ,
> > =20
> > I also met the same issue, and reported a
> > =20
> > bug(https://bugzilla.redhat.com/show_bug.cgi?id=3D851381)
> > <https://bugzilla.redhat.com/show_bug.cgi?id=3D851381>
> > =20
> > In the bug report, I share a simple way to reproduce the bug.
> > Have fun.
> > =20
> > Best Regards.
> > Jules Wang.
> > =20
> > At 2012-08-23 23:02:34,"B=F9i H=F9ng Vie^.t"
<buihungviet at gmail.com
> > =20
> > <mailto:buihungviet at gmail.com>> wrote:
> > Hi Jon,
> > I have no answer for you. Just want to share with you guys
> > that I met the same issue with this message. In my gluster
> > system, Gluster client log files have a lot of these messages.
> > I tried to ask and found nothing on the Web. Amazingly,
> > Gluster have been running for long time :)
> > =20
> > On Thu, Aug 23, 2012 at 8:43 PM, Jon Tegner <tegner at
renget.se
> > =20
> > <mailto:tegner at renget.se>> wrote:
> > Hi, I''m a bit curious of error messages of the
type
> > "remote operation failed: Stale NFS file
handle". All
> > clients using the file system use Gluster Native Client,
> > so why should stale nfs file handle be reported?
> > =20
> > Regards,
> > =20
> > /jon
> > =20
> > =20
> > =20
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> > =20
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org <mailto:Gluster-users at
gluster.org>
> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
> >=20
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users at gluster.org
> > http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)