thr3ads.net - Gluster users - [Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env [Mar 2013]

If this information is useful, please help other people find it:
Share via:

harry mangalam

2013-Mar-22 17:51 UTC

[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

We have a ~2500core academic cluster with saturating amounts of use. 
The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 
filesystem.  All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers
The servers are all running SL6.2 and are stable, with load running stably at 
about 2 continuously.

gluster is config'ed as:

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

 
Many of our users run large array jobs under SGE and especially during those 
runs where there is LOTS of IO, we will VERY occasionally (20 times since last 
June, according to brick logs) see these kinds of errors, resulting in the 
failure of that particular element of the array job.  

Sometimes these are acceptable, but often the next job depends on all elements 
of the array job to complete correctly. At any rate, from the fs POV they 
should all complete.

The rarity of this error and the type of error, and where it is located 
suggest that it might be a hash collision..?  According to gluster bugzilla 
this doesn't seem to be a registered bug, so here I am asking if this has
been
seen by others and how this might be addressed.

========================================================================> The
error below being reported by Grid Engine says:> 
> user "root" 03/21/2013 15:29:23 [507:26777]: error: can't
open output
> file
>
"/gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2
> 54058.103": Permission denied 03/21/2013 15:29:23 [400:25458]: wait3========================================================================
Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs-
glusterd.vol.log), reveals nothing about this error, but the brick logs yeild 
this set of lines referencing that file at the correct time:

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix-
handle.c:461:posix_handle_hard] 0-gl-posix: link 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 -> 
/raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File 
exists)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E 
[posix.c:1730:posix_create] 0-gl-posix: setting gfid on 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) ==> -1 (Permission denied)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) ==> -1 (Permission denied)

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---

harry mangalam

2013-Apr-10 18:35 UTC

head link

[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

Sending this again, since I'm not even sure that the 1st made it to the list
and it's just happened again, even with the same user (one of the heaviest 
users, but I don't think there's anything odd about his usage).

In the last 3 days, we've had 6 such errors (resulting in the logged error:

E [posix.c:1730:posix_create] 0-gl-posix: setting gfid on [file] failed


An question that could be answered is: has anyone had such errors in their 
brick logs show up?

ie:

grep -n "posix.c:1730:posix_create"
/var/log/glusterfs/bricks/raid[12].log

hjm

=== previously ==
We have a ~2500core academic cluster with saturating amounts of use. 
The main data store is running on a 4 node/8brick/340TB/QDR IB gluster 3.3 
filesystem.  All are 8xOpteron/32GB systems with 3ware 9750 SAS controllers
The servers are all running SL6.2 and are stable, with load running stably at 
about 2 continuously.

gluster is config'ed as:

Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*

 
Many of our users run large array jobs under SGE and especially during those 
runs where there is LOTS of IO, we will VERY occasionally (20 times since last 
June, according to brick logs) see these kinds of errors, resulting in the 
failure of that particular element of the array job.  

Sometimes these are acceptable, but often the next job depends on all elements 
of the array job to complete correctly. At any rate, from the fs POV they 
should all complete.

The rarity of this error and the type of error, and where it is located 
suggest that it might be a hash collision..?  According to gluster bugzilla 
this doesn't seem to be a registered bug, so here I am asking if this has
been
seen by others and how this might be addressed.

========================================================================> The
error below being reported by Grid Engine says:> 
> user "root" 03/21/2013 15:29:23 [507:26777]: error: can't
open output
> file
>
"/gl/bio/krthornt/WTCCC/autosomal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o2
> 54058.103": Permission denied 03/21/2013 15:29:23 [400:25458]: wait3========================================================================
Looking thru all the server logs (/var/log/glusterfs/etc-glusterfs-
glusterd.vol.log), reveals nothing about this error, but the brick logs yeild 
this set of lines referencing that file at the correct time:

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667171] W [posix-
handle.c:461:posix_handle_hard] 0-gl-posix: link 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 -> 
/raid1/.glusterfs/5a/0e/5a0e87a6-e35d-4368-841e-b45802fecc4e failed (File 
exists)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:18.667249] E 
[posix.c:1730:posix_create] 0-gl-posix: setting gfid on 
/raid1/bio/krthornt/WTCCC/autosomal_
analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 failed

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.241602] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644765: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) ==> -1 (Permission denied)

/var/log/glusterfs/bricks/raid1.log:[2013-03-21 15:43:19.520455] I [server3_1-
fops.c:1538:server_open_cbk] 0-gl-server: 644970: OPEN 
/bio/krthornt/WTCCC/autoso
mal_analysis_Jan2013/1958BC/COMPUTE_1958BC.o254058.103 (5a0e87a6-
e35d-4368-841e-b45802fecc4e) ==> -1 (Permission denied)

---
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
---

Gluster users - Mar 2013 - gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env

[Gluster-users] gluster 3.3 sporadic 'Permission denied' failure under load in cluster env