harry mangalam
2012-Oct-03  02:51 UTC
[Gluster-users] Retraction: Protocol stacking: gluster over NFS
Hi All,
Well, it <http://goo.gl/hzxyw> was too good to be true. Under extreme, 
extended IO on a 48core node, some part of the the NFS stack collapses and 
leads to an IO lockup thru NFS.  We've replicated it on 48core and 64 core 
nodes, but don't know yet whether it acts similarly on lower-core-count
nodes.
Tho I haven't had time to figure out exactly /how/ it collapses, I owe it to
those who might be thinking of using it to tell them not to.
This is what I wrote, describing the situation to some co-workers:
==============With Joseph's and Kevin's help, I've been able to
replicate Kevin's complete
workflow on BDUC and executed it with a normally mounted gluster fs and my 
gluster-via-NFS-loopback (on both NFS3 and NFS4 clients).  
The good news is that the workflow went to completion on BDUC with the native 
gluster fs mount, doing pretty decent IO on one node - topping out at about 
250MB/s in and 75MB/s out (DDR IB)
        ib1        
  KB/s in  KB/s out
 268248.1  62278.40
 262835.1  64813.55
 248466.0  61000.24
 250071.3  67770.03
 252924.1  67235.13
 196261.3  56165.20
 255562.3  68524.45
 237479.3  68813.99
 209901.8  73147.73
 217020.4  70855.45
The bad news is that I've been able to replicate the failures that JF has 
seen.  The workflow starts normally but then eats up free RAM as KT's
workflow
saturates the nodes with about 26 instances of samtools which does a LOT of 
IO (10s of GB in the ~30m of the run).  This was the case even when I 
increased the number of nfsd's to 16 and even 32.
When using native gluster, the workflow goes to completion in about 23 hrs - 
about the same as when KT executed it on his machine (using NFS I think..?).
However when using the loopback mount, on both NFS3 and NFS4, it locks up the 
NFS side (the gluster mount continues to be R/W), requiring a hard reset on 
the node to clear the NFS error.  It is interesting that the samtools 
processes lock up during /reads/, not writes (via stracing several of the 
processes)
I found this entry in a FraunhoferFS discussion:
from <https://groups.google.com/forum/?fromgroups=#!topic/fhgfs-
user/XoGPbv3kfhc>
[[
In general, any network file system that uses the standard kernel page 
cache on the client side (including e.g. NFS, just to give another 
example) is not suitable for running client and server on the same 
machine, because that would lead to memory allocation deadlocks under 
high memory pressure - so you might want to watch out for that. 
(fhgfs uses a different caching mechanism on the clients to allow 
running it in such scenarios.) 
]]
but why this would be the case, I'm not sure - the server and client
processes
should be unable to step on each others data structures, so why they would 
interfere with each other is unclear.  Others on this list have mentioned 
similar opinions - I'd be interested in why this is theoretically the case.
The upshot is that under extreme, extended IO, NFS will lock up, so while we 
haven't seen it on BDUC except for KT's workflow, it's repeatable
and we can't
recover from it smoothly.  So we should move away from it.
I haven't been able to test it on a 3.x kernel (but will after this
weekend);
it's possible that it might work better, but I'm not optimistic.
-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Passive-Aggressive Supporter of the The Canada Party:
  <http://www.americabutbetter.com/>
Brian Foster
2012-Oct-03  12:32 UTC
[Gluster-users] Retraction: Protocol stacking: gluster over NFS
On 10/02/2012 10:51 PM, harry mangalam wrote:> Hi All, > > Well, it <http://goo.gl/hzxyw> was too good to be true. Under extreme, > extended IO on a 48core node, some part of the the NFS stack collapses and > leads to an IO lockup thru NFS. We've replicated it on 48core and 64 core > nodes, but don't know yet whether it acts similarly on lower-core-count nodes. > > Tho I haven't had time to figure out exactly /how/ it collapses, I owe it to > those who might be thinking of using it to tell them not to. >If you are willing and able to do this kind of testing, why not give the fuse writeback patches a shot? http://article.gmane.org/gmane.linux.file-systems/65661 My understanding is that they are fundamentally susceptible to the same kind of problem, but there is an apparent attempt to avoid the deadlock scenario. The kind of testing you're doing here might be informative as to how well it works. Brian> This is what I wrote, describing the situation to some co-workers: > ==============> With Joseph's and Kevin's help, I've been able to replicate Kevin's complete > workflow on BDUC and executed it with a normally mounted gluster fs and my > gluster-via-NFS-loopback (on both NFS3 and NFS4 clients). > > The good news is that the workflow went to completion on BDUC with the native > gluster fs mount, doing pretty decent IO on one node - topping out at about > 250MB/s in and 75MB/s out (DDR IB) > ib1 > KB/s in KB/s out > 268248.1 62278.40 > 262835.1 64813.55 > 248466.0 61000.24 > 250071.3 67770.03 > 252924.1 67235.13 > 196261.3 56165.20 > 255562.3 68524.45 > 237479.3 68813.99 > 209901.8 73147.73 > 217020.4 70855.45 > > The bad news is that I've been able to replicate the failures that JF has > seen. The workflow starts normally but then eats up free RAM as KT's workflow > saturates the nodes with about 26 instances of samtools which does a LOT of > IO (10s of GB in the ~30m of the run). This was the case even when I > increased the number of nfsd's to 16 and even 32. > > When using native gluster, the workflow goes to completion in about 23 hrs - > about the same as when KT executed it on his machine (using NFS I think..?). > However when using the loopback mount, on both NFS3 and NFS4, it locks up the > NFS side (the gluster mount continues to be R/W), requiring a hard reset on > the node to clear the NFS error. It is interesting that the samtools > processes lock up during /reads/, not writes (via stracing several of the > processes) > > I found this entry in a FraunhoferFS discussion: > from <https://groups.google.com/forum/?fromgroups=#!topic/fhgfs- > user/XoGPbv3kfhc> > [[ > In general, any network file system that uses the standard kernel page > cache on the client side (including e.g. NFS, just to give another > example) is not suitable for running client and server on the same > machine, because that would lead to memory allocation deadlocks under > high memory pressure - so you might want to watch out for that. > (fhgfs uses a different caching mechanism on the clients to allow > running it in such scenarios.) > ]] > but why this would be the case, I'm not sure - the server and client processes > should be unable to step on each others data structures, so why they would > interfere with each other is unclear. Others on this list have mentioned > similar opinions - I'd be interested in why this is theoretically the case. > > The upshot is that under extreme, extended IO, NFS will lock up, so while we > haven't seen it on BDUC except for KT's workflow, it's repeatable and we can't > recover from it smoothly. So we should move away from it. > > I haven't been able to test it on a 3.x kernel (but will after this weekend); > it's possible that it might work better, but I'm not optimistic. >
Possibly Parallel Threads
- Retraction: security hole in sudo allows users full access
- Upload progress
- Thai vignette, cross-compile for Mac OS X, universal/multiarch (Fwd: Mac OS X builds of CelQuantileNorm, vcftools/samtools/tabix, and snpStats)
- raid6: rmw writes all the time?
- process_rename assumes hard links?