On 02/25/2016 08:20 PM, Ravishankar N wrote:> On 02/25/2016 11:36 PM, Kyle Maas wrote: >> How can I tell what AFR version a cluster is using for self-heal? > If all your servers and clients are 3.7.8, then they are by default > running afr-v2. Afr-v2 was a re-write of afr that went in for 3.6., > so any gluster package from then on has this code, you don't need to > explicitly enable anything.That was what I thought until I ran across this IRC log where JoeJulian asked if it was explicitly enabled: https://irclog.perlgeek.de/gluster/2015-10-29>> >> The reason I ask is that I have a two-node replicated 3.7.8 cluster (no >> arbiters) which has locking behavior during self-heal which looks very >> similar to that of AFRv1 (only heals one file at a time per self-heal >> daemon, appears to lock the full inode while it's healing it instead of >> just ranges, etc.), > Both v1 and v2 use range locks while healing a given file, so clients > shouldn't block when heals happen. What is the problem you're facing? > Are your clients also at 3.7.8?Primary symptoms are: 1. While a self-heal is running, only one file at a time is healed per brick. As I understand it, AFRv2 and up should allow for multiple files to be healed concurrently or at least multiple ranges within a file, particularly with io-thread-count set to >1. During a self-heal, neither I/O nor network is saturated, which leads me to believe that I'm looking at a single synchronous self-healing process. 3. More troubling is that during a self-heal, clients cannot so much as list the files on the volume until the self-heal is done. No errors. No timeouts. They just freeze. As soon as the self-heal is complete, they unfreeze and list the contents. 4. Any file access during a self-heal also freezes, just like a directory listing, until the self-heal is done. This wreaks havoc on users who have files open when one of the bricks is rebooted and has to be healed, since with as much data is stored on this cluster, a self-heal can take almost 24 hours. I experience the same problems when I run without any clients other than the bricks themselves mounting the volume, so yes, it happens with the clients on 3.7.8 as well. Warm Regards, Kyle Maas
On 02/26/2016 10:02 AM, Kyle Maas wrote:> On 02/25/2016 08:20 PM, Ravishankar N wrote: >> On 02/25/2016 11:36 PM, Kyle Maas wrote: >>> How can I tell what AFR version a cluster is using for self-heal? >> If all your servers and clients are 3.7.8, then they are by default >> running afr-v2. Afr-v2 was a re-write of afr that went in for 3.6., >> so any gluster package from then on has this code, you don't need to >> explicitly enable anything. > That was what I thought until I ran across this IRC log where JoeJulian > asked if it was explicitly enabled: > > https://irclog.perlgeek.de/gluster/2015-10-29 > >>> The reason I ask is that I have a two-node replicated 3.7.8 cluster (no >>> arbiters) which has locking behavior during self-heal which looks very >>> similar to that of AFRv1 (only heals one file at a time per self-heal >>> daemon, appears to lock the full inode while it's healing it instead of >>> just ranges, etc.), >> Both v1 and v2 use range locks while healing a given file, so clients >> shouldn't block when heals happen. What is the problem you're facing? >> Are your clients also at 3.7.8? > Primary symptoms are: > > 1. While a self-heal is running, only one file at a time is healed per > brick. As I understand it, AFRv2 and up should allow for multiple files > to be healed concurrently or at least multiple ranges within a file, > particularly with io-thread-count set to >1. During a self-heal, > neither I/O nor network is saturated, which leads me to believe that I'm > looking at a single synchronous self-healing process.The self-heal daemon on each node processes one file at a time per replica, so in that sense it is serial. We are working on the multi-threaded self heal patch (http://review.gluster.org/#/c/13329/) for parallel heals.> > 3. More troubling is that during a self-heal, clients cannot so much as > list the files on the volume until the self-heal is done. No errors. > No timeouts. They just freeze. As soon as the self-heal is complete, > they unfreeze and list the contents.I'm guessing http://review.gluster.org/#/c/13207/ would fix that. But as a work around, can you see if 'gluster vol set volname data-self-heal off` makes them more responsive?> > 4. Any file access during a self-heal also freezes, just like a > directory listing, until the self-heal is done.Ditto as above, please see if disabling client-side heal helps. Regards, Ravi> This wreaks havoc on > users who have files open when one of the bricks is rebooted and has to > be healed, since with as much data is stored on this cluster, a > self-heal can take almost 24 hours. > > I experience the same problems when I run without any clients other than > the bricks themselves mounting the volume, so yes, it happens with the > clients on 3.7.8 as well. > > Warm Regards, > Kyle Maas >
On February 25, 2016 8:32:44 PM PST, Kyle Maas <kyle at virtualinterconnect.com> wrote:>On 02/25/2016 08:20 PM, Ravishankar N wrote: >> On 02/25/2016 11:36 PM, Kyle Maas wrote: >>> How can I tell what AFR version a cluster is using for self-heal? >> If all your servers and clients are 3.7.8, then they are by default >> running afr-v2. Afr-v2 was a re-write of afr that went in for 3.6., >> so any gluster package from then on has this code, you don't need to >> explicitly enable anything. > >That was what I thought until I ran across this IRC log where JoeJulian >asked if it was explicitly enabled: > >https://irclog.perlgeek.de/gluster/2015-10-29 >A couple lines down, though, i continued "Ah, I was confusing that with nsr.">>> >>> The reason I ask is that I have a two-node replicated 3.7.8 cluster >(no >>> arbiters) which has locking behavior during self-heal which looks >very >>> similar to that of AFRv1 (only heals one file at a time per >self-heal >>> daemon, appears to lock the full inode while it's healing it instead >of >>> just ranges, etc.), >> Both v1 and v2 use range locks while healing a given file, so >clients >> shouldn't block when heals happen. What is the problem you're facing? >> Are your clients also at 3.7.8? > >Primary symptoms are: > >1. While a self-heal is running, only one file at a time is healed per >brick. As I understand it, AFRv2 and up should allow for multiple >files >to be healed concurrently or at least multiple ranges within a file, >particularly with io-thread-count set to >1. During a self-heal, >neither I/O nor network is saturated, which leads me to believe that >I'm >looking at a single synchronous self-healing process. > >3. More troubling is that during a self-heal, clients cannot so much as >list the files on the volume until the self-heal is done. No errors. >No timeouts. They just freeze. As soon as the self-heal is complete, >they unfreeze and list the contents. > >4. Any file access during a self-heal also freezes, just like a >directory listing, until the self-heal is done. This wreaks havoc on >users who have files open when one of the bricks is rebooted and has to >be healed, since with as much data is stored on this cluster, a >self-heal can take almost 24 hours. > >I experience the same problems when I run without any clients other >than >the bricks themselves mounting the volume, so yes, it happens with the >clients on 3.7.8 as well. > >Warm Regards, >Kyle Maas > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://www.gluster.org/mailman/listinfo/gluster-users-- Sent from my Android device with K-9 Mail. Please excuse my brevity.