Daniel Mons
2013-Apr-10 10:44 UTC
[Gluster-users] GlusterFS 3.3.1 split-brain rsync question
Our production GlusterFS 3.3.1GA setup is a 3x2 distribute-replicate, with 100TB usable for staff. This is one of 4 identical GlusterFS clusters we're running. Very early in the life of our production Gluster rollout, we ran Netatalk 2.X to share files with MacOSX clients (due to slow negative lookup on CIFS/Samba for those pesky resource fork files in MacOSX's Finder). Netatalk 2.X wrote it's CNID_DB files back to Gluster, which caused enormous IO, locking up many nodes at a time (lots of "hung task" errors in dmesg/syslog). We've since moved to Netatalk 3.X which puts its CNID_DB files elsewhere (we put them on local SSD RAID), and the lockups have vanished. However, our split-brain files number in the tens of thousands to to those previous lockups, and aren't always predictable (i.e.: it's not always the case where brick0 is "good" and brick1 is "bad"). Manually fixing the files is far too time consuming. I've written a rudimentary script that trawls /var/log/glusterfs/glustershd.log for split-brain GFIDs, tracks it down on the matching pair of bricks, and figures out via a few rules (size tends to be a good indicator for us, as bigger files tend to be more rencent ones) which is the "good" file. This works for about 80% of files, which will dramatically reduce the amount of data we have to manually check. My question is: what should I do from here? Options are: Option 1) Delete the file from the "bad" brick Option 2) rsync the file from the "good" brick to the "bad" brick with -aX flag (preserve everything, including trusted.afr.$server and trusted.gfid xattrs) Option 3) rsync the file from "good" to "bad", and then setfattr -x trusted.* on the bad brick. Which of these is considered the better (more glustershd compatible) option? Or alternatively, is there something else that's preferred? Normally I'd just test this on our backup gluster, however as it was never running Netatalk, it has no split-brain problems, so I can't test the functionality. Thanks for any insight provided, -Dan
Pete Smith
2013-Apr-10 15:41 UTC
[Gluster-users] GlusterFS 3.3.1 split-brain rsync question
Hi Dan I've come up against this recently whilst trying to delete large amounts of files from our cluster. I'm resolving it with the method from http://comments.gmane.org/gmane.comp.file-systems.gluster.user/1917 With Fabric as a helping hand, it's not too tedious. Not sure about the level of glustershd compatibiity, but it's working for me. HTH Pete -- On 10 April 2013 11:44, Daniel Mons <daemons at kanuka.com.au> wrote:> Our production GlusterFS 3.3.1GA setup is a 3x2 distribute-replicate, > with 100TB usable for staff. This is one of 4 identical GlusterFS > clusters we're running. > > Very early in the life of our production Gluster rollout, we ran > Netatalk 2.X to share files with MacOSX clients (due to slow negative > lookup on CIFS/Samba for those pesky resource fork files in MacOSX's > Finder). Netatalk 2.X wrote it's CNID_DB files back to Gluster, which > caused enormous IO, locking up many nodes at a time (lots of "hung > task" errors in dmesg/syslog). > > We've since moved to Netatalk 3.X which puts its CNID_DB files > elsewhere (we put them on local SSD RAID), and the lockups have > vanished. However, our split-brain files number in the tens of > thousands to to those previous lockups, and aren't always predictable > (i.e.: it's not always the case where brick0 is "good" and brick1 is > "bad"). Manually fixing the files is far too time consuming. > > I've written a rudimentary script that trawls > /var/log/glusterfs/glustershd.log for split-brain GFIDs, tracks it > down on the matching pair of bricks, and figures out via a few rules > (size tends to be a good indicator for us, as bigger files tend to be > more rencent ones) which is the "good" file. This works for about 80% > of files, which will dramatically reduce the amount of data we have to > manually check. > > My question is: what should I do from here? Options are: > > Option 1) Delete the file from the "bad" brick > > Option 2) rsync the file from the "good" brick to the "bad" brick > with -aX flag (preserve everything, including trusted.afr.$server and > trusted.gfid xattrs) > > Option 3) rsync the file from "good" to "bad", and then setfattr -x > trusted.* on the bad brick. > > Which of these is considered the better (more glustershd compatible) > option? Or alternatively, is there something else that's preferred? > > Normally I'd just test this on our backup gluster, however as it was > never running Netatalk, it has no split-brain problems, so I can't > test the functionality. > > Thanks for any insight provided, > > -Dan > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users >-- Pete Smith DevOp/System Administrator Realise Studio 12/13 Poland Street, London W1F 8QB T. +44 (0)20 7165 9644 realisestudio.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130410/28a0bcd0/attachment.html>