Marcus Bointon
2013-Mar-01 00:37 UTC
[Gluster-users] 3.3.1 Replicate only replicating one way
I've given up on trying to upgrade a 3.2.5 installation to 3.3.1 directly, so I'm scrapping it and starting again. I'm on Ubuntu Lucid, using stock packages from the semiosis ppa. My config is very simple - 2 nodes running replicate on a single volume with 4G of small files, created like this: gluster volume create shared replica 2 transport tcp 192.168.0.8:/var/shared 192.168.0.34:/var/shared I copied off all files from the gluster volume, removed all signs of gluster 3.2.5, installed 3.3.1, reconfigured using the same commands as for 3.2.5. Install, peer probe, volume creation and mount (via NFS) all reported working correctly. The problem I'm now seeing is that I can touch a file on one side and it appears on the other, but not the other way around. If I ask for heal info on the volume, both nodes report zero differences, but ls shows there are! If I request a full heal, the files appear correctly and the fixed files appear in the healed list. Something is clearly not talking... I doubt it's a firewall issue since this was previously a working setup and the firewall hasn't been touched. I'm finding it hard to track down since gluster's logs are spread across so many places - just this simple config has 20+ logs - and I've not found anything to explain this behaviour. Node 1: # gluster peer status Number of Peers: 1 Hostname: 192.168.0.8 Uuid: 8f30902f-f125-47bc-87dd-fa48e583efd3 State: Peer in Cluster (Connected) # gluster volume status Status of volume: shared Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.0.8:/var/shared 24010 Y 22440 Brick 192.168.0.34:/var/shared 24009 Y 16957 NFS Server on localhost 38467 Y 16963 Self-heal Daemon on localhost N/A Y 16969 NFS Server on 192.168.0.8 38467 Y 22446 Self-heal Daemon on 192.168.0.8 N/A Y 22452 Node 2: # gluster peer status Number of Peers: 1 Hostname: 192.168.0.34 Uuid: cf6d4c23-a5a2-4c35-859c-52410b6429e1 State: Peer in Cluster (Connected) # gluster volume status Status of volume: shared Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 192.168.0.8:/var/shared 24010 Y 22440 Brick 192.168.0.34:/var/shared 24009 Y 16957 NFS Server on localhost 38467 Y 22446 Self-heal Daemon on localhost N/A Y 22452 NFS Server on 192.168.0.34 38467 Y 16963 Self-heal Daemon on 192.168.0.34 N/A Y 16969 Having said all that, I've just noticed that files *are* appearing on the other node in the direction I thought they were not - but it's *really* slow; I copied about 10,000 files onto it and they are all visible on one node, but after 30 minutes only 10% of them are present on the other node, and they are all listed in the 'info healed' output. This sounds to me as if the replication is only happening in one direction via self-heal, and not through the normal replication route - it's certainly not synchronous. Any idea what could be amiss? Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK info at hand CRM solutions marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/
Todd Stansell
2013-Mar-06 08:02 UTC
[Gluster-users] 3.3.1 Replicate only replicating one way
In our recent testing, we saw all kinds of weird problems while testing rebuilding a failed brick in the same 2 node replicate cluster. Several times we had to kill off all gluster processes and restart things from scratch to get the two sides talking correctly again (where both sides thought they were happily talking to the other side, but self-heal wasn't doing anything). We'd run a full heal or stat some files and they wouldn't replicate back to the other side. After restarting the processes (not just glusterd, but all of the glusterfs ones too), things would start working. Once things were running and the nodes were properly replicating, it appeared to flow both ways nicely. We also saw an lstat of a client mount hang once for 105 seconds while we were rsyncing data into our cluster. No idea why things would lock up for that long. It was an lstat of a directory full of 4GB iso files, so maybe it was waiting for the isos to copy to both boxes. At gigabit speed (~950Mbps), though, 105 seconds is something like 12GB of data. And not sure why it would lock out lstat calls. I'm new to glusterfs, so I don't really have anything more to add. I just wanted you to know I've seen similar weirdness with 3.3.1 in a relatively simple replicate configuration. Todd On Fri, Mar 01, 2013 at 01:37:42AM +0100, Marcus Bointon wrote:> I've given up on trying to upgrade a 3.2.5 installation to 3.3.1 directly, so I'm scrapping it and starting again. I'm on Ubuntu Lucid, using stock packages from the semiosis ppa. > > My config is very simple - 2 nodes running replicate on a single volume with 4G of small files, created like this: > > gluster volume create shared replica 2 transport tcp 192.168.0.8:/var/shared 192.168.0.34:/var/shared > > I copied off all files from the gluster volume, removed all signs of gluster 3.2.5, installed 3.3.1, reconfigured using the same commands as for 3.2.5. Install, peer probe, volume creation and mount (via NFS) all reported working correctly. The problem I'm now seeing is that I can touch a file on one side and it appears on the other, but not the other way around. > > If I ask for heal info on the volume, both nodes report zero differences, but ls shows there are! If I request a full heal, the files appear correctly and the fixed files appear in the healed list. Something is clearly not talking... > > I doubt it's a firewall issue since this was previously a working setup and the firewall hasn't been touched. > > I'm finding it hard to track down since gluster's logs are spread across so many places - just this simple config has 20+ logs - and I've not found anything to explain this behaviour. > > Node 1: > > # gluster peer status > Number of Peers: 1 > > Hostname: 192.168.0.8 > Uuid: 8f30902f-f125-47bc-87dd-fa48e583efd3 > State: Peer in Cluster (Connected) > > # gluster volume status > Status of volume: shared > Gluster process Port Online Pid > ------------------------------------------------------------------------------ > Brick 192.168.0.8:/var/shared 24010 Y 22440 > Brick 192.168.0.34:/var/shared 24009 Y 16957 > NFS Server on localhost 38467 Y 16963 > Self-heal Daemon on localhost N/A Y 16969 > NFS Server on 192.168.0.8 38467 Y 22446 > Self-heal Daemon on 192.168.0.8 N/A Y 22452 > > Node 2: > > # gluster peer status > Number of Peers: 1 > > Hostname: 192.168.0.34 > Uuid: cf6d4c23-a5a2-4c35-859c-52410b6429e1 > State: Peer in Cluster (Connected) > > # gluster volume status > Status of volume: shared > Gluster process Port Online Pid > ------------------------------------------------------------------------------ > Brick 192.168.0.8:/var/shared 24010 Y 22440 > Brick 192.168.0.34:/var/shared 24009 Y 16957 > NFS Server on localhost 38467 Y 22446 > Self-heal Daemon on localhost N/A Y 22452 > NFS Server on 192.168.0.34 38467 Y 16963 > Self-heal Daemon on 192.168.0.34 N/A Y 16969 > > Having said all that, I've just noticed that files *are* appearing on the other node in the direction I thought they were not - but it's *really* slow; I copied about 10,000 files onto it and they are all visible on one node, but after 30 minutes only 10% of them are present on the other node, and they are all listed in the 'info healed' output. This sounds to me as if the replication is only happening in one direction via self-heal, and not through the normal replication route - it's certainly not synchronous. Any idea what could be amiss? > > Marcus > -- > Marcus Bointon > Synchromedia Limited: Creators of http://www.smartmessages.net/ > UK info at hand CRM solutions > marcus at synchromedia.co.uk | http://www.synchromedia.co.uk/ > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users