John Gardeniers
2014-Jun-29 22:33 UTC
[Gluster-users] Slef-heal still not finished after 2 days
Hi All, We have 2 servers, each with on 5TB brick, configured as replica 2. After a series of events that caused the 2 bricks to become way out of step gluster was turned off on one server and its brick was wiped of everything but the attributes were untouched. This weekend we stopped the client and gluster and made a backup of the remaining brick, just to play safe. Gluster was then turned back on, first on the "master" and then on the "slave". Self-heal kicked in and started rebuilding the second brick. However, after 2 full days all files in the volume are still showing heal failed errors. The rebuild was, in my opinion at least, very slow, taking most of a day even though the system is on a 10Gb LAN. The data is a little under 1.4TB committed, 2TB allocated. Once the 2 bricks were very close to having the same amount of space used things slowed right down. For the last day both bricks show a very slow increase in used space, even though there are no changes being written by the client. By slow I mean just a few KB per minute. The logs are confusing, to say the least. In etc-glusterfs-glusterd.vol.log on both servers there are thousands of entries such as (possibly because I was using watch to monitor self-heal progress): [2014-06-29 21:41:11.289742] I [glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume gluster-rhev That timestamp is the latest on either server, that's about 9 hours ago as I type this. I find that a bit disconcerting. I have requested volume heal-failed info since then. The brick log on the "master" server (the one from which we are rebuilding the new brick) contains no entries since before the rebuild started. On the "slave" server the brick log shows a lot of entries such as: [2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk] 0-gluster-rhev-marker: Numerical result out of range occurred while creating symlinks [2014-06-28 08:49:47.887382] I [server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server: 10311315: REMOVEXATTR /44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec (1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key ==> (Numerical result out of range) Those entries are around the time the rebuild was starting. The final entries in that same log (immediately after those listed above) are: [2014-06-29 12:47:28.473999] I [server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869: INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory) [2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk] 0-gluster-rhev-server: 2870: OPEN (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory) As I type it's 2014-06-30 08:31. What do they mean and how can I rectify it? regards, John
Pranith Kumar Karampuri
2014-Jun-30 01:42 UTC
[Gluster-users] Slef-heal still not finished after 2 days
On 06/30/2014 04:03 AM, John Gardeniers wrote:> Hi All, > > We have 2 servers, each with on 5TB brick, configured as replica 2. > After a series of events that caused the 2 bricks to become way out of > step gluster was turned off on one server and its brick was wiped of > everything but the attributes were untouched. > > This weekend we stopped the client and gluster and made a backup of the > remaining brick, just to play safe. Gluster was then turned back on, > first on the "master" and then on the "slave". Self-heal kicked in and > started rebuilding the second brick. However, after 2 full days all > files in the volume are still showing heal failed errors. > > The rebuild was, in my opinion at least, very slow, taking most of a day > even though the system is on a 10Gb LAN. The data is a little under > 1.4TB committed, 2TB allocated.How much more to be healed? 0.6TB?> > Once the 2 bricks were very close to having the same amount of space > used things slowed right down. For the last day both bricks show a very > slow increase in used space, even though there are no changes being > written by the client. By slow I mean just a few KB per minute.Is the I/O still in progress on the mount? Self-heal doesn't happen on files where I/O is going on mounts in 3.4.x. So that could be the reason if I/O is going on.> > The logs are confusing, to say the least. In > etc-glusterfs-glusterd.vol.log on both servers there are thousands of > entries such as (possibly because I was using watch to monitor self-heal > progress): > > [2014-06-29 21:41:11.289742] I > [glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume] > 0-management: Received heal vol req for volume gluster-rhevWhat versoin of gluster are you using?> That timestamp is the latest on either server, that's about 9 hours ago > as I type this. I find that a bit disconcerting. I have requested volume > heal-failed info since then. > > The brick log on the "master" server (the one from which we are > rebuilding the new brick) contains no entries since before the rebuild > started. > > On the "slave" server the brick log shows a lot of entries such as: > > [2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk] > 0-gluster-rhev-marker: Numerical result out of range occurred while > creating symlinks > [2014-06-28 08:49:47.887382] I > [server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server: > 10311315: REMOVEXATTR > /44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec > (1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key ==> (Numerical result out > of range)CC Raghavendra who knows about marker translator.> > Those entries are around the time the rebuild was starting. The final > entries in that same log (immediately after those listed above) are: > > [2014-06-29 12:47:28.473999] I > [server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869: > INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file > or directory) > [2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk] > 0-gluster-rhev-server: 2870: OPEN (null) > (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory)These logs are harmless and were fixed in 3.5 I think. Are you on 3.4.x?> > As I type it's 2014-06-30 08:31. > > What do they mean and how can I rectify it? > > regards, > John > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users