Branden Timm
2014-Feb-03 21:35 UTC
[Gluster-users] Problems after upgrade/volume expansion
Hello, I'm experiencing some major problems with my GlusterFS filesystem after an upgrade/expansion, and I'm hoping I can get pointed in the right direction for troubleshooting it. I had a 5 server, 5 brick distributed volume on 3.3.1. I brought the volume offline, stopped glusterd and glusterfsd on all servers, then upgraded to 3.4.2 and brought glusterd and glusterfsd back online. So far so good. Once the volume was back online and healthy, I added a new server to the trusted storage pool and added two bricks attached to that server to the pool. Everything looked fine so far, gluster volume status showed all six servers and seven bricks as online. The problem came next when I tried to rebalance. I ran "gluster volume rebalance <volname> start force", then once it returned ran "status" and saw that the rebalance failed on all but one node, which showed in progress. The node that it was running successfully on was a pre-existing server, not the new server/brick(s). The other five servers report "1 subvolume(s) are down. Skipping fix layout." Somebody in the IRC channel suggested this means that one of my bricks are down, but "gluster volume <volname> status" reports all servers and bricks as being online. Full pastebin of the rebalance log (essentially the same on all five failing servers) here: http://fpaste.org/74082/14615971/ Currently, I have both missing files and files that report "Transport endopint not connected" when they are accessed. It seems to really be related to the rebalance failures, and the layout seems incorrect as well. Really hoping somebody can point me in the right direction of where to look next. Thanks in advance for any help. -Branden
Branden Timm
2014-Feb-03 22:15 UTC
[Gluster-users] Problems after upgrade/volume expansion
I should mention that the following line from the log is also worrying, as each trusted server is running Gluster v. 3.4.2, as verified by running /usr/sbin/glusterd -V: Using Program GlusterFS 3.3, Num (1298437), Version (330) Branden On 2/3/2014 3:35 PM, Branden Timm wrote:> Hello, > I'm experiencing some major problems with my GlusterFS filesystem > after an upgrade/expansion, and I'm hoping I can get pointed in the > right direction for troubleshooting it. > > I had a 5 server, 5 brick distributed volume on 3.3.1. I brought the > volume offline, stopped glusterd and glusterfsd on all servers, then > upgraded to 3.4.2 and brought glusterd and glusterfsd back online. So > far so good. > > Once the volume was back online and healthy, I added a new server to > the trusted storage pool and added two bricks attached to that server > to the pool. Everything looked fine so far, gluster volume status > showed all six servers and seven bricks as online. > > The problem came next when I tried to rebalance. I ran "gluster > volume rebalance <volname> start force", then once it returned ran > "status" and saw that the rebalance failed on all but one node, which > showed in progress. The node that it was running successfully on was > a pre-existing server, not the new server/brick(s). The other five > servers report "1 subvolume(s) are down. Skipping fix layout." > Somebody in the IRC channel suggested this means that one of my bricks > are down, but "gluster volume <volname> status" reports all servers > and bricks as being online. Full pastebin of the rebalance log > (essentially the same on all five failing servers) here: > http://fpaste.org/74082/14615971/ > > Currently, I have both missing files and files that report "Transport > endopint not connected" when they are accessed. It seems to really be > related to the rebalance failures, and the layout seems incorrect as > well. Really hoping somebody can point me in the right direction of > where to look next. Thanks in advance for any help. > > -Branden > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users