Brian Candler
2012-Jul-11 10:27 UTC
[Gluster-users] Recovering a broken distributed volume
I had a RAID array fail due to a number of Seagate drives going down, so this gave me an opportunity to check the recovery of gluster volumes. I found that the replicated volumes came up just fine, but the non-replicated ones have not. I'm wondering if there's a better solution than simply blowing them away and creating fresh ones (especially to keep the half data set in the distributed volume). The platform is ubuntu 12.04, glusterfs 3.3.0. There are two nodes, dev-storage1/2, and four volumes: * A distributed volume across the two nodes Volume Name: fast Type: Distribute Volume ID: 864fd12d-d879-4310-abaa-a2cb99b7f695 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: dev-storage1:/disk/storage1/fast Brick2: dev-storage2:/disk/storage2/fast * A replicated volume across the two nodes Volume Name: safe Type: Replicate Volume ID: 47a8f326-0e48-4a71-9cfe-f9ef8d555db7 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: dev-storage1:/disk/storage1/safe Brick2: dev-storage2:/disk/storage2/safe * Two single-brick volumes, one on each node. Volume Name: single1 Type: Distribute Volume ID: 74d62eb4-176e-4671-8471-779d909e19f0 Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: dev-storage1:/disk/storage1/single1 Volume Name: single2 Type: Distribute Volume ID: edab496f-c204-4122-ad10-c5f2e2ac92bd Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: dev-storage2:/disk/storage2/single2 These four volumes are FUSE-mounted on /gluster/safe /gluster/fast /gluster/single1 /glsuter/single2 on both servers. The bricks are sharing their underlying filesystems, i.e. dev-storage1:/disk/storage1 and dev-storage2:/disk/storage2. Now, the filesystem dev-storage1:/disk/storage1 failed. I created a new filesystem mounted on dev-storage1:/disk/storage1, did mkdir /disk/storage1/{single1,safe,fast} and restarted glusterd. After a couple of minutes, the contents of the replicated volume ("safe") was synchronised between the two nodes. That is, ls -lR /gluster/safe ls -lR /disk/storage1/safe # on dev-storage1 ls -lR /disk/storage2/safe # on dev-storage2 all showed the same. This is excellent. However the other two filesystems which depend on dev-storage1 are broken. As this is a dev system I could just blow them away, but I would like to use this as an exercise for fixing broken filesystems which I may have to do in production later. Here are the problems: (1) The "single1" volume is empty, which I expected since it's a brand new empty directory, but I cannot create files in it. root at dev-storage1:~# touch /gluster/single1/test touch: cannot touch `/gluster/single1/test': Read-only file system I guess gluster doesn't like the lack of metadata on this directory. Is there a quick recovery procedure here, or do I need to destroy the volume and recreate it? (2) The "fast" (distributed) volume appears empty to the clients: root at dev-storage1:~# ls /gluster/fast root at dev-storage1:~# However there is still half the content available in the brick which didn't fail: root at dev-storage2:~# ls /disk/storage2/fast images iso root at dev-storage2:~# Although this is a test system, ideally I would like to reactivate this volume and make the half data set available. I guess I could destroy the volume, move the data to a safe place, create a new volume and copy in the data. Is there a more direct way? Thanks, Brian.
Brian Candler
2012-Jul-11 14:07 UTC
[Gluster-users] Recovering a broken distributed volume
On Wed, Jul 11, 2012 at 11:27:58AM +0100, Brian Candler wrote:> (1) The "single1" volume is empty, which I expected since it's a brand new > empty directory, but I cannot create files in it. > > root at dev-storage1:~# touch /gluster/single1/test > touch: cannot touch `/gluster/single1/test': Read-only file systemSorry, this was my problem: it turns out a few more drives failed, and the underlying brick filesystem went read-only. Unbelievably that's 7 seagate drives failed out of an array of 12! Anyway, rebuilding the array with the remaining 5 working disks, the single volume came up fine. Also the distributed volume healed itself after I did 'ls' a few times on it. root at dev-storage1:~# ls /gluster/fast ... root at dev-storage1:~# ls /gluster/fast images iso root at dev-storage1:~# ls /gluster/fast/images/ root at dev-storage1:~# ls /gluster/fast/iso linuxmint-11-gnome-dvd-64bit.iso root at dev-storage1:~# ls /gluster/fast/images/ lucidtest root at dev-storage1:~# ls /gluster/fast/images/lucidtest/ tmpaJqTD9.qcow2 I can only see one other strange thing: the newly-created replica appears to have made a sparse copy of a file which wasn't sparse on the original. On the original working side of the replicated volume: root at dev-storage2:~# ls -l /disk/storage2/safe/images/lucidtest/ total 756108 -rw-r--r-- 2 root root 774307840 Jul 11 14:55 tmpaJqTD9.qcow2 root at dev-storage2:~# du -k /disk/storage2/safe/images/lucidtest/ 756116 /disk/storage2/safe/images/lucidtest/ On the newly-created side, which glustershd rebuilt automatically: root at dev-storage1:~# ls -l /disk/storage1/safe/images/lucidtest/ total 422728 -rw-r--r-- 2 root root 774307840 Jul 11 14:55 tmpaJqTD9.qcow2 root at dev-storage1:~# du -k /disk/storage1/safe/images/lucidtest/ 422736 /disk/storage1/safe/images/lucidtest/ Is this intentional? Does glustershd notice runs of zeros and create a sparse file on the target? (This may or may not be desirable, e.g. for performance you might want to fully preallocate a VM image) Regards, Brian.
Arnold Krille
2012-Jul-11 21:01 UTC
[Gluster-users] Recovering a broken distributed volume
On 11.07.2012 22:37, Mailing Lists wrote:> I had that some years ago on two servers at a customer's office, 2 disks in each in raid 1 so 4 disks. Same series ... failing in the same afternoon after 10 months of service !I can only repeat myself: Most people argue "but its two devices, its statistically independent". Well, two devices(*) manufactured at the same time and on the same assembly line (preferably with consecutive serial numbers), running the same firmware-version, bought at the same time, used in the same array with the same external stress and the same usage patterns. Even all my non-mathematical-non-informatics friends know that this isn't what you call "statistically independent". (*) Doesn't matter if its disks or switches or motherboards or processors or memory chips or power supplies or ups or backplanes or power-distribution-boards. If you "remove" your spof by using two of the exact same kind, statistics (and sad experience) say that its still a spof. So, subs (or is it "Mailing Lists":) and probably Brian, thanks for another data-point of this never-happens-in-real-life-scenario. We feel with you for such scenarios. Have fun, Arnold -- Dieses Email wurde elektronisch erstellt und ist ohne handschriftliche Unterschrift g?ltig. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120711/a1e5aee3/attachment.sig>