thr3ads.net - Gluster users - [Gluster-users] Gluster 3.8.10 rebalance VMs corruption [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Pranith Kumar Karampuri

2017-Mar-28 18:19 UTC

[Gluster-users] Gluster 3.8.10 rebalance VMs corruption

On Mon, Mar 27, 2017 at 11:29 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
wrote:
> Hi,
>
>
> Do you guys have any update regarding this issue ?
>I do not actively work on this issue so I do not have an accurate update,
but from what I heard from Krutika and Raghavendra(works on DHT) is:
Krutika debugged initially and found that the issue seems more likely to be
in DHT, Satheesaran who helped us recreate this issue in lab found that
just fix-layout without rebalance also caused the corruption 1 out of 3
times. Raghavendra came up with a possible RCA for why this can happen.
Raghavendra(CCed) would be the right person to provide accurate update.
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
> ------------------------------
> *From:* Krutika Dhananjay <kdhananj at redhat.com>
> *Sent:* Tuesday, March 21, 2017 3:02:55 PM
> *To:* Mahdi Adnan
> *Cc:* Nithya Balachandran; Gowdappa, Raghavendra; Susant Palai;
> gluster-users at gluster.org List
>
> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>
> Hi,
>
> So it looks like Satheesaran managed to recreate this issue. We will be
> seeking his help in debugging this. It will be easier that way.
>
> -Krutika
>
> On Tue, Mar 21, 2017 at 1:35 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
> wrote:
>
>> Hello and thank you for your email.
>> Actually no, i didn't check the gfid of the vms.
>> If this will help, i can setup a new test cluster and get all the data
>> you need.
>>
>> Get Outlook for Android <https://aka.ms/ghei36>
>>
>> From: Nithya Balachandran
>> Sent: Monday, March 20, 20:57
>> Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>> To: Krutika Dhananjay
>> Cc: Mahdi Adnan, Gowdappa, Raghavendra, Susant Palai,
>> gluster-users at gluster.org List
>>
>> Hi,
>>
>> Do you know the GFIDs of the VM images which were corrupted?
>>
>> Regards,
>>
>> Nithya
>>
>> On 20 March 2017 at 20:37, Krutika Dhananjay <kdhananj at
redhat.com> wrote:
>>
>> I looked at the logs.
>>
>> From the time the new graph (since the add-brick command you shared
where
>> bricks 41 through 44 are added) is switched to (line 3011 onwards in
>> nfs-gfapi.log), I see the following kinds of errors:
>>
>> 1. Lookups to a bunch of files failed with ENOENT on both replicas
which
>> protocol/client converts to ESTALE. I am guessing these entries got
>> migrated to
>>
>> other subvolumes leading to 'No such file or directory' errors.
>>
>> DHT and thereafter shard get the same error code and log the following:
>>
>>  0 [2017-03-17 14:04:26.353444] E [MSGID: 109040]
>> [dht-helper.c:1198:dht_migration_complete_check_task] 17-vmware2-dht:
>> <gfid:a68ce411-e381-46a3-93cd-d2af6a7c3532>: failed     to lookup
the
>> file on vmware2-dht [Stale file handle]
>>
>>
>>   1 [2017-03-17 14:04:26.353528] E [MSGID: 133014]
>> [shard.c:1253:shard_common_stat_cbk] 17-vmware2-shard: stat failed:
>> a68ce411-e381-46a3-93cd-d2af6a7c3532 [Stale file handle]
>>
>> which is fine.
>>
>> 2. The other kind are from AFR logging of possible split-brain which I
>> suppose are harmless too.
>> [2017-03-17 14:23:36.968883] W [MSGID: 108008]
>> [afr-read-txn.c:228:afr_read_txn] 17-vmware2-replicate-13: Unreadable
>> subvolume -1 found with event generation 2 for gfid
>> 74d49288-8452-40d4-893e-ff4672557ff9. (Possible split-brain)
>>
>> Since you are saying the bug is hit only on VMs that are undergoing IO
>> while rebalance is running (as opposed to those that remained powered
off),
>>
>> rebalance + IO could be causing some issues.
>>
>> CC'ing DHT devs
>>
>> Raghavendra/Nithya/Susant,
>>
>> Could you take a look?
>>
>> -Krutika
>>
>>
>> On Sun, Mar 19, 2017 at 4:55 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>> Thank you for your email mate.
>>
>> Yes, im aware of this but, to save costs i chose replica 2, this
cluster
>> is all flash.
>>
>> In version 3.7.x i had issues with ping timeout, if one hosts went down
>> for few seconds the whole cluster hangs and become unavailable, to
avoid
>> this i adjusted the ping timeout to 5 seconds.
>>
>> As for choosing Ganesha over gfapi, VMWare does not support Gluster
(FUSE
>> or gfapi) im stuck with NFS for this volume.
>>
>> The other volume is mounted using gfapi in oVirt cluster.
>>
>>
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>> *Sent:* Sunday, March 19, 2017 2:01:49 PM
>>
>> *To:* Mahdi Adnan
>> *Cc:* gluster-users at gluster.org
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>>
>>
>> While I'm still going through the logs, just wanted to point out a
couple
>> of things:
>>
>> 1. It is recommended that you use 3-way replication (replica count 3)
for
>> VM store use case
>>
>> 2. network.ping-timeout at 5 seconds is way too low. Please change it
to
>> 30.
>>
>> Is there any specific reason for using NFS-Ganesha over gfapi/FUSE?
>>
>> Will get back with anything else I might find or more questions if I
have
>> any.
>>
>> -Krutika
>>
>> On Sun, Mar 19, 2017 at 2:36 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>> Thanks mate,
>>
>> Kindly, check the attachment.
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>> *Sent:* Sunday, March 19, 2017 10:00:22 AM
>>
>> *To:* Mahdi Adnan
>> *Cc:* gluster-users at gluster.org
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>>
>>
>> In that case could you share the ganesha-gfapi logs?
>>
>> -Krutika
>>
>> On Sun, Mar 19, 2017 at 12:13 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>> I have two volumes, one is mounted using libgfapi for ovirt mount, the
>> other one is exported via NFS-Ganesha for VMWare which is the one im
>> testing now.
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>> *Sent:* Sunday, March 19, 2017 8:02:19 AM
>>
>> *To:* Mahdi Adnan
>> *Cc:* gluster-users at gluster.org
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>>
>>
>> On Sat, Mar 18, 2017 at 10:36 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>> Kindly, check the attached new log file, i dont know if it's
helpful or
>> not but, i couldn't find the log with the name you just described.
>>
>>
>> No. Are you using FUSE or libgfapi for accessing the volume? Or is it
NFS?
>>
>>
>>
>> -Krutika
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>> *Sent:* Saturday, March 18, 2017 6:10:40 PM
>>
>> *To:* Mahdi Adnan
>> *Cc:* gluster-users at gluster.org
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>>
>>
>> mnt-disk11-vmware2.log seems like a brick log. Could you attach the
fuse
>> mount logs? It should be right under /var/log/glusterfs/ directory
>>
>> named after the mount point name, only hyphenated.
>>
>> -Krutika
>>
>> On Sat, Mar 18, 2017 at 7:27 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>> Hello Krutika,
>>
>> Kindly, check the attached logs.
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>>
>> *Sent:* Saturday, March 18, 2017 3:29:03 PM
>> *To:* Mahdi Adnan
>> *Cc:* gluster-users at gluster.org
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>>
>>
>> Hi Mahdi,
>>
>> Could you attach mount, brick and rebalance logs?
>>
>> -Krutika
>>
>> On Sat, Mar 18, 2017 at 12:14 AM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>> Hi,
>>
>> I have upgraded to Gluster 3.8.10 today and ran the add-brick procedure
>> in a volume contains few VMs.
>>
>> After the completion of rebalance, i have rebooted the VMs, some of ran
>> just fine, and others just crashed.
>>
>> Windows boot to recovery mode and Linux throw xfs errors and does not
>> boot.
>>
>> I ran the test again and it happened just as the first one, but i have
>> noticed only VMs doing disk IOs are affected by this bug.
>>
>> The VMs in power off mode started fine and even md5 of the disk file
did
>> not change after the rebalance.
>>
>> anyone else can confirm this ?
>>
>> Volume info:
>>
>>
>>
>> Volume Name: vmware2
>>
>> Type: Distributed-Replicate
>>
>> Volume ID: 02328d46-a285-4533-aa3a-fb9bfeb688bf
>>
>> Status: Started
>>
>> Snapshot Count: 0
>>
>> Number of Bricks: 22 x 2 = 44
>>
>> Transport-type: tcp
>>
>> Bricks:
>>
>> Brick1: gluster01:/mnt/disk1/vmware2
>>
>> Brick2: gluster03:/mnt/disk1/vmware2
>>
>> Brick3: gluster02:/mnt/disk1/vmware2
>>
>> Brick4: gluster04:/mnt/disk1/vmware2
>>
>> Brick5: gluster01:/mnt/disk2/vmware2
>>
>> Brick6: gluster03:/mnt/disk2/vmware2
>>
>> Brick7: gluster02:/mnt/disk2/vmware2
>>
>> Brick8: gluster04:/mnt/disk2/vmware2
>>
>> Brick9: gluster01:/mnt/disk3/vmware2
>>
>> Brick10: gluster03:/mnt/disk3/vmware2
>>
>> Brick11: gluster02:/mnt/disk3/vmware2
>>
>> Brick12: gluster04:/mnt/disk3/vmware2
>>
>> Brick13: gluster01:/mnt/disk4/vmware2
>>
>> Brick14: gluster03:/mnt/disk4/vmware2
>>
>> Brick15: gluster02:/mnt/disk4/vmware2
>>
>> Brick16: gluster04:/mnt/disk4/vmware2
>>
>> Brick17: gluster01:/mnt/disk5/vmware2
>>
>> Brick18: gluster03:/mnt/disk5/vmware2
>>
>> Brick19: gluster02:/mnt/disk5/vmware2
>>
>> Brick20: gluster04:/mnt/disk5/vmware2
>>
>> Brick21: gluster01:/mnt/disk6/vmware2
>>
>> Brick22: gluster03:/mnt/disk6/vmware2
>>
>> Brick23: gluster02:/mnt/disk6/vmware2
>>
>> Brick24: gluster04:/mnt/disk6/vmware2
>>
>> Brick25: gluster01:/mnt/disk7/vmware2
>>
>> Brick26: gluster03:/mnt/disk7/vmware2
>>
>> Brick27: gluster02:/mnt/disk7/vmware2
>>
>> Brick28: gluster04:/mnt/disk7/vmware2
>>
>> Brick29: gluster01:/mnt/disk8/vmware2
>>
>> Brick30: gluster03:/mnt/disk8/vmware2
>>
>> Brick31: gluster02:/mnt/disk8/vmware2
>>
>> Brick32: gluster04:/mnt/disk8/vmware2
>>
>> Brick33: gluster01:/mnt/disk9/vmware2
>>
>> Brick34: gluster03:/mnt/disk9/vmware2
>>
>> Brick35: gluster02:/mnt/disk9/vmware2
>>
>> Brick36: gluster04:/mnt/disk9/vmware2
>>
>> Brick37: gluster01:/mnt/disk10/vmware2
>>
>> Brick38: gluster03:/mnt/disk10/vmware2
>>
>> Brick39: gluster02:/mnt/disk10/vmware2
>>
>> Brick40: gluster04:/mnt/disk10/vmware2
>>
>> Brick41: gluster01:/mnt/disk11/vmware2
>>
>> Brick42: gluster03:/mnt/disk11/vmware2
>>
>> Brick43: gluster02:/mnt/disk11/vmware2
>>
>> Brick44: gluster04:/mnt/disk11/vmware2
>>
>> Options Reconfigured:
>>
>> cluster.server-quorum-type: server
>>
>> nfs.disable: on
>>
>> performance.readdir-ahead: on
>>
>> transport.address-family: inet
>>
>> performance.quick-read: off
>>
>> performance.read-ahead: off
>>
>> performance.io-cache: off
>>
>> performance.stat-prefetch: off
>>
>> cluster.eager-lock: enable
>>
>> network.remote-dio: enable
>>
>> features.shard: on
>>
>> cluster.data-self-heal-algorithm: full
>>
>> features.cache-invalidation: on
>>
>> ganesha.enable: on
>>
>> features.shard-block-size: 256MB
>>
>> client.event-threads: 2
>>
>> server.event-threads: 2
>>
>> cluster.favorite-child-policy: size
>>
>> storage.build-pgfid: off
>>
>> network.ping-timeout: 5
>>
>> cluster.enable-shared-storage: enable
>>
>> nfs-ganesha: enable
>>
>> cluster.server-quorum-ratio: 51%
>>
>> Adding bricks:
>>
>> gluster volume add-brick vmware2 replica 2
gluster01:/mnt/disk11/vmware2
>> gluster03:/mnt/disk11/vmware2 gluster02:/mnt/disk11/vmware2
>> gluster04:/mnt/disk11/vmware2
>>
>> starting fix layout:
>>
>> gluster volume rebalance vmware2 fix-layout start
>>
>> Starting rebalance:
>>
>> gluster volume rebalance vmware2  start
>>
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>


-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170328/78ab0b52/attachment.html>

Gandalf Corvotempesta

2017-Mar-29 07:02 UTC

head link

[Gluster-users] Gluster 3.8.10 rebalance VMs corruption

Is rebalance and fix layout needed when adding new bricks?
Any workaround for extending a cluster without loose data?

Il 28 mar 2017 8:19 PM, "Pranith Kumar Karampuri" <pkarampu at
redhat.com> ha
scritto:
>
>
> On Mon, Mar 27, 2017 at 11:29 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
> wrote:
>
>> Hi,
>>
>>
>> Do you guys have any update regarding this issue ?
>>
> I do not actively work on this issue so I do not have an accurate update,
> but from what I heard from Krutika and Raghavendra(works on DHT) is:
> Krutika debugged initially and found that the issue seems more likely to be
> in DHT, Satheesaran who helped us recreate this issue in lab found that
> just fix-layout without rebalance also caused the corruption 1 out of 3
> times. Raghavendra came up with a possible RCA for why this can happen.
> Raghavendra(CCed) would be the right person to provide accurate update.
>
>>
>>
>> --
>>
>> Respectfully
>> *Mahdi A. Mahdi*
>>
>> ------------------------------
>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>> *Sent:* Tuesday, March 21, 2017 3:02:55 PM
>> *To:* Mahdi Adnan
>> *Cc:* Nithya Balachandran; Gowdappa, Raghavendra; Susant Palai;
>> gluster-users at gluster.org List
>>
>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption
>>
>> Hi,
>>
>> So it looks like Satheesaran managed to recreate this issue. We will be
>> seeking his help in debugging this. It will be easier that way.
>>
>> -Krutika
>>
>> On Tue, Mar 21, 2017 at 1:35 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>> wrote:
>>
>>> Hello and thank you for your email.
>>> Actually no, i didn't check the gfid of the vms.
>>> If this will help, i can setup a new test cluster and get all the
data
>>> you need.
>>>
>>> Get Outlook for Android <https://aka.ms/ghei36>
>>>
>>> From: Nithya Balachandran
>>> Sent: Monday, March 20, 20:57
>>> Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs
corruption
>>> To: Krutika Dhananjay
>>> Cc: Mahdi Adnan, Gowdappa, Raghavendra, Susant Palai,
>>> gluster-users at gluster.org List
>>>
>>> Hi,
>>>
>>> Do you know the GFIDs of the VM images which were corrupted?
>>>
>>> Regards,
>>>
>>> Nithya
>>>
>>> On 20 March 2017 at 20:37, Krutika Dhananjay <kdhananj at
redhat.com>
>>> wrote:
>>>
>>> I looked at the logs.
>>>
>>> From the time the new graph (since the add-brick command you shared
>>> where bricks 41 through 44 are added) is switched to (line 3011
onwards in
>>> nfs-gfapi.log), I see the following kinds of errors:
>>>
>>> 1. Lookups to a bunch of files failed with ENOENT on both replicas
which
>>> protocol/client converts to ESTALE. I am guessing these entries got
>>> migrated to
>>>
>>> other subvolumes leading to 'No such file or directory'
errors.
>>>
>>> DHT and thereafter shard get the same error code and log the
following:
>>>
>>>  0 [2017-03-17 14:04:26.353444] E [MSGID: 109040]
>>> [dht-helper.c:1198:dht_migration_complete_check_task]
17-vmware2-dht:
>>> <gfid:a68ce411-e381-46a3-93cd-d2af6a7c3532>: failed     to
lookup the
>>> file on vmware2-dht [Stale file handle]
>>>
>>>
>>>   1 [2017-03-17 14:04:26.353528] E [MSGID: 133014]
>>> [shard.c:1253:shard_common_stat_cbk] 17-vmware2-shard: stat failed:
>>> a68ce411-e381-46a3-93cd-d2af6a7c3532 [Stale file handle]
>>>
>>> which is fine.
>>>
>>> 2. The other kind are from AFR logging of possible split-brain
which I
>>> suppose are harmless too.
>>> [2017-03-17 14:23:36.968883] W [MSGID: 108008]
>>> [afr-read-txn.c:228:afr_read_txn] 17-vmware2-replicate-13:
Unreadable
>>> subvolume -1 found with event generation 2 for gfid
>>> 74d49288-8452-40d4-893e-ff4672557ff9. (Possible split-brain)
>>>
>>> Since you are saying the bug is hit only on VMs that are undergoing
IO
>>> while rebalance is running (as opposed to those that remained
powered off),
>>>
>>> rebalance + IO could be causing some issues.
>>>
>>> CC'ing DHT devs
>>>
>>> Raghavendra/Nithya/Susant,
>>>
>>> Could you take a look?
>>>
>>> -Krutika
>>>
>>>
>>> On Sun, Mar 19, 2017 at 4:55 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>>> wrote:
>>>
>>> Thank you for your email mate.
>>>
>>> Yes, im aware of this but, to save costs i chose replica 2, this
cluster
>>> is all flash.
>>>
>>> In version 3.7.x i had issues with ping timeout, if one hosts went
down
>>> for few seconds the whole cluster hangs and become unavailable, to
avoid
>>> this i adjusted the ping timeout to 5 seconds.
>>>
>>> As for choosing Ganesha over gfapi, VMWare does not support Gluster
>>> (FUSE or gfapi) im stuck with NFS for this volume.
>>>
>>> The other volume is mounted using gfapi in oVirt cluster.
>>>
>>>
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>>> *Sent:* Sunday, March 19, 2017 2:01:49 PM
>>>
>>> *To:* Mahdi Adnan
>>> *Cc:* gluster-users at gluster.org
>>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs
corruption
>>>
>>>
>>>
>>> While I'm still going through the logs, just wanted to point
out a
>>> couple of things:
>>>
>>> 1. It is recommended that you use 3-way replication (replica count
3)
>>> for VM store use case
>>>
>>> 2. network.ping-timeout at 5 seconds is way too low. Please change
it to
>>> 30.
>>>
>>> Is there any specific reason for using NFS-Ganesha over gfapi/FUSE?
>>>
>>> Will get back with anything else I might find or more questions if
I
>>> have any.
>>>
>>> -Krutika
>>>
>>> On Sun, Mar 19, 2017 at 2:36 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>>> wrote:
>>>
>>> Thanks mate,
>>>
>>> Kindly, check the attachment.
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>>> *Sent:* Sunday, March 19, 2017 10:00:22 AM
>>>
>>> *To:* Mahdi Adnan
>>> *Cc:* gluster-users at gluster.org
>>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs
corruption
>>>
>>>
>>>
>>> In that case could you share the ganesha-gfapi logs?
>>>
>>> -Krutika
>>>
>>> On Sun, Mar 19, 2017 at 12:13 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>>> wrote:
>>>
>>> I have two volumes, one is mounted using libgfapi for ovirt mount,
the
>>> other one is exported via NFS-Ganesha for VMWare which is the one
im
>>> testing now.
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>>> *Sent:* Sunday, March 19, 2017 8:02:19 AM
>>>
>>> *To:* Mahdi Adnan
>>> *Cc:* gluster-users at gluster.org
>>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs
corruption
>>>
>>>
>>>
>>> On Sat, Mar 18, 2017 at 10:36 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>>> wrote:
>>>
>>> Kindly, check the attached new log file, i dont know if it's
helpful or
>>> not but, i couldn't find the log with the name you just
described.
>>>
>>>
>>> No. Are you using FUSE or libgfapi for accessing the volume? Or is
it
>>> NFS?
>>>
>>>
>>>
>>> -Krutika
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>>> *Sent:* Saturday, March 18, 2017 6:10:40 PM
>>>
>>> *To:* Mahdi Adnan
>>> *Cc:* gluster-users at gluster.org
>>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs
corruption
>>>
>>>
>>>
>>> mnt-disk11-vmware2.log seems like a brick log. Could you attach the
fuse
>>> mount logs? It should be right under /var/log/glusterfs/ directory
>>>
>>> named after the mount point name, only hyphenated.
>>>
>>> -Krutika
>>>
>>> On Sat, Mar 18, 2017 at 7:27 PM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>>> wrote:
>>>
>>> Hello Krutika,
>>>
>>> Kindly, check the attached logs.
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> *From:* Krutika Dhananjay <kdhananj at redhat.com>
>>>
>>> *Sent:* Saturday, March 18, 2017 3:29:03 PM
>>> *To:* Mahdi Adnan
>>> *Cc:* gluster-users at gluster.org
>>> *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs
corruption
>>>
>>>
>>>
>>> Hi Mahdi,
>>>
>>> Could you attach mount, brick and rebalance logs?
>>>
>>> -Krutika
>>>
>>> On Sat, Mar 18, 2017 at 12:14 AM, Mahdi Adnan <mahdi.adnan at
outlook.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I have upgraded to Gluster 3.8.10 today and ran the add-brick
procedure
>>> in a volume contains few VMs.
>>>
>>> After the completion of rebalance, i have rebooted the VMs, some of
ran
>>> just fine, and others just crashed.
>>>
>>> Windows boot to recovery mode and Linux throw xfs errors and does
not
>>> boot.
>>>
>>> I ran the test again and it happened just as the first one, but i
have
>>> noticed only VMs doing disk IOs are affected by this bug.
>>>
>>> The VMs in power off mode started fine and even md5 of the disk
file did
>>> not change after the rebalance.
>>>
>>> anyone else can confirm this ?
>>>
>>> Volume info:
>>>
>>>
>>>
>>> Volume Name: vmware2
>>>
>>> Type: Distributed-Replicate
>>>
>>> Volume ID: 02328d46-a285-4533-aa3a-fb9bfeb688bf
>>>
>>> Status: Started
>>>
>>> Snapshot Count: 0
>>>
>>> Number of Bricks: 22 x 2 = 44
>>>
>>> Transport-type: tcp
>>>
>>> Bricks:
>>>
>>> Brick1: gluster01:/mnt/disk1/vmware2
>>>
>>> Brick2: gluster03:/mnt/disk1/vmware2
>>>
>>> Brick3: gluster02:/mnt/disk1/vmware2
>>>
>>> Brick4: gluster04:/mnt/disk1/vmware2
>>>
>>> Brick5: gluster01:/mnt/disk2/vmware2
>>>
>>> Brick6: gluster03:/mnt/disk2/vmware2
>>>
>>> Brick7: gluster02:/mnt/disk2/vmware2
>>>
>>> Brick8: gluster04:/mnt/disk2/vmware2
>>>
>>> Brick9: gluster01:/mnt/disk3/vmware2
>>>
>>> Brick10: gluster03:/mnt/disk3/vmware2
>>>
>>> Brick11: gluster02:/mnt/disk3/vmware2
>>>
>>> Brick12: gluster04:/mnt/disk3/vmware2
>>>
>>> Brick13: gluster01:/mnt/disk4/vmware2
>>>
>>> Brick14: gluster03:/mnt/disk4/vmware2
>>>
>>> Brick15: gluster02:/mnt/disk4/vmware2
>>>
>>> Brick16: gluster04:/mnt/disk4/vmware2
>>>
>>> Brick17: gluster01:/mnt/disk5/vmware2
>>>
>>> Brick18: gluster03:/mnt/disk5/vmware2
>>>
>>> Brick19: gluster02:/mnt/disk5/vmware2
>>>
>>> Brick20: gluster04:/mnt/disk5/vmware2
>>>
>>> Brick21: gluster01:/mnt/disk6/vmware2
>>>
>>> Brick22: gluster03:/mnt/disk6/vmware2
>>>
>>> Brick23: gluster02:/mnt/disk6/vmware2
>>>
>>> Brick24: gluster04:/mnt/disk6/vmware2
>>>
>>> Brick25: gluster01:/mnt/disk7/vmware2
>>>
>>> Brick26: gluster03:/mnt/disk7/vmware2
>>>
>>> Brick27: gluster02:/mnt/disk7/vmware2
>>>
>>> Brick28: gluster04:/mnt/disk7/vmware2
>>>
>>> Brick29: gluster01:/mnt/disk8/vmware2
>>>
>>> Brick30: gluster03:/mnt/disk8/vmware2
>>>
>>> Brick31: gluster02:/mnt/disk8/vmware2
>>>
>>> Brick32: gluster04:/mnt/disk8/vmware2
>>>
>>> Brick33: gluster01:/mnt/disk9/vmware2
>>>
>>> Brick34: gluster03:/mnt/disk9/vmware2
>>>
>>> Brick35: gluster02:/mnt/disk9/vmware2
>>>
>>> Brick36: gluster04:/mnt/disk9/vmware2
>>>
>>> Brick37: gluster01:/mnt/disk10/vmware2
>>>
>>> Brick38: gluster03:/mnt/disk10/vmware2
>>>
>>> Brick39: gluster02:/mnt/disk10/vmware2
>>>
>>> Brick40: gluster04:/mnt/disk10/vmware2
>>>
>>> Brick41: gluster01:/mnt/disk11/vmware2
>>>
>>> Brick42: gluster03:/mnt/disk11/vmware2
>>>
>>> Brick43: gluster02:/mnt/disk11/vmware2
>>>
>>> Brick44: gluster04:/mnt/disk11/vmware2
>>>
>>> Options Reconfigured:
>>>
>>> cluster.server-quorum-type: server
>>>
>>> nfs.disable: on
>>>
>>> performance.readdir-ahead: on
>>>
>>> transport.address-family: inet
>>>
>>> performance.quick-read: off
>>>
>>> performance.read-ahead: off
>>>
>>> performance.io-cache: off
>>>
>>> performance.stat-prefetch: off
>>>
>>> cluster.eager-lock: enable
>>>
>>> network.remote-dio: enable
>>>
>>> features.shard: on
>>>
>>> cluster.data-self-heal-algorithm: full
>>>
>>> features.cache-invalidation: on
>>>
>>> ganesha.enable: on
>>>
>>> features.shard-block-size: 256MB
>>>
>>> client.event-threads: 2
>>>
>>> server.event-threads: 2
>>>
>>> cluster.favorite-child-policy: size
>>>
>>> storage.build-pgfid: off
>>>
>>> network.ping-timeout: 5
>>>
>>> cluster.enable-shared-storage: enable
>>>
>>> nfs-ganesha: enable
>>>
>>> cluster.server-quorum-ratio: 51%
>>>
>>> Adding bricks:
>>>
>>> gluster volume add-brick vmware2 replica 2
gluster01:/mnt/disk11/vmware2
>>> gluster03:/mnt/disk11/vmware2 gluster02:/mnt/disk11/vmware2
>>> gluster04:/mnt/disk11/vmware2
>>>
>>> starting fix layout:
>>>
>>> gluster volume rebalance vmware2 fix-layout start
>>>
>>> Starting rebalance:
>>>
>>> gluster volume rebalance vmware2  start
>>>
>>>
>>> --
>>>
>>> Respectfully
>>> *Mahdi A. Mahdi*
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
> Pranith
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170329/644735be/attachment.html>

Gluster users - Mar 2017 - Gluster 3.8.10 rebalance VMs corruption

[Gluster-users] Gluster 3.8.10 rebalance VMs corruption

[Gluster-users] Gluster 3.8.10 rebalance VMs corruption