Erik Jacobson
2020-Mar-30 16:54 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
> Hi Erik, > Sadly I didn't have the time to take a look in your logs, but I would like to ask you whether you have statiatics of the network bandwidth usage. > Could it be possible that the gNFS server is starved for bandwidth and fails to reach all bricks leading to 'split-brain' errors ? >I understand. I doubt there is a bandwidth issue but I'll add this to my checks. We have 288 nodes per server normally and they run fine with all servers up. The 76 number is just what we happened to have access to on an internal system. Question: What you mentioned above, and a feeling I have too personally is -- is the split-brain error actually a generic catch-all error for not being able to get access to a file? So when it says "split-brain" could it really mean any type of access error? Could it also be given when there is a IO timeout or something? I'm starting to break open the source code to look around but I think my head will explode before I understand it enough. I will still give it a shot. I have access to this system until later tonight. Then it goes away. We have duplicated it on another system that stays, but the machine internally is so contended for that I wouldn't get a time slot until later in the week anyway. Trying to make as much use of this "gift" machine as I can :) :) Thanks again for the replies so far. Erik
Strahil Nikolov
2020-Mar-30 19:35 UTC
[Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request
On March 30, 2020 7:54:59 PM GMT+03:00, Erik Jacobson <erik.jacobson at hpe.com> wrote:>> Hi Erik, >> Sadly I didn't have the time to take a look in your logs, but I would >like to ask you whether you have statiatics of the network bandwidth >usage. >> Could it be possible that the gNFS server is starved for bandwidth >and fails to reach all bricks leading to 'split-brain' errors ? >> > >I understand. I doubt there is a bandwidth issue but I'll add this to >my >checks. We have 288 nodes per server normally and they run fine with >all >servers up. The 76 number is just what we happened to have access to on >an internal system. > >Question: What you mentioned above, and a feeling I have too personally >is -- is the split-brain error actually a generic catch-all error for >not being able to get access to a file? So when it says "split-brain" >could it really mean any type of access error? Could it also be given >when there is a IO timeout or something? > >I'm starting to break open the source code to look around but I think >my >head will explode before I understand it enough. I will still give it a >shot. > >I have access to this system until later tonight. Then it goes away. We >have duplicated it on another system that stays, but the machine >internally is so contended for that I wouldn't get a time slot until >later in the week anyway. Trying to make as much use of this "gift" >machine as I can :) :) > >Thanks again for the replies so far. > >ErikHey Erik, Sadly I am not a developer, so I can't answer your questions. Still, a bandwith starvation looks like a possible (at least to me) reason - although error messages and timeouts should fill the logs. I can recommend you to increase logging for both brick & volume to the maximum and try to reproduce the issue. Keep in mind that the logs can grow very fast. Best Regards, Strahil Nikolov