thr3ads.net - Gluster users - [Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues [May 2018]

If this information is useful, please help other people find it:
Share via:
Alex K
2018-May-29 20:27 UTC
[Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues

I would check disks status and accessibility of mount points where your
gluster volumes reside.

On Tue, May 29, 2018, 22:28 Jim Kusznir <jim at palousetech.com> wrote:
> On one ovirt server, I'm now seeing these messages:
> [56474.239725] blk_update_request: 63 callbacks suppressed
> [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0
> [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472
> [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584
> [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048
> [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424
> [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536
> [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0
> [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472
> [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584
> [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048
>
>
>
>
> On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir <jim at
palousetech.com> wrote:
>
>> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2):
>>
>> May 29 11:54:41 ovirt3 ovs-vsctl:
>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database
>> connection failed (No such file or directory)
>> May 29 11:54:51 ovirt3 ovs-vsctl:
>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database
>> connection failed (No such file or directory)
>> May 29 11:55:01 ovirt3 ovs-vsctl:
>> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database
>> connection failed (No such file or directory)
>> (appears a lot).
>>
>> I also found on the ssh session of that, some sysv warnings about the
>> backing disk for one of the gluster volumes (straight replica 3).  The
>> glusterfs process for that disk on that machine went offline.  Its my
>> understanding that it should continue to work with the other two
machines
>> while I attempt to replace that disk, right?  Attempted writes
(touching an
>> empty file) can take 15 seconds, repeating it later will be much
faster.
>>
>> Gluster generates a bunch of different log files, I don't know what
ones
>> you want, or from which machine(s).
>>
>> How do I do "volume profiling"?
>>
>> Thanks!
>>
>> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose <sabose at
redhat.com> wrote:
>>
>>> Do you see errors reported in the mount logs for the volume? If so,
>>> could you attach the logs?
>>> Any issues with your underlying disks. Can you also attach output
of
>>> volume profiling?
>>>
>>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir <jim at
palousetech.com>
>>> wrote:
>>>
>>>> Ok, things have gotten MUCH worse this morning.  I'm
getting random
>>>> errors from VMs, right now, about a third of my VMs have been
paused due to
>>>> storage issues, and most of the remaining VMs are not
performing well.
>>>>
>>>> At this point, I am in full EMERGENCY mode, as my production
services
>>>> are now impacted, and I'm getting calls coming in with
problems...
>>>>
>>>> I'd greatly appreciate help...VMs are running VERY slowly
(when they
>>>> run), and they are steadily getting worse.  I don't know
why.  I was seeing
>>>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few
minutes at a
>>>> time (while the VM became unresponsive and any VMs I was logged
into that
>>>> were linux were giving me the CPU stuck messages in my
origional post).  Is
>>>> all this storage related?
>>>>
>>>> I also have two different gluster volumes for VM storage, and
only one
>>>> had the issues, but now VMs in both are being affected at the
same time and
>>>> same way.
>>>>
>>>> --Jim
>>>>
>>>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose <sabose at
redhat.com>
>>>> wrote:
>>>>
>>>>> [Adding gluster-users to look at the heal issue]
>>>>>
>>>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir <jim at
palousetech.com>
>>>>> wrote:
>>>>>
>>>>>> Hello:
>>>>>>
>>>>>> I've been having some cluster and gluster
performance issues lately.
>>>>>> I also found that my cluster was out of date, and was
trying to apply
>>>>>> updates (hoping to fix some of these), and discovered
the ovirt 4.1 repos
>>>>>> were taken completely offline.  So, I was forced to
begin an upgrade to
>>>>>> 4.2.  According to docs I found/read, I needed only add
the new repo, do a
>>>>>> yum update, reboot, and be good on my hosts (did the
yum update, the
>>>>>> engine-setup on my hosted engine).  Things seemed to
work relatively well,
>>>>>> except for a gluster sync issue that showed up.
>>>>>>
>>>>>> My cluster is a 3 node hyperconverged cluster.  I
upgraded the hosted
>>>>>> engine first, then engine 3.  When engine 3 came back
up, for some reason
>>>>>> one of my gluster volumes would not sync.  Here's
sample output:
>>>>>>
>>>>>> [root at ovirt3 ~]# gluster volume heal data-hdd info
>>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>> Status: Connected
>>>>>> Number of entries: 8
>>>>>>
>>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>> Status: Connected
>>>>>> Number of entries: 8
>>>>>>
>>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>
>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>> Status: Connected
>>>>>> Number of entries: 8
>>>>>>
>>>>>> ---------
>>>>>> Its been in this state for a couple days now, and
bandwidth
>>>>>> monitoring shows no appreciable data moving.  I've
tried repeatedly
>>>>>> commanding a full heal from all three clusters in the
node.  Its always the
>>>>>> same files that need healing.
>>>>>>
>>>>>> When running gluster volume heal data-hdd statistics, I
see sometimes
>>>>>> different information, but always some number of
"heal failed" entries.  It
>>>>>> shows 0 for split brain.
>>>>>>
>>>>>> I'm not quite sure what to do.  I suspect it may be
due to nodes 1
>>>>>> and 2 still being on the older ovirt/gluster release,
but I'm afraid to
>>>>>> upgrade and reboot them until I have a good gluster
sync (don't need to
>>>>>> create a split brain issue).  How do I proceed with
this?
>>>>>>
>>>>>> Second issue: I've been experiencing VERY POOR
performance on most of
>>>>>> my VMs.  To the tune that logging into a windows 10 vm
via remote desktop
>>>>>> can take 5 minutes, launching quickbooks inside said vm
can easily take 10
>>>>>> minutes.  On some linux VMs, I get random messages like
this:
>>>>>> Message from syslogd at unifi at May 28 20:39:23 ...
>>>>>>  kernel:[6171996.308904] NMI watchdog: BUG: soft lockup
- CPU#0 stuck
>>>>>> for 22s! [mongod:14766]
>>>>>>
>>>>>> (the process and PID are often different)
>>>>>>
>>>>>> I'm not quite sure what to do about this either. 
My initial thought
>>>>>> was upgrad everything to current and see if its still
there, but I cannot
>>>>>> move forward with that until my gluster is healed...
>>>>>>
>>>>>> Thanks!
>>>>>> --Jim
>>>>>>
>>>>>> _______________________________________________
>>>>>> Users mailing list -- users at ovirt.org
>>>>>> To unsubscribe send an email to users-leave at
ovirt.org
>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>> oVirt Code of Conduct:
>>>>>>
https://www.ovirt.org/community/about/community-guidelines/
>>>>>> List Archives:
>>>>>> https://lists.ovirt.org/archives/list/users at
ovirt.org/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> _______________________________________________
> Users mailing list -- users at ovirt.org
> To unsubscribe send an email to users-leave at ovirt.org
> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
> oVirt Code of Conduct:
> https://www.ovirt.org/community/about/community-guidelines/
> List Archives:
> https://lists.ovirt.org/archives/list/users at
ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180529/aaed5fa1/attachment.html>
Apparently Analagous Threads

Search for more apparently analagous threads
Gluster users - May 2018 - [ovirt-users] Re: Gluster problems, cluster performance issues

[Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues

Apparently Analagous Threads

Wisdom of the Ancients