Alex K
2018-May-29 20:27 UTC
[Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues
I would check disks status and accessibility of mount points where your gluster volumes reside. On Tue, May 29, 2018, 22:28 Jim Kusznir <jim at palousetech.com> wrote:> On one ovirt server, I'm now seeing these messages: > [56474.239725] blk_update_request: 63 callbacks suppressed > [56474.239732] blk_update_request: I/O error, dev dm-2, sector 0 > [56474.240602] blk_update_request: I/O error, dev dm-2, sector 3905945472 > [56474.241346] blk_update_request: I/O error, dev dm-2, sector 3905945584 > [56474.242236] blk_update_request: I/O error, dev dm-2, sector 2048 > [56474.243072] blk_update_request: I/O error, dev dm-2, sector 3905943424 > [56474.243997] blk_update_request: I/O error, dev dm-2, sector 3905943536 > [56474.247347] blk_update_request: I/O error, dev dm-2, sector 0 > [56474.248315] blk_update_request: I/O error, dev dm-2, sector 3905945472 > [56474.249231] blk_update_request: I/O error, dev dm-2, sector 3905945584 > [56474.250221] blk_update_request: I/O error, dev dm-2, sector 2048 > > > > > On Tue, May 29, 2018 at 11:59 AM, Jim Kusznir <jim at palousetech.com> wrote: > >> I see in messages on ovirt3 (my 3rd machine, the one upgraded to 4.2): >> >> May 29 11:54:41 ovirt3 ovs-vsctl: >> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database >> connection failed (No such file or directory) >> May 29 11:54:51 ovirt3 ovs-vsctl: >> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database >> connection failed (No such file or directory) >> May 29 11:55:01 ovirt3 ovs-vsctl: >> ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock: database >> connection failed (No such file or directory) >> (appears a lot). >> >> I also found on the ssh session of that, some sysv warnings about the >> backing disk for one of the gluster volumes (straight replica 3). The >> glusterfs process for that disk on that machine went offline. Its my >> understanding that it should continue to work with the other two machines >> while I attempt to replace that disk, right? Attempted writes (touching an >> empty file) can take 15 seconds, repeating it later will be much faster. >> >> Gluster generates a bunch of different log files, I don't know what ones >> you want, or from which machine(s). >> >> How do I do "volume profiling"? >> >> Thanks! >> >> On Tue, May 29, 2018 at 11:53 AM, Sahina Bose <sabose at redhat.com> wrote: >> >>> Do you see errors reported in the mount logs for the volume? If so, >>> could you attach the logs? >>> Any issues with your underlying disks. Can you also attach output of >>> volume profiling? >>> >>> On Wed, May 30, 2018 at 12:13 AM, Jim Kusznir <jim at palousetech.com> >>> wrote: >>> >>>> Ok, things have gotten MUCH worse this morning. I'm getting random >>>> errors from VMs, right now, about a third of my VMs have been paused due to >>>> storage issues, and most of the remaining VMs are not performing well. >>>> >>>> At this point, I am in full EMERGENCY mode, as my production services >>>> are now impacted, and I'm getting calls coming in with problems... >>>> >>>> I'd greatly appreciate help...VMs are running VERY slowly (when they >>>> run), and they are steadily getting worse. I don't know why. I was seeing >>>> CPU peaks (to 100%) on several VMs, in perfect sync, for a few minutes at a >>>> time (while the VM became unresponsive and any VMs I was logged into that >>>> were linux were giving me the CPU stuck messages in my origional post). Is >>>> all this storage related? >>>> >>>> I also have two different gluster volumes for VM storage, and only one >>>> had the issues, but now VMs in both are being affected at the same time and >>>> same way. >>>> >>>> --Jim >>>> >>>> On Mon, May 28, 2018 at 10:50 PM, Sahina Bose <sabose at redhat.com> >>>> wrote: >>>> >>>>> [Adding gluster-users to look at the heal issue] >>>>> >>>>> On Tue, May 29, 2018 at 9:17 AM, Jim Kusznir <jim at palousetech.com> >>>>> wrote: >>>>> >>>>>> Hello: >>>>>> >>>>>> I've been having some cluster and gluster performance issues lately. >>>>>> I also found that my cluster was out of date, and was trying to apply >>>>>> updates (hoping to fix some of these), and discovered the ovirt 4.1 repos >>>>>> were taken completely offline. So, I was forced to begin an upgrade to >>>>>> 4.2. According to docs I found/read, I needed only add the new repo, do a >>>>>> yum update, reboot, and be good on my hosts (did the yum update, the >>>>>> engine-setup on my hosted engine). Things seemed to work relatively well, >>>>>> except for a gluster sync issue that showed up. >>>>>> >>>>>> My cluster is a 3 node hyperconverged cluster. I upgraded the hosted >>>>>> engine first, then engine 3. When engine 3 came back up, for some reason >>>>>> one of my gluster volumes would not sync. Here's sample output: >>>>>> >>>>>> [root at ovirt3 ~]# gluster volume heal data-hdd info >>>>>> Brick 172.172.1.11:/gluster/brick3/data-hdd >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>>> Status: Connected >>>>>> Number of entries: 8 >>>>>> >>>>>> Brick 172.172.1.12:/gluster/brick3/data-hdd >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>>> Status: Connected >>>>>> Number of entries: 8 >>>>>> >>>>>> Brick 172.172.1.13:/gluster/brick3/data-hdd >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-42a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9 >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-446b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba >>>>>> >>>>>> /cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-44f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625 >>>>>> Status: Connected >>>>>> Number of entries: 8 >>>>>> >>>>>> --------- >>>>>> Its been in this state for a couple days now, and bandwidth >>>>>> monitoring shows no appreciable data moving. I've tried repeatedly >>>>>> commanding a full heal from all three clusters in the node. Its always the >>>>>> same files that need healing. >>>>>> >>>>>> When running gluster volume heal data-hdd statistics, I see sometimes >>>>>> different information, but always some number of "heal failed" entries. It >>>>>> shows 0 for split brain. >>>>>> >>>>>> I'm not quite sure what to do. I suspect it may be due to nodes 1 >>>>>> and 2 still being on the older ovirt/gluster release, but I'm afraid to >>>>>> upgrade and reboot them until I have a good gluster sync (don't need to >>>>>> create a split brain issue). How do I proceed with this? >>>>>> >>>>>> Second issue: I've been experiencing VERY POOR performance on most of >>>>>> my VMs. To the tune that logging into a windows 10 vm via remote desktop >>>>>> can take 5 minutes, launching quickbooks inside said vm can easily take 10 >>>>>> minutes. On some linux VMs, I get random messages like this: >>>>>> Message from syslogd at unifi at May 28 20:39:23 ... >>>>>> kernel:[6171996.308904] NMI watchdog: BUG: soft lockup - CPU#0 stuck >>>>>> for 22s! [mongod:14766] >>>>>> >>>>>> (the process and PID are often different) >>>>>> >>>>>> I'm not quite sure what to do about this either. My initial thought >>>>>> was upgrad everything to current and see if its still there, but I cannot >>>>>> move forward with that until my gluster is healed... >>>>>> >>>>>> Thanks! >>>>>> --Jim >>>>>> >>>>>> _______________________________________________ >>>>>> Users mailing list -- users at ovirt.org >>>>>> To unsubscribe send an email to users-leave at ovirt.org >>>>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/ >>>>>> oVirt Code of Conduct: >>>>>> https://www.ovirt.org/community/about/community-guidelines/ >>>>>> List Archives: >>>>>> https://lists.ovirt.org/archives/list/users at ovirt.org/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOTIRQF/ >>>>>> >>>>>> >>>>> >>>> >>> >> > _______________________________________________ > Users mailing list -- users at ovirt.org > To unsubscribe send an email to users-leave at ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users at ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/ >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180529/aaed5fa1/attachment.html>
Seemingly Similar Threads
- [ovirt-users] Re: Gluster problems, cluster performance issues
- [ovirt-users] Re: Gluster problems, cluster performance issues
- [ovirt-users] Gluster problems, cluster performance issues
- [ovirt-users] Re: Gluster problems, cluster performance issues
- [ovirt-users] Re: Gluster problems, cluster performance issues