thr3ads.net - Gluster users - [Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues [May 2018]

If this information is useful, please help other people find it:
Share via:
Sahina Bose
2018-May-30 07:03 UTC
[Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues

[Adding gluster-users back]

Nothing amiss with volume info and status.
Can you check the agent.log and broker.log - will be under
/var/log/ovirt-hosted-engine-ha/

Also the gluster client logs - under
/var/log/glusterfs/rhev-data-center-mnt-glusterSD<volume>.log


On Wed, May 30, 2018 at 12:08 PM, Jim Kusznir <jim at palousetech.com>
wrote:
> I believe the gluster data store for the engine is up and working
> correctly.  The rest are not mounted, as the engine hasn't managed to
start
> correctly yet.  I did perform the copy-a-junk-file into the data store,
> then check to ensure its there on another host, then delete that and see
> that it disappeared on the first host; it passed that test.  Here's the
> info and status.  (I have NOT performed the steps that Krutika and
> Ravishankar suggested yet, as I don't have my data volumes working
again
> yet.
>
> [root at ovirt2 images]# gluster volume info
>
> Volume Name: data
> Type: Replicate
> Volume ID: e670c488-ac16-4dd1-8bd3-e43b2e42cc59
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick2/data
> Brick2: ovirt2.nwfiber.com:/gluster/brick2/data
> Brick3: ovirt3.nwfiber.com:/gluster/brick2/data (arbiter)
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> server.allow-insecure: on
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
> Volume Name: data-hdd
> Type: Replicate
> Volume ID: d342a3ab-16f3-49f0-bbcf-f788be8ac5f1
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 3 = 3
> Transport-type: tcp
> Bricks:
> Brick1: 172.172.1.11:/gluster/brick3/data-hdd
> Brick2: 172.172.1.12:/gluster/brick3/data-hdd
> Brick3: 172.172.1.13:/gluster/brick3/data-hdd
> Options Reconfigured:
> diagnostics.count-fop-hits: on
> diagnostics.latency-measurement: on
> network.ping-timeout: 30
> server.allow-insecure: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> user.cifs: off
> features.shard: on
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 8
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> cluster.eager-lock: enable
> network.remote-dio: enable
> performance.low-prio-threads: 32
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> transport.address-family: inet
> performance.readdir-ahead: on
>
> Volume Name: engine
> Type: Replicate
> Volume ID: 87ad86b9-d88b-457e-ba21-5d3173c612de
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick1/engine
> Brick2: ovirt2.nwfiber.com:/gluster/brick1/engine
> Brick3: ovirt3.nwfiber.com:/gluster/brick1/engine (arbiter)
> Options Reconfigured:
> changelog.changelog: on
> geo-replication.ignore-pid-check: on
> geo-replication.indexing: on
> performance.readdir-ahead: on
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: off
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> storage.owner-uid: 36
> storage.owner-gid: 36
> features.shard: on
> features.shard-block-size: 512MB
> performance.low-prio-threads: 32
> cluster.data-self-heal-algorithm: full
> cluster.locking-scheme: granular
> cluster.shd-wait-qlength: 10000
> cluster.shd-max-threads: 6
> network.ping-timeout: 30
> user.cifs: off
> nfs.disable: on
> performance.strict-o-direct: on
>
> Volume Name: iso
> Type: Replicate
> Volume ID: b1ba15f5-0f0f-4411-89d0-595179f02b92
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x (2 + 1) = 3
> Transport-type: tcp
> Bricks:
> Brick1: ovirt1.nwfiber.com:/gluster/brick4/iso
> Brick2: ovirt2.nwfiber.com:/gluster/brick4/iso
> Brick3: ovirt3.nwfiber.com:/gluster/brick4/iso (arbiter)
> Options Reconfigured:
> performance.strict-o-direct: on
> nfs.disable: on
> user.cifs: off
> network.ping-timeout: 30
> cluster.shd-max-threads: 6
> cluster.shd-wait-qlength: 10000
> cluster.locking-scheme: granular
> cluster.data-self-heal-algorithm: full
> performance.low-prio-threads: 32
> features.shard-block-size: 512MB
> features.shard: on
> storage.owner-gid: 36
> storage.owner-uid: 36
> cluster.server-quorum-type: server
> cluster.quorum-type: auto
> network.remote-dio: off
> cluster.eager-lock: enable
> performance.stat-prefetch: off
> performance.io-cache: off
> performance.read-ahead: off
> performance.quick-read: off
> performance.readdir-ahead: on
>
>
> [root at ovirt2 images]# gluster volume status
> Status of volume: data
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick ovirt1.nwfiber.com:/gluster/brick2/da
> ta                                          49152     0          Y
>  3226
> Brick ovirt2.nwfiber.com:/gluster/brick2/da
> ta                                          49152     0          Y
>  2967
> Brick ovirt3.nwfiber.com:/gluster/brick2/da
> ta                                          49152     0          Y
>  2554
> Self-heal Daemon on localhost               N/A       N/A        Y
>  4818
> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A        Y
>  17853
> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A        Y
>  4771
>
> Task Status of Volume data
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> Status of volume: data-hdd
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick 172.172.1.11:/gluster/brick3/data-hdd 49153     0          Y
>  3232
> Brick 172.172.1.12:/gluster/brick3/data-hdd 49153     0          Y
>  2976
> Brick 172.172.1.13:/gluster/brick3/data-hdd N/A       N/A        N
>  N/A
> NFS Server on localhost                     N/A       N/A        N
>  N/A
> Self-heal Daemon on localhost               N/A       N/A        Y
>  4818
> NFS Server on ovirt3.nwfiber.com            N/A       N/A        N
>  N/A
> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A        Y
>  17853
> NFS Server on ovirt1.nwfiber.com            N/A       N/A        N
>  N/A
> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A        Y
>  4771
>
> Task Status of Volume data-hdd
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> Status of volume: engine
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick ovirt1.nwfiber.com:/gluster/brick1/en
> gine                                        49154     0          Y
>  3239
> Brick ovirt2.nwfiber.com:/gluster/brick1/en
> gine                                        49154     0          Y
>  2982
> Brick ovirt3.nwfiber.com:/gluster/brick1/en
> gine                                        49154     0          Y
>  2578
> Self-heal Daemon on localhost               N/A       N/A        Y
>  4818
> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A        Y
>  17853
> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A        Y
>  4771
>
> Task Status of Volume engine
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
> Status of volume: iso
> Gluster process                             TCP Port  RDMA Port  Online
> Pid
> ------------------------------------------------------------
> ------------------
> Brick ovirt1.nwfiber.com:/gluster/brick4/is
> o                                           49155     0          Y
>  3247
> Brick ovirt2.nwfiber.com:/gluster/brick4/is
> o                                           49155     0          Y
>  2990
> Brick ovirt3.nwfiber.com:/gluster/brick4/is
> o                                           49155     0          Y
>  2580
> Self-heal Daemon on localhost               N/A       N/A        Y
>  4818
> Self-heal Daemon on ovirt3.nwfiber.com      N/A       N/A        Y
>  17853
> Self-heal Daemon on ovirt1.nwfiber.com      N/A       N/A        Y
>  4771
>
> Task Status of Volume iso
> ------------------------------------------------------------
> ------------------
> There are no active volume tasks
>
>
> On Tue, May 29, 2018 at 11:30 PM, Sahina Bose <sabose at redhat.com>
wrote:
>
>>
>>
>> On Wed, May 30, 2018 at 10:42 AM, Jim Kusznir <jim at
palousetech.com>
>> wrote:
>>
>>> hosted-engine --deploy failed (would not come up on my existing
gluster
>>> storage).  However, I realized no changes were written to my
existing
>>> storage.  So, I went back to trying to get my old engine running.
>>>
>>> hosted-engine --vm-status is now taking a very long time
(5+minutes) to
>>> return, and it returns stail information everywhere.  I thought
perhaps the
>>> lockspace is corrupt, so tried to clean that and metadata, but both
are
>>> failing (--cleam-metadata has hung and I can't even ctrl-c out
of it).
>>>
>>> How can I reinitialize all the lockspace/metadata safely?  There is
no
>>> engine or VMs running currently....
>>>
>>
>> I think the first thing to make sure is that your storage is up and
>> running. So can you mount the gluster volumes and able to access the
>> contents there?
>> Please provide the gluster volume info and gluster volume status of the
>> volumes that you're using.
>>
>>
>>
>>> --Jim
>>>
>>> On Tue, May 29, 2018 at 9:33 PM, Jim Kusznir <jim at
palousetech.com>
>>> wrote:
>>>
>>>> Well, things went from bad to very, very bad....
>>>>
>>>> It appears that during one of the 2 minute lockups, the fencing
agents
>>>> decided that another node in the cluster was down.  As a
result, 2 of the 3
>>>> nodes were simultaneously reset with fencing agent reboot. 
After the nodes
>>>> came back up, the engine would not start.  All running VMs
(including VMs
>>>> on the 3rd node that was not rebooted) crashed.
>>>>
>>>> I've now been working for about 3 hours trying to get the
engine to
>>>> come up.  I don't know why it won't start. 
hosted-engine --vm-start says
>>>> its starting, but it doesn't start (virsh doesn't show
any VMs running).
>>>> I'm currently running --deploy, as I had run out of options
for anything
>>>> else I can come up with.  I hope this will allow me to
re-import all my
>>>> existing VMs and allow me to start them back up after
everything comes back
>>>> up.
>>>>
>>>> I do have an unverified geo-rep backup; I don't know if it
is a good
>>>> backup (there were several prior messages to this list, but I
didn't get
>>>> replies to my questions.  It was running in what I believe to
be "strange",
>>>> and the data directories are larger than their source).
>>>>
>>>> I'll see if my --deploy works, and if not, I'll be back
with another
>>>> message/help request.
>>>>
>>>> When the dust settles and I'm at least minimally functional
again, I
>>>> really want to understand why all these technologies designed
to offer
>>>> redundancy conspired to reduce uptime and create failures where
there
>>>> weren't any otherwise.  I thought with hosted engine, 3
ovirt servers and
>>>> glusterfs with minimum replica 2+arb or replica 3 should have
offered
>>>> strong resilience against server failure or disk failure, and
should have
>>>> prevented / recovered from data corruption.  Instead, all of
the above
>>>> happened (once I get my cluster back up, I still have to try
and recover my
>>>> webserver VM, which won't boot due to XFS corrupt journal
issues created
>>>> during the gluster crashes).  I think a lot of these issues
were rooted
>>>> from the upgrade from 4.1 to 4.2.
>>>>
>>>> --Jim
>>>>
>>>> On Tue, May 29, 2018 at 6:25 PM, Jim Kusznir <jim at
palousetech.com>
>>>> wrote:
>>>>
>>>>> I also finally found the following in my system log on one
server:
>>>>>
>>>>> [10679.524491] INFO: task glusterclogro:14933 blocked for
more than
>>>>> 120 seconds.
>>>>> [10679.525826] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10679.527144] glusterclogro   D ffff97209832bf40     0
14933      1
>>>>> 0x00000080
>>>>> [10679.527150] Call Trace:
>>>>> [10679.527161]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [10679.527218]  [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [10679.527225]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [10679.527254]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>> [10679.527260]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>> [10679.527268]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10679.527271]  [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>> [10679.527275]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10679.527279]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10679.527283] INFO: task glusterposixfsy:14941 blocked for
more than
>>>>> 120 seconds.
>>>>> [10679.528608] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10679.529956] glusterposixfsy D ffff972495f84f10     0
14941      1
>>>>> 0x00000080
>>>>> [10679.529961] Call Trace:
>>>>> [10679.529966]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [10679.530003]  [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [10679.530008]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [10679.530038]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>> [10679.530042]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>> [10679.530046]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10679.530050]  [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>> [10679.530054]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10679.530058]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10679.530062] INFO: task glusteriotwr13:15486 blocked for
more than
>>>>> 120 seconds.
>>>>> [10679.531805] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10679.533732] glusteriotwr13  D ffff9720a83f0000     0
15486      1
>>>>> 0x00000080
>>>>> [10679.533738] Call Trace:
>>>>> [10679.533747]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [10679.533799]  [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [10679.533806]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [10679.533846]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>> [10679.533852]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>> [10679.533858]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10679.533863]  [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>> [10679.533868]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10679.533873]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10919.512757] INFO: task glusterclogro:14933 blocked for
more than
>>>>> 120 seconds.
>>>>> [10919.514714] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [10919.516663] glusterclogro   D ffff97209832bf40     0
14933      1
>>>>> 0x00000080
>>>>> [10919.516677] Call Trace:
>>>>> [10919.516690]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [10919.516696]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [10919.516703]  [<ffffffffb951cc04>] ?
blk_finish_plug+0x14/0x40
>>>>> [10919.516768]  [<ffffffffc05e9224>] ?
_xfs_buf_ioapply+0x334/0x460
>>>>> [xfs]
>>>>> [10919.516774]  [<ffffffffb991432d>]
wait_for_completion+0xfd/0x140
>>>>> [10919.516782]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [10919.516821]  [<ffffffffc05eb0a3>] ?
_xfs_buf_read+0x23/0x40 [xfs]
>>>>> [10919.516859]  [<ffffffffc05eafa9>]
xfs_buf_submit_wait+0xf9/0x1d0
>>>>> [xfs]
>>>>> [10919.516902]  [<ffffffffc061b279>] ?
xfs_trans_read_buf_map+0x199/0x400
>>>>> [xfs]
>>>>> [10919.516940]  [<ffffffffc05eb0a3>]
_xfs_buf_read+0x23/0x40 [xfs]
>>>>> [10919.516977]  [<ffffffffc05eb1b9>]
xfs_buf_read_map+0xf9/0x160 [xfs]
>>>>> [10919.517022]  [<ffffffffc061b279>]
xfs_trans_read_buf_map+0x199/0x400
>>>>> [xfs]
>>>>> [10919.517057]  [<ffffffffc05c8d04>]
xfs_da_read_buf+0xd4/0x100 [xfs]
>>>>> [10919.517091]  [<ffffffffc05c8d53>]
xfs_da3_node_read+0x23/0xd0 [xfs]
>>>>> [10919.517126]  [<ffffffffc05c9fee>]
xfs_da3_node_lookup_int+0x6e/0x2f0
>>>>> [xfs]
>>>>> [10919.517160]  [<ffffffffc05d5a1d>]
xfs_dir2_node_lookup+0x4d/0x170
>>>>> [xfs]
>>>>> [10919.517194]  [<ffffffffc05ccf5d>]
xfs_dir_lookup+0x1bd/0x1e0 [xfs]
>>>>> [10919.517233]  [<ffffffffc05fd8d9>]
xfs_lookup+0x69/0x140 [xfs]
>>>>> [10919.517271]  [<ffffffffc05fa018>]
xfs_vn_lookup+0x78/0xc0 [xfs]
>>>>> [10919.517278]  [<ffffffffb9425cf3>]
lookup_real+0x23/0x60
>>>>> [10919.517283]  [<ffffffffb9426702>]
__lookup_hash+0x42/0x60
>>>>> [10919.517288]  [<ffffffffb942d519>]
SYSC_renameat2+0x3a9/0x5a0
>>>>> [10919.517296]  [<ffffffffb94d3753>] ?
selinux_file_free_security+0x2
>>>>> 3/0x30
>>>>> [10919.517304]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10919.517309]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10919.517313]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [10919.517318]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [10919.517323]  [<ffffffffb942e58e>]
SyS_renameat2+0xe/0x10
>>>>> [10919.517328]  [<ffffffffb942e5ce>]
SyS_rename+0x1e/0x20
>>>>> [10919.517333]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [10919.517339]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11159.496095] INFO: task glusteriotwr9:15482 blocked for
more than
>>>>> 120 seconds.
>>>>> [11159.497546] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11159.498978] glusteriotwr9   D ffff971fa0fa1fa0     0
15482      1
>>>>> 0x00000080
>>>>> [11159.498984] Call Trace:
>>>>> [11159.498995]  [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>> [11159.498999]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [11159.499003]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [11159.499056]  [<ffffffffc05dd9b7>] ?
xfs_iext_bno_to_ext+0xa7/0x1a0
>>>>> [xfs]
>>>>> [11159.499082]  [<ffffffffc05dd43e>] ?
xfs_iext_bno_to_irec+0x8e/0xd0
>>>>> [xfs]
>>>>> [11159.499090]  [<ffffffffb92f7a12>] ?
ktime_get_ts64+0x52/0xf0
>>>>> [11159.499093]  [<ffffffffb9911f00>] ?
bit_wait+0x50/0x50
>>>>> [11159.499097]  [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>> [11159.499101]  [<ffffffffb9913528>]
io_schedule+0x18/0x20
>>>>> [11159.499104]  [<ffffffffb9911f11>]
bit_wait_io+0x11/0x50
>>>>> [11159.499107]  [<ffffffffb9911ac1>]
__wait_on_bit_lock+0x61/0xc0
>>>>> [11159.499113]  [<ffffffffb9393634>]
__lock_page+0x74/0x90
>>>>> [11159.499118]  [<ffffffffb92bc210>] ?
wake_bit_function+0x40/0x40
>>>>> [11159.499121]  [<ffffffffb9394154>]
__find_lock_page+0x54/0x70
>>>>> [11159.499125]  [<ffffffffb9394e85>]
grab_cache_page_write_begin+0x
>>>>> 55/0xc0
>>>>> [11159.499130]  [<ffffffffb9484b76>]
iomap_write_begin+0x66/0x100
>>>>> [11159.499135]  [<ffffffffb9484edf>]
iomap_write_actor+0xcf/0x1d0
>>>>> [11159.499140]  [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>> [11159.499144]  [<ffffffffb94854e7>]
iomap_apply+0xb7/0x150
>>>>> [11159.499149]  [<ffffffffb9485621>]
iomap_file_buffered_write+0xa1
>>>>> /0xe0
>>>>> [11159.499153]  [<ffffffffb9484e10>] ?
iomap_write_end+0x80/0x80
>>>>> [11159.499182]  [<ffffffffc05f025d>]
xfs_file_buffered_aio_write+0x12d/0x2c0
>>>>> [xfs]
>>>>> [11159.499213]  [<ffffffffc05f057d>]
xfs_file_aio_write+0x18d/0x1b0
>>>>> [xfs]
>>>>> [11159.499217]  [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>> [11159.499222]  [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>> [11159.499225]  [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>> [11159.499230]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11159.499234]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11159.499238]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.488720] INFO: task xfsaild/dm-10:1134 blocked for
more than 120
>>>>> seconds.
>>>>> [11279.490197] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.491665] xfsaild/dm-10   D ffff9720a8660fd0     0 
1134      2
>>>>> 0x00000000
>>>>> [11279.491671] Call Trace:
>>>>> [11279.491682]  [<ffffffffb92a3a2e>] ?
try_to_del_timer_sync+0x5e/0x90
>>>>> [11279.491688]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [11279.491744]  [<ffffffffc060de36>]
_xfs_log_force+0x1c6/0x2c0 [xfs]
>>>>> [11279.491750]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [11279.491783]  [<ffffffffc0619fec>] ?
xfsaild+0x16c/0x6f0 [xfs]
>>>>> [11279.491817]  [<ffffffffc060df5c>]
xfs_log_force+0x2c/0x70 [xfs]
>>>>> [11279.491849]  [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>>>>> [xfs]
>>>>> [11279.491880]  [<ffffffffc0619fec>]
xfsaild+0x16c/0x6f0 [xfs]
>>>>> [11279.491913]  [<ffffffffc0619e80>] ?
xfs_trans_ail_cursor_first+0x90/0x90
>>>>> [xfs]
>>>>> [11279.491919]  [<ffffffffb92bb161>]
kthread+0xd1/0xe0
>>>>> [11279.491926]  [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>> [11279.491932]  [<ffffffffb9920677>]
ret_from_fork_nospec_begin+0x2
>>>>> 1/0x21
>>>>> [11279.491936]  [<ffffffffb92bb090>] ?
insert_kthread_work+0x40/0x40
>>>>> [11279.491976] INFO: task glusterclogfsyn:14934 blocked for
more than
>>>>> 120 seconds.
>>>>> [11279.493466] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.494952] glusterclogfsyn D ffff97209832af70     0
14934      1
>>>>> 0x00000080
>>>>> [11279.494957] Call Trace:
>>>>> [11279.494979]  [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/0x520
>>>>> [dm_mod]
>>>>> [11279.494983]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [11279.494987]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [11279.494997]  [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>> [dm_mod]
>>>>> [11279.495001]  [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>> [11279.495005]  [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x140
>>>>> [11279.495010]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [11279.495016]  [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>> [11279.495049]  [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>> [xfs]
>>>>> [11279.495079]  [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0 [xfs]
>>>>> [11279.495086]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>> [11279.495090]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.495094]  [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>> [11279.495098]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.495102]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.495105] INFO: task glusterposixfsy:14941 blocked for
more than
>>>>> 120 seconds.
>>>>> [11279.496606] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.498114] glusterposixfsy D ffff972495f84f10     0
14941      1
>>>>> 0x00000080
>>>>> [11279.498118] Call Trace:
>>>>> [11279.498134]  [<ffffffffc0309839>] ?
__split_and_process_bio+0x2e9/0x520
>>>>> [dm_mod]
>>>>> [11279.498138]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [11279.498142]  [<ffffffffb99118e9>]
schedule_timeout+0x239/0x2c0
>>>>> [11279.498152]  [<ffffffffc0309d98>] ?
dm_make_request+0x128/0x1a0
>>>>> [dm_mod]
>>>>> [11279.498156]  [<ffffffffb991348d>]
io_schedule_timeout+0xad/0x130
>>>>> [11279.498160]  [<ffffffffb99145ad>]
wait_for_completion_io+0xfd/0x140
>>>>> [11279.498165]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [11279.498169]  [<ffffffffb951e574>]
blkdev_issue_flush+0xb4/0x110
>>>>> [11279.498202]  [<ffffffffc06064b9>]
xfs_blkdev_issue_flush+0x19/0x20
>>>>> [xfs]
>>>>> [11279.498231]  [<ffffffffc05eec40>]
xfs_file_fsync+0x1b0/0x1e0 [xfs]
>>>>> [11279.498238]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>> [11279.498242]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.498246]  [<ffffffffb944f3f3>]
SyS_fdatasync+0x13/0x20
>>>>> [11279.498250]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.498254]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.498257] INFO: task glusteriotwr1:14950 blocked for
more than
>>>>> 120 seconds.
>>>>> [11279.499789] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.501343] glusteriotwr1   D ffff97208b6daf70     0
14950      1
>>>>> 0x00000080
>>>>> [11279.501348] Call Trace:
>>>>> [11279.501353]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [11279.501390]  [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [11279.501396]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [11279.501428]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>> [11279.501432]  [<ffffffffb944ef3f>]
generic_write_sync+0x4f/0x70
>>>>> [11279.501461]  [<ffffffffc05f0545>]
xfs_file_aio_write+0x155/0x1b0
>>>>> [xfs]
>>>>> [11279.501466]  [<ffffffffb941a533>]
do_sync_write+0x93/0xe0
>>>>> [11279.501471]  [<ffffffffb941b010>]
vfs_write+0xc0/0x1f0
>>>>> [11279.501475]  [<ffffffffb941c002>]
SyS_pwrite64+0x92/0xc0
>>>>> [11279.501479]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.501483]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.501489]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [11279.501493] INFO: task glusteriotwr4:14953 blocked for
more than
>>>>> 120 seconds.
>>>>> [11279.503047] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [11279.504630] glusteriotwr4   D ffff972499f2bf40     0
14953      1
>>>>> 0x00000080
>>>>> [11279.504635] Call Trace:
>>>>> [11279.504640]  [<ffffffffb9913f79>]
schedule+0x29/0x70
>>>>> [11279.504676]  [<ffffffffc060e388>]
_xfs_log_force_lsn+0x2e8/0x340
>>>>> [xfs]
>>>>> [11279.504681]  [<ffffffffb92cf1b0>] ?
wake_up_state+0x20/0x20
>>>>> [11279.504710]  [<ffffffffc05eeb97>]
xfs_file_fsync+0x107/0x1e0 [xfs]
>>>>> [11279.504714]  [<ffffffffb944f0e7>]
do_fsync+0x67/0xb0
>>>>> [11279.504718]  [<ffffffffb992076f>] ?
system_call_after_swapgs+0xbc/
>>>>> 0x160
>>>>> [11279.504722]  [<ffffffffb944f3d0>]
SyS_fsync+0x10/0x20
>>>>> [11279.504725]  [<ffffffffb992082f>]
system_call_fastpath+0x1c/0x21
>>>>> [11279.504730]  [<ffffffffb992077b>] ?
system_call_after_swapgs+0xc8/
>>>>> 0x160
>>>>> [12127.466494] perf: interrupt took too long (8263 >
8150), lowering
>>>>> kernel.perf_event_max_sample_rate to 24000
>>>>>
>>>>> --------------------
>>>>> I think this is the cause of the massive ovirt performance
issues
>>>>> irrespective of gluster volume.  At the time this happened,
I was also
>>>>> ssh'ed into the host, and was doing some rpm querry
commands.  I had just
>>>>> run rpm -qa |grep glusterfs (to verify what version was
actually
>>>>> installed), and that command took almost 2 minutes to
return!  Normally it
>>>>> takes less than 2 seconds.  That is all pure local SSD IO,
too....
>>>>>
>>>>> I'm no expert, but its my understanding that anytime a
software causes
>>>>> these kinds of issues, its a serious bug in the software,
even if its
>>>>> mis-handled exceptions.  Is this correct?
>>>>>
>>>>> --Jim
>>>>>
>>>>> On Tue, May 29, 2018 at 3:01 PM, Jim Kusznir <jim at
palousetech.com>
>>>>> wrote:
>>>>>
>>>>>> I think this is the profile information for one of the
volumes that
>>>>>> lives on the SSDs and is fully operational with no
down/problem disks:
>>>>>>
>>>>>> [root at ovirt2 yum.repos.d]# gluster volume profile
data info
>>>>>> Brick: ovirt2.nwfiber.com:/gluster/brick2/data
>>>>>> ----------------------------------------------
>>>>>> Cumulative Stats:
>>>>>>    Block Size:                256b+                
512b+
>>>>>>     1024b+
>>>>>>  No. of Reads:                  983                 
2696
>>>>>>       1059
>>>>>> No. of Writes:                    0                 
1113
>>>>>>        302
>>>>>>
>>>>>>    Block Size:               2048b+               
4096b+
>>>>>>     8192b+
>>>>>>  No. of Reads:                  852                
88608
>>>>>>      53526
>>>>>> No. of Writes:                  522               
812340
>>>>>>      76257
>>>>>>
>>>>>>    Block Size:              16384b+              
32768b+
>>>>>>    65536b+
>>>>>>  No. of Reads:                54351               
241901
>>>>>>      15024
>>>>>> No. of Writes:                21636                 
8656
>>>>>>       8976
>>>>>>
>>>>>>    Block Size:             131072b+
>>>>>>  No. of Reads:               524156
>>>>>> No. of Writes:               296071
>>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency  
No. of calls
>>>>>>        Fop
>>>>>>  ---------   -----------   -----------   -----------  
------------
>>>>>>       ----
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
4189
>>>>>>    RELEASE
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
1257
>>>>>> RELEASEDIR
>>>>>>       0.00      46.19 us      12.00 us     187.00 us   
69
>>>>>>      FLUSH
>>>>>>       0.00     147.00 us      78.00 us     367.00 us   
86
>>>>>> REMOVEXATTR
>>>>>>       0.00     223.46 us      24.00 us    1166.00 us   
149
>>>>>>    READDIR
>>>>>>       0.00     565.34 us      76.00 us    3639.00 us   
88
>>>>>>  FTRUNCATE
>>>>>>       0.00     263.28 us      20.00 us   28385.00 us   
228
>>>>>>         LK
>>>>>>       0.00      98.84 us       2.00 us     880.00 us   
1198
>>>>>>    OPENDIR
>>>>>>       0.00      91.59 us      26.00 us   10371.00 us   
3853
>>>>>>     STATFS
>>>>>>       0.00     494.14 us      17.00 us  193439.00 us   
1171
>>>>>>   GETXATTR
>>>>>>       0.00     299.42 us      35.00 us    9799.00 us   
2044
>>>>>>   READDIRP
>>>>>>       0.00    1965.31 us     110.00 us  382258.00 us   
321
>>>>>>    XATTROP
>>>>>>       0.01     113.40 us      24.00 us   61061.00 us   
8134
>>>>>>       STAT
>>>>>>       0.01     755.38 us      57.00 us  607603.00 us   
3196
>>>>>>    DISCARD
>>>>>>       0.05    2690.09 us      58.00 us 2704761.00 us   
3206
>>>>>>       OPEN
>>>>>>       0.10  119978.25 us      97.00 us 9406684.00 us   
154
>>>>>>    SETATTR
>>>>>>       0.18     101.73 us      28.00 us  700477.00 us   
313379
>>>>>>      FSTAT
>>>>>>       0.23    1059.84 us      25.00 us 2716124.00 us   
38255
>>>>>>     LOOKUP
>>>>>>       0.47    1024.11 us      54.00 us 6197164.00 us   
81455
>>>>>>   FXATTROP
>>>>>>       1.72    2984.00 us      15.00 us 37098954.00 us  
103020
>>>>>>   FINODELK
>>>>>>       5.92   44315.32 us      51.00 us 24731536.00 us  
23957
>>>>>>      FSYNC
>>>>>>      13.27    2399.78 us      25.00 us 22089540.00 us  
991005
>>>>>>       READ
>>>>>>      37.00    5980.43 us      52.00 us 22099889.00 us  
1108976
>>>>>>      WRITE
>>>>>>      41.04    5452.75 us      13.00 us 22102452.00 us  
1349053
>>>>>>    INODELK
>>>>>>
>>>>>>     Duration: 10026 seconds
>>>>>>    Data Read: 80046027759 bytes
>>>>>> Data Written: 44496632320 bytes
>>>>>>
>>>>>> Interval 1 Stats:
>>>>>>    Block Size:                256b+                
512b+
>>>>>>     1024b+
>>>>>>  No. of Reads:                  983                 
2696
>>>>>>       1059
>>>>>> No. of Writes:                    0                  
838
>>>>>>        185
>>>>>>
>>>>>>    Block Size:               2048b+               
4096b+
>>>>>>     8192b+
>>>>>>  No. of Reads:                  852                
85856
>>>>>>      51575
>>>>>> No. of Writes:                  382               
705802
>>>>>>      57812
>>>>>>
>>>>>>    Block Size:              16384b+              
32768b+
>>>>>>    65536b+
>>>>>>  No. of Reads:                52673               
232093
>>>>>>      14984
>>>>>> No. of Writes:                13499                 
4908
>>>>>>       4242
>>>>>>
>>>>>>    Block Size:             131072b+
>>>>>>  No. of Reads:               460040
>>>>>> No. of Writes:                 6411
>>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency  
No. of calls
>>>>>>        Fop
>>>>>>  ---------   -----------   -----------   -----------  
------------
>>>>>>       ----
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
2093
>>>>>>    RELEASE
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
1093
>>>>>> RELEASEDIR
>>>>>>       0.00      53.38 us      26.00 us     111.00 us   
16
>>>>>>      FLUSH
>>>>>>       0.00     145.14 us      78.00 us     367.00 us   
71
>>>>>> REMOVEXATTR
>>>>>>       0.00     190.96 us     114.00 us     298.00 us   
71
>>>>>>    SETATTR
>>>>>>       0.00     213.38 us      24.00 us    1145.00 us   
90
>>>>>>    READDIR
>>>>>>       0.00     263.28 us      20.00 us   28385.00 us   
228
>>>>>>         LK
>>>>>>       0.00     101.76 us       2.00 us     880.00 us   
1093
>>>>>>    OPENDIR
>>>>>>       0.01      93.60 us      27.00 us   10371.00 us   
3090
>>>>>>     STATFS
>>>>>>       0.02     537.47 us      17.00 us  193439.00 us   
1038
>>>>>>   GETXATTR
>>>>>>       0.03     297.44 us      35.00 us    9799.00 us   
1990
>>>>>>   READDIRP
>>>>>>       0.03    2357.28 us     110.00 us  382258.00 us   
253
>>>>>>    XATTROP
>>>>>>       0.04     385.93 us      58.00 us   47593.00 us   
2091
>>>>>>       OPEN
>>>>>>       0.04     114.86 us      24.00 us   61061.00 us   
7715
>>>>>>       STAT
>>>>>>       0.06     444.59 us      57.00 us  333240.00 us   
3053
>>>>>>    DISCARD
>>>>>>       0.42     316.24 us      25.00 us  290728.00 us   
29823
>>>>>>     LOOKUP
>>>>>>       0.73     257.92 us      54.00 us  344812.00 us   
63296
>>>>>>   FXATTROP
>>>>>>       1.37      98.30 us      28.00 us   67621.00 us   
313172
>>>>>>      FSTAT
>>>>>>       1.58    2124.69 us      51.00 us  849200.00 us   
16717
>>>>>>      FSYNC
>>>>>>       5.73     162.46 us      52.00 us  748492.00 us   
794079
>>>>>>      WRITE
>>>>>>       7.19    2065.17 us      16.00 us 37098954.00 us  
78381
>>>>>>   FINODELK
>>>>>>      36.44     886.32 us      25.00 us 2216436.00 us   
925421
>>>>>>       READ
>>>>>>      46.30    1178.04 us      13.00 us 1700704.00 us   
884635
>>>>>>    INODELK
>>>>>>
>>>>>>     Duration: 7485 seconds
>>>>>>    Data Read: 71250527215 bytes
>>>>>> Data Written: 5119903744 bytes
>>>>>>
>>>>>> Brick: ovirt3.nwfiber.com:/gluster/brick2/data
>>>>>> ----------------------------------------------
>>>>>> Cumulative Stats:
>>>>>>    Block Size:                  1b+
>>>>>>  No. of Reads:                    0
>>>>>> No. of Writes:              3264419
>>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency  
No. of calls
>>>>>>        Fop
>>>>>>  ---------   -----------   -----------   -----------  
------------
>>>>>>       ----
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
90
>>>>>>     FORGET
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
9462
>>>>>>    RELEASE
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
4254
>>>>>> RELEASEDIR
>>>>>>       0.00      50.52 us      13.00 us     190.00 us   
71
>>>>>>      FLUSH
>>>>>>       0.00     186.97 us      87.00 us     713.00 us   
86
>>>>>> REMOVEXATTR
>>>>>>       0.00      79.32 us      33.00 us     189.00 us   
228
>>>>>>         LK
>>>>>>       0.00     220.98 us     129.00 us     513.00 us   
86
>>>>>>    SETATTR
>>>>>>       0.01     259.30 us      26.00 us    2632.00 us   
137
>>>>>>    READDIR
>>>>>>       0.02     322.76 us     145.00 us    2125.00 us   
321
>>>>>>    XATTROP
>>>>>>       0.03     109.55 us       2.00 us    1258.00 us   
1193
>>>>>>    OPENDIR
>>>>>>       0.05      70.21 us      21.00 us     431.00 us   
3196
>>>>>>    DISCARD
>>>>>>       0.05     169.26 us      21.00 us    2315.00 us   
1545
>>>>>>   GETXATTR
>>>>>>       0.12     176.85 us      63.00 us    2844.00 us   
3206
>>>>>>       OPEN
>>>>>>       0.61     303.49 us      90.00 us    3085.00 us   
9633
>>>>>>      FSTAT
>>>>>>       2.44     305.66 us      28.00 us    3716.00 us   
38230
>>>>>>     LOOKUP
>>>>>>       4.52     266.22 us      55.00 us   53424.00 us   
81455
>>>>>>   FXATTROP
>>>>>>       6.96    1397.99 us      51.00 us   64822.00 us   
23889
>>>>>>      FSYNC
>>>>>>      16.48      84.74 us      25.00 us    6917.00 us   
932592
>>>>>>      WRITE
>>>>>>      30.16     106.90 us      13.00 us 3920189.00 us   
1353046
>>>>>>    INODELK
>>>>>>      38.55    1794.52 us      14.00 us 16210553.00 us  
103039
>>>>>>   FINODELK
>>>>>>
>>>>>>     Duration: 66562 seconds
>>>>>>    Data Read: 0 bytes
>>>>>> Data Written: 3264419 bytes
>>>>>>
>>>>>> Interval 1 Stats:
>>>>>>    Block Size:                  1b+
>>>>>>  No. of Reads:                    0
>>>>>> No. of Writes:               794080
>>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency  
No. of calls
>>>>>>        Fop
>>>>>>  ---------   -----------   -----------   -----------  
------------
>>>>>>       ----
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
2093
>>>>>>    RELEASE
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
1093
>>>>>> RELEASEDIR
>>>>>>       0.00      70.31 us      26.00 us     125.00 us   
16
>>>>>>      FLUSH
>>>>>>       0.00     193.10 us     103.00 us     713.00 us   
71
>>>>>> REMOVEXATTR
>>>>>>       0.01     227.32 us     133.00 us     513.00 us   
71
>>>>>>    SETATTR
>>>>>>       0.01      79.32 us      33.00 us     189.00 us   
228
>>>>>>         LK
>>>>>>       0.01     259.83 us      35.00 us    1138.00 us   
89
>>>>>>    READDIR
>>>>>>       0.03     318.26 us     145.00 us    2047.00 us   
253
>>>>>>    XATTROP
>>>>>>       0.04     112.67 us       3.00 us    1258.00 us   
1093
>>>>>>    OPENDIR
>>>>>>       0.06     167.98 us      23.00 us    1951.00 us   
1014
>>>>>>   GETXATTR
>>>>>>       0.08      70.97 us      22.00 us     431.00 us   
3053
>>>>>>    DISCARD
>>>>>>       0.13     183.78 us      66.00 us    2844.00 us   
2091
>>>>>>       OPEN
>>>>>>       1.01     303.82 us      90.00 us    3085.00 us   
9610
>>>>>>      FSTAT
>>>>>>       3.27     316.59 us      30.00 us    3716.00 us   
29820
>>>>>>     LOOKUP
>>>>>>       5.83     265.79 us      59.00 us   53424.00 us   
63296
>>>>>>   FXATTROP
>>>>>>       7.95    1373.89 us      51.00 us   64822.00 us   
16717
>>>>>>      FSYNC
>>>>>>      23.17     851.99 us      14.00 us 16210553.00 us  
78555
>>>>>>   FINODELK
>>>>>>      24.04      87.44 us      27.00 us    6917.00 us   
794081
>>>>>>      WRITE
>>>>>>      34.36     111.91 us      14.00 us  984871.00 us   
886790
>>>>>>    INODELK
>>>>>>
>>>>>>     Duration: 7485 seconds
>>>>>>    Data Read: 0 bytes
>>>>>> Data Written: 794080 bytes
>>>>>>
>>>>>>
>>>>>> -----------------------
>>>>>> Here is the data from the volume that is backed by the
SHDDs and has
>>>>>> one failed disk:
>>>>>> [root at ovirt2 yum.repos.d]# gluster volume profile
data-hdd info
>>>>>> Brick: 172.172.1.12:/gluster/brick3/data-hdd
>>>>>> --------------------------------------------
>>>>>> Cumulative Stats:
>>>>>>    Block Size:                256b+                
512b+
>>>>>>     1024b+
>>>>>>  No. of Reads:                 1702                   
86
>>>>>>         16
>>>>>> No. of Writes:                    0                  
767
>>>>>>         71
>>>>>>
>>>>>>    Block Size:               2048b+               
4096b+
>>>>>>     8192b+
>>>>>>  No. of Reads:                   19                
51841
>>>>>>       2049
>>>>>> No. of Writes:                   76                
60668
>>>>>>      35727
>>>>>>
>>>>>>    Block Size:              16384b+              
32768b+
>>>>>>    65536b+
>>>>>>  No. of Reads:                 1744                  
639
>>>>>>       1088
>>>>>> No. of Writes:                 8524                 
2410
>>>>>>       1285
>>>>>>
>>>>>>    Block Size:             131072b+
>>>>>>  No. of Reads:               771999
>>>>>> No. of Writes:                29584
>>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency  
No. of calls
>>>>>>        Fop
>>>>>>  ---------   -----------   -----------   -----------  
------------
>>>>>>       ----
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
2902
>>>>>>    RELEASE
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
1517
>>>>>> RELEASEDIR
>>>>>>       0.00     197.00 us     197.00 us     197.00 us   
1
>>>>>>  FTRUNCATE
>>>>>>       0.00      70.24 us      16.00 us     758.00 us   
51
>>>>>>      FLUSH
>>>>>>       0.00     143.93 us      82.00 us     305.00 us   
57
>>>>>> REMOVEXATTR
>>>>>>       0.00     178.63 us     105.00 us     712.00 us   
60
>>>>>>    SETATTR
>>>>>>       0.00      67.30 us      19.00 us     572.00 us   
555
>>>>>>         LK
>>>>>>       0.00     322.80 us      23.00 us    4673.00 us   
138
>>>>>>    READDIR
>>>>>>       0.00     336.56 us     106.00 us   11994.00 us   
237
>>>>>>    XATTROP
>>>>>>       0.00      84.70 us      28.00 us    1071.00 us   
3469
>>>>>>     STATFS
>>>>>>       0.01     387.75 us       2.00 us  146017.00 us   
1467
>>>>>>    OPENDIR
>>>>>>       0.01     148.59 us      21.00 us   64374.00 us   
4454
>>>>>>       STAT
>>>>>>       0.02     783.02 us      16.00 us   93502.00 us   
1902
>>>>>>   GETXATTR
>>>>>>       0.03    1516.10 us      17.00 us  210690.00 us   
1364
>>>>>>    ENTRYLK
>>>>>>       0.03    2555.47 us     300.00 us  674454.00 us   
1064
>>>>>>   READDIRP
>>>>>>       0.07      85.74 us      19.00 us   68340.00 us   
62849
>>>>>>      FSTAT
>>>>>>       0.07    1978.12 us      59.00 us  202596.00 us   
2729
>>>>>>       OPEN
>>>>>>       0.22     708.57 us      15.00 us  394799.00 us   
25447
>>>>>>     LOOKUP
>>>>>>       5.94    2331.74 us      15.00 us 1099530.00 us   
207534
>>>>>>   FINODELK
>>>>>>       7.31    8311.75 us      58.00 us 1800216.00 us   
71668
>>>>>>   FXATTROP
>>>>>>      12.49    7735.19 us      51.00 us 3595513.00 us   
131642
>>>>>>      WRITE
>>>>>>      17.70     957.08 us      16.00 us 13700466.00 us  
1508160
>>>>>>    INODELK
>>>>>>      24.55    2546.43 us      26.00 us 5077347.00 us   
786060
>>>>>>       READ
>>>>>>      31.56   49699.15 us      47.00 us 3746331.00 us   
51777
>>>>>>      FSYNC
>>>>>>
>>>>>>     Duration: 10101 seconds
>>>>>>    Data Read: 101562897361 bytes
>>>>>> Data Written: 4834450432 bytes
>>>>>>
>>>>>> Interval 0 Stats:
>>>>>>    Block Size:                256b+                
512b+
>>>>>>     1024b+
>>>>>>  No. of Reads:                 1702                   
86
>>>>>>         16
>>>>>> No. of Writes:                    0                  
767
>>>>>>         71
>>>>>>
>>>>>>    Block Size:               2048b+               
4096b+
>>>>>>     8192b+
>>>>>>  No. of Reads:                   19                
51841
>>>>>>       2049
>>>>>> No. of Writes:                   76                
60668
>>>>>>      35727
>>>>>>
>>>>>>    Block Size:              16384b+              
32768b+
>>>>>>    65536b+
>>>>>>  No. of Reads:                 1744                  
639
>>>>>>       1088
>>>>>> No. of Writes:                 8524                 
2410
>>>>>>       1285
>>>>>>
>>>>>>    Block Size:             131072b+
>>>>>>  No. of Reads:               771999
>>>>>> No. of Writes:                29584
>>>>>>  %-latency   Avg-latency   Min-Latency   Max-Latency  
No. of calls
>>>>>>        Fop
>>>>>>  ---------   -----------   -----------   -----------  
------------
>>>>>>       ----
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
2902
>>>>>>    RELEASE
>>>>>>       0.00       0.00 us       0.00 us       0.00 us   
1517
>>>>>> RELEASEDIR
>>>>>>       0.00     197.00 us     197.00 us     197.00 us   
1
>>>>>>  FTRUNCATE
>>>>>>       0.00      70.24 us      16.00 us     758.00 us   
51
>>>>>>      FLUSH
>>>>>>       0.00     143.93 us      82.00 us     305.00 us   
57
>>>>>> REMOVEXATTR
>>>>>>       0.00     178.63 us     105.00 us     712.00 us   
60
>>>>>>    SETATTR
>>>>>>       0.00      67.30 us      19.00 us     572.00 us   
555
>>>>>>         LK
>>>>>>       0.00     322.80 us      23.00 us    4673.00 us   
138
>>>>>>    READDIR
>>>>>>       0.00     336.56 us     106.00 us   11994.00 us   
237
>>>>>>    XATTROP
>>>>>>       0.00      84.70 us      28.00 us    1071.00 us   
3469
>>>>>>     STATFS
>>>>>>       0.01     387.75 us       2.00 us  146017.00 us   
1467
>>>>>>    OPENDIR
>>>>>>       0.01     148.59 us      21.00 us   64374.00 us   
4454
>>>>>>       STAT
>>>>>>       0.02     783.02 us      16.00 us   93502.00 us   
1902
>>>>>>   GETXATTR
>>>>>>       0.03    1516.10 us      17.00 us  210690.00 us   
1364
>>>>>>    ENTRYLK
>>>>>>       0.03    2555.47 us     300.00 us  674454.00 us   
1064
>>>>>>   READDIRP
>>>>>>       0.07      85.73 us      19.00 us   68340.00 us   
62849
>>>>>>      FSTAT
>>>>>>       0.07    1978.12 us      59.00 us  202596.00 us   
2729
>>>>>>       OPEN
>>>>>>       0.22     708.57 us      15.00 us  394799.00 us   
25447
>>>>>>     LOOKUP
>>>>>>       5.94    2334.57 us      15.00 us 1099530.00 us   
207534
>>>>>>   FINODELK
>>>>>>       7.31    8311.49 us      58.00 us 1800216.00 us   
71668
>>>>>>   FXATTROP
>>>>>>      12.49    7735.32 us      51.00 us 3595513.00 us   
131642
>>>>>>      WRITE
>>>>>>      17.71     957.08 us      16.00 us 13700466.00 us  
1508160
>>>>>>    INODELK
>>>>>>      24.56    2546.42 us      26.00 us 5077347.00 us   
786060
>>>>>>       READ
>>>>>>      31.54   49651.63 us      47.00 us 3746331.00 us   
51777
>>>>>>      FSYNC
>>>>>>
>>>>>>     Duration: 10101 seconds
>>>>>>    Data Read: 101562897361 bytes
>>>>>> Data Written: 4834450432 bytes
>>>>>>
>>>>>>
>>>>>> On Tue, May 29, 2018 at 2:55 PM, Jim Kusznir <jim at
palousetech.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you for your response.
>>>>>>>
>>>>>>> I have 4 gluster volumes.  3 are replica 2 +
arbitrator.  replica
>>>>>>> bricks are on ovirt1 and ovirt2, arbitrator on
ovirt3.  The 4th volume is
>>>>>>> replica 3, with a brick on all three ovirt
machines.
>>>>>>>
>>>>>>> The first 3 volumes are on an SSD disk; the 4th is
on a Seagate SSHD
>>>>>>> (same in all three machines).  On ovirt3, the SSHD
has reported hard IO
>>>>>>> failures, and that brick is offline.  However, the
other two replicas are
>>>>>>> fully operational (although they still show
contents in the heal info
>>>>>>> command that won't go away, but that may be the
case until I replace the
>>>>>>> failed disk).
>>>>>>>
>>>>>>> What is bothering me is that ALL 4 gluster volumes
are showing
>>>>>>> horrible performance issues.  At this point, as the
bad disk has been
>>>>>>> completely offlined, I would expect gluster to
perform at normal speed, but
>>>>>>> that is definitely not the case.
>>>>>>>
>>>>>>> I've also noticed that the performance hits
seem to come in waves:
>>>>>>> things seem to work acceptably (but slow) for a
while, then suddenly, its
>>>>>>> as if all disk IO on all volumes (including
non-gluster local OS disk
>>>>>>> volumes for the hosts) pause for about 30 seconds,
then IO resumes again.
>>>>>>> During those times, I start getting VM not
responding and host not
>>>>>>> responding notices as well as the applications
having major issues.
>>>>>>>
>>>>>>> I've shut down most of my VMs and am down to
just my essential core
>>>>>>> VMs (shedded about 75% of my VMs).  I still am
experiencing the same issues.
>>>>>>>
>>>>>>> Am I correct in believing that once the failed disk
was brought
>>>>>>> offline that performance should return to normal?
>>>>>>>
>>>>>>> On Tue, May 29, 2018 at 1:27 PM, Alex K
<rightkicktech at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I would check disks status and accessibility of
mount points where
>>>>>>>> your gluster volumes reside.
>>>>>>>>
>>>>>>>> On Tue, May 29, 2018, 22:28 Jim Kusznir <jim
at palousetech.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On one ovirt server, I'm now seeing
these messages:
>>>>>>>>> [56474.239725] blk_update_request: 63
callbacks suppressed
>>>>>>>>> [56474.239732] blk_update_request: I/O
error, dev dm-2, sector 0
>>>>>>>>> [56474.240602] blk_update_request: I/O
error, dev dm-2, sector
>>>>>>>>> 3905945472
>>>>>>>>> [56474.241346] blk_update_request: I/O
error, dev dm-2, sector
>>>>>>>>> 3905945584
>>>>>>>>> [56474.242236] blk_update_request: I/O
error, dev dm-2, sector 2048
>>>>>>>>> [56474.243072] blk_update_request: I/O
error, dev dm-2, sector
>>>>>>>>> 3905943424
>>>>>>>>> [56474.243997] blk_update_request: I/O
error, dev dm-2, sector
>>>>>>>>> 3905943536
>>>>>>>>> [56474.247347] blk_update_request: I/O
error, dev dm-2, sector 0
>>>>>>>>> [56474.248315] blk_update_request: I/O
error, dev dm-2, sector
>>>>>>>>> 3905945472
>>>>>>>>> [56474.249231] blk_update_request: I/O
error, dev dm-2, sector
>>>>>>>>> 3905945584
>>>>>>>>> [56474.250221] blk_update_request: I/O
error, dev dm-2, sector 2048
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 29, 2018 at 11:59 AM, Jim
Kusznir <jim at palousetech.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I see in messages on ovirt3 (my 3rd
machine, the one upgraded to
>>>>>>>>>> 4.2):
>>>>>>>>>>
>>>>>>>>>> May 29 11:54:41 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>> database connection failed (No such
file or directory)
>>>>>>>>>> May 29 11:54:51 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>> database connection failed (No such
file or directory)
>>>>>>>>>> May 29 11:55:01 ovirt3 ovs-vsctl:
ovs|00001|db_ctl_base|ERR|unix:/var/run/openvswitch/db.sock:
>>>>>>>>>> database connection failed (No such
file or directory)
>>>>>>>>>> (appears a lot).
>>>>>>>>>>
>>>>>>>>>> I also found on the ssh session of
that, some sysv warnings about
>>>>>>>>>> the backing disk for one of the gluster
volumes (straight replica 3).  The
>>>>>>>>>> glusterfs process for that disk on that
machine went offline.  Its my
>>>>>>>>>> understanding that it should continue
to work with the other two machines
>>>>>>>>>> while I attempt to replace that disk,
right?  Attempted writes (touching an
>>>>>>>>>> empty file) can take 15 seconds,
repeating it later will be much faster.
>>>>>>>>>>
>>>>>>>>>> Gluster generates a bunch of different
log files, I don't know
>>>>>>>>>> what ones you want, or from which
machine(s).
>>>>>>>>>>
>>>>>>>>>> How do I do "volume
profiling"?
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> On Tue, May 29, 2018 at 11:53 AM,
Sahina Bose <sabose at redhat.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Do you see errors reported in the
mount logs for the volume? If
>>>>>>>>>>> so, could you attach the logs?
>>>>>>>>>>> Any issues with your underlying
disks. Can you also attach
>>>>>>>>>>> output of volume profiling?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 30, 2018 at 12:13 AM,
Jim Kusznir <
>>>>>>>>>>> jim at palousetech.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok, things have gotten MUCH
worse this morning.  I'm getting
>>>>>>>>>>>> random errors from VMs, right
now, about a third of my VMs have been paused
>>>>>>>>>>>> due to storage issues, and most
of the remaining VMs are not performing
>>>>>>>>>>>> well.
>>>>>>>>>>>>
>>>>>>>>>>>> At this point, I am in full
EMERGENCY mode, as my production
>>>>>>>>>>>> services are now impacted, and
I'm getting calls coming in with problems...
>>>>>>>>>>>>
>>>>>>>>>>>> I'd greatly appreciate
help...VMs are running VERY slowly (when
>>>>>>>>>>>> they run), and they are
steadily getting worse.  I don't know why.  I was
>>>>>>>>>>>> seeing CPU peaks (to 100%) on
several VMs, in perfect sync, for a few
>>>>>>>>>>>> minutes at a time (while the VM
became unresponsive and any VMs I was
>>>>>>>>>>>> logged into that were linux
were giving me the CPU stuck messages in my
>>>>>>>>>>>> origional post).  Is all this
storage related?
>>>>>>>>>>>>
>>>>>>>>>>>> I also have two different
gluster volumes for VM storage, and
>>>>>>>>>>>> only one had the issues, but
now VMs in both are being affected at the same
>>>>>>>>>>>> time and same way.
>>>>>>>>>>>>
>>>>>>>>>>>> --Jim
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 28, 2018 at 10:50
PM, Sahina Bose <
>>>>>>>>>>>> sabose at redhat.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> [Adding gluster-users to
look at the heal issue]
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 29, 2018 at
9:17 AM, Jim Kusznir <
>>>>>>>>>>>>> jim at palousetech.com>
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've been having
some cluster and gluster performance issues
>>>>>>>>>>>>>> lately.  I also found
that my cluster was out of date, and was trying to
>>>>>>>>>>>>>> apply updates (hoping
to fix some of these), and discovered the ovirt 4.1
>>>>>>>>>>>>>> repos were taken
completely offline.  So, I was forced to begin an upgrade
>>>>>>>>>>>>>> to 4.2.  According to
docs I found/read, I needed only add the new repo, do
>>>>>>>>>>>>>> a yum update, reboot,
and be good on my hosts (did the yum update, the
>>>>>>>>>>>>>> engine-setup on my
hosted engine).  Things seemed to work relatively well,
>>>>>>>>>>>>>> except for a gluster
sync issue that showed up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My cluster is a 3 node
hyperconverged cluster.  I upgraded
>>>>>>>>>>>>>> the hosted engine
first, then engine 3.  When engine 3 came back up, for
>>>>>>>>>>>>>> some reason one of my
gluster volumes would not sync.  Here's sample output:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root at ovirt3 ~]#
gluster volume heal data-hdd info
>>>>>>>>>>>>>> Brick
172.172.1.11:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick
172.172.1.12:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Brick
172.172.1.13:/gluster/brick3/data-hdd
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/b1ea3f62-0f05-4
>>>>>>>>>>>>>>
ded-8c82-9c91c90e0b61/d5d6bf5a-499f-431d-9013-5453db93ed32
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/8c8b5147-e9d6-4
>>>>>>>>>>>>>>
810-b45b-185e3ed65727/16f08231-93b0-489d-a2fd-687b6bf88eaa
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/12924435-b9c2-4
>>>>>>>>>>>>>>
aab-ba19-1c1bc31310ef/07b3db69-440e-491e-854c-bbfa18a7cff2
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/f3c8e7aa-6ef2-4
>>>>>>>>>>>>>>
2a7-93d4-e0a4df6dd2fa/2eb0b1ad-2606-44ef-9cd3-ae59610a504b
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/647be733-f153-4
>>>>>>>>>>>>>>
cdc-85bd-ba72544c2631/b453a300-0602-4be1-8310-8bd5abe00971
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/48d7ecb8-7ac5-4
>>>>>>>>>>>>>>
725-bca5-b3519681cf2f/0d6080b0-7018-4fa3-bb82-1dd9ef07d9b9
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/6da854d1-b6be-4
>>>>>>>>>>>>>>
46b-9bf0-90a0dbbea830/3c93bd1f-b7fa-4aa2-b445-6904e31839ba
>>>>>>>>>>>>>>
/cc65f671-3377-494a-a7d4-1d9f7c3ae46c/images/7f647567-d18c-4
>>>>>>>>>>>>>>
4f1-a58e-9b8865833acb/f9364470-9770-4bb1-a6b9-a54861849625
>>>>>>>>>>>>>> Status: Connected
>>>>>>>>>>>>>> Number of entries: 8
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------
>>>>>>>>>>>>>> Its been in this state
for a couple days now, and bandwidth
>>>>>>>>>>>>>> monitoring shows no
appreciable data moving.  I've tried repeatedly
>>>>>>>>>>>>>> commanding a full heal
from all three clusters in the node.  Its always the
>>>>>>>>>>>>>> same files that need
healing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When running gluster
volume heal data-hdd statistics, I see
>>>>>>>>>>>>>> sometimes different
information, but always some number of "heal failed"
>>>>>>>>>>>>>> entries.  It shows 0
for split brain.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not quite sure
what to do.  I suspect it may be due to
>>>>>>>>>>>>>> nodes 1 and 2 still
being on the older ovirt/gluster release, but I'm
>>>>>>>>>>>>>> afraid to upgrade and
reboot them until I have a good gluster sync (don't
>>>>>>>>>>>>>> need to create a split
brain issue).  How do I proceed with this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Second issue: I've
been experiencing VERY POOR performance on
>>>>>>>>>>>>>> most of my VMs.  To the
tune that logging into a windows 10 vm via remote
>>>>>>>>>>>>>> desktop can take 5
minutes, launching quickbooks inside said vm can easily
>>>>>>>>>>>>>> take 10 minutes.  On
some linux VMs, I get random messages like this:
>>>>>>>>>>>>>> Message from syslogd at
unifi at May 28 20:39:23 ...
>>>>>>>>>>>>>> 
kernel:[6171996.308904] NMI watchdog: BUG: soft lockup -
>>>>>>>>>>>>>> CPU#0 stuck for 22s!
[mongod:14766]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (the process and PID
are often different)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not quite sure
what to do about this either.  My initial
>>>>>>>>>>>>>> thought was upgrad
everything to current and see if its still there, but I
>>>>>>>>>>>>>> cannot move forward
with that until my gluster is healed...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>> --Jim
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
_______________________________________________
>>>>>>>>>>>>>> Users mailing list --
users at ovirt.org
>>>>>>>>>>>>>> To unsubscribe send an
email to users-leave at ovirt.org
>>>>>>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>>>>>>>
y/about/community-guidelines/
>>>>>>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>>>>>>> es/list/users at
ovirt.org/message/3LEV6ZQ3JV2XLAL7NYBTXOYMYUOT
>>>>>>>>>>>>>> IRQF/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
_______________________________________________
>>>>>>>>> Users mailing list -- users at ovirt.org
>>>>>>>>> To unsubscribe send an email to users-leave
at ovirt.org
>>>>>>>>> Privacy Statement:
https://www.ovirt.org/site/privacy-policy/
>>>>>>>>> oVirt Code of Conduct:
https://www.ovirt.org/communit
>>>>>>>>> y/about/community-guidelines/
>>>>>>>>> List Archives:
https://lists.ovirt.org/archiv
>>>>>>>>> es/list/users at
ovirt.org/message/ACO7RFSLBSRBAIONIC2HQ6Z24ZDES5MF/
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Users mailing list -- users at ovirt.org
>>> To unsubscribe send an email to users-leave at ovirt.org
>>> Privacy Statement: https://www.ovirt.org/site/privacy-policy/
>>> oVirt Code of Conduct: https://www.ovirt.org/communit
>>> y/about/community-guidelines/
>>> List Archives: https://lists.ovirt.org/archiv
>>> es/list/users at
ovirt.org/message/JKBSLSZFRMFGILXGZB4YAMJIJKEKNGMA/
>>>
>>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20180530/803a67b4/attachment-0001.html>
Possibly Parallel Threads

Search for more maybe matching threads
Gluster users - May 2018 - [ovirt-users] Re: Gluster problems, cluster performance issues

[Gluster-users] [ovirt-users] Re: Gluster problems, cluster performance issues

Possibly Parallel Threads

Wisdom of the Ancients