Strahil Nikolov
2020-Jul-02 14:45 UTC
[Gluster-users] Problems with qemu and disperse volumes (live merge)
?? 2 ??? 2020 ?. 16:33:51 GMT+03:00, Marco Fais <evilmf at gmail.com> ??????:>Hi Strahil, > >WARNING: As you enabled sharding - NEVER DISABLE SHARDING, EVER ! >> > >Thanks -- good to be reminded :) > > >> >When you say they will not be optimal are you referring mainly to >> >performance considerations? We did plenty of testing, and in terms >of >> >performance didn't have issues even with I/O intensive workloads >(using >> >SSDs, I had issues with spinning disks). >> >> Yes, the client side has to connect to 6 bricks (4+2) at a time and >> calculate the data in order to obtain the necessary information.Same >is >> valid for writing. >> If you need to conserve space, you can test VDO without compression >(of >> even with it). >> > >Understood -- will explore VDO. Storage usage efficiency is less >important >than fault tolerance or performance for us -- disperse volumes seemed >to >tick all the boxes so we looked at them primarily. >But clearly I had missed that they are not used as mainstream VM >storage >for oVirt (I did know they weren't supported, but as explained thought >was >more on the management side). > > >> >> Also with replica volumes, you can use 'choose-local' /in case >you >> have faster than the network storage (like NVMe)/ and increase the >read >> speed. Of course this feature is useful for Hyperconverged setup >(gluster >> + ovirt on the same node). >> > >Will explore this option as well, thanks for the suggestion. > > >> If you were using ovirt 4.3 , I would recommend you to focus on >> gluster. Yet, you use oVirt 4.4 which is quite newer and it needs > some >> polishing. >> > >Ovirt 4.3.9 (using the older Centos 7 qemu/libvirt) unfortunately had >similar issues with the disperse volumes. Not sure if exactly the same, >as >never looked deeper into it, but the results were similar. >Ovirt 4.4.0 has some issues with snapshot deletion that are independent >from Gluster (I have raised the issue here, >https://bugzilla.redhat.com/show_bug.cgi?id=1840414, should be sorted >with >4.4.2 I guess), so at the moment it only works with the "testing" AV >repo.In such case I can recommend you to: 1. Ensure you have enough space on all bricks for the logs (/var/log/gluster). Several gigs should be OK 2. Enable all logs to 'TRACE' . Red Hat's documentation on the topic is quite good: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level 3. Reproduce the issue on a fresh VM (never done snapshot deletion) 4. Disable (switch to info) all logs as per the link in point 2 The logs will be spread among all nodes. If you have remote logging available, you can also use it for analysis of the logs. Most probably the brick logs can provide useful information.> >> Check ovirt engine logs (on the HostedEngine VM or your standalone >> engine) , vdsm logs on the host that was running the VM and next - >check >> the brick logs. >> > >Will do. > >Thanks, >MarcoAbout VDO - it might require some tuning and even afterwards it won't be very performant, so it depends on your needs. Best Regards, Strahil Nikolov
Marco Fais
2020-Jul-07 21:46 UTC
[Gluster-users] Problems with qemu and disperse volumes (live merge)
Hi Strahil first of all thanks a million for your help -- really appreciate it. Thanks also for the pointers on the debug. I have tried it, and while I can't interpret the results I think I might have found something. There is a lot of information so hopefully this is relevant. During the snapshot creation and deletion, I can see the following errors in the client log: [2020-07-07 21:23:06.837381] W [MSGID: 122019] [ec-helpers.c:401:ec_loc_gfid_check] 0-SSD_Storage-disperse-0: Mismatching GFID's in loc [2020-07-07 21:23:06.837387] D [MSGID: 0] [defaults.c:1328:default_mknod_cbk] 0-stack-trace: stack-address: 0x7f0dc0001a78, SSD_Storage-disperse-0 returned -1 error: Input/output error [Input/output error] [2020-07-07 21:23:06.837392] W [MSGID: 109002] [dht-rename.c:1019:dht_rename_links_create_cbk] 0-SSD_Storage-dht: link/file /8d49207e-f6b9-41d1-8d35-f6e0fb121980/images/4802e66e-a7e3-42df-a570-7155135566ad/b51133ee-54e0-4001-ab4b-9f0dc1e5c6fc.meta on SSD_Storage-disperse-0 failed [Input/output error] [2020-07-07 21:23:06.837850] D [MSGID: 0] [stack.h:502:copy_frame] 0-stack: groups is null (ngrps: 0) [Invalid argument] [2020-07-07 21:23:06.839252] D [dict.c:1168:data_to_uint32] (-->/lib64/libglusterfs.so.0(dict_foreach_match+0x77) [0x7f0ddb1855e7] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x384cf) [0x7f0dd23c54cf] -->/lib64/libglusterfs.so.0(data_to_uint32+0x8e) [0x7f0ddb184f2e] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument] [2020-07-07 21:23:06.839272] D [MSGID: 0] [dht-common.c:6674:dht_readdirp_cbk] 0-SSD_Storage-dht: Processing entries from SSD_Storage-disperse-0 [2020-07-07 21:23:06.839281] D [MSGID: 0] [dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: entry = ., type = 4 [2020-07-07 21:23:06.839291] D [MSGID: 0] [dht-common.c:6813:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: Adding entry = . [2020-07-07 21:23:06.839297] D [MSGID: 0] [dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: entry = .., type = 4 [2020-07-07 21:23:06.839324] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-6 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839327] D [dict.c:1800:dict_get_int32] (-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x227d6) [0x7f0dd23af7d6] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x17661) [0x7f0dd23a4661] -->/lib64/libglusterfs.so.0(dict_get_int32+0x107) [0x7f0ddb186437] ) 0-dict: key glusterfs.inodelk-count, integer type asked, has unsigned integer type [Invalid argument] [2020-07-07 21:23:06.839361] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-11 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839395] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-15 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839419] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-9 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839473] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-18 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839471] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-10 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839491] D [dict.c:1800:dict_get_int32] (-->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x256ad) [0x7f0dd23b26ad] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x17661) [0x7f0dd23a4661] -->/lib64/libglusterfs.so.0(dict_get_int32+0x107) [0x7f0ddb186437] ) 0-dict: key glusterfs.inodelk-count, integer type asked, has unsigned integer type [Invalid argument] [2020-07-07 21:23:06.839512] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-7 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839526] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-23 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839543] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-22 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839543] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-16 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839556] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-21 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839596] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-12 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839617] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-14 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839631] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-13 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839636] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc00395a8, SSD_Storage-client-17 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839643] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0034598, SSD_Storage-client-8 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839656] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-2 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839665] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-2 returned error [Stale file handle] [2020-07-07 21:23:06.839666] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-1 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839683] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-1 returned error [Stale file handle] [2020-07-07 21:23:06.839686] D [dict.c:1168:data_to_uint32] (-->/lib64/libglusterfs.so.0(dict_foreach_match+0x77) [0x7f0ddb1855e7] -->/usr/lib64/glusterfs/7.5/xlator/cluster/disperse.so(+0x384cf) [0x7f0dd23c54cf] -->/lib64/libglusterfs.so.0(data_to_uint32+0x8e) [0x7f0ddb184f2e] ) 0-dict: key null, unsigned integer type asked, has integer type [Invalid argument] [2020-07-07 21:23:06.839698] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-19 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839703] D [MSGID: 0] [dht-common.c:6674:dht_readdirp_cbk] 0-SSD_Storage-dht: Processing entries from SSD_Storage-disperse-0 [2020-07-07 21:23:06.839714] D [MSGID: 0] [dht-common.c:6681:dht_readdirp_cbk] 0-SSD_Storage-dht: SSD_Storage-disperse-0: entry = .., type = 4 [2020-07-07 21:23:06.839716] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-30 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839724] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-34 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839720] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-35 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839755] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-31 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839759] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc009c108, SSD_Storage-client-20 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839774] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-3 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839775] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-32 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839783] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-3 returned error [Stale file handle] [2020-07-07 21:23:06.839798] D [MSGID: 0] [dht-common.c:601:dht_discover_complete] 0-SSD_Storage-dht: key trusted.glusterfs.quota.read-only not present in dict [2020-07-07 21:23:06.839807] D [MSGID: 0] [client-rpc-fops_v2.c:2641:client4_0_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc0024b48, SSD_Storage-client-33 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839807] D [MSGID: 0] [dht-layout.c:789:dht_layout_preset] 0-SSD_Storage-dht: file 00000000-0000-0000-0000-000000000000, subvol = SSD_Storage-disperse-4 [2020-07-07 21:23:06.839825] D [MSGID: 0] [defaults.c:1548:default_lookup_cbk] 0-stack-trace: stack-address: 0x7f0dc007c428, SSD_Storage-disperse-5 returned -1 error: Stale file handle [Stale file handle] [2020-07-07 21:23:06.839835] D [MSGID: 0] [dht-common.c:998:dht_discover_cbk] 0-SSD_Storage-dht: lookup of (null) on SSD_Storage-disperse-5 returned error [Stale file handle] The above is logged just shortly before the qemu-kvm process crashes with the usual error: Unexpected error in raw_check_lock_bytes() at block/file-posix.c:811: 2020-07-07T21:23:06.847336Z qemu-kvm: Failed to get shared "write" lock I have looked also on the bricks logs, but there is too much information there and will need to know what to look for. Not sure if there is any benefit in looking into this any further? Thanks, Marco On Thu, 2 Jul 2020 at 15:45, Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> > > ?? 2 ??? 2020 ?. 16:33:51 GMT+03:00, Marco Fais <evilmf at gmail.com> ??????: > >Hi Strahil, > > > >WARNING: As you enabled sharding - NEVER DISABLE SHARDING, EVER ! > >> > > > >Thanks -- good to be reminded :) > > > > > >> >When you say they will not be optimal are you referring mainly to > >> >performance considerations? We did plenty of testing, and in terms > >of > >> >performance didn't have issues even with I/O intensive workloads > >(using > >> >SSDs, I had issues with spinning disks). > >> > >> Yes, the client side has to connect to 6 bricks (4+2) at a time and > >> calculate the data in order to obtain the necessary information.Same > >is > >> valid for writing. > >> If you need to conserve space, you can test VDO without compression > >(of > >> even with it). > >> > > > >Understood -- will explore VDO. Storage usage efficiency is less > >important > >than fault tolerance or performance for us -- disperse volumes seemed > >to > >tick all the boxes so we looked at them primarily. > >But clearly I had missed that they are not used as mainstream VM > >storage > >for oVirt (I did know they weren't supported, but as explained thought > >was > >more on the management side). > > > > > >> > >> Also with replica volumes, you can use 'choose-local' /in case > >you > >> have faster than the network storage (like NVMe)/ and increase the > >read > >> speed. Of course this feature is useful for Hyperconverged setup > >(gluster > >> + ovirt on the same node). > >> > > > >Will explore this option as well, thanks for the suggestion. > > > > > >> If you were using ovirt 4.3 , I would recommend you to focus on > >> gluster. Yet, you use oVirt 4.4 which is quite newer and it needs > > some > >> polishing. > >> > > > >Ovirt 4.3.9 (using the older Centos 7 qemu/libvirt) unfortunately had > >similar issues with the disperse volumes. Not sure if exactly the same, > >as > >never looked deeper into it, but the results were similar. > >Ovirt 4.4.0 has some issues with snapshot deletion that are independent > >from Gluster (I have raised the issue here, > >https://bugzilla.redhat.com/show_bug.cgi?id=1840414, should be sorted > >with > >4.4.2 I guess), so at the moment it only works with the "testing" AV > >repo. > > > > In such case I can recommend you to: > 1. Ensure you have enough space on all bricks for the logs > (/var/log/gluster). Several gigs should be OK > 2. Enable all logs to 'TRACE' . Red Hat's documentation on the topic is > quite good: > > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3/html/administration_guide/configuring_the_log_level > 3. Reproduce the issue on a fresh VM (never done snapshot deletion) > 4. Disable (switch to info) all logs as per the link in point 2 > > The logs will be spread among all nodes. If you have remote logging > available, you can also use it for analysis of the logs. > > Most probably the brick logs can provide useful information. > > > > > >> Check ovirt engine logs (on the HostedEngine VM or your standalone > >> engine) , vdsm logs on the host that was running the VM and next - > >check > >> the brick logs. > >> > > > >Will do. > > > >Thanks, > >Marco > > > About VDO - it might require some tuning and even afterwards it won't be > very performant, so it depends on your needs. > > Best Regards, > Strahil Nikolov >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200707/81f95077/attachment.html>