Olaf Buitelaar
2020-Jan-27 21:49 UTC
[Gluster-users] [Errno 107] Transport endpoint is not connected
Dear Gluster users, i'm a bit at a los here, and any help would be appreciated. I've lost a couple, since the disks suffered from severe XFS error's and of virtual machines and some won't boot because they can't resolve the size of the image as reported by vdsm: "VM kube-large-01 is down with error. Exit message: Unable to get volume size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume f16492a6-2d0e-4657-88e3-a9f4d8e48e74." which is also reported by the vdsm-client; vdsm-client Volume getSize storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3 storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829 imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74 vdsm-client: Command Volume.getSize with args {'storagepoolID': '59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID': '5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID': 'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID': '2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed: (code=100, message=[Errno 107] Transport endpoint is not connected) with corresponding gluster mount log; [2020-01-27 19:42:22.678793] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied] [2020-01-27 19:42:22.678828] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied] [2020-01-27 19:42:22.679806] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:42:22.679862] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:42:22.679981] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no read subvols for /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 [2020-01-27 19:42:22.680606] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:42:22.680622] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:42:22.681742] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:42:22.681871] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no read subvols for /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 [2020-01-27 19:42:22.682344] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 (00000000-0000-0000-0000-000000000000) [Permission denied] The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-14: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2 times between [2020-01-27 19:42:22.679806] and [2020-01-27 19:42:22.683308] [2020-01-27 19:42:22.683327] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-data-client-13: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:42:22.683438] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-data-replicate-3: no read subvols for /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 [2020-01-27 19:42:22.683495] I [dict.c:560:dict_get] (-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b) [0x7faaaadeb92b] -->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78) [0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94) [0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds [Invalid argument] [2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 176728: LOOKUP() /5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 => -1 (Transport endpoint is not connected) In addition to this, vdsm also reported it couldn't find the image of the HostedEngine, and refused to boot; 2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [storage.TaskManager.Task] (Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error (task:875) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run return fn(*args, **kargs) File "<string>", line 2, in prepareImage File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203, in prepareImage raise se.VolumeDoesNotExist(leafUUID) VolumeDoesNotExist: Volume does not exist: ('38e4fba7-f140-4630-afab-0f744ebe3b57',) 2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm] (vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process failed (vm:933) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in _run self._devices = self._make_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in _make_devices disk_objs = self._perform_host_local_adjustment() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in _perform_host_local_adjustment self._preparePathsForDrives(disk_params) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in _preparePathsForDrives drive, self.id, path=path File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in prepareVolumePath raise vm.VolumeError(drive) VolumeError: Bad volume specification {'protocol': 'gluster', 'address': {'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64', 'index': 0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {}, 'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64', 'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk', 'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0', 'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000', 'device': 'disk', 'path': 'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57', 'propagateErrors': 'off', 'name': 'vda', 'volumeID': '38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias': 'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name': '10.201.0.9', 'port': '0'}], 'discard': False} And last, there is a storage domain which refuses to activate (from de vsdm.log); 2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata (monitor:499) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line 497, in _pathChecked delay = result.delay() File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 391, in delay raise exception.MiscFileReadException(self.path, self.rc, self.err) MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\': Transport endpoint is not connected\n")) corresponding gluster mount log; The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.063826] and [2020-01-27 19:59:21.690134] The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.063734] and [2020-01-27 19:59:21.690150] The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.065027] and [2020-01-27 19:59:21.691313] The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 times between [2020-01-27 19:58:33.065106] and [2020-01-27 19:59:21.691328] The message "W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-mon-2-replicate-0: no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md" repeated 4 times between [2020-01-27 19:58:33.065163] and [2020-01-27 19:59:21.691369] [2020-01-27 19:59:50.539315] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:59:50.539321] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:59:50.540412] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:59:50.540477] W [MSGID: 114031] [client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] 0-ovirt-mon-2-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Permission denied] [2020-01-27 19:59:50.540533] W [MSGID: 108027] [afr-common.c:2274:afr_attempt_readsubvol_set] 0-ovirt-mon-2-replicate-0: no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md [2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 99: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 105: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 112: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 118: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 125: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 131: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 137: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 144: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 150: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 156: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 163: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 169: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 176: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) [2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk] 0-glusterfs-fuse: 183: LOOKUP() /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint is not connected) and some apps directly connecting to gluster mounts report these error's; 2020-01-27 1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not found (Errcode: 107 "Transport endpoint is not connected") 2020-01-27 3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not found (Errcode: 107 "Transport endpoint is not connected") So the errors seem to hint to either a connection issue or a quorum loss of some sort. However gluster is running on it's own private and separate network, with no firewall rules or anything else which could obstruct the connection. In addition gluster volume status reports all bricks and nodes are up, and gluster volume heal reports no pending heals. What makes this issue even more interesting is that when i manually check the files all seems fine; for the first issue, where the machine won't start because vdsm cannot determine the size. qemu is able to report the size; qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7: _ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46 57-88e3-a9f4d8e48e74 image: /rhev/data-center/mnt/glusterSD/10.201.0.7: _ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 file format: raw virtual size: 34T (37580963840000 bytes) disk size: 7.1T in addition i'm able to mount the volume using a loop device; losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7: _ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 kpartx -av /dev/loop0 vgscan vgchange -ay mount /dev/mapper/cl--data5-data5 /data5/ after this i'm able to see all contents of the disk, and in fact write to it. So the earlier reported connection error doesn't seem to apply here? This is actually how i'm currently running the VM, where i detached the disk, and mounted it in the VM via the loop device. The disk is a data disk for a heavily loaded mysql instance, and mysql is reporting no errors, and has been running for about a day now. Of course this not the way it should run, but it is at least working, only performance seems a bit off. So i would like to solve the issue and being able to attach the image as disk again. for the second issue where the Image of the HostedEngine couldn't be found, also all seems correct; The file is there and having the correct permissions; ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9 \:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/ total 49406333 drwxr-xr-x. 2 vdsm kvm 4096 Jan 25 12:03 . drwxr-xr-x. 13 vdsm kvm 4096 Jan 25 14:16 .. -rw-rw----. 1 vdsm kvm 62277025792 Jan 23 03:04 38e4fba7-f140-4630-afab-0f744ebe3b57 -rw-rw----. 1 vdsm kvm 1048576 Jan 25 21:48 38e4fba7-f140-4630-afab-0f744ebe3b57.lease -rw-r--r--. 1 vdsm kvm 285 Jan 27 2018 38e4fba7-f140-4630-afab-0f744ebe3b57.meta And i'm able to mount the image using a loop device and access it's contents. Unfortunate the VM wouldn't boot due to XFS error's. After tinkering with this for about a day to make it boot, i gave up and restored from a recent backup. But i took the data dir from postgress from the mounted old image to the new VM, and postgress was perfectly fine with it, also indicating the image wasn't a complete toast. And the last issue where the storage domain wouldn't activate. The file it claims it cannot read in the log is perfectly readable and writable; cat /rhev/data-center/mnt/glusterSD/10.201.0.11: _ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata CLASS=Data DESCRIPTION=ovirt-mon-2 IOOPTIMEOUTSEC=10 LEASERETRIES=3 LEASETIMESEC=60 LOCKPOLICYLOCKRENEWALINTERVALSEC=5 POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3 REMOTE_PATH=10.201.0.11:/ovirt-mon-2 ROLE=Regular SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458 TYPE=GLUSTERFS VERSION=4 _SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807 So i've no clue where these "Transport endpoint is not connected" are coming from, and how to resolve them? I think there are 4 possible causes for this issue; 1) I was trying to optimize the throughput of gluster on some volumes, since we recently gained some additional write load, which we had difficulty keeping up with. So I tried to incrementally add server.event-threads, via; gluster v set ovirt-data server.event-threads X since this didn't seem to improve the performance i changed it back to it's original values. But when i did that the VM's running on these volumes all locked-up, and required a reboot, which was by than still possible. Please note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't changed. 2) I had a mix of running gluster 6.6 and 6.7, since i was in the middle of upgrading all to 6.7 3) On one of the physical brick nodes, after a reboot xfs errors were reported, and resolved by xfs_repair, which did remove some inodes in the process. For which i wasn't too worried since i would expect the gluster self healing daemon would resolve them, which seemed true for all volumes, except 1, where 1 gfid was pending for about 2 days. in this case also exactly the image which vdsm reports it cannot resolve the size from. But there are other vm image with the same issue, which i left out for brevity. However the pending heal of the single gfid resolved once I mounted the image via the loop device and started writing to. Which is probably due the nature on how gluster resolves what needs healing. Despite a gluster heal X full was issued before. I could also confirm the pending gfid was in fact missing on the brick node on the underlying brick directory, while the heal was still pending. 4) I did some brick replace's (only the ovirt-data volume) but only of arbiter bricks of the affected volume in the first issue. the volume info's of the affected bricks look like this; Volume Name: ovirt-data Type: Distributed-Replicate Volume ID: 2775dc10-c197-446e-a73f-275853d38666 Status: Started Snapshot Count: 0 Number of Bricks: 4 x (2 + 1) = 12 Transport-type: tcp Bricks: Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter) Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter) Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter) Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter) Options Reconfigured: cluster.choose-local: off server.outstanding-rpc-limit: 1024 storage.owner-gid: 36 storage.owner-uid: 36 transport.address-family: inet performance.readdir-ahead: on nfs.disable: on performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off performance.write-behind-window-size: 512MB performance.cache-size: 384MB server.event-threads: 5 performance.strict-o-direct: on cluster.brick-multiplex: on Volume Name: ovirt-engine Type: Distributed-Replicate Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc Status: Started Snapshot Count: 0 Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine Options Reconfigured: performance.strict-o-direct: on performance.write-behind-window-size: 512MB features.shard-block-size: 64MB performance.cache-size: 128MB nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: enable cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off storage.owner-uid: 36 storage.owner-gid: 36 cluster.brick-multiplex: on Volume Name: ovirt-mon-2 Type: Replicate Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2 Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2 Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter) Options Reconfigured: performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.low-prio-threads: 32 network.remote-dio: off cluster.eager-lock: enable cluster.quorum-type: auto cluster.server-quorum-type: server cluster.data-self-heal-algorithm: full cluster.locking-scheme: granular cluster.shd-max-threads: 8 cluster.shd-wait-qlength: 10000 features.shard: on user.cifs: off cluster.choose-local: off client.event-threads: 4 server.event-threads: 4 storage.owner-uid: 36 storage.owner-gid: 36 performance.strict-o-direct: on performance.cache-size: 64MB performance.write-behind-window-size: 128MB features.shard-block-size: 64MB cluster.brick-multiplex: on Thanks Olaf -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200127/f5a45580/attachment.html>
Strahil Nikolov
2020-Jan-28 16:31 UTC
[Gluster-users] [Errno 107] Transport endpoint is not connected
On January 27, 2020 11:49:08 PM GMT+02:00, Olaf Buitelaar <olaf.buitelaar at gmail.com> wrote:>Dear Gluster users, > >i'm a bit at a los here, and any help would be appreciated. > >I've lost a couple, since the disks suffered from severe XFS error's >and of >virtual machines and some won't boot because they can't resolve the >size of >the image as reported by vdsm: >"VM kube-large-01 is down with error. Exit message: Unable to get >volume >size for domain 5f17d41f-d617-48b8-8881-a53460b02829 volume >f16492a6-2d0e-4657-88e3-a9f4d8e48e74." > >which is also reported by the vdsm-client; vdsm-client Volume getSize >storagepoolID=59cd53a9-0003-02d7-00eb-0000000001e3 >storagedomainID=5f17d41f-d617-48b8-8881-a53460b02829 >imageID=2f96fd46-1851-49c8-9f48-78bb50dbdffd >volumeID=f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >vdsm-client: Command Volume.getSize with args {'storagepoolID': >'59cd53a9-0003-02d7-00eb-0000000001e3', 'storagedomainID': >'5f17d41f-d617-48b8-8881-a53460b02829', 'volumeID': >'f16492a6-2d0e-4657-88e3-a9f4d8e48e74', 'imageID': >'2f96fd46-1851-49c8-9f48-78bb50dbdffd'} failed: >(code=100, message=[Errno 107] Transport endpoint is not connected) > >with corresponding gluster mount log; >[2020-01-27 19:42:22.678793] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-14: >remote operation failed. Path: >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied] >[2020-01-27 19:42:22.678828] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-13: >remote operation failed. Path: >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >(a19abb2f-8e7e-42f0-a3c1-dad1eeb3a851) [Permission denied] >[2020-01-27 19:42:22.679806] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-14: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:42:22.679862] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-13: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:42:22.679981] W [MSGID: 108027] >[afr-common.c:2274:afr_attempt_readsubvol_set] >0-ovirt-data-replicate-3: no >read subvols for >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >[2020-01-27 19:42:22.680606] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-14: >remote operation failed. Path: >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:42:22.680622] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-13: >remote operation failed. Path: >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:42:22.681742] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-13: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:42:22.681871] W [MSGID: 108027] >[afr-common.c:2274:afr_attempt_readsubvol_set] >0-ovirt-data-replicate-3: no >read subvols for >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >[2020-01-27 19:42:22.682344] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-14: >remote operation failed. Path: >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >(00000000-0000-0000-0000-000000000000) [Permission denied] >The message "W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-14: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 2 >times between [2020-01-27 19:42:22.679806] and [2020-01-27 >19:42:22.683308] >[2020-01-27 19:42:22.683327] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-data-client-13: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:42:22.683438] W [MSGID: 108027] >[afr-common.c:2274:afr_attempt_readsubvol_set] >0-ovirt-data-replicate-3: no >read subvols for >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >[2020-01-27 19:42:22.683495] I [dict.c:560:dict_get] >(-->/usr/lib64/glusterfs/6.7/xlator/cluster/replicate.so(+0x6e92b) >[0x7faaaadeb92b] >-->/usr/lib64/glusterfs/6.7/xlator/cluster/distribute.so(+0x45c78) >[0x7faaaab08c78] -->/lib64/libglusterfs.so.0(dict_get+0x94) >[0x7faab36ac254] ) 0-dict: !this || key=trusted.glusterfs.dht.mds >[Invalid >argument] >[2020-01-27 19:42:22.683506] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 176728: LOOKUP() >/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >=> -1 (Transport endpoint is not connected) > >In addition to this, vdsm also reported it couldn't find the image of >the >HostedEngine, and refused to boot; >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) >[storage.TaskManager.Task] >(Task='ffdc4242-17ae-4ea1-9535-0e6fcb81944d') Unexpected error >(task:875) >Traceback (most recent call last): >File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, >in _run > return fn(*args, **kargs) > File "<string>", line 2, in prepareImage >File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 50, in >method > ret = func(*args, **kwargs) >File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 3203, >in prepareImage > raise se.VolumeDoesNotExist(leafUUID) >VolumeDoesNotExist: Volume does not exist: >('38e4fba7-f140-4630-afab-0f744ebe3b57',) > >2020-01-25 10:03:45,345+0000 ERROR (vm/20d69acd) [virt.vm] >(vmId='20d69acd-edfd-4aeb-a2ae-49e9c121b7e9') The vm start process >failed >(vm:933) >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 867, in >_startUnderlyingVm > self._run() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2795, in >_run > self._devices = self._make_devices() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2635, in >_make_devices > disk_objs = self._perform_host_local_adjustment() > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2708, in >_perform_host_local_adjustment > self._preparePathsForDrives(disk_params) > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 1036, in >_preparePathsForDrives > drive, self.id, path=path > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 426, in >prepareVolumePath > raise vm.VolumeError(drive) >VolumeError: Bad volume specification {'protocol': 'gluster', >'address': >{'function': '0x0', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', >'slot': '0x06'}, 'serial': '9191ca25-536f-42cd-8373-c04ff9cc1a64', >'index': >0, 'iface': 'virtio', 'apparentsize': '62277025792', 'specParams': {}, >'cache': 'none', 'imageID': '9191ca25-536f-42cd-8373-c04ff9cc1a64', >'shared': 'exclusive', 'truesize': '50591027712', 'type': 'disk', >'domainID': '313f5d25-76af-4ecd-9a20-82a2fe815a3c', 'reqsize': '0', >'format': 'raw', 'poolID': '00000000-0000-0000-0000-000000000000', >'device': 'disk', 'path': >'ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/38e4fba7-f140-4630-afab-0f744ebe3b57', >'propagateErrors': 'off', 'name': 'vda', 'volumeID': >'38e4fba7-f140-4630-afab-0f744ebe3b57', 'diskType': 'network', 'alias': >'ua-9191ca25-536f-42cd-8373-c04ff9cc1a64', 'hosts': [{'name': >'10.201.0.9', >'port': '0'}], 'discard': False} > >And last, there is a storage domain which refuses to activate (from de >vsdm.log); >2020-01-25 10:01:11,750+0000 ERROR (check/loop) [storage.Monitor] Error >checking path >/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata >(monitor:499) >Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/storage/monitor.py", line >497, in _pathChecked > delay = result.delay() >File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line >391, >in delay > raise exception.MiscFileReadException(self.path, self.rc, self.err) >MiscFileReadException: Internal file read failure: >(u'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata', >1, bytearray(b"/usr/bin/dd: failed to open >\'/rhev/data-center/mnt/glusterSD/10.201.0.11:_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata\': >Transport endpoint is not connected\n")) > >corresponding gluster mount log; >The message "W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-0: >remote operation failed. Path: >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 >times between [2020-01-27 19:58:33.063826] and [2020-01-27 >19:59:21.690134] >The message "W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-1: >remote operation failed. Path: >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 >times between [2020-01-27 19:58:33.063734] and [2020-01-27 >19:59:21.690150] >The message "W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-0: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 >times between [2020-01-27 19:58:33.065027] and [2020-01-27 >19:59:21.691313] >The message "W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-1: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied]" repeated 4 >times between [2020-01-27 19:58:33.065106] and [2020-01-27 >19:59:21.691328] >The message "W [MSGID: 108027] >[afr-common.c:2274:afr_attempt_readsubvol_set] >0-ovirt-mon-2-replicate-0: >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md" >repeated >4 times between [2020-01-27 19:58:33.065163] and [2020-01-27 >19:59:21.691369] >[2020-01-27 19:59:50.539315] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-0: >remote operation failed. Path: >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:59:50.539321] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-1: >remote operation failed. Path: >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:59:50.540412] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-1: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:59:50.540477] W [MSGID: 114031] >[client-rpc-fops_v2.c:2634:client4_0_lookup_cbk] >0-ovirt-mon-2-client-0: >remote operation failed. Path: (null) >(00000000-0000-0000-0000-000000000000) [Permission denied] >[2020-01-27 19:59:50.540533] W [MSGID: 108027] >[afr-common.c:2274:afr_attempt_readsubvol_set] >0-ovirt-mon-2-replicate-0: >no read subvols for /47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md >[2020-01-27 19:59:50.540604] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 99: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md >=> -1 (Transport endpoint is not connected) >[2020-01-27 19:59:51.488775] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 105: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 19:59:58.713818] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 112: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 19:59:59.007467] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 118: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:00.136599] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 125: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:00.781763] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 131: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:00.878852] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 137: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:01.580272] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 144: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:01.686464] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 150: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:01.757087] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 156: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:03.061635] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 163: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:03.161894] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 169: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:04.801107] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 176: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) >[2020-01-27 20:00:07.251125] W [fuse-bridge.c:942:fuse_entry_cbk] >0-glusterfs-fuse: 183: LOOKUP() >/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md => -1 (Transport endpoint >is >not connected) > >and some apps directly connecting to gluster mounts report these >error's; >2020-01-27 1:10:48 0 [ERROR] mysqld: File '/binlog/binlog.~rec~' not >found >(Errcode: 107 "Transport endpoint is not connected") >2020-01-27 3:28:01 0 [ERROR] mysqld: File '/binlog/binlog.000113' not >found (Errcode: 107 "Transport endpoint is not connected") > >So the errors seem to hint to either a connection issue or a quorum >loss of >some sort. However gluster is running on it's own private and separate >network, with no firewall rules or anything else which could obstruct >the >connection. >In addition gluster volume status reports all bricks and nodes are up, >and >gluster volume heal reports no pending heals. >What makes this issue even more interesting is that when i manually >check >the files all seems fine; > >for the first issue, where the machine won't start because vdsm cannot >determine the size. >qemu is able to report the size; >qemu-img info /rhev/data-center/mnt/glusterSD/10.201.0.7: >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-46 >57-88e3-a9f4d8e48e74 >image: /rhev/data-center/mnt/glusterSD/10.201.0.7: >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >file format: raw >virtual size: 34T (37580963840000 bytes) >disk size: 7.1T >in addition i'm able to mount the volume using a loop device; >losetup /dev/loop0 /rhev/data-center/mnt/glusterSD/10.201.0.7: >_ovirt-data/5f17d41f-d617-48b8-8881-a53460b02829/images/2f96fd46-1851-49c8-9f48-78bb50dbdffd/f16492a6-2d0e-4657-88e3-a9f4d8e48e74 >kpartx -av /dev/loop0 >vgscan >vgchange -ay >mount /dev/mapper/cl--data5-data5 /data5/ >after this i'm able to see all contents of the disk, and in fact write >to >it. So the earlier reported connection error doesn't seem to apply >here? >This is actually how i'm currently running the VM, where i detached the >disk, and mounted it in the VM via the loop device. The disk is a data >disk for a heavily loaded mysql instance, and mysql is reporting no >errors, >and has been running for about a day now. >Of course this not the way it should run, but it is at least working, >only >performance seems a bit off. So i would like to solve the issue and >being >able to attach the image as disk again. > >for the second issue where the Image of the HostedEngine couldn't be >found, >also all seems correct; >The file is there and having the correct permissions; > ls -la /rhev/data-center/mnt/glusterSD/10.201.0.9 >\:_ovirt-engine/313f5d25-76af-4ecd-9a20-82a2fe815a3c/images/9191ca25-536f-42cd-8373-c04ff9cc1a64/ >total 49406333 >drwxr-xr-x. 2 vdsm kvm 4096 Jan 25 12:03 . >drwxr-xr-x. 13 vdsm kvm 4096 Jan 25 14:16 .. >-rw-rw----. 1 vdsm kvm 62277025792 Jan 23 03:04 >38e4fba7-f140-4630-afab-0f744ebe3b57 >-rw-rw----. 1 vdsm kvm 1048576 Jan 25 21:48 >38e4fba7-f140-4630-afab-0f744ebe3b57.lease >-rw-r--r--. 1 vdsm kvm 285 Jan 27 2018 >38e4fba7-f140-4630-afab-0f744ebe3b57.meta >And i'm able to mount the image using a loop device and access it's >contents. >Unfortunate the VM wouldn't boot due to XFS error's. After tinkering >with >this for about a day to make it boot, i gave up and restored from a >recent >backup. But i took the data dir from postgress from the mounted old >image >to the new VM, and postgress was perfectly fine with it, also >indicating >the image wasn't a complete toast. > >And the last issue where the storage domain wouldn't activate. The file >it >claims it cannot read in the log is perfectly readable and writable; >cat /rhev/data-center/mnt/glusterSD/10.201.0.11: >_ovirt-mon-2/47edf8ee-83c4-4bd2-b275-20ccd9de4458/dom_md/metadata >CLASS=Data >DESCRIPTION=ovirt-mon-2 >IOOPTIMEOUTSEC=10 >LEASERETRIES=3 >LEASETIMESEC=60 >LOCKPOLICY>LOCKRENEWALINTERVALSEC=5 >POOL_UUID=59cd53a9-0003-02d7-00eb-0000000001e3 >REMOTE_PATH=10.201.0.11:/ovirt-mon-2 >ROLE=Regular >SDUUID=47edf8ee-83c4-4bd2-b275-20ccd9de4458 >TYPE=GLUSTERFS >VERSION=4 >_SHA_CKSUM=d49b4a74e70a22a1b816519e8ed4167994672807 > >So i've no clue where these "Transport endpoint is not connected" are >coming from, and how to resolve them? > >I think there are 4 possible causes for this issue; >1) I was trying to optimize the throughput of gluster on some volumes, >since we recently gained some additional write load, which we had >difficulty keeping up with. So I tried to incrementally >add server.event-threads, via; >gluster v set ovirt-data server.event-threads X >since this didn't seem to improve the performance i changed it back to >it's >original values. But when i did that the VM's running on these volumes >all >locked-up, and required a reboot, which was by than still possible. >Please >note for the volumes ovirt-engine and ovirt-mon-2 this setting wasn't >changed. > >2) I had a mix of running gluster 6.6 and 6.7, since i was in the >middle of >upgrading all to 6.7 > >3) On one of the physical brick nodes, after a reboot xfs errors were >reported, and resolved by xfs_repair, which did remove some inodes in >the >process. For which i wasn't too worried since i would expect the >gluster >self healing daemon would resolve them, which seemed true for all >volumes, >except 1, where 1 gfid was pending for about 2 days. in this case also >exactly the image which vdsm reports it cannot resolve the size from. >But >there are other vm image with the same issue, which i left out for >brevity. >However the pending heal of the single gfid resolved once I mounted the >image via the loop device and started writing to. Which is probably due >the >nature on how gluster resolves what needs healing. Despite a gluster >heal X >full was issued before. >I could also confirm the pending gfid was in fact missing on the brick >node >on the underlying brick directory, while the heal was still pending. > >4) I did some brick replace's (only the ovirt-data volume) but only of >arbiter bricks of the affected volume in the first issue. > >the volume info's of the affected bricks look like this; > >Volume Name: ovirt-data >Type: Distributed-Replicate >Volume ID: 2775dc10-c197-446e-a73f-275853d38666 >Status: Started >Snapshot Count: 0 >Number of Bricks: 4 x (2 + 1) = 12 >Transport-type: tcp >Bricks: >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-data >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-data >Brick3: 10.201.0.9:/data0/gfs/bricks/bricka/ovirt-data (arbiter) >Brick4: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-data >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-data >Brick6: 10.201.0.11:/data0/gfs/bricks/bricka/ovirt-data (arbiter) >Brick7: 10.201.0.6:/data5/gfs/bricks/brick1/ovirt-data >Brick8: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-data >Brick9: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-data (arbiter) >Brick10: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-data >Brick11: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-data >Brick12: 10.201.0.10:/data0/gfs/bricks/bricka/ovirt-data (arbiter) >Options Reconfigured: >cluster.choose-local: off >server.outstanding-rpc-limit: 1024 >storage.owner-gid: 36 >storage.owner-uid: 36 >transport.address-family: inet >performance.readdir-ahead: on >nfs.disable: on >performance.quick-read: off >performance.read-ahead: off >performance.io-cache: off >performance.stat-prefetch: off >performance.low-prio-threads: 32 >network.remote-dio: off >cluster.eager-lock: enable >cluster.quorum-type: auto >cluster.server-quorum-type: server >cluster.data-self-heal-algorithm: full >cluster.locking-scheme: granular >cluster.shd-max-threads: 8 >cluster.shd-wait-qlength: 10000 >features.shard: on >user.cifs: off >performance.write-behind-window-size: 512MB >performance.cache-size: 384MB >server.event-threads: 5 >performance.strict-o-direct: on >cluster.brick-multiplex: on > >Volume Name: ovirt-engine >Type: Distributed-Replicate >Volume ID: 9cc4dade-ef2e-4112-bcbf-e0fbc5df4ebc >Status: Started >Snapshot Count: 0 >Number of Bricks: 3 x 3 = 9 >Transport-type: tcp >Bricks: >Brick1: 10.201.0.5:/data5/gfs/bricks/brick1/ovirt-engine >Brick2: 10.201.0.1:/data5/gfs/bricks/brick1/ovirt-engine >Brick3: 10.201.0.2:/data5/gfs/bricks/brick1/ovirt-engine >Brick4: 10.201.0.8:/data5/gfs/bricks/brick1/ovirt-engine >Brick5: 10.201.0.9:/data5/gfs/bricks/brick1/ovirt-engine >Brick6: 10.201.0.3:/data5/gfs/bricks/brick1/ovirt-engine >Brick7: 10.201.0.12:/data5/gfs/bricks/brick1/ovirt-engine >Brick8: 10.201.0.11:/data5/gfs/bricks/brick1/ovirt-engine >Brick9: 10.201.0.7:/data5/gfs/bricks/brick1/ovirt-engine >Options Reconfigured: >performance.strict-o-direct: on >performance.write-behind-window-size: 512MB >features.shard-block-size: 64MB >performance.cache-size: 128MB >nfs.disable: on >transport.address-family: inet >performance.quick-read: off >performance.read-ahead: off >performance.io-cache: off >performance.low-prio-threads: 32 >network.remote-dio: enable >cluster.eager-lock: enable >cluster.quorum-type: auto >cluster.server-quorum-type: server >cluster.data-self-heal-algorithm: full >cluster.locking-scheme: granular >cluster.shd-max-threads: 8 >cluster.shd-wait-qlength: 10000 >features.shard: on >user.cifs: off >storage.owner-uid: 36 >storage.owner-gid: 36 >cluster.brick-multiplex: on > >Volume Name: ovirt-mon-2 >Type: Replicate >Volume ID: 111ff79a-565a-4d31-9f31-4c839749bafd >Status: Started >Snapshot Count: 0 >Number of Bricks: 1 x (2 + 1) = 3 >Transport-type: tcp >Bricks: >Brick1: 10.201.0.10:/data0/gfs/bricks/brick1/ovirt-mon-2 >Brick2: 10.201.0.11:/data0/gfs/bricks/brick1/ovirt-mon-2 >Brick3: 10.201.0.12:/data0/gfs/bricks/bricka/ovirt-mon-2 (arbiter) >Options Reconfigured: >performance.client-io-threads: on >nfs.disable: on >transport.address-family: inet >performance.quick-read: off >performance.read-ahead: off >performance.io-cache: off >performance.low-prio-threads: 32 >network.remote-dio: off >cluster.eager-lock: enable >cluster.quorum-type: auto >cluster.server-quorum-type: server >cluster.data-self-heal-algorithm: full >cluster.locking-scheme: granular >cluster.shd-max-threads: 8 >cluster.shd-wait-qlength: 10000 >features.shard: on >user.cifs: off >cluster.choose-local: off >client.event-threads: 4 >server.event-threads: 4 >storage.owner-uid: 36 >storage.owner-gid: 36 >performance.strict-o-direct: on >performance.cache-size: 64MB >performance.write-behind-window-size: 128MB >features.shard-block-size: 64MB >cluster.brick-multiplex: on > >Thanks OlafHi Olaf, Thanks for the detailed output. On first glance I have noticed that you have a HostedEngine domain for both ovirt's engine VM + for other VMs , is that right? If yes, that's against best practices and not recommended. Second, you use brick multiplexing, but according to RH documentation - that feature is not supported for your workload - so in your case its drawing attention but should not be a problem. Can you specify how many physical hosts do you have ? I will try to check the output deeper, but I think you need to check: 1. Check gluster heal status - any pending heals should be resolved 2. Use telnet/nc/ncat/netcat to verify that each host sees the peers' brick ports. 3. gluster volume heal <volume> info should report all bricks arr connected gluster volume status must report all bricks have a pid 4. OPTIONAL - Try to create smaller (it's not a good idea to have large qcow2 disks) disks via oVirt and assign them to your mysql. Then try to pvmove the LVs from the disk (mounted with loop) to the new disks - that way you can get rid of the old qcow disk . 5. What is your oVirt version ? Could it be an old 3.x ? Don't forget to backup :) Best Regards, Strahil Nikolov