Krutika Dhananjay
2019-May-13 08:20 UTC
[Gluster-users] VMs blocked for more than 120 seconds
OK. In that case, can you check if the following two changes help: # gluster volume set $VOL network.remote-dio off # gluster volume set $VOL performance.strict-o-direct on preferably one option changed at a time, its impact tested and then the next change applied and tested. Also, gluster version please? -Krutika On Mon, May 13, 2019 at 1:02 PM Martin Toth <snowmailer at gmail.com> wrote:> Cache in qemu is none. That should be correct. This is full command : > > /usr/bin/qemu-system-x86_64 -name one-312 -S -machine > pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp > 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 > -no-user-config -nodefaults -chardev > socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait > -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime > -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device > piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 > > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 > -drive file=/var/lib/one//datastores/116/312/*disk.0* > ,format=raw,if=none,id=drive-virtio-disk1,cache=none > -device > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 > -drive file=gluster://localhost:24007/imagestore/ > *7b64d6757acc47a39503f68731f89b8e* > ,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none > -device > scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 > -drive file=/var/lib/one//datastores/116/312/*disk.1* > ,format=raw,if=none,id=drive-ide0-0-0,readonly=on > -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 > > -netdev tap,fd=26,id=hostnet0 > -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 > -chardev pty,id=charserial0 -device > isa-serial,chardev=charserial0,id=serial0 > -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait > -device > virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 > -vnc 0.0.0.0:312,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 > -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on > > I?ve highlighted disks. First is VM context disk - Fuse used, second is > SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. > > Krutika, > I will start profiling on Gluster Volumes and wait for next VM to fail. > Than I will attach/send profiling info after some VM will be failed. I > suppose this is correct profiling strategy. >About this, how many vms do you need to recreate it? A single vm? Or multiple vms doing IO in parallel?> Thanks, > BR! > Martin > > On 13 May 2019, at 09:21, Krutika Dhananjay <kdhananj at redhat.com> wrote: > > Also, what's the caching policy that qemu is using on the affected vms? > Is it cache=none? Or something else? You can get this information in the > command line of qemu-kvm process corresponding to your vm in the ps output. > > -Krutika > > On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay <kdhananj at redhat.com> > wrote: > >> What version of gluster are you using? >> Also, can you capture and share volume-profile output for a run where you >> manage to recreate this issue? >> >> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command >> Let me know if you have any questions. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:34 PM Martin Toth <snowmailer at gmail.com> >> wrote: >> >>> Hi, >>> >>> there is no healing operation, not peer disconnects, no readonly >>> filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, >>> its SSD with 10G, performance is good. >>> >>> > you'd have it's log on qemu's standard output, >>> >>> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking >>> for problem for more than month, tried everything. Can?t find anything. Any >>> more clues or leads? >>> >>> BR, >>> Martin >>> >>> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net wrote: >>> > >>> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >>> >> Hi all, >>> > >>> > Hi >>> > >>> >> >>> >> I am running replica 3 on SSDs with 10G networking, everything works >>> OK but VMs stored in Gluster volume occasionally freeze with ?Task XY >>> blocked for more than 120 seconds?. >>> >> Only solution is to poweroff (hard) VM and than boot it up again. I >>> am unable to SSH and also login with console, its stuck probably on some >>> disk operation. No error/warning logs or messages are store in VMs logs. >>> >> >>> > >>> > As far as I know this should be unrelated, I get this during heals >>> > without any freezes, it just means the storage is slow I think. >>> > >>> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on >>> replica volume. Can someone advice how to debug this problem or what can >>> cause these issues? >>> >> It?s really annoying, I?ve tried to google everything but nothing >>> came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk >>> drivers, but its not related. >>> >> >>> > >>> > Any chance your gluster goes readonly ? Have you checked your gluster >>> > logs to see if maybe they lose each other some times ? >>> > /var/log/glusterfs >>> > >>> > For libgfapi accesses you'd have it's log on qemu's standard output, >>> > that might contain the actual error at the time of the freez. >>> > _______________________________________________ >>> > Gluster-users mailing list >>> > Gluster-users at gluster.org >>> > https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190513/b3fa30eb/attachment.html>
Hi Krutika,> Also, gluster version please?I am running old 3.7.6. (Yes I know I should upgrade asap) I?ve applied firstly "network.remote-dio off", behaviour did not changed, VMs got stuck after some time again. Then I?ve set "performance.strict-o-direct on" and problem completly disappeared. No more stucks at all (7 days without any problems at all). This SOLVED the issue. Can you explain what remote-dio and strict-o-direct variables changed in behaviour of my Gluster? It would be great for later archive/users to understand what and why this solved my issue. Anyway, Thanks a LOT!!! BR, Martin> On 13 May 2019, at 10:20, Krutika Dhananjay <kdhananj at redhat.com> wrote: > > OK. In that case, can you check if the following two changes help: > > # gluster volume set $VOL network.remote-dio off > # gluster volume set $VOL performance.strict-o-direct on > > preferably one option changed at a time, its impact tested and then the next change applied and tested. > > Also, gluster version please? > > -Krutika > > On Mon, May 13, 2019 at 1:02 PM Martin Toth <snowmailer at gmail.com <mailto:snowmailer at gmail.com>> wrote: > Cache in qemu is none. That should be correct. This is full command : > > /usr/bin/qemu-system-x86_64 -name one-312 -S -machine pc-i440fx-xenial,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid e95a774e-a594-4e98-b141-9f30a3f848c1 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-one-312/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=c,menu=on,splash-time=3000,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 > > -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 > -drive file=/var/lib/one//datastores/116/312/disk.0,format=raw,if=none,id=drive-virtio-disk1,cache=none > -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk1,id=virtio-disk1 > -drive file=gluster://localhost:24007/imagestore/ <>7b64d6757acc47a39503f68731f89b8e,format=qcow2,if=none,id=drive-scsi0-0-0-0,cache=none > -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 > -drive file=/var/lib/one//datastores/116/312/disk.1,format=raw,if=none,id=drive-ide0-0-0,readonly=on > -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 > > -netdev tap,fd=26,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=02:00:5c:f0:e4:39,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-one-312/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -vnc 0.0.0.0:312 <http://0.0.0.0:312/>,password -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 -msg timestamp=on > > I?ve highlighted disks. First is VM context disk - Fuse used, second is SDA (OS is installed here) - libgfapi used, third is SWAP - Fuse used. > > Krutika, > I will start profiling on Gluster Volumes and wait for next VM to fail. Than I will attach/send profiling info after some VM will be failed. I suppose this is correct profiling strategy. > > About this, how many vms do you need to recreate it? A single vm? Or multiple vms doing IO in parallel? > > > Thanks, > BR! > Martin > >> On 13 May 2019, at 09:21, Krutika Dhananjay <kdhananj at redhat.com <mailto:kdhananj at redhat.com>> wrote: >> >> Also, what's the caching policy that qemu is using on the affected vms? >> Is it cache=none? Or something else? You can get this information in the command line of qemu-kvm process corresponding to your vm in the ps output. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:49 PM Krutika Dhananjay <kdhananj at redhat.com <mailto:kdhananj at redhat.com>> wrote: >> What version of gluster are you using? >> Also, can you capture and share volume-profile output for a run where you manage to recreate this issue? >> https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command <https://docs.gluster.org/en/v3/Administrator%20Guide/Monitoring%20Workload/#running-glusterfs-volume-profile-command> >> Let me know if you have any questions. >> >> -Krutika >> >> On Mon, May 13, 2019 at 12:34 PM Martin Toth <snowmailer at gmail.com <mailto:snowmailer at gmail.com>> wrote: >> Hi, >> >> there is no healing operation, not peer disconnects, no readonly filesystem. Yes, storage is slow and unavailable for 120 seconds, but why, its SSD with 10G, performance is good. >> >> > you'd have it's log on qemu's standard output, >> >> If you mean /var/log/libvirt/qemu/vm.log there is nothing. I am looking for problem for more than month, tried everything. Can?t find anything. Any more clues or leads? >> >> BR, >> Martin >> >> > On 13 May 2019, at 08:55, lemonnierk at ulrar.net <mailto:lemonnierk at ulrar.net> wrote: >> > >> > On Mon, May 13, 2019 at 08:47:45AM +0200, Martin Toth wrote: >> >> Hi all, >> > >> > Hi >> > >> >> >> >> I am running replica 3 on SSDs with 10G networking, everything works OK but VMs stored in Gluster volume occasionally freeze with ?Task XY blocked for more than 120 seconds?. >> >> Only solution is to poweroff (hard) VM and than boot it up again. I am unable to SSH and also login with console, its stuck probably on some disk operation. No error/warning logs or messages are store in VMs logs. >> >> >> > >> > As far as I know this should be unrelated, I get this during heals >> > without any freezes, it just means the storage is slow I think. >> > >> >> KVM/Libvirt(qemu) using libgfapi and fuse mount to access VM disks on replica volume. Can someone advice how to debug this problem or what can cause these issues? >> >> It?s really annoying, I?ve tried to google everything but nothing came up. I?ve tried changing virtio-scsi-pci to virtio-blk-pci disk drivers, but its not related. >> >> >> > >> > Any chance your gluster goes readonly ? Have you checked your gluster >> > logs to see if maybe they lose each other some times ? >> > /var/log/glusterfs >> > >> > For libgfapi accesses you'd have it's log on qemu's standard output, >> > that might contain the actual error at the time of the freez. >> > _______________________________________________ >> > Gluster-users mailing list >> > Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> > https://lists.gluster.org/mailman/listinfo/gluster-users <https://lists.gluster.org/mailman/listinfo/gluster-users> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org> >> https://lists.gluster.org/mailman/listinfo/gluster-users <https://lists.gluster.org/mailman/listinfo/gluster-users>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190520/dbeffd40/attachment.html>