Hans van Kranenburg
2018-Feb-26 23:40 UTC
[Pkg-xen-devel] Bug#880554: Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64
On 02/26/2018 07:35 PM, Hans van Kranenburg wrote:> On 02/26/2018 03:52 PM, Ian Jackson wrote: >> Christian Schwamborn writes ("Re: Bug#880554: xen domu freezes withkernel linux-image-4.9.0-4-amd64"):>>> I can try, but the only system I can really test this is a productive >>> system, as this 'reliable' shows this issue (and I don't want to crash >>> it on purpose on a regular basis). Since I set gnttab_max_frame to a >>> higher value it runs smooth. If you're confident this will work I can >>> try this in the eventing, when all users logged off. >> >> Thanks. I understand your reluctance. I don't want to mislead you. >> I think the odds of it working are probably ~75%. >> >> Unless you want to tolerate that risk, it might be better for us to >> try to come up with a better way to test it. > > I can try this. > > I can run a dom0 with Xen 4.8 and 4.9 domU, I already have the xen-diag > for it (so confirmed the patch in this bug report builds ok, we should > include it for stretch, it's really useful). > > I think it's mainly trying to get a domU running with various > combinations of domU kernel, number of disks and vcpus, and then look at > the output of xen-diag.Ok, I spent some time trying things. Xen: 4.8.3+comet2+shim4.10.0+comet3-1+deb9u4.1 dom0 kernel 4.9.65-3+deb9u2 domU (PV) kernel 4.9.82-1+deb9u2 Observation so far: nr_frames increases as soon as a combination of disk+vcpu has actually been doing disk activity, and then never decreases. I ended up with a 64-vcpu domU with additional 10 1GiB disks (xvdc, xvdd, etc). I created ext4 fs on the disks and mounted them. I used fio to throw some IO at the disk, trying to hit as many combinations of vcpu and disk. [things] rw=randwrite rwmixread=75 size=8M directory=/mnt/xvdBLAH ioengine=libaio direct=1 iodepth=16 numjobs=64 with BLAH replaced by c, d, e, f etc... -# rm */things*; for i in c d e f g h i j k l; do fio fio-xvd$i; done -# while true; do /usr/lib/xen-4.8/bin/xen-diag gnttab_query_size 2; sleep 10; done domid=2: nr_frames=6, max_nr_frames=128 domid=2: nr_frames=7, max_nr_frames=128 domid=2: nr_frames=7, max_nr_frames=128 domid=2: nr_frames=10, max_nr_frames=128 domid=2: nr_frames=10, max_nr_frames=128 domid=2: nr_frames=11, max_nr_frames=128 domid=2: nr_frames=13, max_nr_frames=128 domid=2: nr_frames=14, max_nr_frames=128 domid=2: nr_frames=15, max_nr_frames=128 domid=2: nr_frames=16, max_nr_frames=128 domid=2: nr_frames=18, max_nr_frames=128 domid=2: nr_frames=18, max_nr_frames=128 domid=2: nr_frames=19, max_nr_frames=128 domid=2: nr_frames=21, max_nr_frames=128 domid=2: nr_frames=21, max_nr_frames=128 domid=2: nr_frames=23, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 domid=2: nr_frames=24, max_nr_frames=128 So I can push it up to about 24 when doing this. -# grep . /sys/module/xen_blkback/parameters/* /sys/module/xen_blkback/parameters/log_stats:0 /sys/module/xen_blkback/parameters/max_buffer_pages:1024 /sys/module/xen_blkback/parameters/max_persistent_grants:1056 /sys/module/xen_blkback/parameters/max_queues:4 /sys/module/xen_blkback/parameters/max_ring_page_order:4 Now, I rebooted my test domo and put the modprobe file in place. (Note: the filename has to end in .conf)!! -# grep . /sys/module/xen_blkback/parameters/* /sys/module/xen_blkback/parameters/log_stats:0 /sys/module/xen_blkback/parameters/max_buffer_pages:1024 /sys/module/xen_blkback/parameters/max_persistent_grants:1056 /sys/module/xen_blkback/parameters/max_queues:1 /sys/module/xen_blkback/parameters/max_ring_page_order:0 After doing the same tests, the result ends up being exactly 24 again. So, the modprobe settings don't seem to do anything. -# tree /sys/block/xvda/mq /sys/block/xvda/mq ??? 0 ??? active ??? cpu0 ? ??? completed ? ??? dispatched ? ??? merged ? ??? rq_list ??? cpu1 ? ??? completed ? ??? dispatched ? ??? merged ? ??? rq_list [...] ??? cpu63 ? ??? completed ? ??? dispatched ? ??? merged ? ??? rq_list [...] ??? cpu_list ??? dispatched ??? io_poll ??? pending ??? queued ??? run ??? tags 65 directories, 264 files Mwooop mwooop mwoop mwooooo (failure trombone). It obviously didn't involve network traffic yet. And, all is stretch kernels etc, which are reported to already be problematic. But, the main thing I wanted to test is if the change would result in a much lower total amount of grants, which is not the case. So, anyone a better idea, or should we just add some clear documentation for the max frames setting in the grub config example? Hans