Sergio Gelato
2019-Jan-16 09:31 UTC
[Pkg-xen-devel] Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
Source: xen Version: 4.8.5+shim4.10.2+xsa282-1+deb9u11 Yesterday I upgraded a test dom0 to this version (from 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10; stretch amd64, Xeon E5430), then rebooted. Running domU's were saved and restored in the usual way. However, all PV domU's running stretch (both i386 and amd64, all kernel 4.9.130-2) lost write access to xvda on restore due to I/O errors. Sample kernel log attached. (/var/log/kern.log stopped recording entries, so I grabbed dmesg output to show what happened afterwards.) I also had PV domUs running jessie (kernel 3.16.59-1, again both i386 and amd64). These were restored successfully. In all cases, xvda is backed by an LVM logical volume local to the dom0. After "reboot -f"ing some of the affected domUs (which made them functional again), I rebooted the dom0. This time all domUs were restored normally. (Of course those that still had their filesystems mounted read-only stayed that way.) Is anyone else seeing this? -------------- next part -------------- Jan 15 07:40:46 bst1 kernel: [1096816.319075] Freezing user space processes ... (elapsed 0.001 seconds) done. Jan 15 07:40:46 bst1 kernel: [1096816.320237] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. Jan 15 07:40:46 bst1 kernel: [1096816.321499] PM: freeze of devices complete after 0.113 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321501] suspending xenstore... Jan 15 07:40:46 bst1 kernel: [1096816.321540] PM: late freeze of devices complete after 0.034 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321582] PM: noirq freeze of devices complete after 0.038 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321609] xen:grant_table: Grant tables using version 1 layout Jan 15 07:40:46 bst1 kernel: [1096816.321609] Suspended for 122.096 seconds Jan 15 07:40:46 bst1 kernel: [1096816.321609] PM: noirq restore of devices complete after 0.098 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321609] PM: early restore of devices complete after 0.038 msecs Jan 15 07:40:46 bst1 kernel: [1096816.328857] PM: restore of devices complete after 6.076 msecs Jan 15 07:40:46 bst1 kernel: [1096816.328909] Restarting tasks ... done. ...skipping... Jan 15 07:40:46 bst1 kernel: [1096816.319075] Freezing user space processes ... (elapsed 0.001 seconds) done. Jan 15 07:40:46 bst1 kernel: [1096816.320237] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. Jan 15 07:40:46 bst1 kernel: [1096816.321499] PM: freeze of devices complete after 0.113 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321501] suspending xenstore... Jan 15 07:40:46 bst1 kernel: [1096816.321540] PM: late freeze of devices complete after 0.034 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321582] PM: noirq freeze of devices complete after 0.038 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321609] xen:grant_table: Grant tables using version 1 layout Jan 15 07:40:46 bst1 kernel: [1096816.321609] Suspended for 122.096 seconds Jan 15 07:40:46 bst1 kernel: [1096816.321609] PM: noirq restore of devices complete after 0.098 msecs Jan 15 07:40:46 bst1 kernel: [1096816.321609] PM: early restore of devices complete after 0.038 msecs Jan 15 07:40:46 bst1 kernel: [1096816.328857] PM: restore of devices complete after 6.076 msecs Jan 15 07:40:46 bst1 kernel: [1096816.328909] Restarting tasks ... done. Jan 15 07:40:51 bst1 kernel: [1096821.985693] blk_update_request: I/O error, dev xvda, sector 0 [1096816.319075] Freezing user space processes ... (elapsed 0.001 seconds) done. [1096816.320237] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done. [1096816.321499] PM: freeze of devices complete after 0.113 msecs [1096816.321501] suspending xenstore... [1096816.321540] PM: late freeze of devices complete after 0.034 msecs [1096816.321582] PM: noirq freeze of devices complete after 0.038 msecs [1096816.321609] xen:grant_table: Grant tables using version 1 layout [1096816.321609] Suspended for 122.096 seconds [1096816.321609] PM: noirq restore of devices complete after 0.098 msecs [1096816.321609] PM: early restore of devices complete after 0.038 msecs [1096816.328857] PM: restore of devices complete after 6.076 msecs [1096816.328909] Restarting tasks ... done. [1096821.985693] blk_update_request: I/O error, dev xvda, sector 0 [1096821.988866] blk_update_request: I/O error, dev xvda, sector 0 [1096821.988892] blk_update_request: I/O error, dev xvda, sector 0 [1096821.988908] blk_update_request: I/O error, dev xvda, sector 12941838 [1096821.988950] Aborting journal on device dm-3-8. [1096821.991190] blk_update_request: I/O error, dev xvda, sector 12931074 [1096821.991213] Buffer I/O error on dev dm-3, logical block 139265, lost sync page write [1096821.991230] blk_update_request: I/O error, dev xvda, sector 3663168 [1096821.991247] blk_update_request: I/O error, dev xvda, sector 9413656 [1096821.991270] Aborting journal on device dm-1-8. [1096821.991334] Aborting journal on device dm-0-8. [1096821.991386] JBD2: Error -5 detected when updating journal superblock for dm-3-8. [1096821.993331] blk_update_request: I/O error, dev xvda, sector 9351168 [1096821.993349] Buffer I/O error on dev dm-1, logical block 196608, lost sync page write [1096821.993363] blk_update_request: I/O error, dev xvda, sector 3649536 [1096821.993372] Buffer I/O error on dev dm-0, logical block 393216, lost sync page write [1096821.993388] blk_update_request: I/O error, dev xvda, sector 7778304 [1096821.993398] Buffer I/O error on dev dm-1, logical block 0, lost sync page write [1096821.993427] JBD2: Error -5 detected when updating journal superblock for dm-1-8. [1096821.993480] EXT4-fs error (device dm-1): ext4_journal_check_start:56: Detected aborted journal [1096821.993497] EXT4-fs (dm-1): Remounting filesystem read-only [1096821.993514] EXT4-fs (dm-1): previous I/O error to superblock detected [1096821.993553] JBD2: Error -5 detected when updating journal superblock for dm-0-8. [1096821.994306] Buffer I/O error on dev dm-1, logical block 0, lost sync page write [1097011.549253] blk_update_request: 1 callbacks suppressed [1097011.549258] blk_update_request: I/O error, dev xvda, sector 12652546 [1097011.549286] Buffer I/O error on dev dm-3, logical block 1, lost sync page write [1097011.549320] EXT4-fs error (device dm-3): ext4_journal_check_start:56: Detected aborted journal [1097011.549335] EXT4-fs (dm-3): Remounting filesystem read-only [1097011.549346] EXT4-fs (dm-3): previous I/O error to superblock detected [1097011.550397] blk_update_request: I/O error, dev xvda, sector 12652546 [1097011.550416] Buffer I/O error on dev dm-3, logical block 1, lost sync page write [1098633.834762] blk_update_request: I/O error, dev xvda, sector 503808 [1098633.834790] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [1098633.834822] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal [1098633.834848] EXT4-fs (dm-0): Remounting filesystem read-only [1098633.834859] EXT4-fs (dm-0): previous I/O error to superblock detected [1098633.835550] blk_update_request: I/O error, dev xvda, sector 503808 [1098633.835569] Buffer I/O error on dev dm-0, logical block 0, lost sync page write [1155878.881255] blk_update_request: I/O error, dev xvda, sector 0 [1155878.882053] blk_update_request: I/O error, dev xvda, sector 16807376 [1155878.882084] Aborting journal on device dm-4-8. [1155878.882805] blk_update_request: I/O error, dev xvda, sector 16805888 [1155878.882825] Buffer I/O error on dev dm-4, logical block 425984, lost sync page write [1155878.882854] JBD2: Error -5 detected when updating journal superblock for dm-4-8.
Hans van Kranenburg
2019-Jan-16 23:41 UTC
[Pkg-xen-devel] Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
Hi Sergio, On 1/16/19 10:31 AM, Sergio Gelato wrote:> Source: xen Version: 4.8.5+shim4.10.2+xsa282-1+deb9u11 > > Yesterday I upgraded a test dom0 to this version (from > 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10; stretch amd64, Xeon E5430), > then rebooted. Running domU's were saved and restored in the usual > way. However, all PV domU's running stretch (both i386 and amd64, all > kernel 4.9.130-2) lost write access to xvda on restore due to I/O > errors. Sample kernel log attached. (/var/log/kern.log stopped > recording entries, so I grabbed dmesg output to show what happened > afterwards.)Thanks for your report.> and amd64). These were restored successfully. > > In all cases, xvda is backed by an LVM logical volume local to the > dom0. > > After "reboot -f"ing some of the affected domUs (which made them > functional again), I rebooted the dom0. This time all domUs were > restored normally. (Of course those that still had their filesystems > mounted read-only stayed that way.) > > Is anyone else seeing this?The usual questions here would be like "can you reproduce the issue" etc... Because if you consistently can cause the problem to happen, you're in a positition to start trying things. The following is not an answer to your question, but a personal suggestion: When speaking for myself, I've had major troubles with Linux 4.9 in the dom0 causing all kinds of crashes when using live migrate (similar to suspend/resume you're doing) and I've never been able to track them down, they were never explained or fixed. blk-mq or general storage related crashes where amongst them. At this point, with the 4.19 kernel for buster already in pretty good shape and in stretch-backports as well, I can recommend trying it out. Hans
Sergio Gelato
2019-Jan-17 07:58 UTC
[Pkg-xen-devel] Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
* Hans van Kranenburg [2019-01-17 00:41:39 +0100]:> > After "reboot -f"ing some of the affected domUs (which made them > > functional again), I rebooted the dom0. This time all domUs were > > restored normally. (Of course those that still had their filesystems > > mounted read-only stayed that way.) > > > > Is anyone else seeing this? > > The usual questions here would be like "can you reproduce the issue" > etc... Because if you consistently can cause the problem to happen, > you're in a positition to start trying things.Since this only happened on the reboot immediately following the Xen upgrade, in order to reproduce it I would need to either try it on another system or downgrade Xen on my test system and upgrade it again. The latter doesn't look like a good use of my limited time. I will eventually want to upgrade my production dom0's. Any domU's that I care about on these systems will be shut down prior to the reboot (that was the main lesson for me from this test), but I can leave a canary behind and see what happens to it. So eventually I should have an idea whether this is reproducible across systems. This will take a few weeks, though: plenty of time for others to run into the same problem if it's reproducible. I've been wondering whether the observed behaviour might be explained in terms of the specific changes made by the latest security patches. I don't see any such obvious explanation myself, but maybe to an expert it would leap out; that's partly why I reported this. If this is the case, then maybe there is no fix other than documenting the issue in release notes. It's also conceivable that the bug is in that domU kernel rather than in Xen. I might set up canary domU's with 4.9.144 (stretch-proposed-updates) and/or 4.19.x.
Diederik de Haas
2020-Oct-30 22:18 UTC
[Pkg-xen-devel] Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
Control: tag -1 moreinfo On donderdag 17 januari 2019 08:58:53 CET Sergio Gelato wrote:> * Hans van Kranenburg [2019-01-17 00:41:39 +0100]: > > The usual questions here would be like "can you reproduce the issue" > > etc... Because if you consistently can cause the problem to happen, > > you're in a positition to start trying things. > > Since this only happened on the reboot immediately following the Xen > upgrade, in order to reproduce it I would need to either try it on > another system ... > > ... but I can leave a canary behind and see what happens to it. So > eventually I should have an idea whether this is reproducible > across systems. This will take a few weeks, though: plenty of time > for others to run into the same problem if it's reproducible.Do you have more info wrt this problem? Did it happen again? -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part. URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20201030/ff8cba22/attachment.sig>
Debian Bug Tracking System
2020-Oct-30 22:27 UTC
[Pkg-xen-devel] Processed: Re: Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
Processing control commands:> tag -1 moreinfoBug #919460 [src:xen] xen: disk I/O problems on stretch PV guest restores after security update Added tag(s) moreinfo. -- 919460: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919460 Debian Bug Tracking System Contact owner at bugs.debian.org with problems
Sergio Gelato
2020-Oct-30 22:46 UTC
[Pkg-xen-devel] Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
No, it did not happen again, at least not to me (and I've since upgraded to buster and Xen 4.11), and if no one else has run into it either it's probably not worth continued attention. In particular, it's unlikely to have been a regression triggered by the +deb9u11 security update.
Diederik de Haas
2020-Oct-30 23:02 UTC
[Pkg-xen-devel] Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
Control: tag -1 unreproducible On vrijdag 30 oktober 2020 23:46:17 CET Sergio Gelato wrote:> No, it did not happen again, at least not to me (and I've since upgraded to > buster and Xen 4.11), and if no one else has run into it either it's > probably not worth continued attention. In particular, it's unlikely to > have been a regression triggered by the +deb9u11 security update.Thanks for reporting back. I'm inclined to agree with you that it's not worth continued attention. If others on the Debian Xen team agree, I'll likely close this bug. Cheers, Diederik -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part. URL: <http://alioth-lists.debian.net/pipermail/pkg-xen-devel/attachments/20201031/064912d7/attachment.sig>
Debian Bug Tracking System
2020-Oct-30 23:03 UTC
[Pkg-xen-devel] Processed: Re: Bug#919460: Bug#919460: xen: disk I/O problems on stretch PV guest restores after security update
Processing control commands:> tag -1 unreproducibleBug #919460 [src:xen] xen: disk I/O problems on stretch PV guest restores after security update Added tag(s) unreproducible. -- 919460: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=919460 Debian Bug Tracking System Contact owner at bugs.debian.org with problems