I don''t have time to bisect this, currently - but just thought I''d let the list know that, while xen-4.2 works (with the recent S3 changes I''ve submitted) - 4.3 is broken again. I''m not sure if it is the hypervisor, or the kernel, since I upgraded both in my "unstable" build environment. Since this is something that XenClient really relies on working, it has been a pain point with every upgrade of Xen for us. It is enormously time consuming to debug on every upgrade, and has a long tail in discovering problems (I started debugging S3 last Aug on xen-unstable, prior to 4.2 being cut) How can we work with the community to try to get some sort of regression testing for this feature that we rely on in our product? Ben
On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote:> I don''t have time to bisect this, currently - but just thought I''d let > the list know that, while xen-4.2 works (with the recent S3 changes > I''ve submitted) - 4.3 is broken again. > > I''m not sure if it is the hypervisor, or the kernel, since I upgraded > both in my "unstable" build environment. >This appears to have been a transient issue. My xen tree was a few days old updating to the tip seems to have resolved this particular issue.> Since this is something that XenClient really relies on working, it > has been a pain point with every upgrade of Xen for us. > It is enormously time consuming to debug on every upgrade, and has a > long tail in discovering problems (I started debugging S3 last Aug on > xen-unstable, prior to 4.2 being cut) > > How can we work with the community to try to get some sort of > regression testing for this feature that we rely on in our product?I am still interested in ideas for getting this into automated testing, and any ideas people may have for this. Would it be helpful to maintain a branch in my xenbits repo that could be a rebased version of konrad''s acpi-s3 patches against Linus'' latest kernel?
On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote:> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote: > > Since this is something that XenClient really relies on working, it > > has been a pain point with every upgrade of Xen for us. > > It is enormously time consuming to debug on every upgrade, and has a > > long tail in discovering problems (I started debugging S3 last Aug on > > xen-unstable, prior to 4.2 being cut) > > > > How can we work with the community to try to get some sort of > > regression testing for this feature that we rely on in our product? > > I am still interested in ideas for getting this into automated > testing, and any ideas people may have for this.CCing Ian Jackson who runs the test infrastructure. Contributing new tests is now less onerous than it once was (i.e. it might even possible at all). There is some info at lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html although the branch may be out of date -- Ian was working on merging the standalone branch at one point. Some questions: * How automatable is s3? * In particular can we automate the wakeup? s3 is save to RAM IIRC, and most power control in the test system is done with PDU power cycling. * Would s3 ever be expected to work on the sorts of whitebox server systems which form the osstest pool or do we need to investigate additional hardware? * How hardware specific are the s3 failures -- we obviously can''t have one of every laptop ever ;-) So assuming the answers to the above are positive then contributing a test case for s3 to the relevant flights seems like a reasonable first step, even if the expectation is that it would always fail with the current mainline Xen + mainline Linux. The test system only tracks regressions, so always failing test cases are OK (you can think of this in the test-drive development kind of way ;-)).> Would it be helpful to maintain a branch in my xenbits repo that could > be a rebased version of konrad''s acpi-s3 patches against Linus'' latest > kernel?What is keeping those out of Linus'' tree? Once we have a test case in the standard flights then we can consider the options around new flights testing other trees. Ian.
On Fri, Apr 26, 2013 at 4:10 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote: >> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote: >> > Since this is something that XenClient really relies on working, it >> > has been a pain point with every upgrade of Xen for us. >> > It is enormously time consuming to debug on every upgrade, and has a >> > long tail in discovering problems (I started debugging S3 last Aug on >> > xen-unstable, prior to 4.2 being cut) >> > >> > How can we work with the community to try to get some sort of >> > regression testing for this feature that we rely on in our product? >> >> I am still interested in ideas for getting this into automated >> testing, and any ideas people may have for this. > > CCing Ian Jackson who runs the test infrastructure.I''ve also CC''ed a few people here, who I mention in my reply below.> > Contributing new tests is now less onerous than it once was (i.e. it > might even possible at all). There is some info at > lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html > although the branch may be out of date -- Ian was working on merging the > standalone branch at one point.I''ll read up on this> > Some questions: > * How automatable is s3? > * In particular can we automate the wakeup? s3 is save to RAM > IIRC, and most power control in the test system is done with PDU > power cycling.I spoke with George Dunlap a bit about this while I was over in the UK a few weeks ago, and drew up an example shell script for this: xen.markmail.org/thread/ghj2ffngemccq6p4 Marek also weighed in, and included some of his own tests, and experiences. In my experience, this mechanism is about as reliable as your RTC. On some systems you might tell it to sleep for 30s, and it will wake in 10s. That said, when things go wrong, the machine does need to be power cycled...so if you are not physically located near the machine under test, you would need a PDU as a recovery mechanism, I suppose.> * Would s3 ever be expected to work on the sorts of whitebox > server systems which form the osstest pool or do we need to > investigate additional hardware?I don''t see why it wouldn''t work, though admittedly I haven''t dealt with xen on servers since 2009.> * How hardware specific are the s3 failures -- we obviously can''t > have one of every laptop ever ;-)Clearly. I''m just looking to get a foot in the door here, so there is a chance of catching gross regressions. The hardware differences seem to be more timing related, due to speed... ie, you are likely to uncover new failures when new, faster hardware comes out for laptops. Since typically server hardware is faster than laptop hardware, that would theoretically catch problems at a higher frequency.> > So assuming the answers to the above are positive then contributing a > test case for s3 to the relevant flights seems like a reasonable first > step, even if the expectation is that it would always fail with the > current mainline Xen + mainline Linux. The test system only tracks > regressions, so always failing test cases are OK (you can think of this > in the test-drive development kind of way ;-)).I''ll take a look at the test infrastructure, and see if I can make heads/tails of it, and come up with a simplistic test.> >> Would it be helpful to maintain a branch in my xenbits repo that could >> be a rebased version of konrad''s acpi-s3 patches against Linus'' latest >> kernel? > > What is keeping those out of Linus'' tree?Added Konrad here, but I believe he is on vacation this week. This has been a bullet point on his OSS presentation, as outstanding pvops work for at least 3 years now. IIRC, the x86 guys NACK''ed the change as being too invasive. I googled around a bit, but can''t seem to find the thread about it.> > Once we have a test case in the standard flights then we can consider > the options around new flights testing other trees.I''m not sure I understand this point. Are you saying you want to see a test that fails in the standard test flight first...because without Konrad''s patches, it will be guaranteed not to work. ...and without other changesets queued up for the 3.10 merge window, non-boot CPUs will always have incorrect C-states. Thanks Ben
On Fri, 2013-04-26 at 13:19 +0100, Ben Guthro wrote:> On Fri, Apr 26, 2013 at 4:10 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> > Some questions: > > * How automatable is s3? > > * In particular can we automate the wakeup? s3 is save to RAM > > IIRC, and most power control in the test system is done with PDU > > power cycling. > > I spoke with George Dunlap a bit about this while I was over in the > UK a few weeks ago, and drew up an example shell script for this: > xen.markmail.org/thread/ghj2ffngemccq6p4 > Marek also weighed in, and included some of his own tests, and experiences. > > In my experience, this mechanism is about as reliable as your RTC. On > some systems you might tell it to sleep for 30s, and it will wake in > 10s. > > That said, when things go wrong, the machine does need to be power > cycled...so if you are not physically located near the machine under > test, you would need a PDU as a recovery mechanism, I suppose.That''#s OK, all the systems in the test harness would have to have PDU for the other test cases (initial install etc) anyway.> >> Would it be helpful to maintain a branch in my xenbits repo that could > >> be a rebased version of konrad''s acpi-s3 patches against Linus'' latest > >> kernel? > > > > What is keeping those out of Linus'' tree? > > Added Konrad here, but I believe he is on vacation this week. > This has been a bullet point on his OSS presentation, as outstanding > pvops work for at least 3 years now. > > IIRC, the x86 guys NACK''ed the change as being too invasive. > I googled around a bit, but can''t seem to find the thread about it.I wonder if it might be something like that :-/> > Once we have a test case in the standard flights then we can consider > > the options around new flights testing other trees. > > I''m not sure I understand this point. > Are you saying you want to see a test that fails in the standard test > flight first...because without Konrad''s patches, it will be guaranteed > not to work.Right. AIUI the flights (and I may be using the wrong term here) are somewhat uniform and and few in number and get run with various combinations inputs (Xen tree, Linux tree, Qemu tree), so there is effectively one "test Linux PV kernel flight" and one "test Xen PV guests flight" etc, so we want to get S3 into those flights, with the existing set of "* tree" inputs. IOW we should add a new row to the grid chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing and then we can consider adding a new column with a different set of tree''s as input. Ian J may have a different opinion on how to approach, but he''s away until mid next week.> ...and without other changesets queued up for the 3.10 merge window, > non-boot CPUs will always have incorrect C-states.It''s OK to add the tests before things work. Ian.
On Fri, Apr 26, 2013 at 9:17 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: <snip>>> I''m not sure I understand this point. >> Are you saying you want to see a test that fails in the standard test >> flight first...because without Konrad''s patches, it will be guaranteed >> not to work. > > Right. AIUI the flights (and I may be using the wrong term here) are > somewhat uniform and and few in number and get run with various > combinations inputs (Xen tree, Linux tree, Qemu tree), so there is > effectively one "test Linux PV kernel flight" and one "test Xen PV > guests flight" etc, so we want to get S3 into those flights, with the > existing set of "* tree" inputs. > > IOW we should add a new row to the grid > chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing > and then we can consider adding a new column with a different set of > tree''s as input. > > Ian J may have a different opinion on how to approach, but he''s away > until mid next week. > >> ...and without other changesets queued up for the 3.10 merge window, >> non-boot CPUs will always have incorrect C-states. > > It''s OK to add the tests before things work.I''ve attached a patch to osstest that would be the beginnings of this sort of test, if I''m reading the code correctly. However, I don''t really have a setup that I can smoke test this. I also left a number of TODOs, since I really just want an opinion to see if I''m on the right track. Thanks Ben _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org lists.xen.org/xen-devel
On Fri, 2013-04-26 at 15:10 +0100, Ben Guthro wrote:> On Fri, Apr 26, 2013 at 9:17 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > <snip> > >> I''m not sure I understand this point. > >> Are you saying you want to see a test that fails in the standard test > >> flight first...because without Konrad''s patches, it will be guaranteed > >> not to work. > > > > Right. AIUI the flights (and I may be using the wrong term here) are > > somewhat uniform and and few in number and get run with various > > combinations inputs (Xen tree, Linux tree, Qemu tree), so there is > > effectively one "test Linux PV kernel flight" and one "test Xen PV > > guests flight" etc, so we want to get S3 into those flights, with the > > existing set of "* tree" inputs. > > > > IOW we should add a new row to the grid > > chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing > > and then we can consider adding a new column with a different set of > > tree''s as input. > > > > Ian J may have a different opinion on how to approach, but he''s away > > until mid next week. > > > >> ...and without other changesets queued up for the 3.10 merge window, > >> non-boot CPUs will always have incorrect C-states. > > > > It''s OK to add the tests before things work. > > I''ve attached a patch to osstest that would be the beginnings of this > sort of test, if I''m reading the code correctly. > However, I don''t really have a setup that I can smoke test this. > > I also left a number of TODOs, since I really just want an opinion to > see if I''m on the right track.Right track, as far as it goes, but I think most of what you have put in TestSupport.pm should actually be in the ts-host-suspend test case itself. You probably need IanJ''s input for anything more concrete. Ian.
On Thu, Apr 25, 2013 at 01:02:35PM -0400, Ben Guthro wrote:> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote: > > I don''t have time to bisect this, currently - but just thought I''d let > > the list know that, while xen-4.2 works (with the recent S3 changes > > I''ve submitted) - 4.3 is broken again. > > > > I''m not sure if it is the hypervisor, or the kernel, since I upgraded > > both in my "unstable" build environment. > > > > This appears to have been a transient issue. My xen tree was a few days old > updating to the tip seems to have resolved this particular issue. >Ok, so master (xen-unstable) works OK regarding ACPI S3. Good. What hypervisor-side patches are still missing from stable-4.2 branch? -- Pasi
On Fri, Apr 26, 2013 at 4:47 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:> On Thu, Apr 25, 2013 at 01:02:35PM -0400, Ben Guthro wrote: >> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote: >> > I don''t have time to bisect this, currently - but just thought I''d let >> > the list know that, while xen-4.2 works (with the recent S3 changes >> > I''ve submitted) - 4.3 is broken again. >> > >> > I''m not sure if it is the hypervisor, or the kernel, since I upgraded >> > both in my "unstable" build environment. >> > >> >> This appears to have been a transient issue. My xen tree was a few days old >> updating to the tip seems to have resolved this particular issue. >> > > Ok, so master (xen-unstable) works OK regarding ACPI S3. Good. > > What hypervisor-side patches are still missing from stable-4.2 branch?The final one that actually makes it work is xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=9aa356bc9f7533c3cb7f02c823f532532876d444 Jan Beulich had already indicated that this would be picked up in the 4.2 release cycle, but it was too late to get it into 4.2.2 Then, also the ns16550 change. While strictly not necessary to fix S3 in the normal path, it does fix a bug that can lead to S3 not working if you a. have one of these SuperIO controllers on the LPC bus. b. have serial enabled. xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=6e96c186d23873597896051b043cfeb119c4a7d5 On the linux side of things, The following are necessary: One of the acpi-s3.vX branches. I use v9, but v10 is also available. I don''t think one has an advantage over the other. acpi-s3.v10: git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=c268cd657314354f910b773a17a9de0299e1cc21 git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=864848221b056aaf25416999c29cb0e14d3c3197 git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=aa7eb7bbb3f2a39435a07c082f99893386ae83ec stable/for-linus-3.10: git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=stable/for-linus-3.10&id=3fac10145b766a2244422788f62dc35978613fd8 Ben
>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote: > I don''t have time to bisect this, currently - but just thought I''d let > the list know that, while xen-4.2 works (with the recent S3 changes > I''ve submitted) - 4.3 is broken again. > > I''m not sure if it is the hypervisor, or the kernel, since I upgraded > both in my "unstable" build environment. > > Since this is something that XenClient really relies on working, it > has been a pain point with every upgrade of Xen for us.Perhaps one point here also is that you upgrade in too big steps? More regular participation in development and patch review would very likely also help keeping down the number of regressions here. Jan> It is enormously time consuming to debug on every upgrade, and has a > long tail in discovering problems (I started debugging S3 last Aug on > xen-unstable, prior to 4.2 being cut) > > How can we work with the community to try to get some sort of > regression testing for this feature that we rely on in our product? > > Ben
On Mon, Apr 29, 2013 at 4:45 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote: >> I don''t have time to bisect this, currently - but just thought I''d let >> the list know that, while xen-4.2 works (with the recent S3 changes >> I''ve submitted) - 4.3 is broken again. >> >> I''m not sure if it is the hypervisor, or the kernel, since I upgraded >> both in my "unstable" build environment. >> >> Since this is something that XenClient really relies on working, it >> has been a pain point with every upgrade of Xen for us. > > Perhaps one point here also is that you upgrade in too big steps?Indeed. Unfortunately, realities in shipping product, and staffing to this effort also need to be considered, that are out of individual engineer''s control (me) Since I am not on the open source platform team in Citrix, I am unable to dedicate my time strictly to the open source development on this, as much as I would like to. Also - when a product is stable, and the newer versions are focused on features we tend not to make use of, it is a tough sell to management to come up with a reason to upgrade. We stayed on the 4.0.y release train because it was stable.> > More regular participation in development and patch review > would very likely also help keeping down the number of > regressions here.The breakage tends to be out of my expertise, until I have to debug it. Consequently, I''ve been learning a lot about schedulers, lately. That said, these breakages have happened in paths that went through reviews, and were not caught. In development of XenClient, test automation is where we tend to uncover S3 related bugs that are not caught in the review process, because these problems are indidious in breaking in unexpected ways. I''ll try to participate in reviews in the future though, where I can, though, since I do appreciate the value in doing so. Ben> > Jan > >> It is enormously time consuming to debug on every upgrade, and has a >> long tail in discovering problems (I started debugging S3 last Aug on >> xen-unstable, prior to 4.2 being cut) >> >> How can we work with the community to try to get some sort of >> regression testing for this feature that we rely on in our product? >> >> Ben > >
On Mon, Apr 29, 2013 at 9:45 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote: >> I don''t have time to bisect this, currently - but just thought I''d let >> the list know that, while xen-4.2 works (with the recent S3 changes >> I''ve submitted) - 4.3 is broken again. >> >> I''m not sure if it is the hypervisor, or the kernel, since I upgraded >> both in my "unstable" build environment. >> >> Since this is something that XenClient really relies on working, it >> has been a pain point with every upgrade of Xen for us. > > Perhaps one point here also is that you upgrade in too big steps? > > More regular participation in development and patch review > would very likely also help keeping down the number of > regressions here.Perhaps, but given the incredible amounts of traffic on the list, how is he supposed to know which patches might break suspend or not? And even if he did, he would have to take the time to understand every single hypervisor patch and predict how it would act on suspend, which is just not reasonable. Remember we''re on the other side of this equation wrt Linux -- it''s all to easy for someone to move something apparently innocuous around and have it break dom0 pvops in a way that''s not noticed until 6 months later. That''s why we do regular testing of Linus'' tree, as well as Ingo''s x86 tree. The right thing to do is to put at least a basic suspend test into the testing push-gate, so that when someone submits a change that breaks suspend, *they* are the ones that have to figure out what went wrong and fix it. -George
On Fri, Apr 26, 2013 at 9:10 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote: >> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote: >> > Since this is something that XenClient really relies on working, it >> > has been a pain point with every upgrade of Xen for us. >> > It is enormously time consuming to debug on every upgrade, and has a >> > long tail in discovering problems (I started debugging S3 last Aug on >> > xen-unstable, prior to 4.2 being cut) >> > >> > How can we work with the community to try to get some sort of >> > regression testing for this feature that we rely on in our product? >> >> I am still interested in ideas for getting this into automated >> testing, and any ideas people may have for this. > > CCing Ian Jackson who runs the test infrastructure. > > Contributing new tests is now less onerous than it once was (i.e. it > might even possible at all). There is some info at > lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html > although the branch may be out of date -- Ian was working on merging the > standalone branch at one point. > > Some questions: > * How automatable is s3? > * In particular can we automate the wakeup? s3 is save to RAM > IIRC, and most power control in the test system is done with PDU > power cycling. > * Would s3 ever be expected to work on the sorts of whitebox > server systems which form the osstest pool or do we need to > investigate additional hardware? > * How hardware specific are the s3 failures -- we obviously can''t > have one of every laptop ever ;-)When I discussed this with Ben before, it seemed that almost any testing would be a very large improvement. Namely, it seemed to me that even if the test just did a "null suspend" -- i.e., shut the entire system down as though about to do a suspend but then just resume without pulling the trigger -- would shake out a lot of bugs (as well as make it very easy for devs that don''t normally care about suspending their machine to fix things). Having an RTC wake-up would be the next thing to do after that. -George
>>> On 29.04.13 at 12:55, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Mon, Apr 29, 2013 at 9:45 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote: >>> I don''t have time to bisect this, currently - but just thought I''d let >>> the list know that, while xen-4.2 works (with the recent S3 changes >>> I''ve submitted) - 4.3 is broken again. >>> >>> I''m not sure if it is the hypervisor, or the kernel, since I upgraded >>> both in my "unstable" build environment. >>> >>> Since this is something that XenClient really relies on working, it >>> has been a pain point with every upgrade of Xen for us. >> >> Perhaps one point here also is that you upgrade in too big steps? >> >> More regular participation in development and patch review >> would very likely also help keeping down the number of >> regressions here. > > Perhaps, but given the incredible amounts of traffic on the list, how > is he supposed to know which patches might break suspend or not? And > even if he did, he would have to take the time to understand every > single hypervisor patch and predict how it would act on suspend, which > is just not reasonable. Remember we''re on the other side of this > equation wrt Linux -- it''s all to easy for someone to move something > apparently innocuous around and have it break dom0 pvops in a way > that''s not noticed until 6 months later. That''s why we do regular > testing of Linus'' tree, as well as Ingo''s x86 tree. > > The right thing to do is to put at least a basic suspend test into the > testing push-gate, so that when someone submits a change that breaks > suspend, *they* are the ones that have to figure out what went wrong > and fix it.I was in no way suggesting this to be a bad idea. What I was trying to point out is that testing a certain feature only every couple of major releases is very likely to not nearly help as much as being involved regularly. And no, I also didn''t mean to suggest for _anyone_ to review each and every individual patch. But looking at some key ones before they go in would certainly help. Jan
On 04/29/2013 03:00 PM, Ben Guthro wrote:> On Mon, Apr 29, 2013 at 7:03 AM, George Dunlap > <George.Dunlap@eu.citrix.com> wrote: >> On Fri, Apr 26, 2013 at 9:10 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: >>> On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote: >>>> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote: >>>>> Since this is something that XenClient really relies on working, it >>>>> has been a pain point with every upgrade of Xen for us. >>>>> It is enormously time consuming to debug on every upgrade, and has a >>>>> long tail in discovering problems (I started debugging S3 last Aug on >>>>> xen-unstable, prior to 4.2 being cut) >>>>> >>>>> How can we work with the community to try to get some sort of >>>>> regression testing for this feature that we rely on in our product? >>>> >>>> I am still interested in ideas for getting this into automated >>>> testing, and any ideas people may have for this. >>> >>> CCing Ian Jackson who runs the test infrastructure. >>> >>> Contributing new tests is now less onerous than it once was (i.e. it >>> might even possible at all). There is some info at >>> lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html >>> although the branch may be out of date -- Ian was working on merging the >>> standalone branch at one point. >>> >>> Some questions: >>> * How automatable is s3? >>> * In particular can we automate the wakeup? s3 is save to RAM >>> IIRC, and most power control in the test system is done with PDU >>> power cycling. >>> * Would s3 ever be expected to work on the sorts of whitebox >>> server systems which form the osstest pool or do we need to >>> investigate additional hardware? >>> * How hardware specific are the s3 failures -- we obviously can''t >>> have one of every laptop ever ;-) >> >> When I discussed this with Ben before, it seemed that almost any >> testing would be a very large improvement. Namely, it seemed to me >> that even if the test just did a "null suspend" -- i.e., shut the >> entire system down as though about to do a suspend but then just >> resume without pulling the trigger -- would shake out a lot of bugs >> (as well as make it very easy for devs that don''t normally care about >> suspending their machine to fix things). Having an RTC wake-up would >> be the next thing to do after that. >> >> -George > > FWIW, the patch that implements this "fake s3" functionality is > implemented here, should it be considered for inclusion: > markmail.org/message/ghj2ffngemccq6p4[Adding xen-devel back in to the cc] FYI playing around with this is on my to-do list, but it will probably be until after the 4.3 release. -George
Ben Guthro writes ("Re: [Xen-devel] S3 is broken again in xen-unstable"): ...> That said, when things go wrong, the machine does need to be power > cycled...so if you are not physically located near the machine under > test, you would need a PDU as a recovery mechanism, I suppose.Ah this makes matters a bit more complicated. The code which implements the test schedule would need to know to power cycle the host after a failure. Could we be confident that after a failed test of this kind we wouldn''t see filesystem corruption ? Also, looking at your test script, you seem to be testing using dom0 only. We''re ignoring guests then. Perhaps this should be a separate test column. (That might be a way to fudge the recovery question too.)> > * How hardware specific are the s3 failures -- we obviously can''t > > have one of every laptop ever ;-) > > Clearly. I''m just looking to get a foot in the door here, so there is > a chance of catching gross regressions. > The hardware differences seem to be more timing related, due to > speed... ie, you are likely to uncover new failures when new, faster > hardware comes out for laptops. > Since typically server hardware is faster than laptop hardware, that > would theoretically catch problems at a higher frequency.If the hardware/BIOS is likely to be buggy, that''s a bit of a pain. We''d have to at least figure out which machines worked and flag them so that the test was only run on those.> > Once we have a test case in the standard flights then we can consider > > the options around new flights testing other trees. > > I''m not sure I understand this point. > Are you saying you want to see a test that fails in the standard test > flight first...because without Konrad''s patches, it will be guaranteed > not to work.As Ian says, there is no problem with deploying the test first and fixing the actual code later... Ian.
On Wed, May 1, 2013 at 7:01 AM, Ian Jackson <Ian.Jackson@eu.citrix.com> wrote:> Ben Guthro writes ("Re: [Xen-devel] S3 is broken again in xen-unstable"): > ... >> That said, when things go wrong, the machine does need to be power >> cycled...so if you are not physically located near the machine under >> test, you would need a PDU as a recovery mechanism, I suppose. > > Ah this makes matters a bit more complicated. The code which > implements the test schedule would need to know to power cycle the > host after a failure. Could we be confident that after a failed test > of this kind we wouldn''t see filesystem corruption ?If you are using a journaled filesystem, I think the confidence level is raised...but there are no guarantees, when you just yank a power cord.> > Also, looking at your test script, you seem to be testing using dom0 > only. We''re ignoring guests then. Perhaps this should be a separate > test column. (That might be a way to fudge the recovery question > too.)I''m going for baby-steps here. The vast majority of the S3 failures we have encountered have been dom0 related, so I thought that would be a decent starting place.> >> > * How hardware specific are the s3 failures -- we obviously can''t >> > have one of every laptop ever ;-) >> >> Clearly. I''m just looking to get a foot in the door here, so there is >> a chance of catching gross regressions. >> The hardware differences seem to be more timing related, due to >> speed... ie, you are likely to uncover new failures when new, faster >> hardware comes out for laptops. >> Since typically server hardware is faster than laptop hardware, that >> would theoretically catch problems at a higher frequency. > > If the hardware/BIOS is likely to be buggy, that''s a bit of a pain. > We''d have to at least figure out which machines worked and flag them > so that the test was only run on those.I think testing a known good configuration for regression seems appropriate, yes. They all *should* work...but I''m just being conservative here.> >> > Once we have a test case in the standard flights then we can consider >> > the options around new flights testing other trees. >> >> I''m not sure I understand this point. >> Are you saying you want to see a test that fails in the standard test >> flight first...because without Konrad''s patches, it will be guaranteed >> not to work. > > As Ian says, there is no problem with deploying the test first and > fixing the actual code later... > > Ian.
On Fri, Apr 26, 2013 at 07:41:07PM -0400, Ben Guthro wrote:> > > > Ok, so master (xen-unstable) works OK regarding ACPI S3. Good. > > > > What hypervisor-side patches are still missing from stable-4.2 branch? > > The final one that actually makes it work is > xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=9aa356bc9f7533c3cb7f02c823f532532876d444 > Jan Beulich had already indicated that this would be picked up in the > 4.2 release cycle, but it was too late to get it into 4.2.2 >Yep, I can see this already in 4.2 branch for 4.2.3.> > Then, also the ns16550 change. > While strictly not necessary to fix S3 in the normal path, it does fix > a bug that can lead to S3 not working if you > a. have one of these SuperIO controllers on the LPC bus. > b. have serial enabled. > xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=6e96c186d23873597896051b043cfeb119c4a7d5 >Jan: I think this ns16550 patch should be backported to 4.2 branch aswell.. Thanks, -- Pasi> > On the linux side of things, The following are necessary: > One of the acpi-s3.vX branches. I use v9, but v10 is also available. I > don''t think one has an advantage over the other. > > acpi-s3.v10: > git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=c268cd657314354f910b773a17a9de0299e1cc21 > git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=864848221b056aaf25416999c29cb0e14d3c3197 > git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=aa7eb7bbb3f2a39435a07c082f99893386ae83ec > > stable/for-linus-3.10: > git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=stable/for-linus-3.10&id=3fac10145b766a2244422788f62dc35978613fd8 > > > Ben
>>> On 07.05.13 at 10:34, Pasi Kärkkäinen<pasik@iki.fi> wrote: > On Fri, Apr 26, 2013 at 07:41:07PM -0400, Ben Guthro wrote: >> Then, also the ns16550 change. >> While strictly not necessary to fix S3 in the normal path, it does fix >> a bug that can lead to S3 not working if you >> a. have one of these SuperIO controllers on the LPC bus. >> b. have serial enabled. >> > xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=6e96c186d23873597896051b0 > 43cfeb119c4a7d5 >> > > Jan: I think this ns16550 patch should be backported to 4.2 branch aswell..Yeah, as being secondary I left this off until we know that this really is the only thing known to break resume (i.e. I saw no point in backporting this when in the end S3 still wouldn't work anyway). Ben - am I right in understanding your earlier summary in this thread to mean that the 4.2 branch, according to your testing, is now is such a state? Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org lists.xen.org/xen-devel
On May 7, 2013, at 4:40 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 07.05.13 at 10:34, Pasi Kärkkäinen<pasik@iki.fi> wrote: >> On Fri, Apr 26, 2013 at 07:41:07PM -0400, Ben Guthro wrote: >>> Then, also the ns16550 change. >>> While strictly not necessary to fix S3 in the normal path, it does fix >>> a bug that can lead to S3 not working if you >>> a. have one of these SuperIO controllers on the LPC bus. >>> b. have serial enabled. >> xenbits.xen.org/gitweb/?p=xen.git;a=commit;hn96c186d23873597896051b0 >> 43cfeb119c4a7d5 >> >> Jan: I think this ns16550 patch should be backported to 4.2 branch aswell.. > > Yeah, as being secondary I left this off until we know that this > really is the only thing known to break resume (i.e. I saw no point > in backporting this when in the end S3 still wouldn''t work anyway). > > Ben - am I right in understanding your earlier summary in this > thread to mean that the 4.2 branch, according to your testing, > is now is such a state? >Yea, S3 works on the 4.2.3 branch without this patch. This fixes a specific corner case on some machines with the SuperIO hardware.
On Fri, Apr 26, 2013 at 9:17 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote:> On Fri, 2013-04-26 at 13:19 +0100, Ben Guthro wrote: >> On Fri, Apr 26, 2013 at 4:10 AM, Ian Campbell <Ian.Campbell@citrix.com> wrote: > >> > Some questions: >> > * How automatable is s3? >> > * In particular can we automate the wakeup? s3 is save to RAM >> > IIRC, and most power control in the test system is done with PDU >> > power cycling. >> >> I spoke with George Dunlap a bit about this while I was over in the >> UK a few weeks ago, and drew up an example shell script for this: >> xen.markmail.org/thread/ghj2ffngemccq6p4 >> Marek also weighed in, and included some of his own tests, and experiences. >> >> In my experience, this mechanism is about as reliable as your RTC. On >> some systems you might tell it to sleep for 30s, and it will wake in >> 10s. >> >> That said, when things go wrong, the machine does need to be power >> cycled...so if you are not physically located near the machine under >> test, you would need a PDU as a recovery mechanism, I suppose. > > That''#s OK, all the systems in the test harness would have to have PDU > for the other test cases (initial install etc) anyway. > >> >> Would it be helpful to maintain a branch in my xenbits repo that could >> >> be a rebased version of konrad''s acpi-s3 patches against Linus'' latest >> >> kernel? >> > >> > What is keeping those out of Linus'' tree? >> >> Added Konrad here, but I believe he is on vacation this week. >> This has been a bullet point on his OSS presentation, as outstanding >> pvops work for at least 3 years now. >> >> IIRC, the x86 guys NACK''ed the change as being too invasive. >> I googled around a bit, but can''t seem to find the thread about it. > > I wonder if it might be something like that :-/FWIW, I believe we are over another hurdle here. I have commitment from the acpi maintainer (Rafael Wysocki) that the following patches will be included in the linux-3.11 merge window: lkml.org/lkml/2013/5/14/465 When this is accepted, this should give "out of the box" S3 functionality with Xen Once Xen-4.3 is released, I would like to revisit trying to see if there would be some way to get something into the automated test system. Ben> >> > Once we have a test case in the standard flights then we can consider >> > the options around new flights testing other trees. >> >> I''m not sure I understand this point. >> Are you saying you want to see a test that fails in the standard test >> flight first...because without Konrad''s patches, it will be guaranteed >> not to work. > > Right. AIUI the flights (and I may be using the wrong term here) are > somewhat uniform and and few in number and get run with various > combinations inputs (Xen tree, Linux tree, Qemu tree), so there is > effectively one "test Linux PV kernel flight" and one "test Xen PV > guests flight" etc, so we want to get S3 into those flights, with the > existing set of "* tree" inputs. > > IOW we should add a new row to the grid > chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing > and then we can consider adding a new column with a different set of > tree''s as input. > > Ian J may have a different opinion on how to approach, but he''s away > until mid next week. > >> ...and without other changesets queued up for the 3.10 merge window, >> non-boot CPUs will always have incorrect C-states. > > It''s OK to add the tests before things work. > > Ian. > >
On Tue, May 21, 2013 at 02:29:23PM -0400, Ben Guthro wrote:> >> > > >> > What is keeping those out of Linus'' tree? > >> > >> Added Konrad here, but I believe he is on vacation this week. > >> This has been a bullet point on his OSS presentation, as outstanding > >> pvops work for at least 3 years now. > >> > >> IIRC, the x86 guys NACK''ed the change as being too invasive. > >> I googled around a bit, but can''t seem to find the thread about it. > > > > I wonder if it might be something like that :-/ > > FWIW, I believe we are over another hurdle here. > I have commitment from the acpi maintainer (Rafael Wysocki) that the > following patches will be included in the linux-3.11 merge window: > > lkml.org/lkml/2013/5/14/465 > > When this is accepted, this should give "out of the box" S3 > functionality with Xen >This is great, thanks a lot!> Once Xen-4.3 is released, I would like to revisit trying to see if > there would be some way to get something into the automated test > system. >-- Pasi