thr3ads.net - Xen devel - S3 is broken again in xen-unstable [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Ben Guthro

2013-Apr-25 12:00 UTC

S3 is broken again in xen-unstable

I don''t have time to bisect this, currently - but just thought
I''d let
the list know that, while xen-4.2 works (with the recent S3 changes
I''ve submitted) - 4.3 is broken again.

I''m not sure if it is the hypervisor, or the kernel, since I upgraded
both in my "unstable" build environment.

Since this is something that XenClient really relies on working, it
has been a pain point with every upgrade of Xen for us.
It is enormously time consuming to debug on every upgrade, and has a
long tail in discovering problems (I started debugging S3 last Aug on
xen-unstable, prior to 4.2 being cut)

How can we work with the community to try to get some sort of
regression testing for this feature that we rely on in our product?

Ben

Ben Guthro

2013-Apr-25 17:02 UTC

head link

Re: S3 is broken again in xen-unstable

On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net>
wrote:> I don''t have time to bisect this, currently - but just thought
I''d let
> the list know that, while xen-4.2 works (with the recent S3 changes
> I''ve submitted) - 4.3 is broken again.
>
> I''m not sure if it is the hypervisor, or the kernel, since I
upgraded
> both in my "unstable" build environment.
>
This appears to have been a transient issue. My xen tree was a few days old
updating to the tip seems to have resolved this particular issue.
> Since this is something that XenClient really relies on working, it
> has been a pain point with every upgrade of Xen for us.
> It is enormously time consuming to debug on every upgrade, and has a
> long tail in discovering problems (I started debugging S3 last Aug on
> xen-unstable, prior to 4.2 being cut)
>
> How can we work with the community to try to get some sort of
> regression testing for this feature that we rely on in our product?
I am still interested in ideas for getting this into automated
testing, and any ideas people may have for this.
Would it be helpful to maintain a branch in my xenbits repo that could
be a rebased version of konrad''s acpi-s3 patches against
Linus'' latest
kernel?

Ian Campbell

2013-Apr-26 08:10 UTC

head link

Re: S3 is broken again in xen-unstable

On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote:> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote:
> > Since this is something that XenClient really relies on working, it
> > has been a pain point with every upgrade of Xen for us.
> > It is enormously time consuming to debug on every upgrade, and has a
> > long tail in discovering problems (I started debugging S3 last Aug on
> > xen-unstable, prior to 4.2 being cut)
> >
> > How can we work with the community to try to get some sort of
> > regression testing for this feature that we rely on in our product?
> 
> I am still interested in ideas for getting this into automated
> testing, and any ideas people may have for this.
CCing Ian Jackson who runs the test infrastructure.

Contributing new tests is now less onerous than it once was (i.e. it
might even possible at all). There is some info at
lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html
although the branch may be out of date -- Ian was working on merging the
standalone branch at one point.

Some questions:
      * How automatable is s3?
      * In particular can we automate the wakeup? s3 is save to RAM
        IIRC, and most power control in the test system is done with PDU
        power cycling.
      * Would s3 ever be expected to work on the sorts of whitebox
        server systems which form the osstest pool or do we need to
        investigate additional hardware?
      * How hardware specific are the s3 failures -- we obviously can''t
        have one of every laptop ever ;-)

So assuming the answers to the above are positive then contributing a
test case for s3 to the relevant flights seems like a reasonable first
step, even if the expectation is that it would always fail with the
current mainline Xen + mainline Linux. The test system only tracks
regressions, so always failing test cases are OK (you can think of this
in the test-drive development kind of way ;-)).
> Would it be helpful to maintain a branch in my xenbits repo that could
> be a rebased version of konrad''s acpi-s3 patches against
Linus'' latest
> kernel?
What is keeping those out of Linus'' tree?

Once we have a test case in the standard flights then we can consider
the options around new flights testing other trees.

Ian.

Ben Guthro

2013-Apr-26 12:19 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, Apr 26, 2013 at 4:10 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:> On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote:
>> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net>
wrote:
>> > Since this is something that XenClient really relies on working,
it
>> > has been a pain point with every upgrade of Xen for us.
>> > It is enormously time consuming to debug on every upgrade, and has
a
>> > long tail in discovering problems (I started debugging S3 last Aug
on
>> > xen-unstable, prior to 4.2 being cut)
>> >
>> > How can we work with the community to try to get some sort of
>> > regression testing for this feature that we rely on in our
product?
>>
>> I am still interested in ideas for getting this into automated
>> testing, and any ideas people may have for this.
>
> CCing Ian Jackson who runs the test infrastructure.
I''ve also CC''ed a few people here, who I mention in my reply
below.
>
> Contributing new tests is now less onerous than it once was (i.e. it
> might even possible at all). There is some info at
> lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html
> although the branch may be out of date -- Ian was working on merging the
> standalone branch at one point.
I''ll read up on this
>
> Some questions:
>       * How automatable is s3?
>       * In particular can we automate the wakeup? s3 is save to RAM
>         IIRC, and most power control in the test system is done with PDU
>         power cycling.
I spoke with George Dunlap  a bit about this while I was over in the
UK a few weeks ago, and drew up an example shell script for this:
xen.markmail.org/thread/ghj2ffngemccq6p4
Marek also weighed in, and included some of his own tests, and experiences.

In my experience, this mechanism is about as reliable as your RTC. On
some systems you might tell it to sleep for 30s, and it will wake in
10s.

That said, when things go wrong, the machine does need to be power
cycled...so if you are not physically located near the machine under
test, you would need a PDU as a recovery mechanism, I suppose.
>       * Would s3 ever be expected to work on the sorts of whitebox
>         server systems which form the osstest pool or do we need to
>         investigate additional hardware?
I don''t see why it wouldn''t work, though admittedly I
haven''t dealt
with xen on servers since 2009.
>       * How hardware specific are the s3 failures -- we obviously
can''t
>         have one of every laptop ever ;-)
Clearly. I''m just looking to get a foot in the door here, so there is
a chance of catching gross regressions.
The hardware differences seem to be more timing related, due to
speed... ie, you are likely to uncover new failures when new, faster
hardware comes out for laptops.
Since typically server hardware is faster than laptop hardware, that
would theoretically catch problems at a higher frequency.
>
> So assuming the answers to the above are positive then contributing a
> test case for s3 to the relevant flights seems like a reasonable first
> step, even if the expectation is that it would always fail with the
> current mainline Xen + mainline Linux. The test system only tracks
> regressions, so always failing test cases are OK (you can think of this
> in the test-drive development kind of way ;-)).
I''ll take a look at the test infrastructure, and see if I can make
heads/tails of it, and come up with a simplistic test.
>
>> Would it be helpful to maintain a branch in my xenbits repo that could
>> be a rebased version of konrad''s acpi-s3 patches against
Linus'' latest
>> kernel?
>
> What is keeping those out of Linus'' tree?
Added Konrad here, but I believe he is on vacation this week.
This has been a bullet point on his OSS presentation, as outstanding
pvops work for at least 3 years now.

IIRC, the x86 guys NACK''ed the change as being too invasive.
I googled around a bit, but can''t seem to find the thread about it.
>
> Once we have a test case in the standard flights then we can consider
> the options around new flights testing other trees.
I''m not sure I understand this point.
Are you saying you want to see a test that fails in the standard test
flight first...because without Konrad''s patches, it will be guaranteed
not to work.

...and without other changesets queued up for the 3.10 merge window,
non-boot CPUs will always have incorrect C-states.

Thanks
Ben

Ian Campbell

2013-Apr-26 13:17 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, 2013-04-26 at 13:19 +0100, Ben Guthro wrote:> On Fri, Apr 26, 2013 at 4:10 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> > Some questions:
> >       * How automatable is s3?
> >       * In particular can we automate the wakeup? s3 is save to RAM
> >         IIRC, and most power control in the test system is done with
PDU
> >         power cycling.
> 
> I spoke with George Dunlap  a bit about this while I was over in the
> UK a few weeks ago, and drew up an example shell script for this:
> xen.markmail.org/thread/ghj2ffngemccq6p4
> Marek also weighed in, and included some of his own tests, and experiences.
> 
> In my experience, this mechanism is about as reliable as your RTC. On
> some systems you might tell it to sleep for 30s, and it will wake in
> 10s.
> 
> That said, when things go wrong, the machine does need to be power
> cycled...so if you are not physically located near the machine under
> test, you would need a PDU as a recovery mechanism, I suppose.
That''#s OK, all the systems in the test harness would have to have PDU
for the other test cases (initial install etc) anyway.
> >> Would it be helpful to maintain a branch in my xenbits repo that
could
> >> be a rebased version of konrad''s acpi-s3 patches against
Linus'' latest
> >> kernel?
> >
> > What is keeping those out of Linus'' tree?
> 
> Added Konrad here, but I believe he is on vacation this week.
> This has been a bullet point on his OSS presentation, as outstanding
> pvops work for at least 3 years now.
> 
> IIRC, the x86 guys NACK''ed the change as being too invasive.
> I googled around a bit, but can''t seem to find the thread about
it.
I wonder if it might be something like that :-/
> > Once we have a test case in the standard flights then we can consider
> > the options around new flights testing other trees.
> 
> I''m not sure I understand this point.
> Are you saying you want to see a test that fails in the standard test
> flight first...because without Konrad''s patches, it will be
guaranteed
> not to work.
Right. AIUI the flights (and I may be using the wrong term here) are
somewhat uniform and and few in number and get run with various
combinations inputs (Xen tree, Linux tree, Qemu tree), so there is
effectively one "test Linux PV kernel flight" and one "test Xen
PV
guests flight" etc, so we want to get S3 into those flights, with the
existing set of "* tree" inputs.

IOW we should add a new row to the grid
chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing
and then we can consider adding a new column with a different set of
tree''s as input.

Ian J may have a different opinion on how to approach, but he''s away
until mid next week.
> ...and without other changesets queued up for the 3.10 merge window,
> non-boot CPUs will always have incorrect C-states.
It''s OK to add the tests before things work.

Ian.

Ben Guthro

2013-Apr-26 14:10 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, Apr 26, 2013 at 9:17 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:
<snip>>> I''m not sure I understand this point.
>> Are you saying you want to see a test that fails in the standard test
>> flight first...because without Konrad''s patches, it will be
guaranteed
>> not to work.
>
> Right. AIUI the flights (and I may be using the wrong term here) are
> somewhat uniform and and few in number and get run with various
> combinations inputs (Xen tree, Linux tree, Qemu tree), so there is
> effectively one "test Linux PV kernel flight" and one "test
Xen PV
> guests flight" etc, so we want to get S3 into those flights, with the
> existing set of "* tree" inputs.
>
> IOW we should add a new row to the grid
> chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing
> and then we can consider adding a new column with a different set of
> tree''s as input.
>
> Ian J may have a different opinion on how to approach, but he''s
away
> until mid next week.
>
>> ...and without other changesets queued up for the 3.10 merge window,
>> non-boot CPUs will always have incorrect C-states.
>
> It''s OK to add the tests before things work.
I''ve attached a patch to osstest that would be the beginnings of this
sort of test, if I''m reading the code correctly.
However, I don''t really have a setup that I can smoke test this.

I also left a number of TODOs, since I really just want an opinion to
see if I''m on the right track.

Thanks
Ben


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
lists.xen.org/xen-devel

Ian Campbell

2013-Apr-26 14:32 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, 2013-04-26 at 15:10 +0100, Ben Guthro wrote:> On Fri, Apr 26, 2013 at 9:17 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
> <snip>
> >> I''m not sure I understand this point.
> >> Are you saying you want to see a test that fails in the standard
test
> >> flight first...because without Konrad''s patches, it will
be guaranteed
> >> not to work.
> >
> > Right. AIUI the flights (and I may be using the wrong term here) are
> > somewhat uniform and and few in number and get run with various
> > combinations inputs (Xen tree, Linux tree, Qemu tree), so there is
> > effectively one "test Linux PV kernel flight" and one
"test Xen PV
> > guests flight" etc, so we want to get S3 into those flights, with
the
> > existing set of "* tree" inputs.
> >
> > IOW we should add a new row to the grid
> > chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing
> > and then we can consider adding a new column with a different set of
> > tree''s as input.
> >
> > Ian J may have a different opinion on how to approach, but
he''s away
> > until mid next week.
> >
> >> ...and without other changesets queued up for the 3.10 merge
window,
> >> non-boot CPUs will always have incorrect C-states.
> >
> > It''s OK to add the tests before things work.
> 
> I''ve attached a patch to osstest that would be the beginnings of
this
> sort of test, if I''m reading the code correctly.
> However, I don''t really have a setup that I can smoke test this.
>
> I also left a number of TODOs, since I really just want an opinion to
> see if I''m on the right track.
Right track, as far as it goes, but I think most of what you have put in
TestSupport.pm should actually be in the ts-host-suspend test case
itself.

You probably need IanJ''s input for anything more concrete.

Ian.

Pasi Kärkkäinen

2013-Apr-26 20:47 UTC

head link

Re: S3 is broken again in xen-unstable

On Thu, Apr 25, 2013 at 01:02:35PM -0400, Ben Guthro
wrote:> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net> wrote:
> > I don''t have time to bisect this, currently - but just
thought I''d let
> > the list know that, while xen-4.2 works (with the recent S3 changes
> > I''ve submitted) - 4.3 is broken again.
> >
> > I''m not sure if it is the hypervisor, or the kernel, since I
upgraded
> > both in my "unstable" build environment.
> >
> 
> This appears to have been a transient issue. My xen tree was a few days old
> updating to the tip seems to have resolved this particular issue.
>
Ok, so master (xen-unstable) works OK regarding ACPI S3. Good.

What hypervisor-side patches are still missing from stable-4.2 branch? 


-- Pasi

Ben Guthro

2013-Apr-26 23:41 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, Apr 26, 2013 at 4:47 PM, Pasi Kärkkäinen <pasik@iki.fi>
wrote:> On Thu, Apr 25, 2013 at 01:02:35PM -0400, Ben Guthro wrote:
>> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net>
wrote:
>> > I don''t have time to bisect this, currently - but just
thought I''d let
>> > the list know that, while xen-4.2 works (with the recent S3
changes
>> > I''ve submitted) - 4.3 is broken again.
>> >
>> > I''m not sure if it is the hypervisor, or the kernel,
since I upgraded
>> > both in my "unstable" build environment.
>> >
>>
>> This appears to have been a transient issue. My xen tree was a few days
old
>> updating to the tip seems to have resolved this particular issue.
>>
>
> Ok, so master (xen-unstable) works OK regarding ACPI S3. Good.
>
> What hypervisor-side patches are still missing from stable-4.2 branch?
The final one that actually makes it work is
xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=9aa356bc9f7533c3cb7f02c823f532532876d444
Jan Beulich had already indicated that this would be picked up in the
4.2 release cycle, but it was too late to get it into 4.2.2


Then, also the ns16550 change.
While strictly not necessary to fix S3 in the normal path, it does fix
a bug that can lead to S3 not working if you
   a. have one of these SuperIO controllers on the LPC bus.
   b. have serial enabled.
xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=6e96c186d23873597896051b043cfeb119c4a7d5


On the linux side of things, The following are necessary:
One of the acpi-s3.vX branches. I use v9, but v10 is also available. I
don''t think one has an advantage over the other.

acpi-s3.v10:
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=c268cd657314354f910b773a17a9de0299e1cc21
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=864848221b056aaf25416999c29cb0e14d3c3197
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=aa7eb7bbb3f2a39435a07c082f99893386ae83ec

stable/for-linus-3.10:
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=stable/for-linus-3.10&id=3fac10145b766a2244422788f62dc35978613fd8


Ben

Jan Beulich

2013-Apr-29 08:45 UTC

head link

Re: S3 is broken again in xen-unstable

>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote:
> I don''t have time to bisect this, currently - but just thought
I''d let
> the list know that, while xen-4.2 works (with the recent S3 changes
> I''ve submitted) - 4.3 is broken again.
> 
> I''m not sure if it is the hypervisor, or the kernel, since I
upgraded
> both in my "unstable" build environment.
> 
> Since this is something that XenClient really relies on working, it
> has been a pain point with every upgrade of Xen for us.
Perhaps one point here also is that you upgrade in too big steps?

More regular participation in development and patch review
would very likely also help keeping down the number of
regressions here.

Jan
> It is enormously time consuming to debug on every upgrade, and has a
> long tail in discovering problems (I started debugging S3 last Aug on
> xen-unstable, prior to 4.2 being cut)
> 
> How can we work with the community to try to get some sort of
> regression testing for this feature that we rely on in our product?
> 
> Ben

Ben Guthro

2013-Apr-29 10:24 UTC

head link

Re: S3 is broken again in xen-unstable

On Mon, Apr 29, 2013 at 4:45 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote:
>> I don''t have time to bisect this, currently - but just thought
I''d let
>> the list know that, while xen-4.2 works (with the recent S3 changes
>> I''ve submitted) - 4.3 is broken again.
>>
>> I''m not sure if it is the hypervisor, or the kernel, since I
upgraded
>> both in my "unstable" build environment.
>>
>> Since this is something that XenClient really relies on working, it
>> has been a pain point with every upgrade of Xen for us.
>
> Perhaps one point here also is that you upgrade in too big steps?
Indeed. Unfortunately, realities in shipping product, and staffing to
this effort also need to be considered, that are out of individual
engineer''s control (me)
Since I am not on the open source platform team in Citrix, I am unable
to dedicate my time strictly to the open source development on this,
as much as I would like to.

Also - when a product is stable, and the newer versions are focused on
features we tend not to make use of, it is a tough sell to management
to come up with a reason to upgrade. We stayed on the 4.0.y release
train because it was stable.
>
> More regular participation in development and patch review
> would very likely also help keeping down the number of
> regressions here.
The breakage tends to be out of my expertise, until I have to debug it.
Consequently, I''ve been learning a lot about schedulers, lately.

That said, these breakages have happened in paths that went through
reviews, and were not caught.
In development of XenClient, test automation is where we tend to
uncover S3 related bugs that are not caught in the review process,
because these problems are indidious in breaking in unexpected ways.

I''ll try to participate in reviews in the future though, where I can,
though, since I do appreciate the value in doing so.


Ben
>
> Jan
>
>> It is enormously time consuming to debug on every upgrade, and has a
>> long tail in discovering problems (I started debugging S3 last Aug on
>> xen-unstable, prior to 4.2 being cut)
>>
>> How can we work with the community to try to get some sort of
>> regression testing for this feature that we rely on in our product?
>>
>> Ben
>
>

George Dunlap

2013-Apr-29 10:55 UTC

head link

Re: S3 is broken again in xen-unstable

On Mon, Apr 29, 2013 at 9:45 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net> wrote:
>> I don''t have time to bisect this, currently - but just thought
I''d let
>> the list know that, while xen-4.2 works (with the recent S3 changes
>> I''ve submitted) - 4.3 is broken again.
>>
>> I''m not sure if it is the hypervisor, or the kernel, since I
upgraded
>> both in my "unstable" build environment.
>>
>> Since this is something that XenClient really relies on working, it
>> has been a pain point with every upgrade of Xen for us.
>
> Perhaps one point here also is that you upgrade in too big steps?
>
> More regular participation in development and patch review
> would very likely also help keeping down the number of
> regressions here.
Perhaps, but given the incredible amounts of traffic on the list, how
is he supposed to know which patches might break suspend or not?  And
even if he did, he would have to take the time to understand every
single hypervisor patch and predict how it would act on suspend, which
is just not reasonable.  Remember we''re on the other side of this
equation wrt Linux -- it''s all to easy for someone to move something
apparently innocuous around and have it break dom0 pvops in a way
that''s not noticed until 6 months later.  That''s why we do
regular
testing of Linus'' tree, as well as Ingo''s x86 tree.

The right thing to do is to put at least a basic suspend test into the
testing push-gate, so that when someone submits a change that breaks
suspend, *they* are the ones that have to figure out what went wrong
and fix it.

 -George

George Dunlap

2013-Apr-29 11:03 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, Apr 26, 2013 at 9:10 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:> On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote:
>> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro <ben@guthro.net>
wrote:
>> > Since this is something that XenClient really relies on working,
it
>> > has been a pain point with every upgrade of Xen for us.
>> > It is enormously time consuming to debug on every upgrade, and has
a
>> > long tail in discovering problems (I started debugging S3 last Aug
on
>> > xen-unstable, prior to 4.2 being cut)
>> >
>> > How can we work with the community to try to get some sort of
>> > regression testing for this feature that we rely on in our
product?
>>
>> I am still interested in ideas for getting this into automated
>> testing, and any ideas people may have for this.
>
> CCing Ian Jackson who runs the test infrastructure.
>
> Contributing new tests is now less onerous than it once was (i.e. it
> might even possible at all). There is some info at
> lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html
> although the branch may be out of date -- Ian was working on merging the
> standalone branch at one point.
>
> Some questions:
>       * How automatable is s3?
>       * In particular can we automate the wakeup? s3 is save to RAM
>         IIRC, and most power control in the test system is done with PDU
>         power cycling.
>       * Would s3 ever be expected to work on the sorts of whitebox
>         server systems which form the osstest pool or do we need to
>         investigate additional hardware?
>       * How hardware specific are the s3 failures -- we obviously
can''t
>         have one of every laptop ever ;-)
When I discussed this with Ben before, it seemed that almost any
testing would be a very large improvement.  Namely, it seemed to me
that even if the test just did a "null suspend" -- i.e., shut the
entire system down as though about to do a suspend but then just
resume without pulling the trigger -- would shake out a lot of bugs
(as well as make it very easy for devs that don''t normally care about
suspending their machine to fix things).  Having an RTC wake-up would
be the next thing to do after that.

 -George

Jan Beulich

2013-Apr-29 11:07 UTC

head link

Re: S3 is broken again in xen-unstable

>>> On 29.04.13 at 12:55, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Mon, Apr 29, 2013 at 9:45 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 25.04.13 at 14:00, Ben Guthro <ben@guthro.net>
wrote:
>>> I don''t have time to bisect this, currently - but just
thought I''d let
>>> the list know that, while xen-4.2 works (with the recent S3 changes
>>> I''ve submitted) - 4.3 is broken again.
>>>
>>> I''m not sure if it is the hypervisor, or the kernel, since
I upgraded
>>> both in my "unstable" build environment.
>>>
>>> Since this is something that XenClient really relies on working, it
>>> has been a pain point with every upgrade of Xen for us.
>>
>> Perhaps one point here also is that you upgrade in too big steps?
>>
>> More regular participation in development and patch review
>> would very likely also help keeping down the number of
>> regressions here.
> 
> Perhaps, but given the incredible amounts of traffic on the list, how
> is he supposed to know which patches might break suspend or not?  And
> even if he did, he would have to take the time to understand every
> single hypervisor patch and predict how it would act on suspend, which
> is just not reasonable.  Remember we''re on the other side of this
> equation wrt Linux -- it''s all to easy for someone to move
something
> apparently innocuous around and have it break dom0 pvops in a way
> that''s not noticed until 6 months later.  That''s why we
do regular
> testing of Linus'' tree, as well as Ingo''s x86 tree.
> 
> The right thing to do is to put at least a basic suspend test into the
> testing push-gate, so that when someone submits a change that breaks
> suspend, *they* are the ones that have to figure out what went wrong
> and fix it.
I was in no way suggesting this to be a bad idea. What I was
trying to point out is that testing a certain feature only every
couple of major releases is very likely to not nearly help as much
as being involved regularly. And no, I also didn''t mean to suggest
for _anyone_ to review each and every individual patch. But
looking at some key ones before they go in would certainly help.

Jan

George Dunlap

2013-Apr-30 09:00 UTC

head link

Re: S3 is broken again in xen-unstable

On 04/29/2013 03:00 PM, Ben Guthro wrote:> On Mon, Apr 29, 2013 at 7:03 AM, George Dunlap
> <George.Dunlap@eu.citrix.com> wrote:
>> On Fri, Apr 26, 2013 at 9:10 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
>>> On Thu, 2013-04-25 at 18:02 +0100, Ben Guthro wrote:
>>>> On Thu, Apr 25, 2013 at 8:00 AM, Ben Guthro
<ben@guthro.net> wrote:
>>>>> Since this is something that XenClient really relies on
working, it
>>>>> has been a pain point with every upgrade of Xen for us.
>>>>> It is enormously time consuming to debug on every upgrade,
and has a
>>>>> long tail in discovering problems (I started debugging S3
last Aug on
>>>>> xen-unstable, prior to 4.2 being cut)
>>>>>
>>>>> How can we work with the community to try to get some sort
of
>>>>> regression testing for this feature that we rely on in our
product?
>>>>
>>>> I am still interested in ideas for getting this into automated
>>>> testing, and any ideas people may have for this.
>>>
>>> CCing Ian Jackson who runs the test infrastructure.
>>>
>>> Contributing new tests is now less onerous than it once was (i.e.
it
>>> might even possible at all). There is some info at
>>> lists.xen.org/archives/html/xen-devel/2012-10/msg01517.html
>>> although the branch may be out of date -- Ian was working on
merging the
>>> standalone branch at one point.
>>>
>>> Some questions:
>>>        * How automatable is s3?
>>>        * In particular can we automate the wakeup? s3 is save to
RAM
>>>          IIRC, and most power control in the test system is done
with PDU
>>>          power cycling.
>>>        * Would s3 ever be expected to work on the sorts of whitebox
>>>          server systems which form the osstest pool or do we need
to
>>>          investigate additional hardware?
>>>        * How hardware specific are the s3 failures -- we obviously
can''t
>>>          have one of every laptop ever ;-)
>>
>> When I discussed this with Ben before, it seemed that almost any
>> testing would be a very large improvement.  Namely, it seemed to me
>> that even if the test just did a "null suspend" -- i.e., shut
the
>> entire system down as though about to do a suspend but then just
>> resume without pulling the trigger -- would shake out a lot of bugs
>> (as well as make it very easy for devs that don''t normally
care about
>> suspending their machine to fix things).  Having an RTC wake-up would
>> be the next thing to do after that.
>>
>>   -George
>
> FWIW, the patch that implements this "fake s3" functionality is
> implemented here, should it be considered for inclusion:
> markmail.org/message/ghj2ffngemccq6p4
[Adding xen-devel back in to the cc]

FYI playing around with this is on my to-do list, but it will probably 
be until after the 4.3 release.

  -George

Ian Jackson

2013-May-01 11:01 UTC

head link

Re: S3 is broken again in xen-unstable

Ben Guthro writes ("Re: [Xen-devel] S3 is broken again in
xen-unstable"):
...> That said, when things go wrong, the machine does need to be power
> cycled...so if you are not physically located near the machine under
> test, you would need a PDU as a recovery mechanism, I suppose.
Ah this makes matters a bit more complicated.  The code which
implements the test schedule would need to know to power cycle the
host after a failure.  Could we be confident that after a failed test
of this kind we wouldn''t see filesystem corruption ?

Also, looking at your test script, you seem to be testing using dom0
only.  We''re ignoring guests then.  Perhaps this should be a separate
test column.  (That might be a way to fudge the recovery question
too.)
> >       * How hardware specific are the s3 failures -- we obviously
can''t
> >         have one of every laptop ever ;-)
> 
> Clearly. I''m just looking to get a foot in the door here, so there
is
> a chance of catching gross regressions.
> The hardware differences seem to be more timing related, due to
> speed... ie, you are likely to uncover new failures when new, faster
> hardware comes out for laptops.
> Since typically server hardware is faster than laptop hardware, that
> would theoretically catch problems at a higher frequency.
If the hardware/BIOS is likely to be buggy, that''s a bit of a pain.
We''d have to at least figure out which machines worked and flag them
so that the test was only run on those.
> > Once we have a test case in the standard flights then we can consider
> > the options around new flights testing other trees.
> 
> I''m not sure I understand this point.
> Are you saying you want to see a test that fails in the standard test
> flight first...because without Konrad''s patches, it will be
guaranteed
> not to work.
As Ian says, there is no problem with deploying the test first and
fixing the actual code later...

Ian.

Ben Guthro

2013-May-01 12:03 UTC

head link

Re: S3 is broken again in xen-unstable

On Wed, May 1, 2013 at 7:01 AM, Ian Jackson <Ian.Jackson@eu.citrix.com>
wrote:> Ben Guthro writes ("Re: [Xen-devel] S3 is broken again in
xen-unstable"):
> ...
>> That said, when things go wrong, the machine does need to be power
>> cycled...so if you are not physically located near the machine under
>> test, you would need a PDU as a recovery mechanism, I suppose.
>
> Ah this makes matters a bit more complicated.  The code which
> implements the test schedule would need to know to power cycle the
> host after a failure.  Could we be confident that after a failed test
> of this kind we wouldn''t see filesystem corruption ?
If you are using a journaled filesystem, I think the confidence level
is raised...but there are no guarantees, when you just yank a power
cord.
>
> Also, looking at your test script, you seem to be testing using dom0
> only.  We''re ignoring guests then.  Perhaps this should be a
separate
> test column.  (That might be a way to fudge the recovery question
> too.)
I''m going for baby-steps here.
The vast majority of the S3 failures we have encountered have been
dom0 related, so I thought that would be a decent starting place.
>
>> >       * How hardware specific are the s3 failures -- we obviously
can''t
>> >         have one of every laptop ever ;-)
>>
>> Clearly. I''m just looking to get a foot in the door here, so
there is
>> a chance of catching gross regressions.
>> The hardware differences seem to be more timing related, due to
>> speed... ie, you are likely to uncover new failures when new, faster
>> hardware comes out for laptops.
>> Since typically server hardware is faster than laptop hardware, that
>> would theoretically catch problems at a higher frequency.
>
> If the hardware/BIOS is likely to be buggy, that''s a bit of a
pain.
> We''d have to at least figure out which machines worked and flag
them
> so that the test was only run on those.
I think testing a known good configuration for regression seems
appropriate, yes.
They all *should* work...but I''m just being conservative here.
>
>> > Once we have a test case in the standard flights then we can
consider
>> > the options around new flights testing other trees.
>>
>> I''m not sure I understand this point.
>> Are you saying you want to see a test that fails in the standard test
>> flight first...because without Konrad''s patches, it will be
guaranteed
>> not to work.
>
> As Ian says, there is no problem with deploying the test first and
> fixing the actual code later...
>
> Ian.

Pasi Kärkkäinen

2013-May-07 08:34 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, Apr 26, 2013 at 07:41:07PM -0400, Ben Guthro
wrote:> >
> > Ok, so master (xen-unstable) works OK regarding ACPI S3. Good.
> >
> > What hypervisor-side patches are still missing from stable-4.2 branch?
> 
> The final one that actually makes it work is
>
xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=9aa356bc9f7533c3cb7f02c823f532532876d444
> Jan Beulich had already indicated that this would be picked up in the
> 4.2 release cycle, but it was too late to get it into 4.2.2
>
Yep, I can see this already in 4.2 branch for 4.2.3.
 > 
> Then, also the ns16550 change.
> While strictly not necessary to fix S3 in the normal path, it does fix
> a bug that can lead to S3 not working if you
>    a. have one of these SuperIO controllers on the LPC bus.
>    b. have serial enabled.
>
xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=6e96c186d23873597896051b043cfeb119c4a7d5
> 
Jan: I think this ns16550 patch should be backported to 4.2 branch aswell.. 


Thanks,

-- Pasi
> 
> On the linux side of things, The following are necessary:
> One of the acpi-s3.vX branches. I use v9, but v10 is also available. I
> don''t think one has an advantage over the other.
> 
> acpi-s3.v10:
>
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=c268cd657314354f910b773a17a9de0299e1cc21
>
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=864848221b056aaf25416999c29cb0e14d3c3197
>
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=devel/acpi-s3.v10&id=aa7eb7bbb3f2a39435a07c082f99893386ae83ec
> 
> stable/for-linus-3.10:
>
git.kernel.org/cgit/linux/kernel/git/konrad/xen.git/commit/?h=stable/for-linus-3.10&id=3fac10145b766a2244422788f62dc35978613fd8
> 
> 
> Ben

Jan Beulich

2013-May-07 08:40 UTC

head link

Re: S3 is broken again in xen-unstable

>>> On 07.05.13 at 10:34, Pasi Kärkkäinen<pasik@iki.fi> wrote:
> On Fri, Apr 26, 2013 at 07:41:07PM -0400, Ben Guthro wrote:
>> Then, also the ns16550 change.
>> While strictly not necessary to fix S3 in the normal path, it does fix
>> a bug that can lead to S3 not working if you
>>    a. have one of these SuperIO controllers on the LPC bus.
>>    b. have serial enabled.
>> 
>
xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=6e96c186d23873597896051b0
> 43cfeb119c4a7d5
>> 
> 
> Jan: I think this ns16550 patch should be backported to 4.2 branch aswell..
Yeah, as being secondary I left this off until we know that this
really is the only thing known to break resume (i.e. I saw no point
in backporting this when in the end S3 still wouldn't work anyway).

Ben - am I right in understanding your earlier summary in this
thread to mean that the 4.2 branch, according to your testing,
is now is such a state?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
lists.xen.org/xen-devel

Ben Guthro

2013-May-07 09:18 UTC

head link

Re: S3 is broken again in xen-unstable

On May 7, 2013, at 4:40 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 07.05.13 at 10:34, Pasi Kärkkäinen<pasik@iki.fi>
wrote:
>> On Fri, Apr 26, 2013 at 07:41:07PM -0400, Ben Guthro wrote:
>>> Then, also the ns16550 change.
>>> While strictly not necessary to fix S3 in the normal path, it does
fix
>>> a bug that can lead to S3 not working if you
>>>   a. have one of these SuperIO controllers on the LPC bus.
>>>   b. have serial enabled.
>>
xenbits.xen.org/gitweb/?p=xen.git;a=commit;hn96c186d23873597896051b0
>> 43cfeb119c4a7d5
>>
>> Jan: I think this ns16550 patch should be backported to 4.2 branch
aswell..
>
> Yeah, as being secondary I left this off until we know that this
> really is the only thing known to break resume (i.e. I saw no point
> in backporting this when in the end S3 still wouldn''t work
anyway).
>
> Ben - am I right in understanding your earlier summary in this
> thread to mean that the 4.2 branch, according to your testing,
> is now is such a state?
>
Yea, S3 works on the 4.2.3 branch without this patch. This fixes a
specific corner case on some machines with the SuperIO hardware.

Ben Guthro

2013-May-21 18:29 UTC

head link

Re: S3 is broken again in xen-unstable

On Fri, Apr 26, 2013 at 9:17 AM, Ian Campbell <Ian.Campbell@citrix.com>
wrote:> On Fri, 2013-04-26 at 13:19 +0100, Ben Guthro wrote:
>> On Fri, Apr 26, 2013 at 4:10 AM, Ian Campbell
<Ian.Campbell@citrix.com> wrote:
>
>> > Some questions:
>> >       * How automatable is s3?
>> >       * In particular can we automate the wakeup? s3 is save to
RAM
>> >         IIRC, and most power control in the test system is done
with PDU
>> >         power cycling.
>>
>> I spoke with George Dunlap  a bit about this while I was over in the
>> UK a few weeks ago, and drew up an example shell script for this:
>> xen.markmail.org/thread/ghj2ffngemccq6p4
>> Marek also weighed in, and included some of his own tests, and
experiences.
>>
>> In my experience, this mechanism is about as reliable as your RTC. On
>> some systems you might tell it to sleep for 30s, and it will wake in
>> 10s.
>>
>> That said, when things go wrong, the machine does need to be power
>> cycled...so if you are not physically located near the machine under
>> test, you would need a PDU as a recovery mechanism, I suppose.
>
> That''#s OK, all the systems in the test harness would have to have
PDU
> for the other test cases (initial install etc) anyway.
>
>> >> Would it be helpful to maintain a branch in my xenbits repo
that could
>> >> be a rebased version of konrad''s acpi-s3 patches
against Linus'' latest
>> >> kernel?
>> >
>> > What is keeping those out of Linus'' tree?
>>
>> Added Konrad here, but I believe he is on vacation this week.
>> This has been a bullet point on his OSS presentation, as outstanding
>> pvops work for at least 3 years now.
>>
>> IIRC, the x86 guys NACK''ed the change as being too invasive.
>> I googled around a bit, but can''t seem to find the thread
about it.
>
> I wonder if it might be something like that :-/
FWIW, I believe we are over another hurdle here.
I have commitment from the acpi maintainer (Rafael Wysocki) that the
following patches will be included in the linux-3.11 merge window:

lkml.org/lkml/2013/5/14/465

When this is accepted, this should give "out of the box" S3
functionality with Xen

Once Xen-4.3 is released, I would like to revisit trying to see if
there would be some way to get something into the automated test
system.


Ben

>
>> > Once we have a test case in the standard flights then we can
consider
>> > the options around new flights testing other trees.
>>
>> I''m not sure I understand this point.
>> Are you saying you want to see a test that fails in the standard test
>> flight first...because without Konrad''s patches, it will be
guaranteed
>> not to work.
>
> Right. AIUI the flights (and I may be using the wrong term here) are
> somewhat uniform and and few in number and get run with various
> combinations inputs (Xen tree, Linux tree, Qemu tree), so there is
> effectively one "test Linux PV kernel flight" and one "test
Xen PV
> guests flight" etc, so we want to get S3 into those flights, with the
> existing set of "* tree" inputs.
>
> IOW we should add a new row to the grid
> chiark.greenend.org.uk/~xensrcts/logs/17816 for s3 testing
> and then we can consider adding a new column with a different set of
> tree''s as input.
>
> Ian J may have a different opinion on how to approach, but he''s
away
> until mid next week.
>
>> ...and without other changesets queued up for the 3.10 merge window,
>> non-boot CPUs will always have incorrect C-states.
>
> It''s OK to add the tests before things work.
>
> Ian.
>
>

Pasi Kärkkäinen

2013-May-21 18:52 UTC

head link

Re: S3 is broken again in xen-unstable

On Tue, May 21, 2013 at 02:29:23PM -0400, Ben Guthro
wrote:> >> >
> >> > What is keeping those out of Linus'' tree?
> >>
> >> Added Konrad here, but I believe he is on vacation this week.
> >> This has been a bullet point on his OSS presentation, as
outstanding
> >> pvops work for at least 3 years now.
> >>
> >> IIRC, the x86 guys NACK''ed the change as being too
invasive.
> >> I googled around a bit, but can''t seem to find the thread
about it.
> >
> > I wonder if it might be something like that :-/
> 
> FWIW, I believe we are over another hurdle here.
> I have commitment from the acpi maintainer (Rafael Wysocki) that the
> following patches will be included in the linux-3.11 merge window:
> 
> lkml.org/lkml/2013/5/14/465
> 
> When this is accepted, this should give "out of the box" S3
> functionality with Xen
>
This is great, thanks a lot! 
 > Once Xen-4.3 is released, I would like to revisit trying to see if
> there would be some way to get something into the automated test
> system.
> 
-- Pasi

Xen devel - Apr 2013 - S3 is broken again in xen-unstable

S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable

Re: S3 is broken again in xen-unstable