thr3ads.net - llvm dev - [llvm-dev] TSAN hack on AArch64 for Android [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2015-Aug-28 20:01 UTC

[llvm-dev] TSAN hack on AArch64 for Android

On 28 August 2015 at 18:02, Jason Kim <jasonk at codeaurora.org>
wrote:> IMO having to disable 2/3 of the tests means the patch isn't ready yet.
Dan is absolutely right.

> (*) The tests weren’t all stable to begin with.
That's not a reason to break them more. If you can't prove they're
correct, you can't prove you broken them more, especially on other
arches.

> (*) I am perfectly willing to triage the failures, but please keep in mind
> that any “threshold” that we cross from NOT testing TSAN on aarch64+android
> to actually testing on it is going to be a large hurdle.
I'm glad you are, but please do it off trunk.

Your changes touch other OSs (ex. AArch64 Linux) and also other Arches
(see some comments to that effect on the patch). You can't just push
something that isn't ready like that.

> (*) BIONIC is NOT glibc!
I missed the point here...

> (*) Its pretty darn near impossible to  have the perfect patch, which is
[a]
> small [b] doesn’t break anything [c] adds major new features.
> (*) What I have currently is: [a] not very big [b] doesn’t break anything
> VISIBLY [c] adds major new features
It's hard, I give you that. But that's why most of us chose compilers
to work with. It's also a lot of fun.

I've seen lots of people provide lots of ideas on how to do that
slowly, and re-use changes in Android, RT, etc.

> (*) The new feature (TSAN on aarch64-linux-androideabi) is currently turned
> OFF by default
You're disabling the tests on Android/TSAN/AArch64, but you're
touching other arches.

Linaro has added *a lot* of sanitizers to AArch64. The way we did was
to have all changes local until *all* the tests pass. If *one or two*
tests were flaky, we marked them as "REQUIRES: stable-runtime" and
investigated it.

Never 2/3 of the tests. Never when touching other arches / OSs.

Have you tested your changes in x86_64 Android/Linux? AArch64 Linux?
What other arches would it touch? Mips64?

If your changes are *that* unstable in Android, I'm very worried that
it will destabilise the whole AArch64 implementation, which we just
pushed in and it is working.

One good example is ASAN/AArch64. When Christophe implemented it last
year, it took us 2 months to commit it to the tree, but we left the
build and *all* tests disabled because it passed 100% on our boxes,
but not on the public buildbot. It was only recently, many months
later, that we enabled the build/tests. In your case, it doesn't even
pass on your box.

> (*) There are plenty of cases in prior history in llvm where a set of tests
> were turned off temporarily, especially when a major new platform is being
> introduced. I don’t think its unreasonable to have that as a starting
point.
A major new platform, ie. a new back-end, doesn't normally break
anything elsewhere. And when it touches generic parts, tests are
carried out to prove that it didn't break anything.

If your code was 100% new files/functions/macros on a new architecture
that no one uses, it'd be ok. But that's not the case.

A big feature is hard to implement. You have to carry along your
branch, re-base, adapt, and you have to add tests, and test on other
architectures and OSs to make sure everything is good. For that big a
change, you cannot rely on the buildbots to tell you that, especially
not on the architecture you're targeting, and the base architecture,
which is x86_64.

If the tests are unstable even on x86_64 (I know they are, more
unstable than on ARM/AArch64), you have to be extra *extra* careful.
We cannot simply disable everything.

cheers,
--renato

Jason Kim via llvm-dev

2015-Aug-31 22:23 UTC

head link

[llvm-dev] TSAN hack on AArch64 for Android

> -----Original Message----- > From: Renato Golin [mailto:renato.golin at
linaro.org]> 
> > (*) Its pretty darn near impossible to  have the perfect patch, which > > is [a] small [b] doesn’t break anything [c] adds major new features.
 > > (*) What I have currently is: [a] not very big [b] doesn’t break
 > > anything VISIBLY [c] adds major new features
 > >  > You're disabling the tests on Android/TSAN/AArch64, but you're
touching
 > other arches.

 [ jasonk --> ]  There seems to be a bit of unwarranted FUD here. 
My work is ONLY for TSAN on aarch64-android. Where are you seeing me
"touching" other architectures? The new bits for the __sanitizer_XXX
structs are directly from bionic .h headers, which *IS* platform-neutral.

Can you please poin to a specific point in the patch that you *know* will
negatively impact other architectures? Please be specific. I am unware of any
such bits in the patch, and if it really *does* impact other platforms, then it
is merely mistake on my part that was (correctly!) caught at code review time,
and thus will be corrected quickly.

 > 
 > Linaro has added *a lot* of sanitizers to AArch64. The way we did was to
 > have all changes local until *all* the tests pass. If *one or two* tests
were
 > flaky, we marked them as "REQUIRES: stable-runtime" and
investigated it.

[ jasonk --> ] Moot point. I think. I am not suggesting anything
substantively different than how ASAN/AARCH64 was integrated into compiler-rt.
Please read below.
> Have you tested your changes in x86_64 Android/Linux? AArch64 Linux? > What other arches would it touch? Mips64?

[ jasonk --> ] Uhh, my work is ONLY for android-aarch64. Others can jump in
and improve for other android platforms (non-aarch64) as they see fit. Where are
you getting all these other architectures from???? I am ONLY touching TSAN on
android AND aarch64. Nothing else. Its not even turned on unless a specific
CMAKE flag is passed during configure time.
 
 > If your changes are *that* unstable in Android, I'm very worried that
it will
 > destabilise the whole AArch64 implementation, which we just pushed in
 > and it is working.

[ jasonk --> ]  In the sense, that the tests were already not 100% stable, as
you say, all I've done is, for android-aarch64 (which, according to
lab.llvm.org is not even a tested platform yet), made sure of the following:

0. My changes should have NO impact on other architectures supported by
compiler-rt
1. tsan on android-aarch64 is still disabled by default, activated only if a
specific cmake flag is present when compiler-rt is being configured.
2. The tests that do not behave as expected on android-aarch64 are turned off,
ONLY for android-aarch64. They are run, as is, for other platforms.
3. This will give some amount of breathing room for me to go in and triage the
tests, one by one, while I have some amount of reassurance that some one else
does not come along and horribly break the work that I've already done.
4. Not yet mentioned, but is critical, is that my TSAN patch has to be tested in
conjunction with an android testing environment, which means a new buildbot, and
possibly a new VM, running on labs.llvm.org, specifically for this work.
5. Not to belabor the point, but right now, even if my sumo patch (D11532)
landed on compiler-rt RIGHT NOW, I would bet that it would not disturb the
existing compiler-rt build at all. Things that were already broken, would likely
still remain broken, and things that were passing, probably would not be
affected at all.

 > 
 > One good example is ASAN/AArch64. When Christophe implemented it last
 > year, it took us 2 months to commit it to the tree, but we left the build
and
 > *all* tests disabled because it passed 100% on our boxes, but not on the
 > public buildbot. It was only recently, many months later, that we enabled
 > the build/tests. In your case, it doesn't even pass on your box.

[ jasonk --> ] Ahh yes.  Moot point again. Tests don't pass SOMEWHERE.
Whether they are, as you say, *all* disabled on the public bot, or (as in my
case, a portion is marked unsupported, ONLY on aarch64-android, but is run as-is
for other existing configurations) is really a minor point.
If you were to have done as you are requesting me to do, then shouldn't the
ASAN/AARCH64 work been delayed being commited to compiler-rt until ALL the tests
passed 100%??

 > 
 > > A big feature is hard to implement. You have to carry along your branch,
re- > base, adapt, and you have to add tests, and test on other architectures
and
 > OSs to make sure everything is good. For that big a change, you cannot
rely
 > on the buildbots to tell you that, especially not on the architecture
you're
 > targeting, and the base architecture, which is x86_64.

[ jasonk --> ]  I am not disagreeing here. All I am asking is for a little
wiggle room for me to triage the tests without having to juggle 3 pins hopping
one legged during a hurricane purched on top of a chandelier, while doing the
occasional ankle-to-hand beer toast (to coin a phrase).

My tsan-on-android-aarch64 taskforce consists of exactly one person, me. Because
the work touches both BIONIC AND compiler-rt, it is more complex task than
X-sanitizer-on-linux type of patches.
For example, I will need a new build bot, new VM running ANDROID to even test
the work on lab.llvm.org. Without these, my D11532 patch, even if it landed
RIGHT now, should be invisible to the rest of the tested configurations.

Rebasing my android patches is significantly more involved, both in time and
effort than purely llvm-clang-compiler-rt rebase.
I am simply trying to reduce the circular tail chasing to a manageable level.

I will ensure that my work does not impact other supported architectures
negatively, but you (collective YOU, as in the compiler-rt community) please
have a bit of forbearance, and *do not believe (unwarranted!) FUD*. The
intention and I think also the work --- even in its current rough form ---
should not impact other tested compiler-rt configurations at all.
As I stated earlier, and blips to the contrary is something I am commited to
fixing as quickly as humanly possible.

 > 
 > If the tests are unstable even on x86_64 (I know they are, more unstable
 > than on ARM/AArch64), you have to be extra *extra* careful.
 > We cannot simply disable everything.

[ jasonk --> ]  Please please please! get this straight! I am not
"disabling everything." My work requires a new CMAKE flag to even be
active. It should not impact other platforms at all.

Thanks for reading!
-Jason

Renato Golin via llvm-dev

2015-Sep-01 18:09 UTC

head link

[llvm-dev] TSAN hack on AArch64 for Android

Hi Jason,

Let me reduce the defcon level of this thread... Whatever is marked
around <FUD></FUD>, is FUD, and should be taken lightly. Everything
else is a genuine question or opinion.

On 31 August 2015 at 23:23, Jason Kim <jasonk at codeaurora.org>
wrote:> My work is ONLY for TSAN on aarch64-android. Where are you seeing me
"touching" other architectures? The new bits for the __sanitizer_XXX
structs are directly from bionic .h headers, which *IS* platform-neutral.
There are ifdef ANDROID_SANITIZER blocks that are not guarded by
AArch64. As you say, it is target neutral, but you haven't tested on
other targets. Wouldn't that impact x86_64 Android? Or is that not
supported at all by TSAN?

Others where WORDSIZE was used, but I'm not sure it only applies on your
case.

Adhemerval had some comments about MIPS64 in the original patch, but I
think it got lost in the number of other comments.

What I'm getting at is that you think the macro separation you have is
enough to keep the TSAN+AArch64+Android at bay, but unless you test it
on AArch64+Linux, x86_64+Android, you can't be sure of it. You may
just as well be right, and if everybody in the sanitizer arena agree
that you are, I'm fine with it.

> [ jasonk --> ] Moot point. I think. I am not suggesting anything
substantively different than how ASAN/AARCH64 was integrated into compiler-rt.
Please read below.
I don't follow.

We had all-minus-2 tests passing on three AArch64 environments, out of
4. You have 90% of the tests failing on your own AArch64+Android
environments. As Dan, said, if 90% of your own tests fail, the patch
is not ready yet.

> 0. My changes should have NO impact on other architectures supported by
compiler-rt
*should* is the keyword here. But I have already agreed that if san
folks agree with you, I do too.

> 1. tsan on android-aarch64 is still disabled by default, activated only if
a specific cmake flag is present when compiler-rt is being configured.
That's good.

> 2. The tests that do not behave as expected on android-aarch64 are turned
off, ONLY for android-aarch64. They are run, as is, for other platforms.
As expected. That wasn't one of my concerns.

> 3. This will give some amount of breathing room for me to go in and triage
the tests, one by one, while I have some amount of reassurance that some one
else does not come along and horribly break the work that I've already done.
<FUD>
So, there is the possibility that the code you're adding, and the
future changes will make it more complicated for others to add/improve
their own code, for instance, AArch64 Linux. I don't know that for
sure, and just wanted to raise concerns that the san experts should be
aware of it.
</FUD>

> 4. Not yet mentioned, but is critical, is that my TSAN patch has to be
tested in conjunction with an android testing environment, which means a new
buildbot, and possibly a new VM, running on labs.llvm.org, specifically for this
work.
That'd be amazing! I was FUDding about not being able to test our own
patches on Android, but having a bot solves that problem.

> 5. Not to belabor the point, but right now, even if my sumo patch (D11532)
landed on compiler-rt RIGHT NOW, I would bet that it would not disturb the
existing compiler-rt build at all. Things that were already broken, would likely
still remain broken, and things that were passing, probably would not be
affected at all.
I'm not a gambling man... :)

> If you were to have done as you are requesting me to do, then shouldn't
the ASAN/AARCH64 work been delayed being commited to compiler-rt until ALL the
tests passed 100%??
No. They were passing all-2, which is about 99%. That's ok. There was
a problem with the buildbot (which is known to be problematic anyway),
so it was disabled by default so not to disturb the community, but we
knew it was correct, as it passed everywhere else.

Now we have more bots, and it passes on all of them, but that old one
is still fiddly. All in all, a clean cut.

You have 180 tests failing. If it was just up to the buildbots, you
disable the tests on Android+AArch64+TSAN and jobs done. But this is,
as Dan said, about the code itself.

It's a decision that we have to make consciously. Adding raw code is
common, but normally that code is well separated from the rest. With
the concerns I had on the macros, I wasn't sure that it was that clean
of a cut.

<FUD>
It is entirely possible that there will be no problem at all, that all
your code is Android only, as you say. I was expecting you'd say that
you tested the code on AArch64+Linux and Android+x86_64, just to make
sure you were right...
</FUD>

> My tsan-on-android-aarch64 taskforce consists of exactly one person, me.
Because the work touches both BIONIC AND compiler-rt, it is more complex task
than X-sanitizer-on-linux type of patches.
Sigh... I know the drill only too well. :(

> I am simply trying to reduce the circular tail chasing to a manageable
level.
Right. Let's get this thing on the road, then.

We don't have any Android environment to test anything from our side,
so any breakage on your side will have to be dealt with by you. I do a
lot of it here, and it's not easy. I hope you're aware of that.

We'll also be submitting patches to improve AArch64 support,
performance or fixing tests. These patches may break your stuff.

One thing I normally do is to download the patches from Phab reviews
and run on my local config, just to make sure that the check-all will
pass on our bot configuration before it actually goes. That helps a
lot the stabilization, and we should work with you to make sure you
can fix them with out help, if needed.

Let's get back to the patches in question, and make sure that we all
agree on the new structures and the TLS implementation. The biggest
part of my FUD, is that we didn't even discuss that beforehand, and
even now, there are at least 5 options to chose from.

Once the structures are accepted by the san experts, and the TLS
implementation is agreed and working, we can get this in, completely
disabled by default.

cheers,
--renato

llvm dev - Aug 2015 - TSAN hack on AArch64 for Android

[llvm-dev] TSAN hack on AArch64 for Android

[llvm-dev] TSAN hack on AArch64 for Android

[llvm-dev] TSAN hack on AArch64 for Android