thr3ads.net - opus - [opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Timothy B. Terriberry

2014-Dec-19 11:26 UTC

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Viswanath Puttagunta wrote:> I responded to your feedback before I started on RFCv3.. and took your
> silence as approval :).. I guess that email got lost in your inbox sea
> some where.. so re-posting the responses.
Sorry, I did see it but I guess I read it rather more quickly than I 
thought. Apologies for that.
> guidance. I wouldn't know where else to put this. Without
Sorry for not being clear. I meant you shouldn't put the -O3 anywhere at 
all.
> "-mfpu=neon", the intrinsic code wouldn't even compile. And
without
> -O3.. the instructions that are generated are just painful. Please
> advise.
With what compiler? Using gcc 4.2 on my rather outdated Cortex A8 setup 
(Greg is working on getting a more modern one configured for us), the 
assembly output with -O3 and (the default) -O2 is identical. However, if 
someone actually wants to make an unoptimized build for some reason 
(e.g., testing or debugging), you've broken that.
> Now, that I unrolled it and tested it to make sure it works, is there
> a specific reason you think this is will be any slower than a simple
> loop and want to go back? I know it saves one load as compared to a
> simple loop. It would be hard to prove that it is measurably better
> than the simple loop. But I would sincerely prefer it this way.. It is
> fairly straight forward unroll in my opinion.
It makes sense to optimize for executed cycles or readability/code size, 
but this doesn't really do either.

You'll see the code in celt_pitch_xcorr_arm.s processes the last samples 
in groups of 2, 1, 1. It costs one extra compare (only one, because keep 
in mind the switch has an extra compare since gcc is not smart enough to 
prove that len==0 is impossible, despite our assert), but avoids an 
indirect jump. That's all probably irrelevant, as the bottlenecks here 
are going to be the stalls waiting for the loads and multiplies on the 
NEON side, and this has one fewer of both when len is 3 or 4. If you 
were going to fully unroll, that's kind of what I expected.

I agree that doing the extra load is suboptimal, but you can still loop 
the first three iterations, which means half the code. That's good for 
code cache size and readability (i.e., Don't Repeat Yourself).

If there were a measurable performance advantage to the switch 
statement, that would change my opinion, but if there isn't, then either 
of the other two approaches seem better.

One more nit I hadn't noticed before:
> +   float *xi = x;
> +   float *yi = y;
These need to be const float32_t (in both xcorr_kernel_neon_float and 
xcorr_kernel_neon_float_process1). They're currently causing a ton of 
warning spew. float32_t appears to not be considered equivalent to 
float, which means you'll also need casts here:
> +   vst1q_f32(sum, SUMM);
and here:
> +   vst1_lane_f32(sum, SUMM_2[0], 0);

Viswanath Puttagunta

2014-Dec-19 13:18 UTC

head link

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

On 19 December 2014 at 05:26, Timothy B. Terriberry <tterribe at xiph.org>
wrote:> Viswanath Puttagunta wrote:
>> I responded to your feedback before I started on RFCv3.. and took your
>> silence as approval :).. I guess that email got lost in your inbox sea
>> some where.. so re-posting the responses.
>
> Sorry, I did see it but I guess I read it rather more quickly than I
> thought. Apologies for that.
>
>> guidance. I wouldn't know where else to put this. Without
>
> Sorry for not being clear. I meant you shouldn't put the -O3 anywhere
at
> all.
>
>> "-mfpu=neon", the intrinsic code wouldn't even compile.
And without
>> -O3.. the instructions that are generated are just painful. Please
>> advise.
>
> With what compiler? Using gcc 4.2 on my rather outdated Cortex A8 setup
> (Greg is working on getting a more modern one configured for us), the
> assembly output with -O3 and (the default) -O2 is identical. However, if
> someone actually wants to make an unoptimized build for some reason
> (e.g., testing or debugging), you've broken that.Thanks. After re-referring to the NEON Programmer's guide carefully
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html
the appropriate flag should have been "-ftree-vectorize".

I can also confirm that -O2 and -O3 produces almost same output. What
I was comparing when I meant "painful" is when I don't use any
optimization
level.. which I tried during unit testing. But it looks like for opus,
the default
optimization flag is -O2... so just adding -ftree-vectorize should be enough.

FYI, -O3 implied -ftree-vectorize.. and that is why it worked in the
past.
>
>> Now, that I unrolled it and tested it to make sure it works, is there
>> a specific reason you think this is will be any slower than a simple
>> loop and want to go back? I know it saves one load as compared to a
>> simple loop. It would be hard to prove that it is measurably better
>> than the simple loop. But I would sincerely prefer it this way.. It is
>> fairly straight forward unroll in my opinion.
>
> It makes sense to optimize for executed cycles or readability/code size,
> but this doesn't really do either.
>
> You'll see the code in celt_pitch_xcorr_arm.s processes the last
samples
> in groups of 2, 1, 1. It costs one extra compare (only one, because keep
> in mind the switch has an extra compare since gcc is not smart enough to
> prove that len==0 is impossible, despite our assert), but avoids an
> indirect jump. That's all probably irrelevant, as the bottlenecks here
> are going to be the stalls waiting for the loads and multiplies on the
> NEON side, and this has one fewer of both when len is 3 or 4. If you
> were going to fully unroll, that's kind of what I expected.
>
> I agree that doing the extra load is suboptimal, but you can still loop
> the first three iterations, which means half the code. That's good for
> code cache size and readability (i.e., Don't Repeat Yourself).
>
> If there were a measurable performance advantage to the switch
> statement, that would change my opinion, but if there isn't, then
either
> of the other two approaches seem better.
OK, thanks for your explanation. I will take a closer look at this part
of celt_pitch_xcorr_arm.s
>
>
> One more nit I hadn't noticed before:
>
>> +   float *xi = x;
>> +   float *yi = y;
>
> These need to be const float32_t (in both xcorr_kernel_neon_float and
> xcorr_kernel_neon_float_process1). They're currently causing a ton of
> warning spew. float32_t appears to not be considered equivalent to
> float, which means you'll also need casts here:
>
>> +   vst1q_f32(sum, SUMM);
>
> and here:
>
>> +   vst1_lane_f32(sum, SUMM_2[0], 0);
Thanks, will do.> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus

Viswanath Puttagunta

2014-Dec-19 13:22 UTC

head link

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

On 19 December 2014 at 07:18, Viswanath Puttagunta
<viswanath.puttagunta at linaro.org> wrote:> On 19 December 2014 at 05:26, Timothy B. Terriberry <tterribe at
xiph.org> wrote:
>> Viswanath Puttagunta wrote:
>>> I responded to your feedback before I started on RFCv3.. and took
your
>>> silence as approval :).. I guess that email got lost in your inbox
sea
>>> some where.. so re-posting the responses.
>>
>> Sorry, I did see it but I guess I read it rather more quickly than I
>> thought. Apologies for that.
>>
>>> guidance. I wouldn't know where else to put this. Without
>>
>> Sorry for not being clear. I meant you shouldn't put the -O3
anywhere at
>> all.
>>
>>> "-mfpu=neon", the intrinsic code wouldn't even
compile. And without
>>> -O3.. the instructions that are generated are just painful. Please
>>> advise.
>>
>> With what compiler? Using gcc 4.2 on my rather outdated Cortex A8 setup
>> (Greg is working on getting a more modern one configured for us), the
>> assembly output with -O3 and (the default) -O2 is identical. However,
if
>> someone actually wants to make an unoptimized build for some reason
>> (e.g., testing or debugging), you've broken that.
> Thanks. After re-referring to the NEON Programmer's guide carefully
>
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0018a/index.html
> the appropriate flag should have been "-ftree-vectorize".After further testing, just having -mfpu=neon is enough.. So, I'll just use
this
to keep it consistent with the checks in configure.ac
>
> I can also confirm that -O2 and -O3 produces almost same output. What
> I was comparing when I meant "painful" is when I don't use
any optimization
> level.. which I tried during unit testing. But it looks like for opus,
> the default
> optimization flag is -O2... so just adding -ftree-vectorize should be
enough.
>
> FYI, -O3 implied -ftree-vectorize.. and that is why it worked in the
> past.
>
>>
>>> Now, that I unrolled it and tested it to make sure it works, is
there
>>> a specific reason you think this is will be any slower than a
simple
>>> loop and want to go back? I know it saves one load as compared to a
>>> simple loop. It would be hard to prove that it is measurably better
>>> than the simple loop. But I would sincerely prefer it this way.. It
is
>>> fairly straight forward unroll in my opinion.
>>
>> It makes sense to optimize for executed cycles or readability/code
size,
>> but this doesn't really do either.
>>
>> You'll see the code in celt_pitch_xcorr_arm.s processes the last
samples
>> in groups of 2, 1, 1. It costs one extra compare (only one, because
keep
>> in mind the switch has an extra compare since gcc is not smart enough
to
>> prove that len==0 is impossible, despite our assert), but avoids an
>> indirect jump. That's all probably irrelevant, as the bottlenecks
here
>> are going to be the stalls waiting for the loads and multiplies on the
>> NEON side, and this has one fewer of both when len is 3 or 4. If you
>> were going to fully unroll, that's kind of what I expected.
>>
>> I agree that doing the extra load is suboptimal, but you can still loop
>> the first three iterations, which means half the code. That's good
for
>> code cache size and readability (i.e., Don't Repeat Yourself).
>>
>> If there were a measurable performance advantage to the switch
>> statement, that would change my opinion, but if there isn't, then
either
>> of the other two approaches seem better.
>
> OK, thanks for your explanation. I will take a closer look at this part
> of celt_pitch_xcorr_arm.s
>
>>
>>
>> One more nit I hadn't noticed before:
>>
>>> +   float *xi = x;
>>> +   float *yi = y;
>>
>> These need to be const float32_t (in both xcorr_kernel_neon_float and
>> xcorr_kernel_neon_float_process1). They're currently causing a ton
of
>> warning spew. float32_t appears to not be considered equivalent to
>> float, which means you'll also need casts here:
>>
>>> +   vst1q_f32(sum, SUMM);
>>
>> and here:
>>
>>> +   vst1_lane_f32(sum, SUMM_2[0], 0);
> Thanks, will do.
>> _______________________________________________
>> opus mailing list
>> opus at xiph.org
>> http://lists.xiph.org/mailman/listinfo/opus

Timothy B. Terriberry

2014-Dec-19 13:25 UTC

head link

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Viswanath Puttagunta wrote:> optimization flag is -O2... so just adding -ftree-vectorize should be
enough.
-ftree-vectorize is just for auto-vectorization, isn't it? What does 
that have to do with intrinsics?

Reasonably Related Threads

Search for more seemingly similar threads

opus - Dec 2014 - [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Reasonably Related Threads