thr3ads.net - opus - [opus] [OPUS] celt_inner_prod() and dual_inner

If this information is useful, please help other people find it:
Share via:

Ulrich Windl

2017-Jun-06 07:03 UTC

[opus] Antw: Re: [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

>>> Linfeng Zhang <linfengz at google.com> schrieb am 06.06.2017
um 06:46 in Nachricht<CAKoqLCAfj+fDUMLfN4dLNSZ4NNAZpaSt_BWZRp+7XBqfhiSqiQ at
mail.gmail.com>:> Hi Jean-Marc,
> 
> I tried "==" before, and it failed when both results are 0.0.
Maybe the
> exponent or sign has difference because of the different 0.0 representation
> in NEON. If anybody know how to handle this 0.0 comparison, that would be
> great.
> Or just use if(a==b || (a==0.0 && b==0.0)) ... but I haven't
try this.
>From some faint memory of my math lessions I had produced code like this to
get the smallest floating-point number different from zero:
double  EPS;            /* smallest number not equal to 0.0 */

/* refined estimate of EPS */
static  double  get_EPS(double eps)
{

        while ( 1.0 + eps != 1.0 )
                eps /= 2;
        return(eps);
}

EPS = get_EPS(1.0);

On the x86_64 platform I get:
(gdb) p EPS
$1 = 1.1102230246251565e-16

Maybe it can help...

Regards,
Ulrich
> 
> Thanks,
> Linfeng
> 
> On Mon, Jun 5, 2017 at 8:43 PM Jean-Marc Valin <jmvalin at
jmvalin.ca> wrote:
> 
>> Hi Linfeng,
>>
>> On 05/06/17 03:31 PM, Linfeng Zhang wrote:
>> > Yes we'll have one more patch set related to xcorr in next
week. Please
>> > don't wait if it's too late for 1.2 release.
>>
>> Assuming there's no issue with the patches, next week isn't too
late.
>>
>> Also, I've started looking at your patches. So far there's one
thing
>> that puzzles me a bit. In the OPUS_CHECK_ASM section of patch 0004, you
>> have:
>>
>> +        celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL);
>>
>> Given the normal range of the values (the xy values are often much
>> larger than one) and the precision involved here (24-bit mantissa), it
>> seems like this test can only succeed if the two values are actually
>> equal. Is the float patch actually bit-exact? If so, then maybe you
>> should be using actual equality. If not, then I guess we need to find
>> the right condition (which isn't obvious for floating point).
>>
>> Cheers,
>>
>>         Jean-Marc
>>
>>
>> > Thanks,
>> > Linfeng
>> >
>> > On Mon, Jun 5, 2017 at 12:28 PM, Linfeng Zhang <linfengz at
google.com
>> > <mailto:linfengz at google.com>> wrote:
>> >
>> >     Hi Jean-Marc,
>> >
>> >     I attached the new version in inner_prod_5patches_v2.zip which
>> >     synced to the current master.
>> >
>> >     For fixed-point ARM, only
>> >    
0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch
>> >     changes the performance.
>> >     For floating-point ARM, only
>> >     0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
>> >     <http://elt_inner_prod-and-dual_inn.pa>tch changes the
performance.
>> >     Patch 1 and 2 are code clean-up and can only affect x86
performance.
>> >     Patch 5 has neglectable effect on floating-point ARM
performance.
>> >
>> >     Thanks,
>> >     Linfeng
>> >
>> >     On Fri, Jun 2, 2017 at 11:26 AM, Jean-Marc Valin <jmvalin
at jmvalin.ca
>> >     <mailto:jmvalin at jmvalin.ca>> wrote:
>> >
>> >         Hi Linfeng,
>> >
>> >         I'll look into your patches. Can you let me know
what's the
>> expected
>> >         effect on performance (if any) for each of your patches?
Also,
>> >         are these
>> >         all the patches you intend to merge for 1.2 or are there
more
>> >         upcoming ones?
>> >
>> >         Cheers,
>> >
>> >                 Jean-Marc
>> >
>> >         On 01/06/17 06:33 PM, Linfeng Zhang wrote:
>> >         > Hi,
>> >         >
>> >         > Attached are 5 patches related to celt_inner_prod()
>> >         > and dual_inner_prod() NEON intrinsics optimization.
>> >         >
>> >         > In
>> >        
0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
>> >         <http://elt_inner_prod-and-dual_inn.pa>tch, the
>> >         > optimization changed the order of floating-point
inner
>> >         products, which
>> >         > will change the results. I
>> >         > created celt_inner_prod_neon_float_c_simulation()
>> >         > and dual_inner_prod_neon_float_c_simulation() to
simulate the
>> >         order
>> >         > floating-point operations in NEON optimization and
compare
>> their
>> >         > results. Sorry that I cannot bond the distance
between
>> original C
>> >         > function and NEON function to any giving reasonable
small
>> >         number or
>> >         > ratio. It's easy to create an input which 0 and
1,000 are both
>> >         correct
>> >         > results by just manipulating the inner product order.
>> >         >
>> >         > The total speed gain is about 1.0% for fixed-point
encoder,
>> >         and 1.8% for
>> >         > floating-point encoder, in Complexity 8, tested on my
>> Chromebook.
>> >         >
>> >         > Thanks,
>> >         > Linfeng
>> >         >
>> >         >
>> >         > _______________________________________________
>> >         > opus mailing list
>> >         > opus at xiph.org <mailto:opus at xiph.org>
>> >         > http://lists.xiph.org/mailman/listinfo/opus 
>> >         <http://lists.xiph.org/mailman/listinfo/opus>
>> >         >
>> >
>> >
>> >
>>

Ian Malone

2017-Jun-06 11:39 UTC

head link

[opus] Antw: Re: [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

On 6 June 2017 at 08:03, Ulrich Windl <Ulrich.Windl at
rz.uni-regensburg.de> wrote:>>>> Linfeng Zhang <linfengz at google.com> schrieb am
06.06.2017 um 06:46 in Nachricht
> <CAKoqLCAfj+fDUMLfN4dLNSZ4NNAZpaSt_BWZRp+7XBqfhiSqiQ at
mail.gmail.com>:
>> Hi Jean-Marc,
>>
>> I tried "==" before, and it failed when both results are 0.0.
Maybe the
>> exponent or sign has difference because of the different 0.0
representation
>> in NEON. If anybody know how to handle this 0.0 comparison, that would
be
>> great.
>> Or just use if(a==b || (a==0.0 && b==0.0)) ... but I
haven't try this.
>
>
> From some faint memory of my math lessions I had produced code like this to
get the smallest floating-point number different from zero:
>
> double  EPS;            /* smallest number not equal to 0.0 */
>
> /* refined estimate of EPS */
> static  double  get_EPS(double eps)
> {
>
>         while ( 1.0 + eps != 1.0 )
>                 eps /= 2;
>         return(eps);
> }
>
> EPS = get_EPS(1.0);
>
> On the x86_64 platform I get:
> (gdb) p EPS
> $1 = 1.1102230246251565e-16
>
That's an estimate of the precision of the floating point
representation (does adding it change the value of 1.0), rather than
the smallest magnitude number representable not zero.

-- 
imalone
http://ibmalone.blogspot.co.uk

Linfeng Zhang

2017-Jun-06 18:27 UTC

head link

[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

Thank Ulrich!

Yes, using
        celt_assert(1.0 + celt_inner_prod_neon_float_c_simulation(x, y, N)
== 1.0 + xy);
        celt_assert(1.0 + xy1_c == 1.0 + *xy1);
        celt_assert(1.0 + xy2_c == 1.0 + *xy2);
can avoid the useage of VERY_SMALL.

Hi Jean-Marc,

I added
    {
        const opus_val32 xy_c = celt_inner_prod_neon_float_c_simulation(x,
y, N);
        const int32_t *x_bin = (int32_t*)x;
        const int32_t *y_bin = (int32_t*)y;
        const int32_t *xy_bin = (int32_t*)&xy;
        const int32_t *xy_bin_c = (int32_t*)&xy_c;
        // if((xy_c != xy) && (xy_c != 0.0) && (xy != 0.0)) {
        if(xy_c != xy) {
            printf("\n xy_c = %9f, xy   = %9f", xy_c, xy);
            printf(" | xy_c = %13e, xy   = %13e", xy_c, xy);
            printf(" | xy_c (bin) = 0x%8x, xy   (bin) = 0x%8x\n",
*xy_bin_c, *xy_bin);
            printf("\n N = %d", N);
            for (i = 0; i < N; i++) {
              printf("\n x[%d] = %9f, y[%d] = %9f", i, x[i], i, y[i]);
              printf(" | x[%d] = %13e, y[%d] = %13e", i, x[i], i,
y[i]);
              printf(" | x[%d] (bin) = 0x%8x, y[%d] (bin) = 0x%8x", i,
x_bin[i], i, y_bin[i]);
            }
            printf("\n\n");
        }
    }

And got the following two cases when testing speech_mono_32_48kHz.pcm
(Download: https://drive.google.com/file/d/0B2bjttuYjfVYaHBDZE1XV3B0MHM) on
NEON:

$ ./opus_demo -e voip  48000 1 32000 -complexity  8
speech_mono_32_48kHz.pcm  tmp.opus

libopus 1.2-beta-27-g6c51a195-dirty
Encoding 48000 Hz input at 32.000 kb/s in auto bandwidth with 960-sample
frames.

 xy_c =  0.000000, xy   =  0.000000 | xy_c =  5.605194e-45, xy    0.000000e+00 |
xy_c (bin) = 0x       4, xy   (bin) = 0x       0

 N = 8
 x[0] = -0.000000, y[0] = -0.000000 | x[0] = -7.783648e-23, y[0] -7.783648e-23 |
x[0] (bin) = 0x9abc3273, y[0] (bin) = 0x9abc3273
 x[1] = -0.000000, y[1] = -0.000000 | x[1] = -1.862279e-23, y[1] -1.862279e-23 |
x[1] (bin) = 0x99b41bca, y[1] (bin) = 0x99b41bca
 x[2] =  0.000000, y[2] =  0.000000 | x[2] =  1.092297e-23, y[2]  1.092297e-23 |
x[2] (bin) = 0x195347ee, y[2] (bin) = 0x195347ee
 x[3] = -0.000000, y[3] = -0.000000 | x[3] = -5.171255e-25, y[3] -5.171255e-25 |
x[3] (bin) = 0x97200ae8, y[3] (bin) = 0x97200ae8
 x[4] = -0.000000, y[4] = -0.000000 | x[4] = -4.773915e-24, y[4] -4.773915e-24 |
x[4] (bin) = 0x98b8ae90, y[4] (bin) = 0x98b8ae90
 x[5] = -0.000000, y[5] = -0.000000 | x[5] = -3.717311e-25, y[5] -3.717311e-25 |
x[5] (bin) = 0x96e61724, y[5] (bin) = 0x96e61724
 x[6] = -0.000000, y[6] = -0.000000 | x[6] = -1.724025e-24, y[6] -1.724025e-24 |
x[6] (bin) = 0x980563d5, y[6] (bin) = 0x980563d5
 x[7] = -0.000000, y[7] = -0.000000 | x[7] = -2.245937e-24, y[7] -2.245937e-24 |
x[7] (bin) = 0x982dc55f, y[7] (bin) = 0x982dc55f

=============================================================
 xy_c =  0.000000, xy   =  0.000000 | xy_c =  1.121039e-44, xy    0.000000e+00 |
xy_c (bin) = 0x       8, xy   (bin) = 0x       0

 N = 8
 x[0] = -0.000000, y[0] = -0.000000 | x[0] = -1.000134e-22, y[0] -1.000134e-22 |
x[0] (bin) = 0x9af1d148, y[0] (bin) = 0x9af1d148
 x[1] =  0.000000, y[1] =  0.000000 | x[1] =  3.052170e-23, y[1]  3.052170e-23 |
x[1] (bin) = 0x1a139809, y[1] (bin) = 0x1a139809
 x[2] = -0.000000, y[2] = -0.000000 | x[2] = -2.135591e-23, y[2] -2.135591e-23 |
x[2] (bin) = 0x99ce8aaf, y[2] (bin) = 0x99ce8aaf
 x[3] =  0.000000, y[3] =  0.000000 | x[3] =  1.180839e-23, y[3]  1.180839e-23 |
x[3] (bin) = 0x19646856, y[3] (bin) = 0x19646856
 x[4] = -0.000000, y[4] = -0.000000 | x[4] = -1.230446e-23, y[4] -1.230446e-23 |
x[4] (bin) = 0x996e00bc, y[4] (bin) = 0x996e00bc
 x[5] =  0.000000, y[5] =  0.000000 | x[5] =  6.443248e-24, y[5]  6.443248e-24 |
x[5] (bin) = 0x18f942d6, y[5] (bin) = 0x18f942d6
 x[6] = -0.000000, y[6] = -0.000000 | x[6] = -8.497414e-24, y[6] -8.497414e-24 |
x[6] (bin) = 0x99245d28, y[6] (bin) = 0x99245d28
 x[7] =  0.000000, y[7] =  0.000000 | x[7] =  3.849347e-24, y[7]  3.849347e-24 |
x[7] (bin) = 0x1894ea17, y[7] (bin) = 0x1894ea17

There are 3 possible reasons.
1. Of course celt_inner_prod_neon_float_c_simulation() may have bug. Please
help me find if any.
2. Though impossible, it's possible NEON is not IEEE 754-compliant when
dealing with near 0 floating-point values.
3. Though more impossible, it's possible gcc is not IEEE 754-compliant
here. :)

Since x[i] == y[i] in both cases, they are actually calculating the energy.
(-1.000134e-22 * -1.000134e-22) is larger than the smallest
single-precision number and should be represented as none-zero (such as
0x8). I don't know why NEON gives 0 result.

Thanks,
Linfeng

On Tue, Jun 6, 2017 at 12:03 AM, Ulrich Windl <Ulrich.Windl at
rz.uni-regensbur
g.de> wrote:
> >>> Linfeng Zhang <linfengz at google.com> schrieb am
06.06.2017 um 06:46 in
> Nachricht
> <CAKoqLCAfj+fDUMLfN4dLNSZ4NNAZpaSt_BWZRp+7XBqfhiSqiQ at
mail.gmail.com>:
> > Hi Jean-Marc,
> >
> > I tried "==" before, and it failed when both results are
0.0. Maybe the
> > exponent or sign has difference because of the different 0.0
> representation
> > in NEON. If anybody know how to handle this 0.0 comparison, that would
be
> > great.
> > Or just use if(a==b || (a==0.0 && b==0.0)) ... but I
haven't try this.
>
>
> From some faint memory of my math lessions I had produced code like this
> to get the smallest floating-point number different from zero:
>
> double  EPS;            /* smallest number not equal to 0.0 */
>
> /* refined estimate of EPS */
> static  double  get_EPS(double eps)
> {
>
>         while ( 1.0 + eps != 1.0 )
>                 eps /= 2;
>         return(eps);
> }
>
> EPS = get_EPS(1.0);
>
> On the x86_64 platform I get:
> (gdb) p EPS
> $1 = 1.1102230246251565e-16
>
> Maybe it can help...
>
> Regards,
> Ulrich
>
> >
> > Thanks,
> > Linfeng
> >
> > On Mon, Jun 5, 2017 at 8:43 PM Jean-Marc Valin <jmvalin at
jmvalin.ca>
> wrote:
> >
> >> Hi Linfeng,
> >>
> >> On 05/06/17 03:31 PM, Linfeng Zhang wrote:
> >> > Yes we'll have one more patch set related to xcorr in
next week.
> Please
> >> > don't wait if it's too late for 1.2 release.
> >>
> >> Assuming there's no issue with the patches, next week
isn't too late.
> >>
> >> Also, I've started looking at your patches. So far there's
one thing
> >> that puzzles me a bit. In the OPUS_CHECK_ASM section of patch
0004, you
> >> have:
> >>
> >> +        celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL);
> >>
> >> Given the normal range of the values (the xy values are often much
> >> larger than one) and the precision involved here (24-bit
mantissa), it
> >> seems like this test can only succeed if the two values are
actually
> >> equal. Is the float patch actually bit-exact? If so, then maybe
you
> >> should be using actual equality. If not, then I guess we need to
find
> >> the right condition (which isn't obvious for floating point).
> >>
> >> Cheers,
> >>
> >>         Jean-Marc
> >>
> >>
> >> > Thanks,
> >> > Linfeng
> >> >
> >> > On Mon, Jun 5, 2017 at 12:28 PM, Linfeng Zhang <linfengz
at google.com
> >> > <mailto:linfengz at google.com>> wrote:
> >> >
> >> >     Hi Jean-Marc,
> >> >
> >> >     I attached the new version in inner_prod_5patches_v2.zip
which
> >> >     synced to the current master.
> >> >
> >> >     For fixed-point ARM, only
> >> >    
0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch
> >> >     changes the performance.
> >> >     For floating-point ARM, only
> >> >    
0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
> >> >     <http://elt_inner_prod-and-dual_inn.pa>tch changes
the
> performance.
> >> >     Patch 1 and 2 are code clean-up and can only affect x86
> performance.
> >> >     Patch 5 has neglectable effect on floating-point ARM
performance.
> >> >
> >> >     Thanks,
> >> >     Linfeng
> >> >
> >> >     On Fri, Jun 2, 2017 at 11:26 AM, Jean-Marc Valin <
> jmvalin at jmvalin.ca
> >> >     <mailto:jmvalin at jmvalin.ca>> wrote:
> >> >
> >> >         Hi Linfeng,
> >> >
> >> >         I'll look into your patches. Can you let me know
what's the
> >> expected
> >> >         effect on performance (if any) for each of your
patches? Also,
> >> >         are these
> >> >         all the patches you intend to merge for 1.2 or are
there more
> >> >         upcoming ones?
> >> >
> >> >         Cheers,
> >> >
> >> >                 Jean-Marc
> >> >
> >> >         On 01/06/17 06:33 PM, Linfeng Zhang wrote:
> >> >         > Hi,
> >> >         >
> >> >         > Attached are 5 patches related to
celt_inner_prod()
> >> >         > and dual_inner_prod() NEON intrinsics
optimization.
> >> >         >
> >> >         > In
> >> >        
0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
> >> >         <http://elt_inner_prod-and-dual_inn.pa>tch, the
> >> >         > optimization changed the order of floating-point
inner
> >> >         products, which
> >> >         > will change the results. I
> >> >         > created
celt_inner_prod_neon_float_c_simulation()
> >> >         > and dual_inner_prod_neon_float_c_simulation() to
simulate
> the
> >> >         order
> >> >         > floating-point operations in NEON optimization
and compare
> >> their
> >> >         > results. Sorry that I cannot bond the distance
between
> >> original C
> >> >         > function and NEON function to any giving
reasonable small
> >> >         number or
> >> >         > ratio. It's easy to create an input which 0
and 1,000 are
> both
> >> >         correct
> >> >         > results by just manipulating the inner product
order.
> >> >         >
> >> >         > The total speed gain is about 1.0% for
fixed-point encoder,
> >> >         and 1.8% for
> >> >         > floating-point encoder, in Complexity 8, tested
on my
> >> Chromebook.
> >> >         >
> >> >         > Thanks,
> >> >         > Linfeng
> >> >         >
> >> >         >
> >> >         > _______________________________________________
> >> >         > opus mailing list
> >> >         > opus at xiph.org <mailto:opus at xiph.org>
> >> >         > http://lists.xiph.org/mailman/listinfo/opus
> >> >         <http://lists.xiph.org/mailman/listinfo/opus>
> >> >         >
> >> >
> >> >
> >> >
> >>
>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xiph.org/pipermail/opus/attachments/20170606/b0777ead/attachment-0001.html>

Jean-Marc Valin

2017-Jun-06 18:40 UTC

head link

[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

OK, so it looks like the issue is with denormalized numbers. I did a bit
of searching, but it's not clear whether Neon handles then properly (vs
setting then to zero). If that's indeed the issue, then I'm OK with your
original patch (using VERY_SMALL, not 1.0+xy), but I'd like to make sure
that it's indeed the issue.

Cheers,

	Jean-Marc

On 06/06/17 02:27 PM, Linfeng Zhang wrote:> Thank Ulrich!
> 
> Yes, using
>         celt_assert(1.0 + celt_inner_prod_neon_float_c_simulation(x, y,
> N) == 1.0 + xy);
>         celt_assert(1.0 + xy1_c == 1.0 + *xy1);
>         celt_assert(1.0 + xy2_c == 1.0 + *xy2);
> can avoid the useage of VERY_SMALL.
> 
> Hi Jean-Marc,
> 
> I added
>     {
>         const opus_val32 xy_c >
celt_inner_prod_neon_float_c_simulation(x, y, N);
>         const int32_t *x_bin = (int32_t*)x;
>         const int32_t *y_bin = (int32_t*)y;
>         const int32_t *xy_bin = (int32_t*)&xy;
>         const int32_t *xy_bin_c = (int32_t*)&xy_c;
>         // if((xy_c != xy) && (xy_c != 0.0) && (xy != 0.0))
{
>         if(xy_c != xy) {
>             printf("\n xy_c = %9f, xy   = %9f", xy_c, xy);
>             printf(" | xy_c = %13e, xy   = %13e", xy_c, xy);
>             printf(" | xy_c (bin) = 0x%8x, xy   (bin) = 0x%8x\n",
> *xy_bin_c, *xy_bin);
>             printf("\n N = %d", N);
>             for (i = 0; i < N; i++) {
>               printf("\n x[%d] = %9f, y[%d] = %9f", i, x[i], i,
y[i]);
>               printf(" | x[%d] = %13e, y[%d] = %13e", i, x[i], i,
y[i]);
>               printf(" | x[%d] (bin) = 0x%8x, y[%d] (bin) =
0x%8x", i,
> x_bin[i], i, y_bin[i]);
>             }
>             printf("\n\n");
>         }
>     }
> 
> And got the following two cases when testing speech_mono_32_48kHz.pcm
> (Download: https://drive.google.com/file/d/0B2bjttuYjfVYaHBDZE1XV3B0MHM
> <https://drive.google.com/file/d/0B2bjttuYjfVYaHBDZE1XV3B0MHM>) on
NEON:
> 
> $ ./opus_demo -e voip  48000 1 32000 -complexity  8
> speech_mono_32_48kHz.pcm  tmp.opus
> 
> libopus 1.2-beta-27-g6c51a195-dirty
> Encoding 48000 Hz input at 32.000 kb/s in auto bandwidth with 960-sample
> frames.
> 
>  xy_c =  0.000000, xy   =  0.000000 | xy_c =  5.605194e-45, xy   > 
0.000000e+00 | xy_c (bin) = 0x       4, xy   (bin) = 0x       0
> 
>  N = 8
>  x[0] = -0.000000, y[0] = -0.000000 | x[0] = -7.783648e-23, y[0] >
-7.783648e-23 | x[0] (bin) = 0x9abc3273, y[0] (bin) = 0x9abc3273
>  x[1] = -0.000000, y[1] = -0.000000 | x[1] = -1.862279e-23, y[1] >
-1.862279e-23 | x[1] (bin) = 0x99b41bca, y[1] (bin) = 0x99b41bca
>  x[2] =  0.000000, y[2] =  0.000000 | x[2] =  1.092297e-23, y[2] > 
1.092297e-23 | x[2] (bin) = 0x195347ee, y[2] (bin) = 0x195347ee
>  x[3] = -0.000000, y[3] = -0.000000 | x[3] = -5.171255e-25, y[3] >
-5.171255e-25 | x[3] (bin) = 0x97200ae8, y[3] (bin) = 0x97200ae8
>  x[4] = -0.000000, y[4] = -0.000000 | x[4] = -4.773915e-24, y[4] >
-4.773915e-24 | x[4] (bin) = 0x98b8ae90, y[4] (bin) = 0x98b8ae90
>  x[5] = -0.000000, y[5] = -0.000000 | x[5] = -3.717311e-25, y[5] >
-3.717311e-25 | x[5] (bin) = 0x96e61724, y[5] (bin) = 0x96e61724
>  x[6] = -0.000000, y[6] = -0.000000 | x[6] = -1.724025e-24, y[6] >
-1.724025e-24 | x[6] (bin) = 0x980563d5, y[6] (bin) = 0x980563d5
>  x[7] = -0.000000, y[7] = -0.000000 | x[7] = -2.245937e-24, y[7] >
-2.245937e-24 | x[7] (bin) = 0x982dc55f, y[7] (bin) = 0x982dc55f
> 
> =============================================================> 
>  xy_c =  0.000000, xy   =  0.000000 | xy_c =  1.121039e-44, xy   > 
0.000000e+00 | xy_c (bin) = 0x       8, xy   (bin) = 0x       0
> 
>  N = 8
>  x[0] = -0.000000, y[0] = -0.000000 | x[0] = -1.000134e-22, y[0] >
-1.000134e-22 | x[0] (bin) = 0x9af1d148, y[0] (bin) = 0x9af1d148
>  x[1] =  0.000000, y[1] =  0.000000 | x[1] =  3.052170e-23, y[1] > 
3.052170e-23 | x[1] (bin) = 0x1a139809, y[1] (bin) = 0x1a139809
>  x[2] = -0.000000, y[2] = -0.000000 | x[2] = -2.135591e-23, y[2] >
-2.135591e-23 | x[2] (bin) = 0x99ce8aaf, y[2] (bin) = 0x99ce8aaf
>  x[3] =  0.000000, y[3] =  0.000000 | x[3] =  1.180839e-23, y[3] > 
1.180839e-23 | x[3] (bin) = 0x19646856, y[3] (bin) = 0x19646856
>  x[4] = -0.000000, y[4] = -0.000000 | x[4] = -1.230446e-23, y[4] >
-1.230446e-23 | x[4] (bin) = 0x996e00bc, y[4] (bin) = 0x996e00bc
>  x[5] =  0.000000, y[5] =  0.000000 | x[5] =  6.443248e-24, y[5] > 
6.443248e-24 | x[5] (bin) = 0x18f942d6, y[5] (bin) = 0x18f942d6
>  x[6] = -0.000000, y[6] = -0.000000 | x[6] = -8.497414e-24, y[6] >
-8.497414e-24 | x[6] (bin) = 0x99245d28, y[6] (bin) = 0x99245d28
>  x[7] =  0.000000, y[7] =  0.000000 | x[7] =  3.849347e-24, y[7] > 
3.849347e-24 | x[7] (bin) = 0x1894ea17, y[7] (bin) = 0x1894ea17
> 
> There are 3 possible reasons.
> 1. Of course celt_inner_prod_neon_float_c_simulation() may have bug.
> Please help me find if any.
> 2. Though impossible, it's possible NEON is not IEEE 754-compliant when
> dealing with near 0 floating-point values.
> 3. Though more impossible, it's possible gcc is not IEEE 754-compliant
> here. :)
> 
> Since x[i] == y[i] in both cases, they are actually calculating the energy.
> (-1.000134e-22 * -1.000134e-22) is larger than the smallest
> single-precision number and should be represented as none-zero (such as
> 0x8). I don't know why NEON gives 0 result.
> 
> Thanks,
> Linfeng
> 
> On Tue, Jun 6, 2017 at 12:03 AM, Ulrich Windl
> <Ulrich.Windl at rz.uni-regensburg.de
> <mailto:Ulrich.Windl at rz.uni-regensburg.de>> wrote:
> 
>     >>> Linfeng Zhang <linfengz at google.com
<mailto:linfengz at google.com>>
>     schrieb am 06.06.2017 um 06:46 in Nachricht
>     <CAKoqLCAfj+fDUMLfN4dLNSZ4NNAZpaSt_BWZRp+7XBqfhiSqiQ at
mail.gmail.com
>     <mailto:CAKoqLCAfj%2BfDUMLfN4dLNSZ4NNAZpaSt_BWZRp%2B7XBqfhiSqiQ at
mail.gmail.com>>:
>     > Hi Jean-Marc,
>     >
>     > I tried "==" before, and it failed when both results are
0.0. Maybe the
>     > exponent or sign has difference because of the different 0.0
representation
>     > in NEON. If anybody know how to handle this 0.0 comparison, that
would be
>     > great.
>     > Or just use if(a==b || (a==0.0 && b==0.0)) ... but I
haven't try this.
> 
> 
>     From some faint memory of my math lessions I had produced code like
>     this to get the smallest floating-point number different from zero:
> 
>     double  EPS;            /* smallest number not equal to 0.0 */
> 
>     /* refined estimate of EPS */
>     static  double  get_EPS(double eps)
>     {
> 
>             while ( 1.0 + eps != 1.0 )
>                     eps /= 2;
>             return(eps);
>     }
> 
>     EPS = get_EPS(1.0);
> 
>     On the x86_64 platform I get:
>     (gdb) p EPS
>     $1 = 1.1102230246251565e-16
> 
>     Maybe it can help...
> 
>     Regards,
>     Ulrich
> 
>     >
>     > Thanks,
>     > Linfeng
>     >
>     > On Mon, Jun 5, 2017 at 8:43 PM Jean-Marc Valin <jmvalin at
jmvalin.ca
>     <mailto:jmvalin at jmvalin.ca>> wrote:
>     >
>     >> Hi Linfeng,
>     >>
>     >> On 05/06/17 03:31 PM, Linfeng Zhang wrote:
>     >> > Yes we'll have one more patch set related to xcorr in
next
>     week. Please
>     >> > don't wait if it's too late for 1.2 release.
>     >>
>     >> Assuming there's no issue with the patches, next week
isn't too late.
>     >>
>     >> Also, I've started looking at your patches. So far
there's one thing
>     >> that puzzles me a bit. In the OPUS_CHECK_ASM section of patch
>     0004, you
>     >> have:
>     >>
>     >> +        celt_assert(ABS32(xy1_c - *xy1) <= VERY_SMALL);
>     >>
>     >> Given the normal range of the values (the xy values are often
much
>     >> larger than one) and the precision involved here (24-bit
>     mantissa), it
>     >> seems like this test can only succeed if the two values are
actually
>     >> equal. Is the float patch actually bit-exact? If so, then
maybe you
>     >> should be using actual equality. If not, then I guess we need
to find
>     >> the right condition (which isn't obvious for floating
point).
>     >>
>     >> Cheers,
>     >>
>     >>         Jean-Marc
>     >>
>     >>
>     >> > Thanks,
>     >> > Linfeng
>     >> >
>     >> > On Mon, Jun 5, 2017 at 12:28 PM, Linfeng Zhang
>     <linfengz at google.com <mailto:linfengz at google.com>
>     >> > <mailto:linfengz at google.com <mailto:linfengz at
google.com>>> wrote:
>     >> >
>     >> >     Hi Jean-Marc,
>     >> >
>     >> >     I attached the new version in
inner_prod_5patches_v2.zip which
>     >> >     synced to the current master.
>     >> >
>     >> >     For fixed-point ARM, only
>     >> >    
0003-Optimize-fixed-point-celt_inner_prod-and-dual_inner_.patch
>     >> >     changes the performance.
>     >> >     For floating-point ARM, only
>     >> >    
0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
>     >> >     <http://elt_inner_prod-and-dual_inn.pa
>     <http://elt_inner_prod-and-dual_inn.pa>>tch changes the
performance.
>     >> >     Patch 1 and 2 are code clean-up and can only affect
x86
>     performance.
>     >> >     Patch 5 has neglectable effect on floating-point ARM
>     performance.
>     >> >
>     >> >     Thanks,
>     >> >     Linfeng
>     >> >
>     >> >     On Fri, Jun 2, 2017 at 11:26 AM, Jean-Marc Valin
>     <jmvalin at jmvalin.ca <mailto:jmvalin at jmvalin.ca>
>     >> >     <mailto:jmvalin at jmvalin.ca <mailto:jmvalin
at jmvalin.ca>>> wrote:
>     >> >
>     >> >         Hi Linfeng,
>     >> >
>     >> >         I'll look into your patches. Can you let me
know what's the
>     >> expected
>     >> >         effect on performance (if any) for each of your
>     patches? Also,
>     >> >         are these
>     >> >         all the patches you intend to merge for 1.2 or
are
>     there more
>     >> >         upcoming ones?
>     >> >
>     >> >         Cheers,
>     >> >
>     >> >                 Jean-Marc
>     >> >
>     >> >         On 01/06/17 06:33 PM, Linfeng Zhang wrote:
>     >> >         > Hi,
>     >> >         >
>     >> >         > Attached are 5 patches related to
celt_inner_prod()
>     >> >         > and dual_inner_prod() NEON intrinsics
optimization.
>     >> >         >
>     >> >         > In
>     >> >       
>      0004-Optimize-floating-point-celt_inner_prod-and-dual_inn.pa
>     >> >         <http://elt_inner_prod-and-dual_inn.pa
>     <http://elt_inner_prod-and-dual_inn.pa>>tch, the
>     >> >         > optimization changed the order of
floating-point inner
>     >> >         products, which
>     >> >         > will change the results. I
>     >> >         > created
celt_inner_prod_neon_float_c_simulation()
>     >> >         > and
dual_inner_prod_neon_float_c_simulation() to
>     simulate the
>     >> >         order
>     >> >         > floating-point operations in NEON
optimization and
>     compare
>     >> their
>     >> >         > results. Sorry that I cannot bond the
distance between
>     >> original C
>     >> >         > function and NEON function to any giving
reasonable small
>     >> >         number or
>     >> >         > ratio. It's easy to create an input
which 0 and 1,000
>     are both
>     >> >         correct
>     >> >         > results by just manipulating the inner
product order.
>     >> >         >
>     >> >         > The total speed gain is about 1.0% for
fixed-point
>     encoder,
>     >> >         and 1.8% for
>     >> >         > floating-point encoder, in Complexity 8,
tested on my
>     >> Chromebook.
>     >> >         >
>     >> >         > Thanks,
>     >> >         > Linfeng
>     >> >         >
>     >> >         >
>     >> >         >
_______________________________________________
>     >> >         > opus mailing list
>     >> >         > opus at xiph.org <mailto:opus at
xiph.org>
>     <mailto:opus at xiph.org <mailto:opus at xiph.org>>
>     >> >         > http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>
>     >> >         <http://lists.xiph.org/mailman/listinfo/opus
>     <http://lists.xiph.org/mailman/listinfo/opus>>
>     >> >         >
>     >> >
>     >> >
>     >> >
>     >>
> 
> 
> 
>

Bjorn Reese

2017-Jun-06 21:29 UTC

head link

[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

On 06/06/2017 08:40 PM, Jean-Marc Valin wrote:> OK, so it looks like the issue is with denormalized numbers. I did a bit
> of searching, but it's not clear whether Neon handles then properly (vs
> setting then to zero). If that's indeed the issue, then I'm OK with
your
> original patch (using VERY_SMALL, not 1.0+xy), but I'd like to make
sure
> that it's indeed the issue.
Check with "gcc -dM -E -x c /dev/null | grep DENORM"\

Apparently Analagous Threads

Search for more reasonably related threads

opus - Jun 2017 - celt_inner_prod() and dual_inner_prod() NEON intrinsics

[opus] Antw: Re: [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

[opus] Antw: Re: [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

[opus] [OPUS] celt_inner_prod() and dual_inner_prod() NEON intrinsics

Apparently Analagous Threads