thr3ads.net - opus - [opus] Alleged bug in Silk codec [Jun 2014]

If this information is useful, please help other people find it:
Share via:

KOEN VOS

2014-Jun-20 21:04 UTC

[opus] Alleged bug in Silk codec

Yes those instructions exist, although they're a bit slower than the basic
16x16->32 with 32-bit accumulation (SMLABB).  So I'd be surprised if the
function with 64 bit accumulation would run as fast as the current code.
Don't know how much we care about 16-bit platforms.  And accuracy should
not matter.

On the other hand, a 64-bit implementation is much cleaner/shorter, which
is always a good argument :-)
All in all, no strong preference from my side.
koen.

On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry <tterribe at
xiph.org>
wrote:
> KOEN VOS wrote:
>
>> current implementation is faster on a 32 or 16 bit processor.  If you
>> would find the opposite to be true (ie that a 64 bit implementation is
>> faster on, say, a 32 bit ARM CPU) then perhaps we should reconsider.
>>
>
> Doesn't ARMv6 have a dual signed 16x16->32 multiply with a 64-bit
> accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a
64-bit
> accumulator (SMLALBB). I would think a 64-bit version could be made pretty
> fast on 32-bit ARM, without even resorting to SIMD.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/opus/attachments/20140620/6fc87274/attachment.htm

Marcello Caramma (mcaramma)

2014-Jun-25 10:37 UTC

head link

[opus] Alleged bug in Silk codec

Yes, regarding the unsigned to signed conversion you are right, it is
implementation defined.  I just had an issue a couple of years ago with a
compiler which incorrectly treated unsigned overflow as undefined rather than
implementation defined?

Regarding the 64 bit profiling: I looked at the disassembly (gcc ?c ?S ?O2
../opus/silk/sum_sqr_shift.c ?I../opus/include ?I../opus/celt) of the 64 bit
accumulator version (unrolled twice like the current code) and found that, as
well as having only one loop, the loop has 12 instructions per iteration.

The current version (after fixing the bug) gives 12 instructions per iteration
until shift becomes non zero (first loop).  After that there are 16 instructions
per iteration in the second loop, and that without counting the additional
instructions we jump to when we need to handle the overflow and increase the
shift.
Although this is not proper profiling, it is good enough for me. And considering
the increase in code clarity, I would definitely go for the 64 bit version.

Just for kicks, I also made an experimental version that uses 32 bits but leaves
8 bits margin before accumulating, and that shaves 2 instructions per loop,
making it 10. All in all, I am not sure it is worth the hassle though
considering this is only about 1/order of the cost of the autocorrelation?

Marcello


From: KOEN VOS <koenvos74 at gmail.com<mailto:koenvos74 at
gmail.com>>
Date: Friday, 20 June 2014 22:04
To: "Timothy B. Terriberry" <tterribe at
xiph.org<mailto:tterribe at xiph.org>>
Cc: "opus at xiph.org<mailto:opus at xiph.org>" <opus at
xiph.org<mailto:opus at xiph.org>>, Marcello Caramma <mcaramma at
cisco.com<mailto:mcaramma at cisco.com>>
Subject: Re: [opus] Alleged bug in Silk codec

Yes those instructions exist, although they're a bit slower than the basic
16x16->32 with 32-bit accumulation (SMLABB).  So I'd be surprised if the
function with 64 bit accumulation would run as fast as the current code. 
Don't know how much we care about 16-bit platforms.  And accuracy should not
matter.

On the other hand, a 64-bit implementation is much cleaner/shorter, which is
always a good argument :-)
All in all, no strong preference from my side.
koen.


On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry <tterribe at
xiph.org<mailto:tterribe at xiph.org>> wrote:
KOEN VOS wrote:
current implementation is faster on a 32 or 16 bit processor.  If you
would find the opposite to be true (ie that a 64 bit implementation is
faster on, say, a 32 bit ARM CPU) then perhaps we should reconsider.

Doesn't ARMv6 have a dual signed 16x16->32 multiply with a 64-bit
accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a 64-bit
accumulator (SMLALBB). I would think a 64-bit version could be made pretty fast
on 32-bit ARM, without even resorting to SIMD.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/opus/attachments/20140625/63c6442d/attachment.htm

Jean-Marc Valin

2014-Jul-01 15:24 UTC

head link

[opus] Alleged bug in Silk codec

Hi Marcello,

FYI I just pushed the 64-bit C0 patch.

Cheers,

	Jean-Marc

On 25/06/14 06:37 AM, Marcello Caramma (mcaramma) wrote:> Yes, regarding the unsigned to signed conversion you are right, it is
> implementation defined.  I just had an issue a couple of years ago with
> a compiler which incorrectly treated unsigned overflow as undefined
> rather than implementation defined?
> 
> Regarding the 64 bit profiling: I looked at the disassembly (gcc ?c ?S
> ?O2 ../opus/silk/sum_sqr_shift.c ?I../opus/include ?I../opus/celt) of
> the 64 bit accumulator version (unrolled twice like the current code)
> and found that, as well as having only one loop, the loop has 12
> instructions per iteration.
> 
> The current version (after fixing the bug) gives 12 instructions per
> iteration until shift becomes non zero (first loop).  After that there
> are 16 instructions per iteration in the second loop, and that without
> counting the additional instructions we jump to when we need to handle
> the overflow and increase the shift.
> Although this is not proper profiling, it is good enough for me. And
> considering the increase in code clarity, I would definitely go for the
> 64 bit version.
> 
> Just for kicks, I also made an experimental version that uses 32 bits
> but leaves 8 bits margin before accumulating, and that shaves 2
> instructions per loop, making it 10. All in all, I am not sure it is
> worth the hassle though considering this is only about 1/order of the
> cost of the autocorrelation?
> 
> Marcello
> 
> 
> From: KOEN VOS <koenvos74 at gmail.com <mailto:koenvos74 at
gmail.com>>
> Date: Friday, 20 June 2014 22:04
> To: "Timothy B. Terriberry" <tterribe at xiph.org
<mailto:tterribe at xiph.org>>
> Cc: "opus at xiph.org <mailto:opus at xiph.org>" <opus
at xiph.org
> <mailto:opus at xiph.org>>, Marcello Caramma <mcaramma at
cisco.com
> <mailto:mcaramma at cisco.com>>
> Subject: Re: [opus] Alleged bug in Silk codec
> 
> Yes those instructions exist, although they're a bit slower than the
> basic 16x16->32 with 32-bit accumulation (SMLABB).  So I'd be
surprised
> if the function with 64 bit accumulation would run as fast as the
> current code.  Don't know how much we care about 16-bit platforms.  And
> accuracy should not matter.
> 
> On the other hand, a 64-bit implementation is much cleaner/shorter,
> which is always a good argument :-)
> All in all, no strong preference from my side.
> koen.
> 
> 
> On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry
> <tterribe at xiph.org <mailto:tterribe at xiph.org>> wrote:
> 
>     KOEN VOS wrote:
> 
>         current implementation is faster on a 32 or 16 bit processor.
>          If you
>         would find the opposite to be true (ie that a 64 bit
>         implementation is
>         faster on, say, a 32 bit ARM CPU) then perhaps we should
reconsider.
> 
> 
>     Doesn't ARMv6 have a dual signed 16x16->32 multiply with a
64-bit
>     accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a
>     64-bit accumulator (SMLALBB). I would think a 64-bit version could
>     be made pretty fast on 32-bit ARM, without even resorting to SIMD.
> 
>

Apparently Analagous Threads

Search for more reasonably related threads

opus - Jun 2014 - Alleged bug in Silk codec

[opus] Alleged bug in Silk codec

[opus] Alleged bug in Silk codec

[opus] Alleged bug in Silk codec

Apparently Analagous Threads