Yes those instructions exist, although they're a bit slower than the basic 16x16->32 with 32-bit accumulation (SMLABB). So I'd be surprised if the function with 64 bit accumulation would run as fast as the current code. Don't know how much we care about 16-bit platforms. And accuracy should not matter. On the other hand, a 64-bit implementation is much cleaner/shorter, which is always a good argument :-) All in all, no strong preference from my side. koen. On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry <tterribe at xiph.org> wrote:> KOEN VOS wrote: > >> current implementation is faster on a 32 or 16 bit processor. If you >> would find the opposite to be true (ie that a 64 bit implementation is >> faster on, say, a 32 bit ARM CPU) then perhaps we should reconsider. >> > > Doesn't ARMv6 have a dual signed 16x16->32 multiply with a 64-bit > accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a 64-bit > accumulator (SMLALBB). I would think a 64-bit version could be made pretty > fast on 32-bit ARM, without even resorting to SIMD. >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/opus/attachments/20140620/6fc87274/attachment.htm
Yes, regarding the unsigned to signed conversion you are right, it is implementation defined. I just had an issue a couple of years ago with a compiler which incorrectly treated unsigned overflow as undefined rather than implementation defined? Regarding the 64 bit profiling: I looked at the disassembly (gcc ?c ?S ?O2 ../opus/silk/sum_sqr_shift.c ?I../opus/include ?I../opus/celt) of the 64 bit accumulator version (unrolled twice like the current code) and found that, as well as having only one loop, the loop has 12 instructions per iteration. The current version (after fixing the bug) gives 12 instructions per iteration until shift becomes non zero (first loop). After that there are 16 instructions per iteration in the second loop, and that without counting the additional instructions we jump to when we need to handle the overflow and increase the shift. Although this is not proper profiling, it is good enough for me. And considering the increase in code clarity, I would definitely go for the 64 bit version. Just for kicks, I also made an experimental version that uses 32 bits but leaves 8 bits margin before accumulating, and that shaves 2 instructions per loop, making it 10. All in all, I am not sure it is worth the hassle though considering this is only about 1/order of the cost of the autocorrelation? Marcello From: KOEN VOS <koenvos74 at gmail.com<mailto:koenvos74 at gmail.com>> Date: Friday, 20 June 2014 22:04 To: "Timothy B. Terriberry" <tterribe at xiph.org<mailto:tterribe at xiph.org>> Cc: "opus at xiph.org<mailto:opus at xiph.org>" <opus at xiph.org<mailto:opus at xiph.org>>, Marcello Caramma <mcaramma at cisco.com<mailto:mcaramma at cisco.com>> Subject: Re: [opus] Alleged bug in Silk codec Yes those instructions exist, although they're a bit slower than the basic 16x16->32 with 32-bit accumulation (SMLABB). So I'd be surprised if the function with 64 bit accumulation would run as fast as the current code. Don't know how much we care about 16-bit platforms. And accuracy should not matter. On the other hand, a 64-bit implementation is much cleaner/shorter, which is always a good argument :-) All in all, no strong preference from my side. koen. On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry <tterribe at xiph.org<mailto:tterribe at xiph.org>> wrote: KOEN VOS wrote: current implementation is faster on a 32 or 16 bit processor. If you would find the opposite to be true (ie that a 64 bit implementation is faster on, say, a 32 bit ARM CPU) then perhaps we should reconsider. Doesn't ARMv6 have a dual signed 16x16->32 multiply with a 64-bit accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a 64-bit accumulator (SMLALBB). I would think a 64-bit version could be made pretty fast on 32-bit ARM, without even resorting to SIMD. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/opus/attachments/20140625/63c6442d/attachment.htm
Hi Marcello, FYI I just pushed the 64-bit C0 patch. Cheers, Jean-Marc On 25/06/14 06:37 AM, Marcello Caramma (mcaramma) wrote:> Yes, regarding the unsigned to signed conversion you are right, it is > implementation defined. I just had an issue a couple of years ago with > a compiler which incorrectly treated unsigned overflow as undefined > rather than implementation defined? > > Regarding the 64 bit profiling: I looked at the disassembly (gcc ?c ?S > ?O2 ../opus/silk/sum_sqr_shift.c ?I../opus/include ?I../opus/celt) of > the 64 bit accumulator version (unrolled twice like the current code) > and found that, as well as having only one loop, the loop has 12 > instructions per iteration. > > The current version (after fixing the bug) gives 12 instructions per > iteration until shift becomes non zero (first loop). After that there > are 16 instructions per iteration in the second loop, and that without > counting the additional instructions we jump to when we need to handle > the overflow and increase the shift. > Although this is not proper profiling, it is good enough for me. And > considering the increase in code clarity, I would definitely go for the > 64 bit version. > > Just for kicks, I also made an experimental version that uses 32 bits > but leaves 8 bits margin before accumulating, and that shaves 2 > instructions per loop, making it 10. All in all, I am not sure it is > worth the hassle though considering this is only about 1/order of the > cost of the autocorrelation? > > Marcello > > > From: KOEN VOS <koenvos74 at gmail.com <mailto:koenvos74 at gmail.com>> > Date: Friday, 20 June 2014 22:04 > To: "Timothy B. Terriberry" <tterribe at xiph.org <mailto:tterribe at xiph.org>> > Cc: "opus at xiph.org <mailto:opus at xiph.org>" <opus at xiph.org > <mailto:opus at xiph.org>>, Marcello Caramma <mcaramma at cisco.com > <mailto:mcaramma at cisco.com>> > Subject: Re: [opus] Alleged bug in Silk codec > > Yes those instructions exist, although they're a bit slower than the > basic 16x16->32 with 32-bit accumulation (SMLABB). So I'd be surprised > if the function with 64 bit accumulation would run as fast as the > current code. Don't know how much we care about 16-bit platforms. And > accuracy should not matter. > > On the other hand, a 64-bit implementation is much cleaner/shorter, > which is always a good argument :-) > All in all, no strong preference from my side. > koen. > > > On Fri, Jun 20, 2014 at 12:40 PM, Timothy B. Terriberry > <tterribe at xiph.org <mailto:tterribe at xiph.org>> wrote: > > KOEN VOS wrote: > > current implementation is faster on a 32 or 16 bit processor. > If you > would find the opposite to be true (ie that a 64 bit > implementation is > faster on, say, a 32 bit ARM CPU) then perhaps we should reconsider. > > > Doesn't ARMv6 have a dual signed 16x16->32 multiply with a 64-bit > accumulator (SMLALD)? Even v5E should have a single 16x16->32 with a > 64-bit accumulator (SMLALBB). I would think a 64-bit version could > be made pretty fast on 32-bit ARM, without even resorting to SIMD. > >