I'm considering implementing Opus as the codec for an embedded ARM-based battery powered audio system. In the interest of battery life and board footprint I'd like to specify the smallest CPU that can do the job. In some quick testing on Cortex-A8 (a very different core, but at least ISA compatible and hopefully fairly similar to M4 for things like cycle counts and code size) I saw promising results -- about 30 MHz of A8 CPU was sufficient to encode an audio stream using the 1.1.1-beta fixed point codec at 48 kHz mono, complexity=5, bitrate=20kbit/sec. Since the target SoCs tend to have an M3 or M4 running up to 100-150 MHz, and power consumption runs nearly linearly with clock speed, this seemed to give us some headroom to run the rest of our application stack and tune for battery life. However now that we're doing a first implementation on M4, we're seeing significantly higher cycle counts -- more in the range of 100 MHz of CPU needed to encode with the same parameters. Additionally, compared to 1.0.3, the code size and data size of the Opus codec in 1.1 has increased significantly (which makes it a challenge to fit in the on-SoC SRAM of the M4). Obviously we need to use the ARM ASM that landed in -beta, and we can decrease the complexity to somewhat reduce the CPU utilization, but I'm wondering if I'm missing any other low-hanging fruit in optimizing Opus for this target CPU. I haven't even started to do code profiling or CPU performance counter analysis. Does anyone have examples of similar applications? What kinds of CPU occupancy have other people seen on similar CPUs? Do we need to get some NEON asm? Does anybody have spare cycles to take paid work in this space? -andy
Hi Andy, On 03/11/14 07:36 PM, Andy Isaacson wrote:> In some quick testing on Cortex-A8 (a very different core, but at least > ISA compatible and hopefully fairly similar to M4 for things like cycle > counts and code size) I saw promising results -- about 30 MHz of A8 CPU > was sufficient to encode an audio stream using the 1.1.1-beta fixed > point codec at 48 kHz mono, complexity=5, bitrate=20kbit/sec.First, I think the big difference between the M4 and the A8 is that A8 has Neon, which Opus is able to use.> However now that we're doing a first implementation on M4, we're seeing > significantly higher cycle counts -- more in the range of 100 MHz of CPU > needed to encode with the same parameters. Additionally, compared to > 1.0.3, the code size and data size of the Opus codec in 1.1 has > increased significantly (which makes it a challenge to fit in the on-SoC > SRAM of the M4).I suspect most of the size increase you're seeing is from the new code in src/analysis.c which you do not need. In fact, if you're operating at 20 kb/s for speech, then you can entirely remove the CELT encoder from your build. You still need the decoder because there's no guarantee what the remote end will send you.> Obviously we need to use the ARM ASM that landed in -beta, and we can > decrease the complexity to somewhat reduce the CPU utilization, but I'm > wondering if I'm missing any other low-hanging fruit in optimizing Opus > for this target CPU. I haven't even started to do code profiling or CPU > performance counter analysis.There's a few things to check. First, make sure that OPUS_ARM_INLINE_EDSP (enabling DSP extensions) is defined in your config.h. Also, check for OPUS_ARM_ASM and OPUS_HAVE_RTCD. That means all the asm is enabled. At that point, the best is to run the profiles to see where the CPU time is spent. Cheers, Jean-Marc
On 11/03/2014 08:32 PM, Jean-Marc Valin wrote:> Hi Andy, > > On 03/11/14 07:36 PM, Andy Isaacson wrote: > > In some quick testing on Cortex-A8 (a very different core, but at least > > ISA compatible and hopefully fairly similar to M4 for things like cycle > > counts and code size) I saw promising results -- about 30 MHz of A8 CPU > > was sufficient to encode an audio stream using the 1.1.1-beta fixed > > point codec at 48 kHz mono, complexity=5, bitrate=20kbit/sec. > > First, I think the big difference between the M4 and the A8 is that A8 > has Neon, which Opus is able to use. > > > However now that we're doing a first implementation on M4, we're seeing > > significantly higher cycle counts -- more in the range of 100 MHz of CPU > > needed to encode with the same parameters. Additionally, compared to > > 1.0.3, the code size and data size of the Opus codec in 1.1 has > > increased significantly (which makes it a challenge to fit in the on-SoC > > SRAM of the M4). > > I suspect most of the size increase you're seeing is from the new code > in src/analysis.c which you do not need. In fact, if you're operating at > 20 kb/s for speech, then you can entirely remove the CELT encoder from > your build. You still need the decoder because there's no guarantee what > the remote end will send you. > > > Obviously we need to use the ARM ASM that landed in -beta, and we can > > decrease the complexity to somewhat reduce the CPU utilization, but I'm > > wondering if I'm missing any other low-hanging fruit in optimizing Opus > > for this target CPU. I haven't even started to do code profiling or CPU > > performance counter analysis. > > There's a few things to check. First, make sure that > OPUS_ARM_INLINE_EDSP (enabling DSP extensions) is defined in your > config.h. Also, check for OPUS_ARM_ASM and OPUS_HAVE_RTCD. That means > all the asm is enabled. At that point, the best is to run the profiles > to see where the CPU time is spent. > > Cheers, > > Jean-Marc > _______________________________________________ > opus mailing list > opus at xiph.org > http://lists.xiph.org/mailman/listinfo/opus >Incidentally, I think this advice constitutes a part of a bigger "Opus for embedded systems" guide, where you can define use cases and implementation optimizations that might be useful. Andy, you may consider posting what you and/or your company are willing to share to the Xiph wiki, for example. -- Libre Video http://librevideo.org