On 2022-04-06, Andrew Sonzogni wrote:> I've used this one but it's not THAT good: (chan1 + chan2 + chan3) / 3 > > The output signal may peak or be buggy some times.That is an age old question. In general, if you assume the channels are statistically independent, and you want each channel to contribute equally and maximally, you scale each by the square root of the number of channels. That solution mathematically mostly goes through under the central limit theorem, and exactly so if each of the sources are Gaussian distributed. In general you can assume that with sampled real life signals, because of reverberation and another application of the central limit theorem, but in case of "digital" signals and in particular if you sum many channels which are coherent (contain more or less the same stuff), you'll fall off the ladder. Personally I'd advocate for a pure sum with no scaling. But that sort of thing is kind of opinionated: it only really works if you have dynamics to spare and also an absolute scaling of the incoming waveforms over your whole signal chain. That is, each of the source waveforms at digital full scale are supposed to represent an absolute amplitude of, say, 96 dB(Z). When you do this sort of a gain architecture, anything coming in goes out just as it was, unless modified. If it comes in at 24-bit fullscale, it'll literally break not your ears but also your windows. So you'll never actually put in anything near fullscale. You *might* put in combinations of sounds which then -- if incoherent with each other -- will usually scale by the square root law. Which is fine, as long as you keep your output fullscale to something which will break stuff. Say, do 24 bits and set the digital fullscale at something like 130dB SPL in your monitors. This by the way is how it was and is done in cinematic audio editing. The full dynamics shatter your bones and ass, in absolute SPL range. Anything lesser is just added linearly into the mix. The added bonus of this setup is that you actually get to use amplitude as an extra variable which isn't normalized away by someone turning a knob or putting in an extra compressor in the signal pathway. You can actually compose with volume, instead of assuming everybody takes your dynamics away at will. (Obviously in many cases they will do just that, but trust you me, composing in absolute amplitude leaves *so* much more dynamics and nuance for the rest of them idjits to churn up.)> For your information, I'm using an ARM M4F with Opus configured like > this (40ms, 16kHz, 16 bitrate, 0 compres).Just unpack and sum. You really, *really* don't want to see how an optimized algorithm for something like this would be. It'd take a cross-compile into a fully functional language, some heady CS-minded meta-coding concepts, and a week of computation to find the precise weaving-in of the functional forms. At best. It's nowhere *near* the best use of your time, especially since you can optimize the simple decode-sum-recode(?) cycle already. As an example of the latter kind of optimization, if you have a number of unsynchronized opus streams decoding, you can rather easily land the output in circular buffers of double the block size. The code for updating such a buffer is classic, as is reading from it with arbitrary delay upto half the buffer size. The code to SIMD-vectorize arbitrary byte spanning reads from such a buffer is also classic, and can be found even from the better implementations of memcpy(). The only thing you need to do then is to weave your code together so that it more or less advances a word/cacheline/whatever-unit a time in each buffer, linearly, and sums over the inputs synchronously, without unduly polluting your CPU's data cache. It's nasty even then, but highly doable, and that's about as fast you can go with any library imaginable (any deeper algorithm would require recoding the Opus library in a functional language, and weaving its internals together with your buffering algo). -- Sampo Syreeni, aka decoy - decoy at iki.fi, http://decoy.iki.fi/front +358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2
Thanks for the answer and thanks for the warning about complex algorithms as I was starting to check how conferencing application were doing it. My signal will approximately be the same on each channel as I only send voice data. Also, I already send each channel on a different queue as to decode and synchronize them correctly and independently. So if I resume, it's better to decode, then use the sum of each channel scaled by the square root of the number of channels. Is that right ? Or is there some kind of trickier and better algorithms that I'm not aware of ? Sorry if my questions seems childish, I don't have that much experience in audio based algorithms and it's kind of fascinating how much it can be tricky to adjust correctly some parameters or even find the correct algorithms for a certain use case. Kind regards, Andrew Le 06/04/2022 ? 12:37, Sampo Syreeni a ?crit?:> On 2022-04-06, Andrew Sonzogni wrote: > >> I've used this one but it's not THAT good: (chan1 + chan2 + chan3) / 3 >> >> The output signal may peak or be buggy some times. > > That is an age old question. In general, if you assume the channels > are statistically independent, and you want each channel to contribute > equally and maximally, you scale each by the square root of the number > of channels. That solution mathematically mostly goes through under > the central limit theorem, and exactly so if each of the sources are > Gaussian distributed. In general you can assume that with sampled real > life signals, because of reverberation and another application of the > central limit theorem, but in case of "digital" signals and in > particular if you sum many channels which are coherent (contain more > or less the same stuff), you'll fall off the ladder. > > Personally I'd advocate for a pure sum with no scaling. But that sort > of thing is kind of opinionated: it only really works if you have > dynamics to spare and also an absolute scaling of the incoming > waveforms over your whole signal chain. That is, each of the source > waveforms at digital full scale are supposed to represent an absolute > amplitude of, say, 96 dB(Z). > > When you do this sort of a gain architecture, anything coming in goes > out just as it was, unless modified. If it comes in at 24-bit > fullscale, it'll literally break not your ears but also your windows. > So you'll never actually put in anything near fullscale. You *might* > put in combinations of sounds which then -- if incoherent with each > other -- will usually scale by the square root law. Which is fine, as > long as you keep your output fullscale to something which will break > stuff. Say, do 24 bits and set the digital fullscale at something like > 130dB SPL in your monitors. > > This by the way is how it was and is done in cinematic audio editing. > The full dynamics shatter your bones and ass, in absolute SPL range. > Anything lesser is just added linearly into the mix. The added bonus > of this setup is that you actually get to use amplitude as an extra > variable which isn't normalized away by someone turning a knob or > putting in an extra compressor in the signal pathway. You can actually > compose with volume, instead of assuming everybody takes your dynamics > away at will. (Obviously in many cases they will do just that, but > trust you me, composing in absolute amplitude leaves *so* much more > dynamics and nuance for the rest of them idjits to churn up.) > >> For your information, I'm using an ARM M4F with Opus configured like >> this (40ms, 16kHz, 16 bitrate, 0 compres). > > Just unpack and sum. You really, *really* don't want to see how an > optimized algorithm for something like this would be. It'd take a > cross-compile into a fully functional language, some heady CS-minded > meta-coding concepts, and a week of computation to find the precise > weaving-in of the functional forms. At best. It's nowhere *near* the > best use of your time, especially since you can optimize the simple > decode-sum-recode(?) cycle already. > > > As an example of the latter kind of optimization, if you have a number > of unsynchronized opus streams decoding, you can rather easily land > the output in circular buffers of double the block size. The code for > updating such a buffer is classic, as is reading from it with > arbitrary delay upto half the buffer size. The code to SIMD-vectorize > arbitrary byte spanning reads from such a buffer is also classic, and > can be found even from the better implementations of memcpy(). The > only thing you need to do then is to weave your code together so that > it more or less advances a word/cacheline/whatever-unit a time in each > buffer, linearly, and sums over the inputs synchronously, without > unduly polluting your CPU's data cache. It's nasty even then, but > highly doable, and that's about as fast you can go with any library > imaginable (any deeper algorithm would require recoding the Opus > library in a functional language, and weaving its internals together > with your buffering algo).
On 2022-04-06, Sampo Syreeni wrote:>> For your information, I'm using an ARM M4F with Opus configured like this >> (40ms, 16kHz, 16 bitrate, 0 compres). > > Just unpack and sum.To affirm, what you need to do is find a suitable outbut buffer synchronization algorithm, and the parallelize the fuck out of your code. In modern embedded and mobile architectures, our performance will be dictated by how many cores you utilize, consistently, not by exact optimizations in your you utilize your libraries. Since 40ms at 16kHz is 640 samples, you'd do a 1280 sample circulating buffer for each input. You'd do a balanced tree of pairwise sums of those towards a common output buffer, using intermediate buffers, preferably held in private and faster memory space for each core. Preferably the first level cache, but probably L2. Unlike in the input buffers, you'd probably mostly like to treat the secondary buffers as reset to beginning, or otherwise cache aligned; time base alignment can be done with the freely running ring buffers at the first stage, and isn't much needed further down the line. There is a cost in latency, but you can easily round it down to a cacheline, so something like 32-64 bytes. That rounds down nicely from 640 samples, to either ten or twenty lines. Do make note that most multiprocessor architectures can do either one of two things: 1) NUMA architectures expose private memory, which doesn't need any synchronization in-thread, and 2) where uniform, cache coherent memory access is available, often there are primitives which allow a certain memory location to be added to coherently. Use the first idea to add to parts of the addition tree without synchro, use the second part to synchro with the shared memory part of the tree. Just add in linearly incresing offset what you have, and then let the cache sort it out. Otherwise try to maintain a scheduling round which writes linearly in shared memory and only synchronizes on cache line boundaries. Now you have at most 80ms buffer for arbitrary timebase alignment and a fixed Opus blockwidth. It's rather a lot, but you can scale down from that in two's if you're willing to take the hit in efficiency of prediction. Your architecture is fully synchronised from the first layer down the addition tree, and you can use separate cores to populate the separate input buffers, preferably from separate TCP/UDP streams down the stack. That's about as good as you can get, presuming a fixed block width from OPUS. Which it has; it isn't a zero delay codec after all. -- Sampo Syreeni, aka decoy - decoy at iki.fi, http://decoy.iki.fi/front +358-40-3751464, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2