Hi, thanks for some On Wed, Feb 4, 2015 at 5:17 AM, Timothy B. Terriberry <tterribe at vt.edu> wrote: I don't believe anyone has been working on this for some years. There are> two basic approaches. > > One is threading within a single frame, which does not require any API > behavior changes. In theory you can scale to a fairly decent number of > threads everywhere except the final conversion from tokens to VLC codes in > oc_enc_frame_pack(). However, the units of work are sufficiently small and > the task dependencies sufficiently involved that this needs some kind of > lock-free work-stealing queues to have a hope of getting more benefit from > the parallelism than you pay in synchronization overhead. I'd started > designing one with the hope that all memory allocations could be done > up-front at encoder initialization (to avoid locking contention there), but > this turns out to be sufficiently different from how most lock-free data > structures worked at the time that it was a fair amount of work. I've been > meaning to look at what Mozilla's Servo project is doing for this these > days (since they have similar challenges). > > The other is traditional FFmpeg-style frame threading, which gives each > thread a separate frame to encode, and merely waits for enough rows of the > previous frame to be finished so that it can start its motion search. This > is generally much more effective than threading within a frame, but a) > requires additional delay (the API supports this in theory, but software > using that API might not expect it, so it would have to be enabled manually > through some sort of th_encode_ctl call) and b) requires changes to the > rate control to deal with the fact that statistics from the previous frame > are not immediately available. b) was the real blocker here. > >I have read Theora Specification (from March 2011) and I have some more ideas. 1. Each thread deals with frames from intra frame up to next intra frame - 1; 2. Each thread deals with 1/n-th of the duration, and all outputs are finally concatenated. 3. Maybe not a multithreading, but parallel/vector computing - encoding one frame, divided into small areas and processed on OpenCL or CUDA. I'm aware these are rather naive approaches. Mostly because they need to have enough data upfront. And for 1. - stream encoding would introduce some latency. And with nowadays processor power encoding can be done in realtime, so no speedup with streamed video. Maybe one could spend more time finding better compression. Well, 2. is totally naive. But, if the whole video is available, the speed up should be almost linear. About 3. Well it's a vendor lock ;-) But hey, better this than nothing, right? As this is a variation of concept #1 you described, CUDA and OpenCL have efficient mechanisms to deal with synchronization, memory sharing etc. This approach probably would benefit with higher resolutions. CUDA and/or OpenCL could be also performing concept #2, with the same limitations unfortunately. -- Best regards Mateusz Pabis -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.xiph.org/pipermail/theora-dev/attachments/20150204/9d6a0bb9/attachment.htm
M. Pabis wrote:> 1. Each thread deals with frames from intra frame up to next intra frame > - 1;This works if you know where the intra frames are. Currently the frame type decision is made by trying to encode as an inter frame, and keeping statistics on expect rate and distortion from using all intra modes during mode decision. Then if it looks like an all-intra frame is likely to be more efficient, the frame is re-encoded as a keyframe. There is no lookahead at all. You could certainly do this in two-pass mode, but the first pass mode is not very much faster than the second pass. In fact, I'm pretty sure you could do this without any modification to libtheora at all.> 2. Each thread deals with 1/n-th of the duration, and all outputs are > finally concatenated.This is pretty similar to 1, except that you can be more relaxed about picking your partition points (i.e., if you put a keyframe in the wrong place 4 or 8 times in a whole sequence, the overhead will not be that large). Again, I think you can do this with no modifications to libtheora at all. In both cases the real trick will be rate control, since unless you're doing average bitrate, the number of bits you want to spend on each segment can vary quite a lot. If you are doing average bitrate, then this is easy. This is what sites like YouTube already do to reduce the latency between video upload and a video being available, and you can do this even with an encoder that is itself multithreaded (i.e., splitting across multiple machines instead of threads). Whether or not your encoder is multithreaded just controls how many segments you need to split the sequence into for a desired degree of parallelism.> 3. Maybe not a multithreading, but parallel/vector computing - encoding > one frame, divided into small areas and processed on OpenCL or CUDA.Lots of people have tried to do something like this for various codecs, but I'm not aware of anyone ever getting any real improvements. A lot of the processing does not work well on a GPU, and the data-marshalling to get information back and forth between the CPU and GPU tend to wipe out the gains from parallelism. I would personally suggest not wasting time on this approach.> right? As this is a variation of concept #1 you described, CUDA and > OpenCL have efficient mechanisms to deal with synchronization, memory > sharing etc. This approach probably would benefit with higherWell, they mostly deal with it by not synchronizing. I.e., to get good performance you need something on the order of 1000-way parallelism among tasks that do not have to synchronize with each other. They have grown some synchronization mechanisms, but they have a huge performance penalty on these architectures.
Am 04.02.2015 um 12:31 schrieb Timothy B. Terriberry:> M. Pabis wrote: >> 1. Each thread deals with frames from intra frame up to next intra frame >> - 1; > > This works if you know where the intra frames are.Could this information be gathered by having one thread encode a downsampled version of the input video sequence, or would this be a bad predictor? Who knows, perhaps one can also gather data on relative bitrate distribution between segments. Maik
Am 04.02.2015 um 12:31 schrieb Timothy B. Terriberry:> M. Pabis wrote: >> 1. Each thread deals with frames from intra frame up to next intra frame >> - 1; > > This works if you know where the intra frames are.Could this information be gathered by having one thread encode a downsampled version of the input video sequence, or would this be a bad predictor? Who knows, perhaps one can also gather data on relative bitrate distribution between segments. Maik