thr3ads.net - theora dev - [theora-dev] Multithread support [Feb 2015]

If this information is useful, please help other people find it:
Share via:

M. Pabis

2015-Feb-04 10:48 UTC

[theora-dev] Multithread support

Hi, thanks for some

On Wed, Feb 4, 2015 at 5:17 AM, Timothy B. Terriberry <tterribe at vt.edu>
wrote:

I don't believe anyone has been working on this for some years. There
are> two basic approaches.
>
> One is threading within a single frame, which does not require any API
> behavior changes. In theory you can scale to a fairly decent number of
> threads everywhere except the final conversion from tokens to VLC codes in
> oc_enc_frame_pack(). However, the units of work are sufficiently small and
> the task dependencies sufficiently involved that this needs some kind of
> lock-free work-stealing queues to have a hope of getting more benefit from
> the parallelism than you pay in synchronization overhead. I'd started
> designing one with the hope that all memory allocations could be done
> up-front at encoder initialization (to avoid locking contention there), but
> this turns out to be sufficiently different from how most lock-free data
> structures worked at the time that it was a fair amount of work. I've
been
> meaning to look at what Mozilla's Servo project is doing for this these
> days (since they have similar challenges).
>
> The other is traditional FFmpeg-style frame threading, which gives each
> thread a separate frame to encode, and merely waits for enough rows of the
> previous frame to be finished so that it can start its motion search. This
> is generally much more effective than threading within a frame, but a)
> requires additional delay (the API supports this in theory, but software
> using that API might not expect it, so it would have to be enabled manually
> through some sort of th_encode_ctl call) and b) requires changes to the
> rate control to deal with the fact that statistics from the previous frame
> are not immediately available. b) was the real blocker here.
>
>I have read Theora Specification (from March 2011) and I have some more
ideas.

1. Each thread deals with frames from intra frame up to next intra frame -
1;
2. Each thread deals with 1/n-th of the duration, and all outputs are
finally concatenated.
3. Maybe not a multithreading, but parallel/vector computing - encoding one
frame, divided into small areas and processed on OpenCL or CUDA.

I'm aware these are rather naive approaches. Mostly because they need to
have enough data upfront. And for 1. - stream encoding would introduce some
latency. And with nowadays processor power encoding can be done in
realtime, so no speedup with streamed video. Maybe one could spend more
time finding better compression.

Well, 2. is totally naive. But, if the whole video is available, the speed
up should be almost linear.

About 3. Well it's a vendor lock ;-) But hey, better this than nothing,
right? As this is a variation of concept #1 you described, CUDA and OpenCL
have efficient mechanisms to deal with synchronization, memory sharing etc.
This approach probably would benefit with higher resolutions. CUDA and/or
OpenCL could be also performing concept #2, with the same limitations
unfortunately.

-- 
Best regards
Mateusz Pabis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/theora-dev/attachments/20150204/9d6a0bb9/attachment.htm

Timothy B. Terriberry

2015-Feb-04 11:31 UTC

head link

[theora-dev] Multithread support

M. Pabis wrote:> 1. Each thread deals with frames from intra frame up to next intra frame
> - 1;
This works if you know where the intra frames are. Currently the frame 
type decision is made by trying to encode as an inter frame, and keeping 
statistics on expect rate and distortion from using all intra modes 
during mode decision. Then if it looks like an all-intra frame is likely 
to be more efficient, the frame is re-encoded as a keyframe. There is no 
lookahead at all.

You could certainly do this in two-pass mode, but the first pass mode is 
not very much faster than the second pass. In fact, I'm pretty sure you 
could do this without any modification to libtheora at all.
> 2. Each thread deals with 1/n-th of the duration, and all outputs are
> finally concatenated.
This is pretty similar to 1, except that you can be more relaxed about 
picking your partition points (i.e., if you put a keyframe in the wrong 
place 4 or 8 times in a whole sequence, the overhead will not be that 
large). Again, I think you can do this with no modifications to 
libtheora at all.

In both cases the real trick will be rate control, since unless you're 
doing average bitrate, the number of bits you want to spend on each 
segment can vary quite a lot. If you are doing average bitrate, then 
this is easy.

This is what sites like YouTube already do to reduce the latency between 
video upload and a video being available, and you can do this even with 
an encoder that is itself multithreaded (i.e., splitting across multiple 
machines instead of threads). Whether or not your encoder is 
multithreaded just controls how many segments you need to split the 
sequence into for a desired degree of parallelism.
> 3. Maybe not a multithreading, but parallel/vector computing - encoding
> one frame, divided into small areas and processed on OpenCL or CUDA.
Lots of people have tried to do something like this for various codecs, 
but I'm not aware of anyone ever getting any real improvements. A lot of 
the processing does not work well on a GPU, and the data-marshalling to 
get information back and forth between the CPU and GPU tend to wipe out 
the gains from parallelism.

I would personally suggest not wasting time on this approach.
> right? As this is a variation of concept #1 you described, CUDA and
> OpenCL have efficient mechanisms to deal with synchronization, memory
> sharing etc. This approach probably would benefit with higher
Well, they mostly deal with it by not synchronizing. I.e., to get good 
performance you need something on the order of 1000-way parallelism 
among tasks that do not have to synchronize with each other. They have 
grown some synchronization mechanisms, but they have a huge performance 
penalty on these architectures.

Maik Merten

2015-Feb-04 12:06 UTC

head link

[theora-dev] Multithread support

Am 04.02.2015 um 12:31 schrieb Timothy B. Terriberry:> M. Pabis wrote:
>> 1. Each thread deals with frames from intra frame up to next intra
frame
>> - 1;
>
> This works if you know where the intra frames are.
Could this information be gathered by having one thread encode a 
downsampled version of the input video sequence, or would this be a bad 
predictor?

Who knows, perhaps one can also gather data on relative bitrate 
distribution between segments.


Maik

Maik Merten

2015-Feb-04 12:06 UTC

head link

[theora-dev] Multithread support

Am 04.02.2015 um 12:31 schrieb Timothy B. Terriberry:> M. Pabis wrote:
>> 1. Each thread deals with frames from intra frame up to next intra
frame
>> - 1;
>
> This works if you know where the intra frames are.
Could this information be gathered by having one thread encode a 
downsampled version of the input video sequence, or would this be a bad 
predictor?

Who knows, perhaps one can also gather data on relative bitrate 
distribution between segments.


Maik

Seemingly Similar Threads

Search for more possibly parallel threads

theora dev - Feb 2015 - Multithread support

[theora-dev] Multithread support

[theora-dev] Multithread support

[theora-dev] Multithread support

[theora-dev] Multithread support

Seemingly Similar Threads