thr3ads.net - theora dev - [theora-dev] Multithread support [Feb 2015]

If this information is useful, please help other people find it:
Share via:

M. Pabis

2015-Feb-03 23:37 UTC

[theora-dev] Multithread support

Hi,

I recently had to encode few hours of desktop stream with Theora and I
noticed it used only one core for encoding. Was I missing something? I did
not find any "thread" options.

As I dig, I found there was a multithread patch back in 2007, and some
ffmpeg2theora-multithread commits, but it looks like all this was dropped.
Am I right?

If the multithreading encoding was dropped out, may I ask why?

I think I could dedicate some of my free time to bring multithreading to
the Theora encoder but I would like to ensure not to be redundant ;-)

-- 
Thanks in advance for answers.
Mateusz Pabis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/theora-dev/attachments/20150204/2a6e39aa/attachment.htm

Timothy B. Terriberry

2015-Feb-04 04:17 UTC

head link

[theora-dev] Multithread support

M. Pabis wrote:> If the multithreading encoding was dropped out, may I ask why?
IIRC, the commits from 2007 only threaded the motion search, and gave 
gains of only 10 to 20%. However, as part of the work to improve the 
Theora encoder quality for the HTML5 video tag, the way this search was 
done was completely rewritten, and we got significantly more than 10 to 
20% speed-ups just by having a smarter single-threaded algorithm.
> I think I could dedicate some of my free time to bring multithreading to
> the Theora encoder but I would like to ensure not to be redundant ;-)
I don't believe anyone has been working on this for some years. There 
are two basic approaches.

One is threading within a single frame, which does not require any API 
behavior changes. In theory you can scale to a fairly decent number of 
threads everywhere except the final conversion from tokens to VLC codes 
in oc_enc_frame_pack(). However, the units of work are sufficiently 
small and the task dependencies sufficiently involved that this needs 
some kind of lock-free work-stealing queues to have a hope of getting 
more benefit from the parallelism than you pay in synchronization 
overhead. I'd started designing one with the hope that all memory 
allocations could be done up-front at encoder initialization (to avoid 
locking contention there), but this turns out to be sufficiently 
different from how most lock-free data structures worked at the time 
that it was a fair amount of work. I've been meaning to look at what 
Mozilla's Servo project is doing for this these days (since they have 
similar challenges).

The other is traditional FFmpeg-style frame threading, which gives each 
thread a separate frame to encode, and merely waits for enough rows of 
the previous frame to be finished so that it can start its motion 
search. This is generally much more effective than threading within a 
frame, but a) requires additional delay (the API supports this in 
theory, but software using that API might not expect it, so it would 
have to be enabled manually through some sort of th_encode_ctl call) and 
b) requires changes to the rate control to deal with the fact that 
statistics from the previous frame are not immediately available. b) was 
the real blocker here.

In every encoder I know of (for any format), the second approach is much 
more effective than the first.

M. Pabis

2015-Feb-04 10:48 UTC

head link

[theora-dev] Multithread support

Hi, thanks for some

On Wed, Feb 4, 2015 at 5:17 AM, Timothy B. Terriberry <tterribe at vt.edu>
wrote:

I don't believe anyone has been working on this for some years. There
are> two basic approaches.
>
> One is threading within a single frame, which does not require any API
> behavior changes. In theory you can scale to a fairly decent number of
> threads everywhere except the final conversion from tokens to VLC codes in
> oc_enc_frame_pack(). However, the units of work are sufficiently small and
> the task dependencies sufficiently involved that this needs some kind of
> lock-free work-stealing queues to have a hope of getting more benefit from
> the parallelism than you pay in synchronization overhead. I'd started
> designing one with the hope that all memory allocations could be done
> up-front at encoder initialization (to avoid locking contention there), but
> this turns out to be sufficiently different from how most lock-free data
> structures worked at the time that it was a fair amount of work. I've
been
> meaning to look at what Mozilla's Servo project is doing for this these
> days (since they have similar challenges).
>
> The other is traditional FFmpeg-style frame threading, which gives each
> thread a separate frame to encode, and merely waits for enough rows of the
> previous frame to be finished so that it can start its motion search. This
> is generally much more effective than threading within a frame, but a)
> requires additional delay (the API supports this in theory, but software
> using that API might not expect it, so it would have to be enabled manually
> through some sort of th_encode_ctl call) and b) requires changes to the
> rate control to deal with the fact that statistics from the previous frame
> are not immediately available. b) was the real blocker here.
>
>I have read Theora Specification (from March 2011) and I have some more
ideas.

1. Each thread deals with frames from intra frame up to next intra frame -
1;
2. Each thread deals with 1/n-th of the duration, and all outputs are
finally concatenated.
3. Maybe not a multithreading, but parallel/vector computing - encoding one
frame, divided into small areas and processed on OpenCL or CUDA.

I'm aware these are rather naive approaches. Mostly because they need to
have enough data upfront. And for 1. - stream encoding would introduce some
latency. And with nowadays processor power encoding can be done in
realtime, so no speedup with streamed video. Maybe one could spend more
time finding better compression.

Well, 2. is totally naive. But, if the whole video is available, the speed
up should be almost linear.

About 3. Well it's a vendor lock ;-) But hey, better this than nothing,
right? As this is a variation of concept #1 you described, CUDA and OpenCL
have efficient mechanisms to deal with synchronization, memory sharing etc.
This approach probably would benefit with higher resolutions. CUDA and/or
OpenCL could be also performing concept #2, with the same limitations
unfortunately.

-- 
Best regards
Mateusz Pabis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.xiph.org/pipermail/theora-dev/attachments/20150204/9d6a0bb9/attachment.htm

Reasonably Related Threads

Search for more apparently analagous threads

theora dev - Feb 2015 - Multithread support

[theora-dev] Multithread support

[theora-dev] Multithread support

[theora-dev] Multithread support

Reasonably Related Threads