Hi, thanks for some
On Wed, Feb 4, 2015 at 5:17 AM, Timothy B. Terriberry <tterribe at vt.edu>
wrote:
I don't believe anyone has been working on this for some years. There
are> two basic approaches.
>
> One is threading within a single frame, which does not require any API
> behavior changes. In theory you can scale to a fairly decent number of
> threads everywhere except the final conversion from tokens to VLC codes in
> oc_enc_frame_pack(). However, the units of work are sufficiently small and
> the task dependencies sufficiently involved that this needs some kind of
> lock-free work-stealing queues to have a hope of getting more benefit from
> the parallelism than you pay in synchronization overhead. I'd started
> designing one with the hope that all memory allocations could be done
> up-front at encoder initialization (to avoid locking contention there), but
> this turns out to be sufficiently different from how most lock-free data
> structures worked at the time that it was a fair amount of work. I've
been
> meaning to look at what Mozilla's Servo project is doing for this these
> days (since they have similar challenges).
>
> The other is traditional FFmpeg-style frame threading, which gives each
> thread a separate frame to encode, and merely waits for enough rows of the
> previous frame to be finished so that it can start its motion search. This
> is generally much more effective than threading within a frame, but a)
> requires additional delay (the API supports this in theory, but software
> using that API might not expect it, so it would have to be enabled manually
> through some sort of th_encode_ctl call) and b) requires changes to the
> rate control to deal with the fact that statistics from the previous frame
> are not immediately available. b) was the real blocker here.
>
>
I have read Theora Specification (from March 2011) and I have some more
ideas.
1. Each thread deals with frames from intra frame up to next intra frame -
1;
2. Each thread deals with 1/n-th of the duration, and all outputs are
finally concatenated.
3. Maybe not a multithreading, but parallel/vector computing - encoding one
frame, divided into small areas and processed on OpenCL or CUDA.
I'm aware these are rather naive approaches. Mostly because they need to
have enough data upfront. And for 1. - stream encoding would introduce some
latency. And with nowadays processor power encoding can be done in
realtime, so no speedup with streamed video. Maybe one could spend more
time finding better compression.
Well, 2. is totally naive. But, if the whole video is available, the speed
up should be almost linear.
About 3. Well it's a vendor lock ;-) But hey, better this than nothing,
right? As this is a variation of concept #1 you described, CUDA and OpenCL
have efficient mechanisms to deal with synchronization, memory sharing etc.
This approach probably would benefit with higher resolutions. CUDA and/or
OpenCL could be also performing concept #2, with the same limitations
unfortunately.
--
Best regards
Mateusz Pabis
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
lists.xiph.org/pipermail/theora-dev/attachments/20150204/9d6a0bb9/attachment.htm