Shows that as the MCU increases, the OpenMP extra overhead is
amortized and OpenMP becomes as fast as the pthreads implementation.
The last chart
http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png
Shows that both pthreads and OpenMP overhead decreases as what seems
to be a logarithmic function of the MCU size.
This was a great experiment, and from what I can conclude, the OpenMP
implementation can be as good as the pthread.
Therefore maybe it is worth to work on a OpenMP implementation because
of the easiness of code maintenance.
What you guys think ?
Cheers,
Felipe
On Mon, Jul 14, 2008 at 1:04 AM, Leonardo de Paula Rosa Piga
<lpiga at terra.com.br> wrote:> I forgot to send the link for the last graph
> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/systime.png)
>
> On Mon, Jul 14, 2008 at 1:03 AM, Leonardo de Paula Rosa Piga
> <lpiga at terra.com.br> wrote:
>>
>> Hi Timothy, below some new and good results.
>>
>> On Mon, Jul 7, 2008 at 1:52 AM, Timothy B. Terriberry
>> <tterribe at email.unc.edu> wrote:
>>>
>>> Leonardo de Paula Rosa Piga wrote:
>>> > Coarse grain functions are the best functions to be
parallelize to
>>> > become the overhead worthwhile, but the current implementation
has one,
>>> > at most two. The parts that I suggested in my initial plan are
fine
>>>
>>> The reason the current decoder does this is cache coherency. The
idea is
>>> that only a few (16 to 32) rows need to be kept in L1/L2 cache
between
>>> each stage of the pipeline, which is a big reason the current
decoder is
>>> as fast as it is on high resolution content.
>>>
>>> It's possible break this pipeline back up into separate stages
that
>>> operate on the entire frame at once (e.g., just make the MCU size
the
>>> height of the frame). You lose cache coherency, but get
coarse-grained
>>> parallelism. Only testing will determine which is the better
strategy.
>>
>> You are right! You gave me a great tip. I did some tests for different
>> MCU size. The MCU size for the current implementation is 8.
>> For MCU size >= 16, PThread and OpenMP implementations produce the
same
>> results, that is, a speedup 13% on average. The time spend to thread
>> communication was reduced.
>>
>> I plotted three graphs to show these facts
>> One for Real Time vs MCU size. This graph shows that for MCU size >=
16
>> PThread and OpenMP implementations are equivalents.
>> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/comparison.png)
>>
>> The second graph compares the speedup and prove that for coarse grain
>> functions we can achieve better results.
>> (http://lampiao.lsc.ic.unicamp.br/~piga/gsoc_2008/speedup.png)
>>
>> And to conclude the third graph. It was plotted the system time vs MCU
>> size. For greater MCU size, lower is the system time. Because the
thread
>> communication overhead was reduced.
>>
>>>
>>>
>>> > the decoding time should be reduced in 33%, but it was just
10% for
>>> > pthread an 5% for openMP.
>>>
>>> The chroma coefficients are usually quantized much more coarsely,
so
>>> they very likely don't account for a full 33% of the decode
time even on
>>> a uniprocessor. Fewer coded blocks and fewer tokens to unpack in
the
>>> blocks that are coded means fewer and smaller iDCTs, fewer
invocations
>>> of the loop filter, etc.
>>>
>>> It's sad that OpenMP didn't do better... I was hoping with
the option
>>> available to them to do platform-specific tricks, they could cut
down on
>>> the overhead of pthreads, but I guess that stuff's just not
"there" yet.
>>
>> The results above show that it is not the case. For coarse grain jobs
they
>> are equivalent
>>
>>>
>>>
>>> > These version will run the functions
(c_dec_dc_unpredict_mcu_plane +
>>> > oc_dec_frags_recon_mcu_plane) and
>>> > (oc_state_loop_filter_frag_rows + oc_state_borders_fill_rows)
in
>>> > parallel. The upper bound for the gain is 60%, that is, let T2
be a
>>> > video decoded with the pipelined implementation. T2 should be
at most
>>> > 0.4To.
>>>
>>> I think you mean "at least". Let us know what your test
results look
>>> like (good or bad)! Keep in mind that, if possible, the same thread
that
>>> does oc_dec_dc_unpredict_mcu_plane+oc_dec_frags_recon_mcu_plane on
a set
>>> of blocks should also be the one to do
>>> oc_state_loop_filter_frag_rows+oc_state_borders_fill_rows on the
same
>>> set of blocks (and hopefully the scheduler doesn't muck things
up by
>>> moving the thread to a different physical CPU inbetween).
>>>
>>> _______________________________________________
>>> theora-dev mailing list
>>> theora-dev at xiph.org
>>> http://lists.xiph.org/mailman/listinfo/theora-dev
>>>
>>
>>
>>
>> --
>> Leonardo de Paula Rosa Piga
>> Undergraduate Computer Engineering Student
>> LSC - IC - UNICAMP
>> http://lampiao.lsc.ic.unicamp.br/~piga
>
>
> --
> Leonardo de Paula Rosa Piga
> Undergraduate Computer Engineering Student
> LSC - IC - UNICAMP
> http://lampiao.lsc.ic.unicamp.br/~piga
> _______________________________________________
> theora-dev mailing list
> theora-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/theora-dev
>
>
--
________________________________________
Felipe Portavales Goldstein <portavales at gmail>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://lampiao.lsc.ic.unicamp.br/~portavales/