thr3ads.net - theora dev - [theora-dev] SSE2 assembly support [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Kay Tiong Khoo

2010-Feb-10 07:46 UTC

[theora-dev] SSE2 assembly support

Hi,

I''ve read a thread in the mailing list about theora''s poor
encoding speed which makes it an unsuitable candidate for live video
applications.

To improve the codec performance, I would like to volunteer to add SSE2 assembly
support via C/C++ intrinsics, which should also improve portability. As a first
step, I''ll be converting the existing MMX and MMXEXT assembly code.

Is anyone else working on this? If this sounds good, I''ll start work
soon.

Thanks.
----
Kay Khoo
RotateRight, LLC

Kay Tiong Khoo

2010-Feb-10 09:19 UTC

head link

[theora-dev] SSE2 assembly support

If the optimizations are done using C/C++ intrinsics, it will be easier to make
the code 64-bit safe as the compiler can account for the difference.

To answer your question - yes, I intend for the optimizations to benefit both
32-bit and 64-bit platforms and work across Windows, Mac OS X and Linux.

----
Kay Khoo
RotateRight, LLC

On Feb 10, 2010, at 4:55 PM, ZikZak wrote:
> Hi,
> 
> That sounds great, I''m not a coder only a user and Theora is very
useful for my needs.
> I''m wondering if your optimization will have any benefit for
64bits machine.
> 
> Regards
> --
> ZikZak

Benjamin M. Schwartz

2010-Feb-10 13:29 UTC

head link

[theora-dev] SSE2 assembly support

Kay Tiong Khoo wrote:> To improve the codec performance, I would like to volunteer to add SSE2
assembly support
This would be a great and welcome contribution!
> via C/C++ intrinsics, which should also improve portability.
libtheora doesn''t currently use intrinsics, but that''s ok.  If
you provide
a patch with intrinsics that can speed up libtheora, I''m sure it will
be
accepted, even if the maintainers decide to recode it as inline assembly.
> As a first step, I''ll be converting the existing MMX and MMXEXT
assembly code.
Converting it to SSE2 would be interesting.
> Is anyone else working on this?
I don''t think so.
> If this sounds good, I''ll start work soon. 
Sounds good to me!

--Ben

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
Url :
http://lists.xiph.org/pipermail/theora-dev/attachments/20100210/de2fd176/attachment.pgp

Timothy B. Terriberry

2010-Feb-10 23:30 UTC

head link

[theora-dev] SSE2 assembly support

There is some room for SSE2 optimizations (I just committed some earlier
today), but right now the slowest functions in the encoder are all in C.
  A few of these could benefit from SIMD, but algorithmic optimizations
will be both easier and give bigger performance improvements. Many of
the existing SIMD functions operate on 8x8 blocks, and so MMX is
generally enough to extract the maximum amount of parallelism.
Restructuring things to operate on larger blocks when possible is a good
idea, but a lot more work.

Finally, I am not generally a fan of intrinsics because a) their
portability is overrated and b) last I checked, compilers generate
horrible code from them. The current inline asm already works for 32-bit
and 64-bit platforms, except on Windows, but that is MSVC''s fault.

Kay Tiong Khoo

2010-Feb-11 08:58 UTC

head link

[theora-dev] SSE2 assembly support

Hi,

Thanks for all the info and advice.
 
I took a profile using a statistical sampler of the example_encoder performing
an encode of the deadline_cif.y4m media file. Below are the top 10 functions
sorted by "Self" samples. "Total" samples occur in the
symbol or its children.

OS: CentOS release 5.4 (Final) 2.6.18-164.el5
Processor: 4 x 2.40GHz Intel Core 2

      Self      Total Symbol 
     22.7%      22.7% oc_analyze_mb_mode_luma 
     16.0%      16.0% oc_enc_frag_satd2_thresh_mmxext
     13.0%      13.0% oc_enc_frag_satd_thresh_mmxext 
     12.7%      12.7% oc_enc_tokenize_ac 
      5.7%      22.3% oc_enc_block_transform_quantize 
      5.0%       5.0% oc_analyze_mb_mode_chroma 
      4.0%      95.4% oc_enc_analyze_inter 
      2.7%       7.0% oc_mcenc_search_frame 
      2.6%       2.6% oc_enc_fdct8x8_mmx 
      1.7%      33.4% oc_cost_inter  

The encoder was compiled with:

CFLAGS="-Wall -Wno-parentheses -g -O3 -fforce-addr -fno-omit-frame-pointer
-finline-functions -funroll-loops"

The profile concurs with Timothy''s assessment. The optimized MMX
functions account for ~30% of the samples, so the room for improvement by
conversion to SSE2 is limited. I will try some opportunistic optimizations
before starting on the conversion work.

----
Kay Khoo
RotateRight, LLC

On Feb 11, 2010, at 7:30 AM, Timothy B. Terriberry wrote:
> There is some room for SSE2 optimizations (I just committed some earlier
> today), but right now the slowest functions in the encoder are all in C.
>  A few of these could benefit from SIMD, but algorithmic optimizations
> will be both easier and give bigger performance improvements. Many of
> the existing SIMD functions operate on 8x8 blocks, and so MMX is
> generally enough to extract the maximum amount of parallelism.
> Restructuring things to operate on larger blocks when possible is a good
> idea, but a lot more work.
> 
> Finally, I am not generally a fan of intrinsics because a) their
> portability is overrated and b) last I checked, compilers generate
> horrible code from them. The current inline asm already works for 32-bit
> and 64-bit platforms, except on Windows, but that is MSVC''s fault.
> _______________________________________________
> theora-dev mailing list
> theora-dev at xiph.org
> http://lists.xiph.org/mailman/listinfo/theora-dev

Timothy B. Terriberry

2010-Feb-11 09:10 UTC

head link

[theora-dev] SSE2 assembly support

Kay Tiong Khoo wrote:> The profile concurs with Timothy''s assessment. The optimized MMX
functions account for ~30% of the samples, so the room for improvement by
conversion to SSE2 is limited. I will try some opportunistic optimizations
before starting on the conversion work.
Make sure you are working from the current 1.2 development branch:
http://svn.xiph.org/experimental/derf/theora-ptalarbvorm/

On x86-64, this should be using SSE2 already for SATD (your profile
shows the MMXEXT versions). It still uses MMXEXT SATD on x86-32 because
the SSE2 versions profiled as slower on an actual 32-bit processor
(where each instruction often requires multiple clocks).

theora dev - Feb 2010 - SSE2 assembly support

[theora-dev] SSE2 assembly support

[theora-dev] SSE2 assembly support

[theora-dev] SSE2 assembly support

[theora-dev] SSE2 assembly support

[theora-dev] SSE2 assembly support

[theora-dev] SSE2 assembly support