Hi, I''ve read a thread in the mailing list about theora''s poor encoding speed which makes it an unsuitable candidate for live video applications. To improve the codec performance, I would like to volunteer to add SSE2 assembly support via C/C++ intrinsics, which should also improve portability. As a first step, I''ll be converting the existing MMX and MMXEXT assembly code. Is anyone else working on this? If this sounds good, I''ll start work soon. Thanks. ---- Kay Khoo RotateRight, LLC
If the optimizations are done using C/C++ intrinsics, it will be easier to make the code 64-bit safe as the compiler can account for the difference. To answer your question - yes, I intend for the optimizations to benefit both 32-bit and 64-bit platforms and work across Windows, Mac OS X and Linux. ---- Kay Khoo RotateRight, LLC On Feb 10, 2010, at 4:55 PM, ZikZak wrote:> Hi, > > That sounds great, I''m not a coder only a user and Theora is very useful for my needs. > I''m wondering if your optimization will have any benefit for 64bits machine. > > Regards > -- > ZikZak
Kay Tiong Khoo wrote:> To improve the codec performance, I would like to volunteer to add SSE2 assembly supportThis would be a great and welcome contribution!> via C/C++ intrinsics, which should also improve portability.libtheora doesn''t currently use intrinsics, but that''s ok. If you provide a patch with intrinsics that can speed up libtheora, I''m sure it will be accepted, even if the maintainers decide to recode it as inline assembly.> As a first step, I''ll be converting the existing MMX and MMXEXT assembly code.Converting it to SSE2 would be interesting.> Is anyone else working on this?I don''t think so.> If this sounds good, I''ll start work soon.Sounds good to me! --Ben -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature Url : http://lists.xiph.org/pipermail/theora-dev/attachments/20100210/de2fd176/attachment.pgp
There is some room for SSE2 optimizations (I just committed some earlier today), but right now the slowest functions in the encoder are all in C. A few of these could benefit from SIMD, but algorithmic optimizations will be both easier and give bigger performance improvements. Many of the existing SIMD functions operate on 8x8 blocks, and so MMX is generally enough to extract the maximum amount of parallelism. Restructuring things to operate on larger blocks when possible is a good idea, but a lot more work. Finally, I am not generally a fan of intrinsics because a) their portability is overrated and b) last I checked, compilers generate horrible code from them. The current inline asm already works for 32-bit and 64-bit platforms, except on Windows, but that is MSVC''s fault.
Hi, Thanks for all the info and advice. I took a profile using a statistical sampler of the example_encoder performing an encode of the deadline_cif.y4m media file. Below are the top 10 functions sorted by "Self" samples. "Total" samples occur in the symbol or its children. OS: CentOS release 5.4 (Final) 2.6.18-164.el5 Processor: 4 x 2.40GHz Intel Core 2 Self Total Symbol 22.7% 22.7% oc_analyze_mb_mode_luma 16.0% 16.0% oc_enc_frag_satd2_thresh_mmxext 13.0% 13.0% oc_enc_frag_satd_thresh_mmxext 12.7% 12.7% oc_enc_tokenize_ac 5.7% 22.3% oc_enc_block_transform_quantize 5.0% 5.0% oc_analyze_mb_mode_chroma 4.0% 95.4% oc_enc_analyze_inter 2.7% 7.0% oc_mcenc_search_frame 2.6% 2.6% oc_enc_fdct8x8_mmx 1.7% 33.4% oc_cost_inter The encoder was compiled with: CFLAGS="-Wall -Wno-parentheses -g -O3 -fforce-addr -fno-omit-frame-pointer -finline-functions -funroll-loops" The profile concurs with Timothy''s assessment. The optimized MMX functions account for ~30% of the samples, so the room for improvement by conversion to SSE2 is limited. I will try some opportunistic optimizations before starting on the conversion work. ---- Kay Khoo RotateRight, LLC On Feb 11, 2010, at 7:30 AM, Timothy B. Terriberry wrote:> There is some room for SSE2 optimizations (I just committed some earlier > today), but right now the slowest functions in the encoder are all in C. > A few of these could benefit from SIMD, but algorithmic optimizations > will be both easier and give bigger performance improvements. Many of > the existing SIMD functions operate on 8x8 blocks, and so MMX is > generally enough to extract the maximum amount of parallelism. > Restructuring things to operate on larger blocks when possible is a good > idea, but a lot more work. > > Finally, I am not generally a fan of intrinsics because a) their > portability is overrated and b) last I checked, compilers generate > horrible code from them. The current inline asm already works for 32-bit > and 64-bit platforms, except on Windows, but that is MSVC''s fault. > _______________________________________________ > theora-dev mailing list > theora-dev at xiph.org > http://lists.xiph.org/mailman/listinfo/theora-dev
Kay Tiong Khoo wrote:> The profile concurs with Timothy''s assessment. The optimized MMX functions account for ~30% of the samples, so the room for improvement by conversion to SSE2 is limited. I will try some opportunistic optimizations before starting on the conversion work.Make sure you are working from the current 1.2 development branch: http://svn.xiph.org/experimental/derf/theora-ptalarbvorm/ On x86-64, this should be using SSE2 already for SATD (your profile shows the MMXEXT versions). It still uses MMXEXT SATD on x86-32 because the SSE2 versions profiled as slower on an actual 32-bit processor (where each instruction often requires multiple clocks).