Sukhomlinov, Vadim
2009-Oct-13 13:14 UTC
[theora-dev] Proposal for replacing asm code with intrinsics
Hi, I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2). There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems: 1) Need to sync gcc & msvc versions 2) Only 32bit environment is supported 3) No support for newer than MMX instruction sets My proposal is to replace all functions in assembly with compiler intrinsic which compiles into 1-2 assembly instructions and are much easier to maintain. For example: _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers. And code like: psadbw mm4,mm5 paddw mm0,mm4 Can be re-written into _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); Compiler will replace variables with actual registers, ensuring better allocation and scheduling of them. So, benefits are: 1) Easier to read & understand code which can use same variable names as generic version in C 2) Single source code for gcc & msvc & intel compiler (all of them supports same syntax) 3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX) thru replacement of __m64 to __m128 4) 64-bit code generation support 5) Compiler can reschedule instructions based on target CPU to deliver better performance w/o manual tuning. I did several tests with high-quality manually optimized assembly in the past and then replaced it to intrinsics which resulted in 3-5% better performance when using Intel compiler. Anyway, I don't expect any performance issues with it. It will require some change in project structure and makefiles and I'm not sure if this ok - at least I don't know how to coordinate work on Theora with over developers. Could you please help me here? Thanks in advance, Vadim -----Original Message----- From: theora-dev-bounces at xiph.org [mailto:theora-dev-bounces at xiph.org] On Behalf Of theora-dev-request at xiph.org Sent: Thursday, October 08, 2009 11:00 PM To: theora-dev at xiph.org Subject: theora-dev Digest, Vol 65, Issue 2 Send theora-dev mailing list submissions to theora-dev at xiph.org To subscribe or unsubscribe via the World Wide Web, visit http://lists.xiph.org/mailman/listinfo/theora-dev or, via email, send a message with subject or body 'help' to theora-dev-request at xiph.org You can reach the person managing the list at theora-dev-owner at xiph.org When replying, please edit your Subject line so it is more specific than "Re: Contents of theora-dev digest..." Today's Topics: 1. Possible inefficiency in encode.c (Chris Cooksey) 2. Re: Possible inefficiency in encode.c (Timothy B. Terriberry) ---------------------------------------------------------------------- Message: 1 Date: Wed, 07 Oct 2009 17:40:43 -0400 From: Chris Cooksey <chriscooksey at gmail.com> Subject: [theora-dev] Possible inefficiency in encode.c To: <theora-dev at xiph.org> Message-ID: <C6F2831B.28AF4%chriscooksey at gmail.com> Content-Type: text/plain; charset="US-ASCII" Hi, I am very new to Theora, having just started working through the code a few weeks ago. I am working on a requantization tool to reduce bit rates, hopefully on the fly, for some video conferencing work. As I was working through the encoding phase I noticed this line in encode.c: for(ti=_enc->dct_token_offs[pli][zzi];ti<ndct_tokens;ti++){ It's around line 804, but I am working with 1.1b3 sources so it may have moved a bit. Anyway, I am thinking that this line might be an adequate substitute: for(ti=0;ti<ndct_tokens;ti++){ Because the tokens are now stored in separate per plane arrays instead of all strung together in one big array like they used to be. I presume the point of doing that was to eliminate the need for dct_token_offs altogether. I see dct_token_offs being used in a couple of other places too. I could be wrong of course. Please don't beat this neophyte up if I am :-) Thanks, Chris. ------------------------------ Message: 2 Date: Wed, 07 Oct 2009 23:37:18 -0400 From: "Timothy B. Terriberry" <tterribe at email.unc.edu> Subject: Re: [theora-dev] Possible inefficiency in encode.c To: theora-dev at xiph.org Message-ID: <4ACD5E6E.8040705 at email.unc.edu> Content-Type: text/plain; charset=ISO-8859-1 Chris Cooksey wrote:> Because the tokens are now stored in separate per plane arrays instead of > all strung together in one big array like they used to be. I presume the > point of doing that was to eliminate the need for dct_token_offs altogether.The actual point was so that the token lists could be filled in a different order than the one in which they will appear in the bitstream. However, one of the consequences of this is that EOB runs cannot span lists, even though the bitstream allows it. This is fixed up after tokenization, before packing the tokens into the packet, in oc_enc_tokenize_finish(). What this means is that sometimes the first token in the list must be skipped, because it was an EOB run that has actually been merged with the last token in a different list. dct_token_offs[][] marks which lists need to skip such a token (i.e., it's always either 0 or 1). It would actually probably be faster to keep things in a single contiguous array, with offsets to the individual lists, just because it would remove an extra indirection that C compilers generally do a poor job of optimizing. We did this in the decoder, and it did provide a small speed-up. I just never got around to doing it in the encoder. ------------------------------ _______________________________________________ theora-dev mailing list theora-dev at xiph.org http://lists.xiph.org/mailman/listinfo/theora-dev End of theora-dev Digest, Vol 65, Issue 2 ***************************************** -------------------------------------------------------------------- Closed Joint Stock Company Intel A/O Registered legal address: Krylatsky Hills Business Park, 17 Krylatskaya Str., Bldg 4, Moscow 121614, Russian Federation This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
j at v2v.cc
2009-Oct-13 13:59 UTC
[theora-dev] Proposal for replacing asm code with intrinsics
Sukhomlinov, Vadim wrote:> Hi, > > I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2). > There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems: > 1) Need to sync gcc & msvc versions > 2) Only 32bit environment is supported > 3) No support for newer than MMX instruction sets > > My proposal is to replace all functions in assembly with compiler intrinsic which compiles into 1-2 assembly instructions and are much easier to maintain. > For example: > _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers. > > And code like: > psadbw mm4,mm5 > paddw mm0,mm4 > > Can be re-written into > _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names > mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); > Compiler will replace variables with actual registers, ensuring better allocation and scheduling of them. > So, benefits are: > 1) Easier to read & understand code which can use same variable names as generic version in C > 2) Single source code for gcc & msvc & intel compiler (all of them supports same syntax) > 3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX) thru replacement of __m64 to __m128 > 4) 64-bit code generation support > 5) Compiler can reschedule instructions based on target CPU to deliver better performance w/o manual tuning. I did several tests with high-quality manually optimized assembly in the past and then replaced it to intrinsics which resulted in 3-5% better performance when using Intel compiler. Anyway, I don't expect any performance issues with it. > > It will require some change in project structure and makefiles and I'm not sure if this ok - at least I don't know how to coordinate work on Theora with over developers. Could you please help me here? >Hi, just some notes, current code works on 64bit, at least the gcc version, not sure about msvc. there was an attempt to use intrinsics some time ago but it was slower compared to the asm version(with gcc). do you think your intrinsic version will be same speed or faster or do you expect it to be slower? gcc would be the important compiler here, if its only faster with Intel compilers is a regression for most common uses. for linux distributions it would still be required to detect the cpu at runtime and use the fastest implementation for the current cpu, compile time optimization along is not enough. j
Nils Pipenbrinck
2009-Oct-13 18:52 UTC
[theora-dev] Proposal for replacing asm code with intrinsics
Sukhomlinov, Vadim wrote:> Hi, > > I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2). > There are several source files in \x86 and \x86_vc which developed using inline assembler. However this cause several maintenance problems: > 1) Need to sync gcc & msvc versions > 2) Only 32bit environment is supported > 3) No support for newer than MMX instruction sets >I've done tests on VS.net and GCC half a year ago when we encounterd a strange code-generation bug in the assember-code with for the win32 release builds of firefox (anyone remembers ?) As far as I remember I've used GCC 4.2.something for testing. The performance will drop was about 10 to 15%. Imho the wins for maintainability alone are worth it. If the code gets rewritten for SSE I'd expect no performance loss and with a bit of luck even a tiny performance win due to the wider registers. Btw - the reasons why the intrinsics have been slower than the hand-written codes are: * The assembler-code is hand scheduled and the loops have been (mostly) written with modulo-scheduling in mind (something the GCC can unfortunately only do in theory). * For some reason the intrinsics generate sub-optimal code. I've seen plenty of useless register moves and spills to memory. * Also it seems like GCC has no idea how to schedule any intrinsics. It looks like the input and output registers are ignored and GCC just converts the SSA tree to raw code without moving instructions around. Back when I've written the assembler code, moving the processing of data as far away as possible from the memory accesses made the biggest difference because it masked the cache-misses. I still would prefer SSE intrinsics.. That wouldn't only make maintainability easier but also allows much easy porting to ARM-NEON, PPC Altivec and MIPS-MDMX. Cheers, Nils Pipenbrinck
Ralph Giles
2009-Oct-14 00:58 UTC
[theora-dev] Proposal for replacing asm code with intrinsics
On Tue, Oct 13, 2009 at 6:14 AM, Sukhomlinov, Vadim <vadim.sukhomlinov at intel.com> wrote:> I'm new to Theora and would like to propose several performance optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).Welcome. Several others have commented on various points. My rough memory of this: 1) There are no good options for cross-platform assembly. 2) Last time we tried intrinsics didn't seem worthwhile. but I'd be happy to see that that's changed. 3) Even if intrinsics unified the gcc and msvc assembly for x86, we still need to add things like NEON for arm, so I expect we will have multiple asm versions that must be sync'd anyway. 4) Past benchmarking showed SSE2 wasn't much better than MMX. That's why there's only one sse routine in the 1.1 code. So, if you can make it go faster, that would be interesting, regardless of which representation we end up trying to maintain. -r