thr3ads.net - theora dev - [theora-dev] Proposal for replacing asm code with intrinsics [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Sukhomlinov, Vadim

2009-Oct-13 13:14 UTC

[theora-dev] Proposal for replacing asm code with intrinsics

Hi, 

I'm new to Theora and would like to propose several performance optimization
using advanced instructions in x86 CPUs (SSE2-SSE4.2).
There are several source files in \x86 and \x86_vc which developed using inline
assembler. However this cause several maintenance problems:
1) Need to sync gcc & msvc versions
2) Only 32bit environment is supported
3) No support for newer than MMX instruction sets

My proposal is to replace all functions in assembly with compiler intrinsic
which compiles into 1-2 assembly instructions and are much easier to maintain.
For example:
_mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with
compiler-allocated registers.

And code like:
    psadbw mm4,mm5
    paddw mm0,mm4

Can be re-written into 
_m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names
mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); 
Compiler will replace variables with actual registers, ensuring better
allocation and scheduling of them.
So, benefits are:
1) Easier to read & understand code which can use same variable names as
generic version in C
2) Single source code for gcc & msvc & intel compiler (all of them
supports same syntax)
3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX) thru
replacement of __m64 to __m128
4) 64-bit code generation support
5) Compiler can reschedule instructions based on target CPU to deliver better
performance w/o manual tuning. I did several tests with high-quality manually
optimized assembly in the past and then replaced it to intrinsics which resulted
in 3-5% better performance when using Intel compiler. Anyway, I don't expect
any performance issues with it.

It will require some change in project structure and makefiles and I'm not
sure if this ok - at least I don't know how to coordinate work on Theora
with over developers. Could you please help me here?

Thanks in advance,
Vadim

-----Original Message-----
From: theora-dev-bounces at xiph.org [mailto:theora-dev-bounces at xiph.org] On
Behalf Of theora-dev-request at xiph.org
Sent: Thursday, October 08, 2009 11:00 PM
To: theora-dev at xiph.org
Subject: theora-dev Digest, Vol 65, Issue 2

Send theora-dev mailing list submissions to
	theora-dev at xiph.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://lists.xiph.org/mailman/listinfo/theora-dev
or, via email, send a message with subject or body 'help' to
	theora-dev-request at xiph.org

You can reach the person managing the list at
	theora-dev-owner at xiph.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of theora-dev digest..."

Today's Topics:

   1. Possible inefficiency in encode.c (Chris Cooksey)
   2. Re: Possible inefficiency in encode.c (Timothy B. Terriberry)

----------------------------------------------------------------------

Message: 1
Date: Wed, 07 Oct 2009 17:40:43 -0400
From: Chris Cooksey <chriscooksey at gmail.com>
Subject: [theora-dev] Possible inefficiency in encode.c
To: <theora-dev at xiph.org>
Message-ID: <C6F2831B.28AF4%chriscooksey at gmail.com>
Content-Type: text/plain;	charset="US-ASCII"

Hi,

I am very new to Theora, having just started working through the code a few
weeks ago.

I am working on a requantization tool to reduce bit rates, hopefully on the
fly, for some video conferencing work.

As I was working through the encoding phase I noticed this line in encode.c:

     for(ti=_enc->dct_token_offs[pli][zzi];ti<ndct_tokens;ti++){

It's around line 804, but I am working with 1.1b3 sources so it may have
moved a bit.

Anyway, I am thinking that this line might be an adequate substitute:

     for(ti=0;ti<ndct_tokens;ti++){

Because the tokens are now stored in separate per plane arrays instead of
all strung together in one big array like they used to be. I presume the
point of doing that was to eliminate the need for dct_token_offs altogether.

I see dct_token_offs being used in a couple of other places too.

I could be wrong of course. Please don't beat this neophyte up if I am :-)

Thanks,
Chris.

------------------------------

Message: 2
Date: Wed, 07 Oct 2009 23:37:18 -0400
From: "Timothy B. Terriberry" <tterribe at email.unc.edu>
Subject: Re: [theora-dev] Possible inefficiency in encode.c
To: theora-dev at xiph.org
Message-ID: <4ACD5E6E.8040705 at email.unc.edu>
Content-Type: text/plain; charset=ISO-8859-1

Chris Cooksey wrote:> Because the tokens are now stored in separate per plane arrays instead of
> all strung together in one big array like they used to be. I presume the
> point of doing that was to eliminate the need for dct_token_offs
altogether.
The actual point was so that the token lists could be filled in a
different order than the one in which they will appear in the bitstream.
However, one of the consequences of this is that EOB runs cannot span
lists, even though the bitstream allows it.

This is fixed up after tokenization, before packing the tokens into the
packet, in oc_enc_tokenize_finish(). What this means is that sometimes
the first token in the list must be skipped, because it was an EOB run
that has actually been merged with the last token in a different list.
dct_token_offs[][] marks which lists need to skip such a token (i.e.,
it's always either 0 or 1).

It would actually probably be faster to keep things in a single
contiguous array, with offsets to the individual lists, just because it
would remove an extra indirection that C compilers generally do a poor
job of optimizing. We did this in the decoder, and it did provide a
small speed-up. I just never got around to doing it in the encoder.

------------------------------

_______________________________________________
theora-dev mailing list
theora-dev at xiph.org
http://lists.xiph.org/mailman/listinfo/theora-dev

End of theora-dev Digest, Vol 65, Issue 2
*****************************************

--------------------------------------------------------------------
Closed Joint Stock Company Intel A/O
Registered legal address: Krylatsky Hills Business Park, 
17 Krylatskaya Str., Bldg 4, Moscow 121614, 
Russian Federation

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

j at v2v.cc

2009-Oct-13 13:59 UTC

head link

[theora-dev] Proposal for replacing asm code with intrinsics

Sukhomlinov, Vadim wrote:> Hi, 
> 
> I'm new to Theora and would like to propose several performance
optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
> There are several source files in \x86 and \x86_vc which developed using
inline assembler. However this cause several maintenance problems:
> 1) Need to sync gcc & msvc versions
> 2) Only 32bit environment is supported
> 3) No support for newer than MMX instruction sets
> 
> My proposal is to replace all functions in assembly with compiler intrinsic
which compiles into 1-2 assembly instructions and are much easier to maintain.
> For example:
> _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with
compiler-allocated registers.
> 
> And code like:
>     psadbw mm4,mm5
>     paddw mm0,mm4
> 
> Can be re-written into 
> _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names
> mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); 
> Compiler will replace variables with actual registers, ensuring better
allocation and scheduling of them.
> So, benefits are:
> 1) Easier to read & understand code which can use same variable names
as generic version in C
> 2) Single source code for gcc & msvc & intel compiler (all of them
supports same syntax)
> 3) Easier migration to SSE2 (which can handle 128bit vs. 64 as with MMX)
thru replacement of __m64 to __m128
> 4) 64-bit code generation support
> 5) Compiler can reschedule instructions based on target CPU to deliver
better performance w/o manual tuning. I did several tests with high-quality
manually optimized assembly in the past and then replaced it to intrinsics which
resulted in 3-5% better performance when using Intel compiler. Anyway, I
don't expect any performance issues with it.
> 
> It will require some change in project structure and makefiles and I'm
not sure if this ok - at least I don't know how to coordinate work on Theora
with over developers. Could you please help me here?
> 
Hi,
just some notes, current code works on 64bit, at least the gcc version,
not sure about msvc. there was an attempt to use intrinsics some time
ago but it was slower compared to the asm version(with gcc). do you
think your intrinsic version will be same speed or faster or do you
expect it to be slower? gcc would be the important compiler here, if its
only faster with Intel compilers is a regression for most common uses.
for linux distributions it would still be required to detect the cpu at
runtime and use the fastest implementation for the current cpu, compile
time optimization along is not enough.

j

Nils Pipenbrinck

2009-Oct-13 18:52 UTC

head link

[theora-dev] Proposal for replacing asm code with intrinsics

Sukhomlinov, Vadim wrote:> Hi, 
>
> I'm new to Theora and would like to propose several performance
optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
> There are several source files in \x86 and \x86_vc which developed using
inline assembler. However this cause several maintenance problems:
> 1) Need to sync gcc & msvc versions
> 2) Only 32bit environment is supported
> 3) No support for newer than MMX instruction sets
>   
I've done tests on VS.net and GCC half a year ago when we encounterd a
strange code-generation bug in the assember-code with for the win32
release builds of firefox (anyone remembers ?)

As far as I remember I've used GCC 4.2.something for testing.

The performance will drop was about 10 to 15%.

Imho the wins for maintainability alone are worth it. If the code gets
rewritten for SSE I'd expect no performance loss and with a bit of luck
even a tiny performance win due to the wider registers.

Btw - the reasons why the intrinsics have been slower than the
hand-written codes are:

 * The assembler-code is hand scheduled and the loops have been (mostly)
written with modulo-scheduling in mind (something the GCC can
unfortunately only do in theory).

 * For some reason the intrinsics generate sub-optimal code. I've seen
plenty of useless register moves and spills to memory.

 * Also it seems like GCC has no idea how to schedule any intrinsics. It
looks like the input and output registers are ignored and GCC just
converts the SSA tree to raw code without moving instructions around.

Back when I've written the assembler code, moving the processing of data
as far away as possible from the memory accesses made the biggest
difference because it masked the cache-misses.

I still would prefer SSE intrinsics.. That wouldn't only make
maintainability easier but also allows much easy porting to ARM-NEON,
PPC Altivec and MIPS-MDMX.

Cheers,
  Nils Pipenbrinck

Ralph Giles

2009-Oct-14 00:58 UTC

head link

[theora-dev] Proposal for replacing asm code with intrinsics

On Tue, Oct 13, 2009 at 6:14 AM, Sukhomlinov, Vadim
<vadim.sukhomlinov at intel.com> wrote:
> I'm new to Theora and would like to propose several performance
optimization using advanced instructions in x86 CPUs (SSE2-SSE4.2).
Welcome. Several others have commented on various points. My rough
memory of this:

1) There are no good options for cross-platform assembly.
2) Last time we tried intrinsics didn't seem worthwhile. but I'd be
happy to see that that's changed.
3) Even if intrinsics unified the gcc and msvc assembly for x86, we
still need to add things like NEON for arm, so I expect we will have
multiple asm versions that must be sync'd anyway.
4) Past benchmarking showed SSE2 wasn't much better than MMX. That's
why there's only one sse routine in the 1.1 code.

So, if you can make it go faster, that would be interesting,
regardless of which representation we end up trying to maintain.

 -r

Seemingly Similar Threads

Search for more apparently analagous threads

theora dev - Oct 2009 - Proposal for replacing asm code with intrinsics

[theora-dev] Proposal for replacing asm code with intrinsics

[theora-dev] Proposal for replacing asm code with intrinsics

[theora-dev] Proposal for replacing asm code with intrinsics

[theora-dev] Proposal for replacing asm code with intrinsics

Seemingly Similar Threads