On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote:>>> 8B C3 mov eax, ebx >>> 03 C1 add eax, ecx >>> becomes >>> 8B C3 03 C1 add eax, ebx, ecx > > In my understanding, twoaddr pass tends to emit such a sequence.Yes, it always does, and the coalescer tries very hard to eliminate the copy.> Though I don't have sandybridge, I have not measured. > Prior processors(intel and amd) might spend 1 ALU to execute "mov", > then mov - add must have dependency.I think you will find it is more complicated than that. A 'mov' usually doesn't need an ALU resource. You should read about the 'reservation station' style register renaming. http://en.wikipedia.org/wiki/Register_renaming http://www.intel.com/Assets/PDF/manual/248966.pdf /jakob
Hi Jacob, As far as I know, an x86 'mov' instruction always On 08 Apr 2011, at 19:27, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:> > On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote: > >>>> 8B C3 mov eax, ebx >>>> 03 C1 add eax, ecx >>>> becomes >>>> 8B C3 03 C1 add eax, ebx, ecx >> >> In my understanding, twoaddr pass tends to emit such a sequence. > > Yes, it always does, and the coalescer tries very hard to eliminate the copy. > >> Though I don't have sandybridge, I have not measured. >> Prior processors(intel and amd) might spend 1 ALU to execute "mov", >> then mov - add must have dependency. > > I think you will find it is more complicated than that. A 'mov' usually doesn't need an ALU resource. > > You should read about the 'reservation station' style register renaming. > > http://en.wikipedia.org/wiki/Register_renaming > http://www.intel.com/Assets/PDF/manual/248966.pdf > > /jakob >
Hi Jacob, As far as I know, an x86 'mov' instruction always uses an ALU resource. According to Agner Fog's documents (http://www.agner.org/optimize/), it can execute on port 0, 1 or 5 on recent architectures though. So it's not that likely to be resource limited. But it still occupies an instruction slot throughout the entire pipeline, costing power and potentially limiting other actual arithmetic instructions from scheduling optimally. Also, it has a latency of 1 cycle, while non-destructive instructions would shorten the latency of dependent instructions. My immediate concern is getting a reasonable estimate for how often this macro-op fusion could be performed. This could then be used to evaluate whether it's worth the added decoder complexity. Cheers, Nicolas On Fri, Apr 8, 2011 at 7:27 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:> > On Apr 8, 2011, at 9:56 AM, NAKAMURA Takumi wrote: > > >>> 8B C3 mov eax, ebx > >>> 03 C1 add eax, ecx > >>> becomes > >>> 8B C3 03 C1 add eax, ebx, ecx > > > > In my understanding, twoaddr pass tends to emit such a sequence. > > Yes, it always does, and the coalescer tries very hard to eliminate the > copy. > > > Though I don't have sandybridge, I have not measured. > > Prior processors(intel and amd) might spend 1 ALU to execute "mov", > > then mov - add must have dependency. > > I think you will find it is more complicated than that. A 'mov' usually > doesn't need an ALU resource. > > You should read about the 'reservation station' style register renaming. > > http://en.wikipedia.org/wiki/Register_renaming > http://www.intel.com/Assets/PDF/manual/248966.pdf > > /jakob > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110417/8f1cd34f/attachment.html>
On Apr 17, 2011, at 9:59 AM, Nicolas Capens wrote:> My immediate concern is getting a reasonable estimate for how often this macro-op fusion could be performed. This could then be used to evaluate whether it's worth the added decoder complexity.In that case, just look at the generated code. I don't think any pass is inserting instructions between 'mov' and two-address arithmetic instructions. /jakob