thr3ads.net - search: "interlagos"

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local: Local: 1: 324.425530 1: 324.102142 2: 17141.324050 2: 620.185930 3: 52212.232343 3: 25242.574661 4: 93136.458314 4: 47982.037866 6: 167967.4...

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 03

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket) AMD (qspinlock + pending + opt) Local: Local: 1: 324.425530 1: 324.102142 2: 17141.324050 2: 620.185930 3: 52212.232343 3: 25242.574661 4: 93136.458314 4: 47982.037866 6: 167967.4...

[LLVMdev] X86 FMA4

2012 Jul 25

6

[LLVMdev] X86 FMA4

We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. Why is VFMADDSD4 defined with vector types? Is this simply because the gcc intrinsic uses vector types? It's quite unnatural if you have a compiler that generates FMAs as opposed to requiring user intrinsics. -Dave

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

0

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...set; when I compile with: gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3 I get the second set; afaict the other locks don't seem to have this problem, but I only just noticed. --- I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node (2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop. (ticket) (qspinlock + all) (waiman) AMD Interlagos Local: 1: 324.425530 1: 324.102142 1: 323.857834 2: 17141.324050 2: 620.185930 2: 618.737681 3: 52212.232343 3: 252...

Bug#675266: xen-hypervisor-4.0-amd64: Hard reset when starting a DomU on HP DL585 G7 // Opteron 6238

2012 May 30

0

Bug#675266: xen-hypervisor-4.0-amd64: Hard reset when starting a DomU on HP DL585 G7 // Opteron 6238

Package: xen-hypervisor-4.0-amd64 Version: 4.0.1-4 Severity: important *** Please type your report below this line *** Hello Debian Team, I have some strange behavior with a DL585 G7 with only two cpu sockets used (Opteron 6238 Interlagos). I use Debian 6.0 with a XEN Kernel. Every time when I start a DomU, the server makes a hard reset without any kernel panic or output. When I start the same disks in another DL585 G7 with all four cpu sockets used (Opteron 6174 Magny-Cours), everything works normal! Is the Opteron Interlagos cor...

[LLVMdev] X86 FMA4

2012 Jul 26

1

[LLVMdev] X86 FMA4

...ar operands, we end up with... vmovsd fp4_+1056(%rip), %xmm0 # fpppp.f:666 vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory. -Cameron On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: > Because the intrinsics uses vector types (same as gcc). > > > - Jan > > > > -----...

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

2013 Feb 21

2

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

...the extra memory traffic of stuffing vectors is more of a performance hit than the partial register updates. Unfortunately, this is counter-intuitive to the documentation available. And, this may only be true for the benchmarks that hold my interest. For completeness, I'm mainly interested in Interlagos and Sandybridge, so this conjecture may not hold for other processors such as Atom. Hope this helps, Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130221/acf6a6c9/attachment.html>

cannot compile R on Cray XE6 HLRS HERMIT

2013 Apr 29

1

cannot compile R on Cray XE6 HLRS HERMIT

...My environment is as follows: 1) modules/3.2.6.7 13) udreg/2.3.2-1.0401.5929.3.3.gem 25) configuration/1.0-1.0401.35391.1.2.gem 2) xtpe-network-gemini 14) ugni/4.0-1.0401.5928.9.5.gem 26) hosts/1.0-1.0401.35364.1.115.gem 3) xtpe-interlagos 15) pmi/4.0.1-1.0000.9421.73.3.gem 27) lbcd/2.1-1.0401.35360.1.2.gem 4) cray-mpich2/5.6.4 16) dmapp/3.2.1-1.0401.5983.4.5.gem 28) nodehealth/5.0-1.0401.38460.12.18.gem 5) eswrap/1.0.9 17) gni-headers/2.1-1...

[LLVMdev] X86 FMA4

2012 Jul 26

0

[LLVMdev] X86 FMA4

...+1056(%rip), %xmm0 # fpppp.f:666 > > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill > > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 > > > > > >I do not know the actual number of cycles offhand, but I believe on > Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as > a vmovsd if it involves memory. > > > > > >-Cameron > > > > > >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: > > > >Because the intrinsics uses...

[LLVMdev] X86 FMA4

2012 Jul 27

2

[LLVMdev] X86 FMA4

...fp4_+1056(%rip), %xmm0 # fpppp.f:666 > > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill > > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 > > > > > >I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory. > > > > > >-Cameron > > > > > >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: > > > >Because the intrinsics uses vecto...

[LLVMdev] X86 FMA4

2012 Jul 26

0

[LLVMdev] X86 FMA4

Jan Sjodin <jan_sjodin at yahoo.com> writes: > You can't execute FMA4 instructions on Intel processors, so it doesn't > really matter what the impact of the move instructions would be, since > it would end up with an illegal instruction regardless. :) Interlagos? All the world is not Intel. > It does perhaps bring up an issue of tuning for different > architectures, but that is something nobody is really looking into at > the moment afaik. *ahem* :) -Dave

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...7696 > 6 4942 4434 9876 > 7 6304 5176 11901 > 8 7736 5955 14551 > I'm just not seeing that; with test-4 modified to take the AMD compute units into account: root at interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh 4: 50783.509653 8: 146295.875715 16: 332942.964709 4: 51033.341441 8: 146320.656285 16: 332586.355194 And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg loops with uncondition...

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Mar 04

1

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

...7696 > 6 4942 4434 9876 > 7 6304 5176 11901 > 8 7736 5955 14551 > I'm just not seeing that; with test-4 modified to take the AMD compute units into account: root at interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh 4: 50783.509653 8: 146295.875715 16: 332942.964709 4: 51033.341441 8: 146320.656285 16: 332586.355194 And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg loops with uncondition...

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

2013 Feb 21

0

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

You can change the input LLVM-IR. On Feb 21, 2013, at 7:16 AM, "Nowicki, Tyler" <tyler.nowicki at intel.com> wrote: > Hi, > > I am interested in evaluating the performance of packed vs scalar double-precision floating point instructions on x86-atom and I was wondering if anyone knows more precisely where to modify llvm to use one or the other. I know I probably need

[LLVMdev] X86 FMA4

2012 Jul 26

0

[LLVMdev] X86 FMA4

Because the intrinsics uses vector types (same as gcc). - Jan ----- Original Message ----- > From: "dag at cray.com" <dag at cray.com> > To: llvmdev at cs.uiuc.edu > Cc: > Sent: Wednesday, July 25, 2012 3:26 PM > Subject: [LLVMdev] X86 FMA4 > > We're migrating to LLVM 3.1 and trying to use the upstream FMA patterns. > > Why is VFMADDSD4

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

2013 Feb 21

2

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

Hi, I am interested in evaluating the performance of packed vs scalar double-precision floating point instructions on x86-atom and I was wondering if anyone knows more precisely where to modify llvm to use one or the other. I know I probably need to change something in the type legalizer. Could anyone provide more details than that? Thanks, Tyler -------------- next part -------------- An HTML

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

2013 Feb 26

0

[LLVMdev] Generate scalar SSE instructions instead of packed instructions

...the extra memory traffic of stuffing vectors is more of a performance hit than the partial register updates. Unfortunately, this is counter-intuitive to the documentation available. And, this may only be true for the benchmarks that hold my interest. For completeness, I'm mainly interested in Interlagos and Sandybridge, so this conjecture may not hold for other processors such as Atom. Hope this helps, Cameron -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130226/51f7e8aa/attachment.html>

[LLVMdev] X86 FMA4

2012 Jul 27

0

[LLVMdev] X86 FMA4

...ppp.f:666 >> > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill >> > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 >> > >> > >> >I do not know the actual number of cycles offhand, but I believe on >> Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as >> a vmovsd if it involves memory. >> > >> > >> >-Cameron >> > >> > >> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> >> wrote: >> &...

[LLVMdev] X86 FMA4

2012 Jul 27

3

[LLVMdev] X86 FMA4

...mm0 # fpppp.f:666 >> > vmovsd %xmm0, 10088(%rsp) # fpppp.f:666 <= 8-byte spill >> > vfmaddsd %xmm3, fp4_+3288(%rip), %xmm0, %xmm3 # fpppp.f:666 >> > >> > >> >I do not know the actual number of cycles offhand, but I believe on Interlagos and Sandybridge, a vmovaps takes roughly 3x as many micro-ops as a vmovsd if it involves memory. >> > >> > >> >-Cameron >> > >> > >> >On Thu, Jul 26, 2012 at 9:41 AM, Jan Sjodin <jan_sjodin at yahoo.com> wrote: >> > >> >B...

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014 Feb 28

5

[PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: > >>+ old = xchg(&qlock->lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); > >>+ > >>+ if (old == 0) { > >>+ /* > >>+ * Got the lock, can clear the waiting bit now > >>+ */ > >>+ smp_u8_store_release(&qlock->wait, 0); > > > >So we just did an

search for: interlagos