thr3ads.net - CentOS - [CentOS] Re: Hot swap CPU -- shared memory (1 NUMA/UPA) v. clustered (4 MCH) [Jul 2005]

If this information is useful, please help other people find it:
Share via:

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jul-08 22:43 UTC

[CentOS] Re: Hot swap CPU -- shared memory (1 NUMA/UPA) v. clustered (4 MCH)

From: Bruno Delbono <bruno.s.delbono at mail.ac>> I'm really sorry to start this thread again but I found something very 
> interesting I thought everyone should ^at least^ have a look at:
> http://uadmin.blogspot.com/2005/06/4-dual-xeon-vs-e4500.html
> This article takes into account a comparision of 4 dual xeon vs. e4500. 
> The author (not me!) talks about "A Shootout Between Sun E4500 and a 
> Linux Redhat3.0 AS Cluster Using Oracle10g [the cluster walks away
limping]"
People can play with the numbers all they want.

Shared memory systems (especially NUMA) with their multi-GBps, native
interconnects are typically going to be faster at anything that is GbE or
FC interconnected.  At the same time, lower-cost, clustered systems
are competitive *IF* the application scales _linearally_.  In this test
scenario, the operations were geared towards operations that scale
_linearally_.

But even then, I think this blogger pointed out some very good data.

The reality of the age of the systems in comparison, as well as the
"less than commodity" cluster configuration of the PC Servers.  It
too is also a "power hungry" setup, and very costly.

The reality is that a dual-2.8GHz Xeon MCH system does _not_ have
the same interconnect capable of an even _yesteryear_ UltraSPARC
II hardware that actually had a _drastically_lower_ cost overall.  And
even then, for its age, price, etc..., the "old" 4-way UltraSPARC II
NUMA/UPA was quite "competitive" against a processor with 7x the
CPU clock, but because the latter, "newer" P4 MCH platform has a
_far_lower_ total aggregate throughput capable interconnect.  Had
the test included database benchmarks that were less linear and
favored shared memory systems, then the results might have been
very different.

Frankly, if they wanted to make the "Linux v. Solaris" game stick,
they should have taken a SunFire V40z and compared _directly_.
Or at least pit a SunFire V40z running Linux against the same cluster,
as the costs are very much near each other.

So, in the end, I think this was a _poor_ test overall, because apples
and oranges are being compared.  The clustered setup has _better_
failover, the shared memory system has _better_ "raw interconnect."
And it wasn't fair to use an aged Sun box, a newer, "cost
equivalent"
SPARC (III/IV?) should have been used -- especially given the costs.
It's very likely that someone was just "re-using" the only Sun box
they had, which is just a piss-poor show of journalism.  The memory
of the Sun should have also been boosted to the same, total amount
to show off the power of a shared memory system with the appropriate
benchmarks.

I mean, the shared memory system has interconnect measured in the
multi-GBps, while the cluster is well sub-0.1GBps for the GbE, and
sub-0.5GBps for the FC-AL on even PCI-X 2.0.

Furthermore, I would really like to see how the 4x P4 MCH platforms
would perform versus more of a NetApp setup, beyond just the
SunFire V40z (Opteron).  Or, better yet, a new Opteron platform with
HTX InfiniBand (Infiniband directly on the HT).  That's _exactly_ what
Sun is moving to, and something Intel doesn't have.  Especially
considering that InfiniBand on PCI-X 2.0 is only capable of maybe
0.8GBps in an "ideal" setup, and 1GbE can't get anywhere near that
(let alone the layer-3/4 overhead!), while HTX InfiniBand has broken
1.8GBps!

At 1.8GBps throughput InfiniBand, you're starting to blur the
difference between Clustered and Shared Memory, especially
with the HyperTransport protocol -- and especially versus
traditional GbE.  Intel can't even compete at 40% the power
of Opteron HTX in a cluster configuration.

**SIDE DISCUSSION:  SATA and the evolution to SAS (bye SCSI!)

As far as SATA v. SCSI, they were using 10K SATA disks that basically
roll of the same lines as SCSI.  Using the intelligent FC host fabric, the
SATA's not only queue just as good as a SCSI array, the SATA throughput
is _higher_ because of the reduced overhead in the protocol.  There has
been study after study after study that shows if [S]ATA is paired with
an intelligent storage host, the host will queue and the drives will
burst.  The combination _roasts_ traditional SCSI -- hence why SCSI is
quickly being replaced by Serial Attached Storage (SAS), a multi-target
ATA-like interconnect, in new designs.

Anyone who still thinks SCSI is "faster" versus [S]ATA is lost.  Yes,
[S]ATA doesn't have queuing, and putting NCQ (Native Command
Queuing) on the Intelligent Drive Electronics (IDE) is still not the
same as having an intelligent Host Adapter (HA) which SCSI always
has.  But when you use an intelligent Host Adapter (HA) for [S]ATA,
such as the ASIC in 3Ware and other products, the game totally
changes.  That's when SCSI's command set actually becomes a
latency liability versus ATA -- especially in an ASIC design like
3Ware's, and new Broadcom, Intel and other solutions, SCSI can
_not_ compete.  Hence, again, the evolution to SAS.

Parallel is dead, because it's much better to have a number of
point-to-point devices with direct pins to a PHY to a wide ASIC
than a wide bus  that is shared by all devices.

And these 10K RPM SATA models are rolling of the _exact_same_
fab lines as their SCSI equivalents, with the same vibration specs
and MTBF numbers.  They are not "commodity" [S]ATA drives, of
which many 7,200rpm SCSI drives are even now sharing (and
share the same 0.4Mhr MTBF as commodity [S]ATA).


--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith

2005-Jul-09 00:03 UTC

head link

[CentOS] Re: [OT] Hot swap CPU -- acronyms/terminology defined ...

Since someone previously complained about my lack of definitions and
expansion ...

On Fri, 2005-07-08 at 17:43 -0500, Bryan J. Smith 
wrote:> Shared memory systems (especially NUMA)
NUMA = Non-Uniform Memory Architecture

Most platforms have adopted it, including Opteron, Power, SPARC, etc...,
although commodity PowerPC and Intel systems have not (not even Itanium
2).  There are proprietary, extremely costly Xeon and Itanium systems
that use NUMA.
> GbE
Gigabit Ethernet -- 1Gbps = 1000Mbps ~ 100MBps = 0.1GBps with typical
8/10 data-level encoding.

That's before including the overhead of layer-2 frame, let alone typical
layer-3 packet and layer-4 transport parsing.
> or FC interconnected.
FiberChannel, which is a storage-oriented networking stack.  Typical
speeds for FiberChannel Arbitrated Loop (FC-AL) are 2-4Gbps 2000-4000Mbps ~
200-400MBps (0.2-0.4GBps) with typical 8/10 data-level
encoding.
> The reality is that a dual-2.8GHz Xeon MCH
Memory Controller Hub (MCH) aka Front Side Bus (slang/Bottleneck) (FSB).
All processors content for the same GTL logic "bus" to the MCH, same
with all memory and all I/O -- in a literal "hub" type architecture
(_all_ nodes receive from a single transmitting node).

On Itanium2, Intel calls this Scalable Node Architecture (SNA), which is
not true at all.

Bandwidth of the latest AGTL+ is up to 8.4GBps in DDR2/PCIe
implementations, although it can be widened to 16.8GBps.  But remember,
that is _shared_ by _all_ CPU, memory _and_ I/O -- and only one can talk
to another at a time because of the MCH.  Even if a proprietary NUMA
solution is used, _all_ I/O _still_ goes through that MCH (to reach the
ICH chips).
> NUMA/UPA
Ultra Port Architecture (UPA), which is Sun's crossbar "switch"
interconnect for UltraSPARC I and II.  Most RISC (including Power, but
typically not PowerPC) platforms uses a "switch" instead of a
"hub" --
including EV6 (Alpha 264, Athlon "32-bit"), UPA and others.  This
allows
the UPA "port" to connect to a variety of system "nodes,"
and up to even
128 "nodes" -- to 1-2+GBps per "node" in the "partial
mesh."
Performance is typically 1GBps per UPA "port," with 2 processors
typical
in a daughtercard with local memory (hence NUMA).

The "Fireplane" is an advancement of the UPA for UltraSPARC III and IV
which increases performance to 4.8GBps per "node."

Opteron, by comparison, has direct 6.4GBps for DDR memory _plus_ up to 3
HyperTransport links of 8.0GBps each (6.4GBps in previous versions -- 1
for 100 series, 2 for 200, 3 for 800).
> sub-0.5GBps for the FC-AL on even PCI-X 2.0.
PCI-X 1.0 is up to 133MHz @ 64-bit = 1.0GBps (1 slot configuration).
PCI-X 2.0 is up to 266MHz @ 64-bit = 2.0GBps (2 slot configuration).

Real "end-to-end" performance is typically much lower.  E.g., Intel
has
yet to reach 0.8GBps for InfiniBand over PCI-X, which is the "lowest
overhead" of a communication protocol for clusters.  FC introduces far
more, and GbE (~0.1GBps before overhead) even more.  They do _not_
bother with 10GbE (~1GBps before overhead) on PCI-X (but use custom
600-1200MHz XScale microcontrollers with direct 10GbE interfaces short
of the PHY).

PCIe is supposed to address this, with up to 4GBps bi-directional in a
16 channel configuration, but the MCH+ICH design is proving impossible
to break sustained 1GBps in many cases because it is a peripheral
interconnect, not a system interconnect.  Graphics cards are typically
just shunting to/from system memory, and do it without accessing the
CPU, and Host Based Adapters (HBA) do the same for FC and GbE.  I.e.,
there's _no_way_ to talk to the CPU through the MCH+ICH kludge at those
speeds, so the "processing is localized."
> HTX InfiniBand (Infiniband directly on the HT).
HyperTransport eXtension (HTX) is a system (not peripherial)
interconnect that allows clustering of Opteron with _native_
HyperTransport signaling.  InfiniBand is capable of 1.8GBps end-to-end
-- and that's before figuring the fact that _each_ Opteron can have
_multiple_ HyperTransport connections.  This is a _commodity_ design,
whereas the few, capable Intel Xeon/Itanium2 clusters are quite
proprietary and extremely expensive (like SGI's Altix).
> fab lines as their SCSI equivalents, with the same vibration specs
> and MTBF numbers.  They are not "commodity" [S]ATA drives, of
Mean Time Between Failures (MTBF)

Commodity Disk:  400,000 hours (50,000 starts, 8 hours operation)
Enterprise Disk:  1,400,000 hours (24x7 operation)

It has _nothing_ to do with interface.  40, 80, 120, 160, 200, 250, 300,
320, 400GB drives are "commodity disk" designs, 9, 18, 36, 73, 146GB
are
"enterprise disk" designs.  The former have 3-8x the vibration (less
precise alignment) at 5,400-7,200 RPM than the latter at 10,000-15,000
RPM.  There are SCSI drives coming off "commodity disk" lines
(although
they might test to higher tolerances, less vibration), and SATA drives
coming of "enterprise disk" lines.

Until recently, most materials required commodity disk to be operating
at 40C or less, whereas Enterprise is 55C.  Newer commodity disks can
take 60C operating temps -- hence the return to 3-5 year warranties --
but the specs still only for 8x5 operation -- some vendors (Hitachi-IBM)
only rate 14x5 as "worst case usage" and warranty voided.

Some vendors are introducing "near-line disk" which are commodity
disks
that test to higher tolerances, and they are rated as 24x7 "network
managed" -- i.e., the system powers them down (non-24x7 operation, just
the system is), and spins them back up on occasion (you should _never_
let a commodity disk sit, hence why the are _poor_ for "off-line"
backup).

Anyhoo, the point is that I can build an ASIC that can interface into
4-16 PHY (physical interface) chips that are "point-to-point" (host to
drive electronics) and drive them _directly_ and _independently_. 
That's what intelligent SATA Host Adapters do -- Especially with
"enterprise" 10,000 RPM SATA devices -- switch fabric + low-latency.
SCSI is a "shared bus" or "multiple shared buses" of
parallel design --
fine for yesteryear, but massive overhead and cost for today.  The next
move is "Serial Attached Storage" (SAS) which will replace SCSI, using
a
combination of "point-to-point" ASICs in a more distributed/managed
design.

-- 
Bryan J. Smith                                     b.j.smith at ieee.org 
--------------------------------------------------------------------- 
It is mathematically impossible for someone who makes more than you
to be anything but richer than you.  Any tax rate that penalizes them
will also penalize you similarly (to those below you, and then below
them).  Linear algebra, let alone differential calculus or even ele-
mentary concepts of limits, is mutually exclusive with US journalism.
So forget even attempting to explain how tax cuts work.  ;->

Reasonably Related Threads

Search for more possibly parallel threads

CentOS - Jul 2005 - Re: Hot swap CPU -- shared memory (1 NUMA/UPA) v. clustered (4 MCH)

[CentOS] Re: Hot swap CPU -- shared memory (1 NUMA/UPA) v. clustered (4 MCH)

[CentOS] Re: [OT] Hot swap CPU -- acronyms/terminology defined ...

Reasonably Related Threads