thr3ads.net - CentOS - [CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-21 15:54 UTC

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Maciej ?enczykowski <maze at cela.pl>> That's a good point - does anyone know what the new Intel
> Virtualization thingamajig in the new dual core pentium D's is about?
It's all speculation at this point.  But there are _several_ factors.

But I'm sure the first time Intel saw AMD's x86-64/PAE52 presentation,
the same thing popped into my mind that popped into Intel's mind ...
  Virtualization

- The 48-bit/256TiB limitation of x86-64 "Long Mode"

There is a "progammers limit" of 48-bit/256TiB in x86-64 "Long
Mode."
This limitation is due to how i386/i486-TLB works -- 16-bit segment,
32-bit off-set.  If AMD would have choosen to ignore such compatibility,
it would have been near-impossible for 32-bit/PAE36 programs to run
under a kernel of a different model.  But "Long Mode" was designed
so its PAE52 model could run both 32-bit (and PAE36) as well as new
48-bit programs.

We'll revisit that in a bit.  Now, let's talk about Intel/AMD design
lineage.

- Intel IA-32 Complete Design Lineage

IA-32 Gen 1 (1986):  i386, including i486
- Non-superscalar:  ALU + optional FPU (std. in 486DX), TLB added in i486
IA-32 Gen 2 (1992):  i586, Pentium/MMX (defunct, redesigned in i686)
- Superscalar  2+1 ALU+FPU (pipelined)
IA-32 Gen 3 (1994):  i686, Pentium Pro, II, III, 4 (partial refit)
- Superscalar:  2+2 ALU+FPU (pipelined), FPU 1 complex or 2 ADD
- P3 = +1 SSE pipe, P4 = +2 SSE pipe

Intel hasn't revamped it's aging i686 architecture in almost 12 years.
the Pentium Pro through Pentium III are the _exact_same_ 7-issue
(2+2+3 ALU+FPU+controll) design (the P3 slaps on one SSE unit),
and the Pentium 4 was a quick, 18-month refit of longer pipes (with
associated reduction in ALU/FPU performance MHz for MHz) that
extended pipes for clock (and added a 2nd SSE unit).

I'm sure Intel's reasoning for not bothering with a complete generation
redesign beyond i686 is because it thought EPIC/Predication would
have taken over by now.  The reality has been quite the opposite
(which I won't get back into).

Since then, Intel has made a number of "hacks" to the i686
architecture.
One is HyperThreading which tries to keep its pipes full by using its
control units to virtualize two instruction schedulers, registers, etc...
In a nutshell, it's a nice way to get "out-of-order and register
renaming for almost free."  Other than basic coherency checking as
necessary in silicon, it "passes the buck" to the OS, leveraging its
context switching (and associated overhead) to manage some details.

That's why HyperThreading can actually be slower for some applications,
because they do not thread, and the added overhead in _software_
results in reduced processing time for the applications.

"Yamhill" IA-32e aka "EM64T"  was just a P4 ALU refit for
x86-64/PAE52,
but it lacks many design considerations that the Athlon has -- especially
outside the programmer/software considerations, and definitely more
at the core interconnect/platform..  I.e., because Intel continues to use
a single-point-of-contention "memory controller hub" (MCH), memory
interconnect and I/O management, among other details, are still left to
the MCH.  This is going to become more and more of a headache.  The
reality is that the Intel IA-32e platform _must_ get past the "northbridge
outside the CPU" attitude to compete with AMD.

As such, I have _always_ theorized that "Yamhill" is a 2-part project.
Part 2 is the first redesign of a x86 core in almost (now) 12 years,
which goes beyond merely adding true register renaming and out-of-
order execution (which are largely hacks in the P4/HT), but goes directly
to the concept of virtualizing cores.  More on that in a bit, now AMD ...

- AMD x86 Complete Design Lineage

AMD Gen 1 (1992*):  i386/486 ISA -- 386, 486, 5x86, K5*
- Non-superscalar:  ALU + optional FPU (std. in K5)
AMD Gen 2 (1994*):  i486/686 ISA -- Nx586+FPU/K5*, Nx686/K6
- Superscalar:  3+1 ALU+FPU (ALUs pipelined, FPU _not_ piplined)
AMD Gen 3 (1999):  i686/x86-64 ISA -- Athlon, Athlon64/Opteron
- Superscalar:  3+3 ALU+FPU (pipelined), FPU 2 _and_ 1 ADD/MULT
- Extensions are microcoded and leverage ALU/FPU as appropriate

*NOTE:  The NexGen Nx586 released in 1994 forms the basis for
latter K5 (i486) and the K6 (i686).  AMD had scalability issues with
its original non-superscalar K5 design and purchased NexGen.

SIDE NOTE:  SSE Comparison
- P4 can do 3 MULT SSE (1 FPU complex + 2 SSE pipes)
- Athlon can do 3 MULT SSE (2 FPU complex + 1 FPU MULT)

Contrary to popular opinion, Athlon64/Opteron is the _same_core_
design as the 32-bit Athlon platform.  It is still the same, ultra-
powerful 3+3 ALU+FPU core, with its 2 complex + 1 ADD/MULT
FPU able to equal Intel's 1 complex _or_ 2 ADD FPU plus 2 SSE
pipes at doing the majority of matrix transforms (which are MULT --
hence why Intel's FPU can't do 2 simultaneously, and relies heavily
on its precision-lacking SSE pipes).

Also contrary to popular opinion, 40-bit/1TiB Digital Alpha EV6
interconnect forms the basis for _all_ addressing in _all_ Athlon
releases, including the 32-bit.  There are a few mainboards that
allow even 32-bit Athlons to safely address above 4GB with
_no_ paging or issues (with an OS that offers such a supporting
kernel, like Linux).  The 3-16 point EV6 crossbar and not
"hub" architecture, forced Athlon MP to put any I/O coherency
login in the chip, so the AGPgart control is actually on the Athlon
MP, and not in the northbridge.  This has evolved into a full
I/O MMU in Athlon64/Opteron.

Because Athlon is 5 years newer than Intel i686, and there is
a wealthy of talent influx from Digital (even though Intel did get
some as well, they haven't redesigned i686 completely), Athlon
has some of the latest, run-time register renaming and out-of-order
execution control in the core itself.  This is why doing something
like HyperThreading would benefit AMD _very_little_ and largley
introduce self-defeating (and even performance reducing) overhead.

In addition to the design of PAE52, the #1 reason why you can
safely assume AMD is moving towards virtualization is because of
the design limits they put on Athlon64/Opteron.  E.g., although the
original 32-bit Athlon platform used logic that allowed up to the
full EV6 8MB SRAM addressing (cache), Athlon64/Opteron has been
artificially limited to 1MB SRAM (saving many considerations and
other benefits).  This clearly indicates AMD did not consider
Athlon64/Opteron

- The Evolution to Virtual Cores

AMD's adoption of '90s concepts of register renaming and out-of-order
execution are great for a single core.  And Intel's HyperThreading
with the minor P4 run-time additions passes-the-buck decently in lieu
of a complete core redesign (which they haven't done since 1994).
But the concept of extending the pipes any further for performance
has been largely broken in the P4, and Intel is actually falling back
to its last rev of the i686 original, P3.

Multiple, _physical_ cores have been the first step.  This is little more
than slapping in a second set of all the non-SRAM transistors, plus
any additional bridging logic, if necessary.  AMD HyperTransport
requires none -- as HyperTransport can "tunnel" anything, EV6
memory/addressing, I/O tunnels/bridges, inter-CPU, etc... all
"gluelessly."  Intel MCH GTL+ cannot, and requires bridges between
the "chipset MCH" and the "multi-core MCH," adding latency. 
And
there are nagging 32-bit limitations with GTL+ as well (long story).

The next logical evolution in microprocessor design is to blur the
physical separation between cores.  It's the best way without tearing
down the entire '70s-induced concept of machine code (operator+
operand, possibly control, at least microcoded internally) and the
resulting instruction sets.  Instead of discrete, superscalar units
of a half-dozen to a dozen, pipelined units, there will be numerous,
independent pipes, possibly with their own registers or a number
of generic registers, as a single unit.  Other than the controlling
firmware and/or OS, this is _not_ what software will use.

What the software will use are the virtual instantiations that
partition this set of pipes and registers, which may very well be
dynamic in nature.  Let's say I boot Windows, I might instantiate
a virtual i686/PAE36 core guaranteeing 100% full Win32
compatibility.  Depending on what resources the chip physically
has, I will likely even instantiate multiple i686 processors.  The
concept of multi-CPU and multi-threading has evolved into
virtual-cores with virtual-threading.  Virtualizing more CPUs with
a total number of more pipes/registers than is actual will allow
more registers and pipelines to be executing instead of the 
common 40-50% for superscalar CISC or 60-70% for superscalar
RISC.

As an "added bonus," this means the 48-bit/256TiB constraint
for PAE36 compatibility is _removed_.  I.e., you can have a
much larger, true memory pool, and any required windowing/
segmentation is done with_out_ paging by the "host" memory
model, even though the OS is virtually running in a PAE36 or
PAE52 model.

This also gives rise to an entirely new platform for virtualization
of simultaneous OSes -- be it the same OS, or different OSes.
Because cores are virtual, you can have multiple, independent
processors with their own registers, memory windows into
physical RAM, etc...  On the more "consumer" front, this will
allow it to work with existing OSes as-is.  On the more "load
balancing server" front, this will often be paired with software
(think EMC/VMWare *SX products) so numerous instances can
be dynamically load-balanced across virtual cores -- but far
more overhead and increased efficiency is put on the chip.
But it is still managed by software (just with reduced
context switching overhead in the software).

Again, it's really just a consolidation of all the run-time
optimizations we have now, along with both multi-core and
multi-threading approaches, into a general pool of pipes,
registers and organization.  Additionally, it breaks the physical
constraints of the memory model for the physical hardware,
which is a very big issue for our future.  To ensure x86/PAE36
and x86-64/PAE52 compatibility in the future, such machines
will need to be virtualized or we'll be stuck at 48-bit/256TiB.
> As in is it worth anything?
Yes -- and almost everything to the future of Microsoft being able to
sustain much their existing Win32 codebase which does _not_ port
to PAE52 very easily and definitely _not_ with full compatibility.

And we have to break the 48-bit/256TiB limitations of PAE52,
while still ensuring PAE52 OSes/applications, as well as some
legacy PAE36 OSes/applications, still run.  The only way is to
virtualize the whole freak'n chip so we can instantiate a processor,
registers and its memory model -- even if dynamically assigned/
shared.  And that's just for end-users, possibly workstations and
entry servers.

For load-balancing servers, you'll still need a software solution
for management.  It will be that the hardware just offers far
greater efficiency and reduced context switching.  In fact, the
next consolidation are these virtual core chips in blades, where
you not only manage the virtual cores in the individual chips/
blades, but an entire rack of blades as a single unit with multiple
OSes spread across.  This already exists, but this takes it one
step further -- because the processors themselves are virtualized
with greatly reduced overhead on the part of the software.
> Will it allow a dual simultaneous boot of Linux+WinXP+MaxOS 
> under Xen or something along those lines?
Yes.

It will both give more virtualized processors to a single executing
OS, as well as create segmented, virtualized processors for
independently and simultaneously operating processors.

> Even on an SMP machine?
First off, remove the Intel-centric notion of "Symmetric" MP (SMP).

Secondly, multi-processing and multi-threading are going to merge
with traditional register renaming and out-of-order execution.  So
the traditional concept of "MP" is _dying_.  In fact, in the '90s,
it really died in general.

I know it's hard to think outside the box and traditional thought,
but most users don't understand superscalar design in the first
place.  Those who do understand why AMD has _not_ bothered
to adopt Intel SMT (HyperThreading) in Athlon, because it won't
benefit (because AMD's cores are 5 years newer in design, and
put far more optimizations in the chip to keep pipes full and
registers used that to virtualize two sets for the OS to use).
> Anyone have any experience/knowledge about this?
I can only speculate based on the history of the players involved,
as well as what AMD's PAE52 design as well as limitations of the
current Athlon core (which is largely the _same_ between both the
32-bit and newer 64-bit versions).

But the concept of adding more pipes with lots of stages for
timing is only leaving more and more stages in pipes empty,
or doing little.  There has to be a consolidation of many
run-time optimizations inside of the chip, and the best way to
do that is to create a bank of pipes, registers, etc... and virtually
assemble them into virtual cores that are partitioned with memory
as a traditional PAE36 or PAE52 processor (or multi-processor).

It's going to solve a _lot_ of issues -- both semiconductor and
software.
> What level of CPU/hardware(?) does the virt-core support?
> And is the virt-core 32bit?
You can be certain that the "host" OS (possibly firmware-based?)
will be able to instantiate multiple PAE36 and/or PAE52 virtual
systems with their own and -- I'll use legacy terminology here 
(even if it's not technically correct) -- "Ring 0" access.  So,
technically, there should be possible to run any PAE36 or PAE52
OS simultaneously on the same hardware as any other PAE36 or
PAE52 OS.

The larger issues of firmware-OS interoperability as well as
partitioning resources (memory, disk, etc...) is really more of a
political/market issue.  I.e., AMD and Intel can provide the platform,
but people have to work together to use it.  Furthermore, it also
means that Intel can continue to best AMD in funding of OEMs and
firmware/software vendors, so it still has an advantage in that
capacity.

I'm sure Apple will be protective of its firmware, and Intel's new,
supposed "open" firmware is rather proprietary.  As I've
repeatedly
commented elsewhere, the 2 "most open" hardware vendors right
now are AMD and Sun, x86-64 and SPARC, respectively.  Intel
has not only protected non-programmer aspects of IA-64 heavily,
but most of their new platform developments for even IA-32e
(EM64T) are _very_proprietary_.  IBM is partially doing the same
with Power in a microelectronics offering, but it is _not_ the
same in its branded Power solutions (among others).

So it's not going to solve vendors who require firmware and
data organization that is not open and stanardized.  We're
fine on legacy Win32 platforms, but it's not going to address
Mactel, nor solve the problem of existing OSes that don't run
under current virtualization solutions because of such
proprietary requirements.


--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Peter Arremann

2005-Jun-24 05:37 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

On Tuesday 21 June 2005 11:54, Bryan J. Smith <b.j.smith at ieee.org>
wrote:>  There are a few mainboards that
> allow even 32-bit Athlons to safely address above 4GB with
> _no_ paging or issues (with an OS that offers such a supporting
> kernel, like Linux). 
How does that work? :-) 

Peter.

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-24 21:42 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: James Olin Oden <james.oden at gmail.com>> Any idea what the BIOS actually does to enable "linux" memory
mode?
> That is what registers are poked in the MCH (I assume it would be the
> MCH)?   Do you think this could be done post bios by say a boot
> loader?
It completely _breaks_ Intel GTL compatibility at the APIC level IIRC.
You can _not_ run Windows on it or a "normal" Linux kernel.


From: Feizhou <feizhou at graffiti.net>> marc.theaimsgroup.com/?l=linux-kernel&m=107759901509280&w=2
> The Linux option in bios is mentioned here.
> marc.theaimsgroup.com/?l=linux-kernel&m=107757492125437&w=2
> I think what Bryan here is talking about is called IOMMU by the kernel
guys.
No, that was already at the bottom of the link I sent before.

I was referring to _another_ thread where a gentlemen from AMD (with
an amd.com address) was talking about the hack that enables this.  It
is a little known feature because only 2-3 mainboards have the BIOS
hack as an option.



--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-24 22:00 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Peter Arremann <loony at loonybin.org>> Thanks - but both links again talk about a 32bit/4GB schema and don't
talk at
> all about addressing >4GB without the need for paging - that was the 
> statement Bryan made when posting. 
Correct.  The links to the posts he provided were the _same_ as the the
posts at the bottom of my original link.

I'm still trying to find the post I was referring to.  As I said, apparently
it was _not_ in the 2004Feb17 thread on Intel EM64T.



--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-24 22:44 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Peter Arremann <loony at loonybin.org>> And that's exactly the part I don't get - if you have a 32bit
address model
> then you have to use PAE of some sort (compatible to Intel PAE36 or not) to
> get to address more than that memory...
The BIOS hack basically says "don't be stupid" >4GB.  It tells
the Athlon MP
to the PAE36 addresses linearlly, instead of paging in GTL compatible fashion.

With that said, do you mean _beyond_ 36-bit/64GiB?

First off, I don't think there ever was an Athlon MP mainboard with more
than
64GiB (or more than 32GiB for that matter).
Secondly, programs are written to PAE36, so they can only address 64GiB
(if not 32-bit/4GiB).

*BUT* is it possible for the Athlon MP to still address more than 64GB
and "window" (not page) PAE36/64GiB programs?
That's a good question, and it's the #1 reason I'm trying to find
that post
in the LKML.

In reality, if the hack uses Linux/x86-64, it could very well be that it puts
the Athlon MP in "48-bit/256TiB Long Mode" and translates addreses
linearlly.
The Athlon MP wouldn't offer PAE52, no.  But it _could_ offer PAE36 windows
in a 48-bit (40-bit physical EV6) space.

Remember, the Athlon/MP and Athlon64/Opteron are the _exact_same_
core design, including the 40-bit/1TiB EV6 interconnect for addressing
outside the chip (even if the latter tunnels over HyperTransport between
CPUs and I/O to other memory, instead of using a "crossbar switch").
> again - how do I generate a >32bit address when using a 32bit
> address model without pages? :-)
Remember, in i386 (and, subsequently, the i486 TLB), the segment
register is offset at bit 20.  That means the most significant bit
(MSB) of the segment register actually has a value of 2^35 (32GiB)!

Segment:  1111 1111 1111 1111 ____ ____ ____ ____ ____
 Offset:  ____ 1111 1111 1111 1111 1111 1111 1111 1111

Before PAE36 in the i686, Intel would thrown an exception if you
set _any_ of the top 4-bits of the segment register, or if the
5th MSB of the segment was set and either the MSB of the offset,
or the previous bits "overflowed" when "normalized" into the
physical
address.  That would result in a physical address >32-bit/>4GiB.

In i686 with PAE36, Intel now uses those >32-bit/>4GiB addresses
as "pages" into <32-bit/<4GiB addresses.

In EV6 with the incompatible "BIOS hack" and a supporting kernel,
the Athlon MP's TLB doesn't page, it allows _direct_, _linear_ access
above 32-bit/4GiB.

Digital EV6 is capable of physically addresssing 40-bit.
Intel GTL is only capable of 32-bit.

When Athlon is in GTL compatibility mode, it only does 32-bit for full
compatibility.  But you can tell it to use native, 40-bit EV6 with this
BIOS hack.  The BIOS hack sets up the crossbar registers which the
OS must then support.

I'll confirm how this is being done.  I believe it actually leverages
the Linux/x86-64 kernel.
> Once I have that address, throwing it out on the bus is 
> easy - but how do you generate that address without a PAE?
That's where the hack comes in.

Remember, _all_ kernels in _all_ OSes use the TLB, hence why
an i486 ISA compatible is virtually required these days (or they
do it in the kernel's software).

Normally the Athlon TLB will do the same thing as any Intel GTL,
so any such i686 kernel with PAE36 will normally do the same thing.
But the BIOS hack puts the Athlon TLB into its _true_ 40-bit EV6
addressing, and the kernel then knows of this and uses it.
> So you're simply talking about being the ability to output an
> address longer than 32bit on the bus? 
It's more than just that.

It's the combination of presenting PAE36 to _user_ software,
while using _linear_ addressing physically.

Intel i686+GTL already does PAE36, and pages in addressings
above 32-bit.  Athlon+EV6 normally just emulates that, right
down to the TLB.

This hack doesn't, it unleashes the Athlon as it's meant to work
on a 40-bit interconnect designed for 64-bit Alpha processors.
And that includes having the kernel drive the TLB to just take
those >32-bit "normalized" addresses that would normally be
"paged" and just act like it's linear space.

As far as above 36-bit, I'm not sure.  I'll check on that.
> on AMD64, yes, thats for sure... but you were referring to
> 32bit athlons in the statement I'm trying to understand. 
And what I'm saying is that Athlon is Athlon.  It's the same
core, same 40-bit EV6 interconnect.  Everything else is
compatibility.

On Athlon 64, you can run a PAE52 kernel (on its 40-bit
physical platform) and run PAE36 applications, giving linear
address space to them all while they think paging is being used.

What this hack does is do the same thing for Athlon MP
-- marketed as "32-bit," but the _same_ 40-bit capable,
_physical_ platform.  EV6 is EV6, designed for 64-bit Alpha,
and AMD couldn't "cripple" it.  They use the same board logic
and cores, whether a 32-bit Athlon or 64-bit Alpha are used.

But doing anything but traditional, GTL 32-bit/PAE36 would
normally _break_ hardware and OSes.  Unless you have a
hack to the BIOS which enables this "Linux" memory mode,
and a kernel which supports it.

Again, I'll find the post, which will lead me to the tech info.


--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-25 21:46 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Peter Arremann <loony at loonybin.org>> Then the first AnandTech benchmark article 
> (anandtech.com/IT/showdoc.aspx?i=2447) is exactly what you want
to
> look at. Huge amount of memory (when compared to the size of the database 
> running on the system) on a 64bit linux kernel...
> We're doing the same for one of our apps called IPM. Its a PHP app
running
> against a quad opteron with 16GB ram. Heavy on network IO (during business 
> hours its rare that we don't saturate the main 100mbit link)  but
little disk
> activity. DB size is about 2.5GB and we end up with a couple of gig for
disk
> buffers. CentOS4 of course... anything specific you're looking for?
I think I know where you and I are differing.

When you talk about "heavy [network] IO," you refer to SQL-based
applications over a primary 100mbit link.  In reality, the MCH bottleneck
isn't much of an issue here.

When I talk about "heavy [network] IO," I'm typically referring to
less
intelligent applications (e.g., NFS or other "raw block" transfer)
over
one or even multiple GbE, possibly FC-AL links (possibly direct IP,
link agregated IP, maybe 1 out-of-band channel, or possibly to a
Storage Area Network, SAN).  Although I have built financial
transaction systems that have required GbE as well as engineering.

I guess this is where my terminology really differs.  A lot of people
are using Linux for Internet services.  I've typically been using Linux
for high-performance LAN systems -- both "raw block" as well as
intelligent applications.  My Internet connection is not my bottleneck.

BTW, despite popular thought, this can be done quite inexpensively
when needed.  It really all depends.  But in these applications,
definitely _not_ going to see Opteron doing much for you over Xeon.
Let alone not when running Linux or Windows (Solaris might be another
story though).

-- Bryan

P.S.  Just a follow-up, _never_ assume you're the only EE in a thread.
[ Let alone don't assume I haven't designed memory controllers as
part of my job, beyond just my option in computer architecture.
That's why I had to "give up" in the other thread, because
everytime
I try to explain something, you are going to a very simplified path
and I have to stop and explain it (e.g., the fact that IA-32/x86 _can_
address beyond 4GiB). ]


--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-25 22:06 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Peter Arremann <loony at loonybin.org>> Then enlighten me - if I have 40 address bits - transmit only the higher 37
> since I don't need the lower end. Timing schemas show only one Input
hold
> time per address transfer for all available pins - how can that be a 32bit 
> bus? (ftp://download.intel.com/design/Xeon/datashts/30675401.pdf) 
Once again, stop assuming a trace and its specifications for board layout
means that memory is directly addressable over those pins.  We could
go on that all night and for days, and I can give you countless examples
of embedded, PC and other memory controllers, memory technology, etc...
where paging and other "swap" is required.

Heck, as an EE, I would ass-u-me you would have been exposed to
memory controller design.  It is a complex mess whereby each new
change in addressing exponentially increases the transistor count.
> And, if timing diagrams, pinouts and so on lie about the size of the bus,
> ... cut derrogatory non-sense ...
*STOP*  This is why I can't even begin.  I _never_ said that Intel
didn't
offer 64GiB (36-bit) memory addressing.  The traces _must_ exist so
Intel can on [A]GTL+ platforms, but it does not mean that there is not
some "paging" going on at the memory _logic_.  I just said it cannot
address it linearly -- directly by 36-bit -- in the GTL design.

It all goes back to Intel's belief that IA-64/Itanium would have taken
over the i686/GTL world by now when it decide not to build a new
architecture for IA-32 back in the '90s.  A belief that Intel has not
been paying the price for, and quickly retrofiting everything it can
ASAP.

AMD decided to switch to EV6 instead as its foundation for _all_
current processors back in 1996+, and that includes tunneling EV6
over HyperTransport in Athlon64/HyperTransport.  This has to do
with the fact that the Athlon core is a true 40-bit addressing
processor, and not 32-bit with PAE36 to page in the "overhang"
from a segment+offset that is normalized above 4GB.  Athlon just
_emulates_ PAE36 for compatibility.

If you hit "/proc/cpuinfo" on _any_ Athlon, it will show that the
PAE36 flag is supported.  PAE36 support has _always_ been in
_every_ Athlon because the core design is 40-bit EV6.  AMD
took the time, effort and transistors to emulate PAE36 at the
control, and put in the logic in its TLB and memory controller.
If they didn't, then AMD wouldn't be able to run any PAE36
OSes or applications.

And even if you don't have more than 4GB of RAM, there is
still the memory organization issues.  E.g., Red Hat currently
ships i386/i686 kernels with the 4G+4G model which hurts
performance over the 1G+3G model.  Why does it hurt
performance?  Because it either relies on the PAE36 paging
logic, or a software emulation of it (if the processor does
not support PAE36).  With the BIOS hack and kernel, this
allows Athlon MP to linearlly address directly above 4GB.

Intel is _finally_ just coming out with its first, true 40-bit
physical interconnect that breaks the limitations of GTL.
People should be wary to _not_ go with the majority of
Intel's existing platforms because of this, even those with
EM64T.  Only these newer platforms.
> And if its not, then your whole speach about the pae36
> differences between a gtl+ or ev6 connected device is
> wrong, which then in turn makes the only real difference
> the iommu the newer athlon cores provide (so dma can go
> above the first 4GB rather than having to be bounced)...
> This is also supported by the intel, amd and redhat docs
> (see links posted above and in previous mails),  the post
> Feizhou made in this thread convering the LKML references about 
> using the apggart and even microsoft 
> (microsoft.com/whdc/system/platform/server/PAE/pae_os.mspx)
I know, I also posted it, and there are even _more_ comments in the
LKML.  There are comments on how AMD has extra GARTs in the
Athlon (yes, even the so-called "32-bit" Athlon) to handle _all_
I/O, not just AGP (long story).

But I'm not talking just about that.  I'm talking about the serious
limitations with GTL itself -- even on Socket-640 and Xeon EM64T.
It's only the latest
> *shrugs* Intel, AMD and any other spec sheet you can find down
> to VIA chipset docs agrees with that... But I guess I'm still
> wrong though?
Dude, you are using _board_level_ spec sheets.  I'm talking about
the _internal_ design of the CPU/interconnect and how it handles
the addressing between the systems software and interconnect.

You get that from _neither_ of the "board level" spec sheets _nor_
the "programmer" guides.  You have to find more eccentric docs,
many times, they are not on-line.  Intel is not going to boast how
even their AGTL+ chipsets with first-gen EM64T can't directly
address above 4GiB because of legacy design limitations in the
platform.



--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-25 22:45 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Peter Arremann <loony at loonybin.org>> Actually apps like the one I was referring to showed about 50%
> single thread performance gain when going from a 2.4GHZ Xeon
> to a 2.2GHz Opteron.
That's not a good comparison because the ALU-Control of the Athlon
is about 50% faster MHz for MHz than P4-Xeon.  So that could easily
be a computational benefit and not an interconnect one.
> Never assume you've done more more than others either :-)
I didn't.
> I've done the more difficult job with finding all the applicable
documents
Dude, you did _not_ send me any documentation I have not already seen.
I have been using developer.intel.com years in my semiconductor design
career, as well as more of an IT-level system designer.

The problem here is that you can't seem to understand that a trace does
_not_ indicate how something actually works in the memory logic.  E.g., 
the existence of trace does not tell you if a normalized address that
results in a PAE36 (>4GiB) address is going to either:  
  A)  Directly drive that trace and fetch for the process to use
  B)  Be intercepted by the paging logic, referenced in a page table,
       then _that_ logic actually fetches the memory, which is then mapped
       into <4GiB for the process to use

It's not just the simple trace, and it's not just the simple,
programmer-level
logic.  If Intel PAE36 processors didn't have address lines above bit 31,
they couldn't address above 4GiB at all!  But just because they have those
traces doesn't mean they can directly use them.

First you looked at it from the "programmer" level, then you looked at
it
from the "board technician" level.  Now I'm tell you
_get_in_the_chip_!
> while you just put out hearsay and "doesn't work like that"
statements
> without really backing it up. And yes, IA-32 _can_ address more than
> 4GB - called PAE.
Exactly!  What I'm telling you is that _all_ Athlon "32-bit"
processors can
by-pass PAE36, and significantly _increase_ performance at the page table
level.  But it requres the platform be configured in a way that is
_incompatible_ with all OSes _except_ a hack for Linux.  This is rare, but
it has been done in a few mainboards.



--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-28 17:00 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: alex at milivojevic.org> Well, this is how I interpreted Bryan's emails.  He'll probably
correct me if
> I'm wrong (yeah, I have EE, but haven't done much EE work since I 
> graduated, it just happened to be mostly "software" things for
me, so I'm
> rather rusty here  :-(
If you haven't noticed, I am _not_ big on "credentials."  As far
as I'm
concerned, as long as someone has the knowledge, I could care less
about the paper**.  Even most state BoPEs will allow you to replace
an ABET Accredited BSE with 8-12 years of experience (plus another
4-5 years experience post-degree) to qualify to become a PE.

There are semiconductor engineering mindsets and they create differ
from programmer or even technologist.

[ **NOTE:  Being a consultant, I finally had to "give in" to the
paper.
I also decided to major in engineering "just in case I needed it"
(and,
later on, got to use it for 2 years in the semiconductor industry). ]
> I don't think Bryan is talking about "external" width of the
address bus,
> how many lines you see printed on the motherboard.
That's because it is _irrelevant_ in many cases.  You can mux lines,
which is exactly what the original EV6 Crossbar Athlon does.
> He's talking how the things are organized internally, and about the way
> the bus logic works.  That would be what theoretically implementations
> might use, not what some specific implementation is limiting itself to.
Exactly.  Until the new 40-bit Xeon MPs came out, they were pretty much
"slap on" designs with extended ALUs and microcoding to be x86-64
compatible to a point.
> Sorry for mixing software and hardware from now on,
It hard not to.

First you have to consider the programmer aspects.
Then you have to realize those will influence hardware compatibility.
Then you have to remember that with every new addressing model,
you _exponentially_ increase the external logic to drive it.
> just trying to make Peter see where his misconception is (the best
> way I can, which might not be good way at all).
It's fairly difficult to do it without breakout a basic memory controller
circuit at least the transistor level, or possibly the combinational NAND
gates it somewhat represents.
> The programming model for 32-bit userland applications is obviusly
> limited to 32 bits -- the sizeof(void *) will tell you that.  So single 
> process can see only 32-bit logical linear address space.
To a point.  PAE36 is what allows you to break that, because even in
the i386, the 16-bit segment register that is offset 4-bits from the
32-bit offset register results in a 36-bit "normalized" address (it
can
actually be 37-bit, but that's another story).  How the processor
handles that in a way that is compatible with the OS is the problem.

PAE36 uses "paging" from above 32-bit (up to 36-bit/64GiB) down to
below 32-bit (upto 4GiB).  PAE36 processors can support this paging,
and then have at least 36-bit traces on the platform.  GTL+ logic
was designed with this "slapped on," and never bothered with direct,
linear access until just recently (with the new 40-bit redesign for
Xeon MP).

Athlon, on the other hand, has always been 40-bit EV6.  AMD
decided to support the PAE36 paging, which added logic.  But they
_always_ had the "40-bit linear addressing for free" inherently in
the platform.  That's where these few BIOS hacks come in,
combined with an OS/model that can take advantage of it.  If you
enable this mode for Windows, it will _not_ work!
> On the other hand, processor (the hardware) doesn't need to have
> such limits, if it's internal organization is wider.  So, AMD
(processor)
> is able to see 40-bit linear physical address space as one single big
> chunk of memory.  32-bit applications will have their
> 32-bit *logical* address space mapped into this 40-bit linear *physical*
> address space of processor.  I don't know if programming model of
> "32-bit" AMD processors allows you to have wider-then-32-bit
pointers
> (even if it did, you would have to have compiler that can generate such
> code, gcc can't do it for sure).
Not true.  The model is _still_ PAE36, up to 36-bit/64GiB.  But instead
of paging, the combination of Athlon's "inherent 40-bit" with an
OS
that can do it will not use paging.  Normally PAE36 does paging, which
is what PAE36 OSes normally do.
> Obviously, the kernel needs to know how to manage things in this
> wider physical address space, the reason why you need patched Linux
> kernel to take advantage of it.
The actual amount of space doesn't change -- at least not under the
PAE36 model.  It's just how the OS commands the memory logic to
use it, and that also requires the firmware (BIOS) to pre-configure it
at POST.  Under a norminal PAE36 OS and board, the memory logic and
OS always do paging to access above 32-bit/4GiB.

But I don't think this hack is able to linearly address the entire 40-bit
because of the limitations of what PAE36 can address.
> Intel (the processor) on the other hand, is able to physically address
> only 32-bit address space.  Anything wider than that, it needs to page.
> Dealing with paging will obviosly be additional work for OS, hence
> lower performance.
Paging is a significant hit.  Anyone running in the 4G+4G "HIGHMEM"
model of their Linux kernel and recompiles for 1G+3G model will notice
a noticeable performance gain (as long with only 960MiB of memory).
> So while both processors will have more that 32 address lines on the
> packaging (and printed on motherboard) minus couple of lowest one 
> that are not needed) as you can see in various specifications that you
> queted, that doesn't mean processor's core actually sees all that
> address space.


> How else to try explaining this...  Hm... Remeber Intel 8086?  It could 
> address only 16-bit address space, but it had more than 16 address lines
> on the packaging.  It used segmenting (hopefully the right term) to see
> wider-than-16-bit address space.  Now try to make analogy with what was
> discussed so far ;-)
It's somewhat correct, although it gets interesting.

It's more like EMS than XMS, because XMS used to shunt the processor in
and out of Protected286/386, and that required a 286/386 to access it.
You could run a true 24-bit/32-bit, respectively, to avoid that.

EMS worked on old 8088/8086 (as well as 80286/386), because pages
above 1MB could be mapped in.  The 80386 and some 80286 could
emulate this as well without a special card with special addressing.

In the GTL+ bus, this is exactly what it does.  It offers special
addressing lines for a memory logic that pages, because that's all
the OS does.  Even the early EM64T processors had to deal with
this "limitation" of GTL+ platform.

You want to ensure you get a new EM64T platform that doesn't
have that approach.


--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Bryan J. Smith <b.j.smith@ieee.org>

2005-Jun-28 17:28 UTC

head link

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

From: Steven Vishoot <sir_funzone at yahoo.com>> As Carol Brady said "what was all that about" I dont
> know if anyone else feels this way, but i think this
> subjuct has been beaten to death and beyond. Not sure
> if i remember right, but wasnt the original post about
> upgrading from 4.0 to 4.1? How did this topic become
> an engineering course? sorry for being a crab...but
> come on....
If you deploy servers with more than 1GB of RAM, and
definitely more than 4GB of RAM, you should be aware
of what limitations there are with Intel solutions.

That's what this was about.

If someone else wanted to make it into a discussion of
board traces and other, irrelevant non-sense, then I'm
sorry for getting into that with him.



--
Bryan J. Smith   mailto:b.j.smith at ieee.org

Seemingly Similar Threads

Search for more seemingly similar threads

CentOS - Jun 2005 - [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

Seemingly Similar Threads