Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-21 15:54 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Maciej ?enczykowski <maze at cela.pl>> That's a good point - does anyone know what the new Intel > Virtualization thingamajig in the new dual core pentium D's is about?It's all speculation at this point. But there are _several_ factors. But I'm sure the first time Intel saw AMD's x86-64/PAE52 presentation, the same thing popped into my mind that popped into Intel's mind ... Virtualization - The 48-bit/256TiB limitation of x86-64 "Long Mode" There is a "progammers limit" of 48-bit/256TiB in x86-64 "Long Mode." This limitation is due to how i386/i486-TLB works -- 16-bit segment, 32-bit off-set. If AMD would have choosen to ignore such compatibility, it would have been near-impossible for 32-bit/PAE36 programs to run under a kernel of a different model. But "Long Mode" was designed so its PAE52 model could run both 32-bit (and PAE36) as well as new 48-bit programs. We'll revisit that in a bit. Now, let's talk about Intel/AMD design lineage. - Intel IA-32 Complete Design Lineage IA-32 Gen 1 (1986): i386, including i486 - Non-superscalar: ALU + optional FPU (std. in 486DX), TLB added in i486 IA-32 Gen 2 (1992): i586, Pentium/MMX (defunct, redesigned in i686) - Superscalar 2+1 ALU+FPU (pipelined) IA-32 Gen 3 (1994): i686, Pentium Pro, II, III, 4 (partial refit) - Superscalar: 2+2 ALU+FPU (pipelined), FPU 1 complex or 2 ADD - P3 = +1 SSE pipe, P4 = +2 SSE pipe Intel hasn't revamped it's aging i686 architecture in almost 12 years. the Pentium Pro through Pentium III are the _exact_same_ 7-issue (2+2+3 ALU+FPU+controll) design (the P3 slaps on one SSE unit), and the Pentium 4 was a quick, 18-month refit of longer pipes (with associated reduction in ALU/FPU performance MHz for MHz) that extended pipes for clock (and added a 2nd SSE unit). I'm sure Intel's reasoning for not bothering with a complete generation redesign beyond i686 is because it thought EPIC/Predication would have taken over by now. The reality has been quite the opposite (which I won't get back into). Since then, Intel has made a number of "hacks" to the i686 architecture. One is HyperThreading which tries to keep its pipes full by using its control units to virtualize two instruction schedulers, registers, etc... In a nutshell, it's a nice way to get "out-of-order and register renaming for almost free." Other than basic coherency checking as necessary in silicon, it "passes the buck" to the OS, leveraging its context switching (and associated overhead) to manage some details. That's why HyperThreading can actually be slower for some applications, because they do not thread, and the added overhead in _software_ results in reduced processing time for the applications. "Yamhill" IA-32e aka "EM64T" was just a P4 ALU refit for x86-64/PAE52, but it lacks many design considerations that the Athlon has -- especially outside the programmer/software considerations, and definitely more at the core interconnect/platform.. I.e., because Intel continues to use a single-point-of-contention "memory controller hub" (MCH), memory interconnect and I/O management, among other details, are still left to the MCH. This is going to become more and more of a headache. The reality is that the Intel IA-32e platform _must_ get past the "northbridge outside the CPU" attitude to compete with AMD. As such, I have _always_ theorized that "Yamhill" is a 2-part project. Part 2 is the first redesign of a x86 core in almost (now) 12 years, which goes beyond merely adding true register renaming and out-of- order execution (which are largely hacks in the P4/HT), but goes directly to the concept of virtualizing cores. More on that in a bit, now AMD ... - AMD x86 Complete Design Lineage AMD Gen 1 (1992*): i386/486 ISA -- 386, 486, 5x86, K5* - Non-superscalar: ALU + optional FPU (std. in K5) AMD Gen 2 (1994*): i486/686 ISA -- Nx586+FPU/K5*, Nx686/K6 - Superscalar: 3+1 ALU+FPU (ALUs pipelined, FPU _not_ piplined) AMD Gen 3 (1999): i686/x86-64 ISA -- Athlon, Athlon64/Opteron - Superscalar: 3+3 ALU+FPU (pipelined), FPU 2 _and_ 1 ADD/MULT - Extensions are microcoded and leverage ALU/FPU as appropriate *NOTE: The NexGen Nx586 released in 1994 forms the basis for latter K5 (i486) and the K6 (i686). AMD had scalability issues with its original non-superscalar K5 design and purchased NexGen. SIDE NOTE: SSE Comparison - P4 can do 3 MULT SSE (1 FPU complex + 2 SSE pipes) - Athlon can do 3 MULT SSE (2 FPU complex + 1 FPU MULT) Contrary to popular opinion, Athlon64/Opteron is the _same_core_ design as the 32-bit Athlon platform. It is still the same, ultra- powerful 3+3 ALU+FPU core, with its 2 complex + 1 ADD/MULT FPU able to equal Intel's 1 complex _or_ 2 ADD FPU plus 2 SSE pipes at doing the majority of matrix transforms (which are MULT -- hence why Intel's FPU can't do 2 simultaneously, and relies heavily on its precision-lacking SSE pipes). Also contrary to popular opinion, 40-bit/1TiB Digital Alpha EV6 interconnect forms the basis for _all_ addressing in _all_ Athlon releases, including the 32-bit. There are a few mainboards that allow even 32-bit Athlons to safely address above 4GB with _no_ paging or issues (with an OS that offers such a supporting kernel, like Linux). The 3-16 point EV6 crossbar and not "hub" architecture, forced Athlon MP to put any I/O coherency login in the chip, so the AGPgart control is actually on the Athlon MP, and not in the northbridge. This has evolved into a full I/O MMU in Athlon64/Opteron. Because Athlon is 5 years newer than Intel i686, and there is a wealthy of talent influx from Digital (even though Intel did get some as well, they haven't redesigned i686 completely), Athlon has some of the latest, run-time register renaming and out-of-order execution control in the core itself. This is why doing something like HyperThreading would benefit AMD _very_little_ and largley introduce self-defeating (and even performance reducing) overhead. In addition to the design of PAE52, the #1 reason why you can safely assume AMD is moving towards virtualization is because of the design limits they put on Athlon64/Opteron. E.g., although the original 32-bit Athlon platform used logic that allowed up to the full EV6 8MB SRAM addressing (cache), Athlon64/Opteron has been artificially limited to 1MB SRAM (saving many considerations and other benefits). This clearly indicates AMD did not consider Athlon64/Opteron - The Evolution to Virtual Cores AMD's adoption of '90s concepts of register renaming and out-of-order execution are great for a single core. And Intel's HyperThreading with the minor P4 run-time additions passes-the-buck decently in lieu of a complete core redesign (which they haven't done since 1994). But the concept of extending the pipes any further for performance has been largely broken in the P4, and Intel is actually falling back to its last rev of the i686 original, P3. Multiple, _physical_ cores have been the first step. This is little more than slapping in a second set of all the non-SRAM transistors, plus any additional bridging logic, if necessary. AMD HyperTransport requires none -- as HyperTransport can "tunnel" anything, EV6 memory/addressing, I/O tunnels/bridges, inter-CPU, etc... all "gluelessly." Intel MCH GTL+ cannot, and requires bridges between the "chipset MCH" and the "multi-core MCH," adding latency. And there are nagging 32-bit limitations with GTL+ as well (long story). The next logical evolution in microprocessor design is to blur the physical separation between cores. It's the best way without tearing down the entire '70s-induced concept of machine code (operator+ operand, possibly control, at least microcoded internally) and the resulting instruction sets. Instead of discrete, superscalar units of a half-dozen to a dozen, pipelined units, there will be numerous, independent pipes, possibly with their own registers or a number of generic registers, as a single unit. Other than the controlling firmware and/or OS, this is _not_ what software will use. What the software will use are the virtual instantiations that partition this set of pipes and registers, which may very well be dynamic in nature. Let's say I boot Windows, I might instantiate a virtual i686/PAE36 core guaranteeing 100% full Win32 compatibility. Depending on what resources the chip physically has, I will likely even instantiate multiple i686 processors. The concept of multi-CPU and multi-threading has evolved into virtual-cores with virtual-threading. Virtualizing more CPUs with a total number of more pipes/registers than is actual will allow more registers and pipelines to be executing instead of the common 40-50% for superscalar CISC or 60-70% for superscalar RISC. As an "added bonus," this means the 48-bit/256TiB constraint for PAE36 compatibility is _removed_. I.e., you can have a much larger, true memory pool, and any required windowing/ segmentation is done with_out_ paging by the "host" memory model, even though the OS is virtually running in a PAE36 or PAE52 model. This also gives rise to an entirely new platform for virtualization of simultaneous OSes -- be it the same OS, or different OSes. Because cores are virtual, you can have multiple, independent processors with their own registers, memory windows into physical RAM, etc... On the more "consumer" front, this will allow it to work with existing OSes as-is. On the more "load balancing server" front, this will often be paired with software (think EMC/VMWare *SX products) so numerous instances can be dynamically load-balanced across virtual cores -- but far more overhead and increased efficiency is put on the chip. But it is still managed by software (just with reduced context switching overhead in the software). Again, it's really just a consolidation of all the run-time optimizations we have now, along with both multi-core and multi-threading approaches, into a general pool of pipes, registers and organization. Additionally, it breaks the physical constraints of the memory model for the physical hardware, which is a very big issue for our future. To ensure x86/PAE36 and x86-64/PAE52 compatibility in the future, such machines will need to be virtualized or we'll be stuck at 48-bit/256TiB.> As in is it worth anything?Yes -- and almost everything to the future of Microsoft being able to sustain much their existing Win32 codebase which does _not_ port to PAE52 very easily and definitely _not_ with full compatibility. And we have to break the 48-bit/256TiB limitations of PAE52, while still ensuring PAE52 OSes/applications, as well as some legacy PAE36 OSes/applications, still run. The only way is to virtualize the whole freak'n chip so we can instantiate a processor, registers and its memory model -- even if dynamically assigned/ shared. And that's just for end-users, possibly workstations and entry servers. For load-balancing servers, you'll still need a software solution for management. It will be that the hardware just offers far greater efficiency and reduced context switching. In fact, the next consolidation are these virtual core chips in blades, where you not only manage the virtual cores in the individual chips/ blades, but an entire rack of blades as a single unit with multiple OSes spread across. This already exists, but this takes it one step further -- because the processors themselves are virtualized with greatly reduced overhead on the part of the software.> Will it allow a dual simultaneous boot of Linux+WinXP+MaxOS > under Xen or something along those lines?Yes. It will both give more virtualized processors to a single executing OS, as well as create segmented, virtualized processors for independently and simultaneously operating processors.> Even on an SMP machine?First off, remove the Intel-centric notion of "Symmetric" MP (SMP). Secondly, multi-processing and multi-threading are going to merge with traditional register renaming and out-of-order execution. So the traditional concept of "MP" is _dying_. In fact, in the '90s, it really died in general. I know it's hard to think outside the box and traditional thought, but most users don't understand superscalar design in the first place. Those who do understand why AMD has _not_ bothered to adopt Intel SMT (HyperThreading) in Athlon, because it won't benefit (because AMD's cores are 5 years newer in design, and put far more optimizations in the chip to keep pipes full and registers used that to virtualize two sets for the OS to use).> Anyone have any experience/knowledge about this?I can only speculate based on the history of the players involved, as well as what AMD's PAE52 design as well as limitations of the current Athlon core (which is largely the _same_ between both the 32-bit and newer 64-bit versions). But the concept of adding more pipes with lots of stages for timing is only leaving more and more stages in pipes empty, or doing little. There has to be a consolidation of many run-time optimizations inside of the chip, and the best way to do that is to create a bank of pipes, registers, etc... and virtually assemble them into virtual cores that are partitioned with memory as a traditional PAE36 or PAE52 processor (or multi-processor). It's going to solve a _lot_ of issues -- both semiconductor and software.> What level of CPU/hardware(?) does the virt-core support? > And is the virt-core 32bit?You can be certain that the "host" OS (possibly firmware-based?) will be able to instantiate multiple PAE36 and/or PAE52 virtual systems with their own and -- I'll use legacy terminology here (even if it's not technically correct) -- "Ring 0" access. So, technically, there should be possible to run any PAE36 or PAE52 OS simultaneously on the same hardware as any other PAE36 or PAE52 OS. The larger issues of firmware-OS interoperability as well as partitioning resources (memory, disk, etc...) is really more of a political/market issue. I.e., AMD and Intel can provide the platform, but people have to work together to use it. Furthermore, it also means that Intel can continue to best AMD in funding of OEMs and firmware/software vendors, so it still has an advantage in that capacity. I'm sure Apple will be protective of its firmware, and Intel's new, supposed "open" firmware is rather proprietary. As I've repeatedly commented elsewhere, the 2 "most open" hardware vendors right now are AMD and Sun, x86-64 and SPARC, respectively. Intel has not only protected non-programmer aspects of IA-64 heavily, but most of their new platform developments for even IA-32e (EM64T) are _very_proprietary_. IBM is partially doing the same with Power in a microelectronics offering, but it is _not_ the same in its branded Power solutions (among others). So it's not going to solve vendors who require firmware and data organization that is not open and stanardized. We're fine on legacy Win32 platforms, but it's not going to address Mactel, nor solve the problem of existing OSes that don't run under current virtualization solutions because of such proprietary requirements. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Peter Arremann
2005-Jun-24 05:37 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
On Tuesday 21 June 2005 11:54, Bryan J. Smith <b.j.smith at ieee.org> wrote:> There are a few mainboards that > allow even 32-bit Athlons to safely address above 4GB with > _no_ paging or issues (with an OS that offers such a supporting > kernel, like Linux).How does that work? :-) Peter.
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-24 21:42 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: James Olin Oden <james.oden at gmail.com>> Any idea what the BIOS actually does to enable "linux" memory mode? > That is what registers are poked in the MCH (I assume it would be the > MCH)? Do you think this could be done post bios by say a boot > loader?It completely _breaks_ Intel GTL compatibility at the APIC level IIRC. You can _not_ run Windows on it or a "normal" Linux kernel. From: Feizhou <feizhou at graffiti.net>> http://marc.theaimsgroup.com/?l=linux-kernel&m=107759901509280&w=2 > The Linux option in bios is mentioned here. > http://marc.theaimsgroup.com/?l=linux-kernel&m=107757492125437&w=2 > I think what Bryan here is talking about is called IOMMU by the kernel guys.No, that was already at the bottom of the link I sent before. I was referring to _another_ thread where a gentlemen from AMD (with an amd.com address) was talking about the hack that enables this. It is a little known feature because only 2-3 mainboards have the BIOS hack as an option. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-24 22:00 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Peter Arremann <loony at loonybin.org>> Thanks - but both links again talk about a 32bit/4GB schema and don't talk at > all about addressing >4GB without the need for paging - that was the > statement Bryan made when posting.Correct. The links to the posts he provided were the _same_ as the the posts at the bottom of my original link. I'm still trying to find the post I was referring to. As I said, apparently it was _not_ in the 2004Feb17 thread on Intel EM64T. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-24 22:44 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Peter Arremann <loony at loonybin.org>> And that's exactly the part I don't get - if you have a 32bit address model > then you have to use PAE of some sort (compatible to Intel PAE36 or not) to > get to address more than that memory...The BIOS hack basically says "don't be stupid" >4GB. It tells the Athlon MP to the PAE36 addresses linearlly, instead of paging in GTL compatible fashion. With that said, do you mean _beyond_ 36-bit/64GiB? First off, I don't think there ever was an Athlon MP mainboard with more than 64GiB (or more than 32GiB for that matter). Secondly, programs are written to PAE36, so they can only address 64GiB (if not 32-bit/4GiB). *BUT* is it possible for the Athlon MP to still address more than 64GB and "window" (not page) PAE36/64GiB programs? That's a good question, and it's the #1 reason I'm trying to find that post in the LKML. In reality, if the hack uses Linux/x86-64, it could very well be that it puts the Athlon MP in "48-bit/256TiB Long Mode" and translates addreses linearlly. The Athlon MP wouldn't offer PAE52, no. But it _could_ offer PAE36 windows in a 48-bit (40-bit physical EV6) space. Remember, the Athlon/MP and Athlon64/Opteron are the _exact_same_ core design, including the 40-bit/1TiB EV6 interconnect for addressing outside the chip (even if the latter tunnels over HyperTransport between CPUs and I/O to other memory, instead of using a "crossbar switch").> again - how do I generate a >32bit address when using a 32bit > address model without pages? :-)Remember, in i386 (and, subsequently, the i486 TLB), the segment register is offset at bit 20. That means the most significant bit (MSB) of the segment register actually has a value of 2^35 (32GiB)! Segment: 1111 1111 1111 1111 ____ ____ ____ ____ ____ Offset: ____ 1111 1111 1111 1111 1111 1111 1111 1111 Before PAE36 in the i686, Intel would thrown an exception if you set _any_ of the top 4-bits of the segment register, or if the 5th MSB of the segment was set and either the MSB of the offset, or the previous bits "overflowed" when "normalized" into the physical address. That would result in a physical address >32-bit/>4GiB. In i686 with PAE36, Intel now uses those >32-bit/>4GiB addresses as "pages" into <32-bit/<4GiB addresses. In EV6 with the incompatible "BIOS hack" and a supporting kernel, the Athlon MP's TLB doesn't page, it allows _direct_, _linear_ access above 32-bit/4GiB. Digital EV6 is capable of physically addresssing 40-bit. Intel GTL is only capable of 32-bit. When Athlon is in GTL compatibility mode, it only does 32-bit for full compatibility. But you can tell it to use native, 40-bit EV6 with this BIOS hack. The BIOS hack sets up the crossbar registers which the OS must then support. I'll confirm how this is being done. I believe it actually leverages the Linux/x86-64 kernel.> Once I have that address, throwing it out on the bus is > easy - but how do you generate that address without a PAE?That's where the hack comes in. Remember, _all_ kernels in _all_ OSes use the TLB, hence why an i486 ISA compatible is virtually required these days (or they do it in the kernel's software). Normally the Athlon TLB will do the same thing as any Intel GTL, so any such i686 kernel with PAE36 will normally do the same thing. But the BIOS hack puts the Athlon TLB into its _true_ 40-bit EV6 addressing, and the kernel then knows of this and uses it.> So you're simply talking about being the ability to output an > address longer than 32bit on the bus?It's more than just that. It's the combination of presenting PAE36 to _user_ software, while using _linear_ addressing physically. Intel i686+GTL already does PAE36, and pages in addressings above 32-bit. Athlon+EV6 normally just emulates that, right down to the TLB. This hack doesn't, it unleashes the Athlon as it's meant to work on a 40-bit interconnect designed for 64-bit Alpha processors. And that includes having the kernel drive the TLB to just take those >32-bit "normalized" addresses that would normally be "paged" and just act like it's linear space. As far as above 36-bit, I'm not sure. I'll check on that.> on AMD64, yes, thats for sure... but you were referring to > 32bit athlons in the statement I'm trying to understand.And what I'm saying is that Athlon is Athlon. It's the same core, same 40-bit EV6 interconnect. Everything else is compatibility. On Athlon 64, you can run a PAE52 kernel (on its 40-bit physical platform) and run PAE36 applications, giving linear address space to them all while they think paging is being used. What this hack does is do the same thing for Athlon MP -- marketed as "32-bit," but the _same_ 40-bit capable, _physical_ platform. EV6 is EV6, designed for 64-bit Alpha, and AMD couldn't "cripple" it. They use the same board logic and cores, whether a 32-bit Athlon or 64-bit Alpha are used. But doing anything but traditional, GTL 32-bit/PAE36 would normally _break_ hardware and OSes. Unless you have a hack to the BIOS which enables this "Linux" memory mode, and a kernel which supports it. Again, I'll find the post, which will lead me to the tech info. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-25 21:46 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Peter Arremann <loony at loonybin.org>> Then the first AnandTech benchmark article > (http://www.anandtech.com/IT/showdoc.aspx?i=2447) is exactly what you want to > look at. Huge amount of memory (when compared to the size of the database > running on the system) on a 64bit linux kernel... > We're doing the same for one of our apps called IPM. Its a PHP app running > against a quad opteron with 16GB ram. Heavy on network IO (during business > hours its rare that we don't saturate the main 100mbit link) but little disk > activity. DB size is about 2.5GB and we end up with a couple of gig for disk > buffers. CentOS4 of course... anything specific you're looking for?I think I know where you and I are differing. When you talk about "heavy [network] IO," you refer to SQL-based applications over a primary 100mbit link. In reality, the MCH bottleneck isn't much of an issue here. When I talk about "heavy [network] IO," I'm typically referring to less intelligent applications (e.g., NFS or other "raw block" transfer) over one or even multiple GbE, possibly FC-AL links (possibly direct IP, link agregated IP, maybe 1 out-of-band channel, or possibly to a Storage Area Network, SAN). Although I have built financial transaction systems that have required GbE as well as engineering. I guess this is where my terminology really differs. A lot of people are using Linux for Internet services. I've typically been using Linux for high-performance LAN systems -- both "raw block" as well as intelligent applications. My Internet connection is not my bottleneck. BTW, despite popular thought, this can be done quite inexpensively when needed. It really all depends. But in these applications, definitely _not_ going to see Opteron doing much for you over Xeon. Let alone not when running Linux or Windows (Solaris might be another story though). -- Bryan P.S. Just a follow-up, _never_ assume you're the only EE in a thread. [ Let alone don't assume I haven't designed memory controllers as part of my job, beyond just my option in computer architecture. That's why I had to "give up" in the other thread, because everytime I try to explain something, you are going to a very simplified path and I have to stop and explain it (e.g., the fact that IA-32/x86 _can_ address beyond 4GiB). ] -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-25 22:06 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Peter Arremann <loony at loonybin.org>> Then enlighten me - if I have 40 address bits - transmit only the higher 37 > since I don't need the lower end. Timing schemas show only one Input hold > time per address transfer for all available pins - how can that be a 32bit > bus? (ftp://download.intel.com/design/Xeon/datashts/30675401.pdf)Once again, stop assuming a trace and its specifications for board layout means that memory is directly addressable over those pins. We could go on that all night and for days, and I can give you countless examples of embedded, PC and other memory controllers, memory technology, etc... where paging and other "swap" is required. Heck, as an EE, I would ass-u-me you would have been exposed to memory controller design. It is a complex mess whereby each new change in addressing exponentially increases the transistor count.> And, if timing diagrams, pinouts and so on lie about the size of the bus, > ... cut derrogatory non-sense ...*STOP* This is why I can't even begin. I _never_ said that Intel didn't offer 64GiB (36-bit) memory addressing. The traces _must_ exist so Intel can on [A]GTL+ platforms, but it does not mean that there is not some "paging" going on at the memory _logic_. I just said it cannot address it linearly -- directly by 36-bit -- in the GTL design. It all goes back to Intel's belief that IA-64/Itanium would have taken over the i686/GTL world by now when it decide not to build a new architecture for IA-32 back in the '90s. A belief that Intel has not been paying the price for, and quickly retrofiting everything it can ASAP. AMD decided to switch to EV6 instead as its foundation for _all_ current processors back in 1996+, and that includes tunneling EV6 over HyperTransport in Athlon64/HyperTransport. This has to do with the fact that the Athlon core is a true 40-bit addressing processor, and not 32-bit with PAE36 to page in the "overhang" from a segment+offset that is normalized above 4GB. Athlon just _emulates_ PAE36 for compatibility. If you hit "/proc/cpuinfo" on _any_ Athlon, it will show that the PAE36 flag is supported. PAE36 support has _always_ been in _every_ Athlon because the core design is 40-bit EV6. AMD took the time, effort and transistors to emulate PAE36 at the control, and put in the logic in its TLB and memory controller. If they didn't, then AMD wouldn't be able to run any PAE36 OSes or applications. And even if you don't have more than 4GB of RAM, there is still the memory organization issues. E.g., Red Hat currently ships i386/i686 kernels with the 4G+4G model which hurts performance over the 1G+3G model. Why does it hurt performance? Because it either relies on the PAE36 paging logic, or a software emulation of it (if the processor does not support PAE36). With the BIOS hack and kernel, this allows Athlon MP to linearlly address directly above 4GB. Intel is _finally_ just coming out with its first, true 40-bit physical interconnect that breaks the limitations of GTL. People should be wary to _not_ go with the majority of Intel's existing platforms because of this, even those with EM64T. Only these newer platforms.> And if its not, then your whole speach about the pae36 > differences between a gtl+ or ev6 connected device is > wrong, which then in turn makes the only real difference > the iommu the newer athlon cores provide (so dma can go > above the first 4GB rather than having to be bounced)... > This is also supported by the intel, amd and redhat docs > (see links posted above and in previous mails), the post > Feizhou made in this thread convering the LKML references about > using the apggart and even microsoft > (http://www.microsoft.com/whdc/system/platform/server/PAE/pae_os.mspx)I know, I also posted it, and there are even _more_ comments in the LKML. There are comments on how AMD has extra GARTs in the Athlon (yes, even the so-called "32-bit" Athlon) to handle _all_ I/O, not just AGP (long story). But I'm not talking just about that. I'm talking about the serious limitations with GTL itself -- even on Socket-640 and Xeon EM64T. It's only the latest> *shrugs* Intel, AMD and any other spec sheet you can find down > to VIA chipset docs agrees with that... But I guess I'm still > wrong though?Dude, you are using _board_level_ spec sheets. I'm talking about the _internal_ design of the CPU/interconnect and how it handles the addressing between the systems software and interconnect. You get that from _neither_ of the "board level" spec sheets _nor_ the "programmer" guides. You have to find more eccentric docs, many times, they are not on-line. Intel is not going to boast how even their AGTL+ chipsets with first-gen EM64T can't directly address above 4GiB because of legacy design limitations in the platform. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-25 22:45 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Peter Arremann <loony at loonybin.org>> Actually apps like the one I was referring to showed about 50% > single thread performance gain when going from a 2.4GHZ Xeon > to a 2.2GHz Opteron.That's not a good comparison because the ALU-Control of the Athlon is about 50% faster MHz for MHz than P4-Xeon. So that could easily be a computational benefit and not an interconnect one.> Never assume you've done more more than others either :-)I didn't.> I've done the more difficult job with finding all the applicable documentsDude, you did _not_ send me any documentation I have not already seen. I have been using developer.intel.com years in my semiconductor design career, as well as more of an IT-level system designer. The problem here is that you can't seem to understand that a trace does _not_ indicate how something actually works in the memory logic. E.g., the existence of trace does not tell you if a normalized address that results in a PAE36 (>4GiB) address is going to either: A) Directly drive that trace and fetch for the process to use B) Be intercepted by the paging logic, referenced in a page table, then _that_ logic actually fetches the memory, which is then mapped into <4GiB for the process to use It's not just the simple trace, and it's not just the simple, programmer-level logic. If Intel PAE36 processors didn't have address lines above bit 31, they couldn't address above 4GiB at all! But just because they have those traces doesn't mean they can directly use them. First you looked at it from the "programmer" level, then you looked at it from the "board technician" level. Now I'm tell you _get_in_the_chip_!> while you just put out hearsay and "doesn't work like that" statements > without really backing it up. And yes, IA-32 _can_ address more than > 4GB - called PAE.Exactly! What I'm telling you is that _all_ Athlon "32-bit" processors can by-pass PAE36, and significantly _increase_ performance at the page table level. But it requres the platform be configured in a way that is _incompatible_ with all OSes _except_ a hack for Linux. This is rare, but it has been done in a few mainboards. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-28 17:00 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: alex at milivojevic.org> Well, this is how I interpreted Bryan's emails. He'll probably correct me if > I'm wrong (yeah, I have EE, but haven't done much EE work since I > graduated, it just happened to be mostly "software" things for me, so I'm > rather rusty here :-(If you haven't noticed, I am _not_ big on "credentials." As far as I'm concerned, as long as someone has the knowledge, I could care less about the paper**. Even most state BoPEs will allow you to replace an ABET Accredited BSE with 8-12 years of experience (plus another 4-5 years experience post-degree) to qualify to become a PE. There are semiconductor engineering mindsets and they create differ from programmer or even technologist. [ **NOTE: Being a consultant, I finally had to "give in" to the paper. I also decided to major in engineering "just in case I needed it" (and, later on, got to use it for 2 years in the semiconductor industry). ]> I don't think Bryan is talking about "external" width of the address bus, > how many lines you see printed on the motherboard.That's because it is _irrelevant_ in many cases. You can mux lines, which is exactly what the original EV6 Crossbar Athlon does.> He's talking how the things are organized internally, and about the way > the bus logic works. That would be what theoretically implementations > might use, not what some specific implementation is limiting itself to.Exactly. Until the new 40-bit Xeon MPs came out, they were pretty much "slap on" designs with extended ALUs and microcoding to be x86-64 compatible to a point.> Sorry for mixing software and hardware from now on,It hard not to. First you have to consider the programmer aspects. Then you have to realize those will influence hardware compatibility. Then you have to remember that with every new addressing model, you _exponentially_ increase the external logic to drive it.> just trying to make Peter see where his misconception is (the best > way I can, which might not be good way at all).It's fairly difficult to do it without breakout a basic memory controller circuit at least the transistor level, or possibly the combinational NAND gates it somewhat represents.> The programming model for 32-bit userland applications is obviusly > limited to 32 bits -- the sizeof(void *) will tell you that. So single > process can see only 32-bit logical linear address space.To a point. PAE36 is what allows you to break that, because even in the i386, the 16-bit segment register that is offset 4-bits from the 32-bit offset register results in a 36-bit "normalized" address (it can actually be 37-bit, but that's another story). How the processor handles that in a way that is compatible with the OS is the problem. PAE36 uses "paging" from above 32-bit (up to 36-bit/64GiB) down to below 32-bit (upto 4GiB). PAE36 processors can support this paging, and then have at least 36-bit traces on the platform. GTL+ logic was designed with this "slapped on," and never bothered with direct, linear access until just recently (with the new 40-bit redesign for Xeon MP). Athlon, on the other hand, has always been 40-bit EV6. AMD decided to support the PAE36 paging, which added logic. But they _always_ had the "40-bit linear addressing for free" inherently in the platform. That's where these few BIOS hacks come in, combined with an OS/model that can take advantage of it. If you enable this mode for Windows, it will _not_ work!> On the other hand, processor (the hardware) doesn't need to have > such limits, if it's internal organization is wider. So, AMD (processor) > is able to see 40-bit linear physical address space as one single big > chunk of memory. 32-bit applications will have their > 32-bit *logical* address space mapped into this 40-bit linear *physical* > address space of processor. I don't know if programming model of > "32-bit" AMD processors allows you to have wider-then-32-bit pointers > (even if it did, you would have to have compiler that can generate such > code, gcc can't do it for sure).Not true. The model is _still_ PAE36, up to 36-bit/64GiB. But instead of paging, the combination of Athlon's "inherent 40-bit" with an OS that can do it will not use paging. Normally PAE36 does paging, which is what PAE36 OSes normally do.> Obviously, the kernel needs to know how to manage things in this > wider physical address space, the reason why you need patched Linux > kernel to take advantage of it.The actual amount of space doesn't change -- at least not under the PAE36 model. It's just how the OS commands the memory logic to use it, and that also requires the firmware (BIOS) to pre-configure it at POST. Under a norminal PAE36 OS and board, the memory logic and OS always do paging to access above 32-bit/4GiB. But I don't think this hack is able to linearly address the entire 40-bit because of the limitations of what PAE36 can address.> Intel (the processor) on the other hand, is able to physically address > only 32-bit address space. Anything wider than that, it needs to page. > Dealing with paging will obviosly be additional work for OS, hence > lower performance.Paging is a significant hit. Anyone running in the 4G+4G "HIGHMEM" model of their Linux kernel and recompiles for 1G+3G model will notice a noticeable performance gain (as long with only 960MiB of memory).> So while both processors will have more that 32 address lines on the > packaging (and printed on motherboard) minus couple of lowest one > that are not needed) as you can see in various specifications that you > queted, that doesn't mean processor's core actually sees all that > address space.> How else to try explaining this... Hm... Remeber Intel 8086? It could > address only 16-bit address space, but it had more than 16 address lines > on the packaging. It used segmenting (hopefully the right term) to see > wider-than-16-bit address space. Now try to make analogy with what was > discussed so far ;-)It's somewhat correct, although it gets interesting. It's more like EMS than XMS, because XMS used to shunt the processor in and out of Protected286/386, and that required a 286/386 to access it. You could run a true 24-bit/32-bit, respectively, to avoid that. EMS worked on old 8088/8086 (as well as 80286/386), because pages above 1MB could be mapped in. The 80386 and some 80286 could emulate this as well without a special card with special addressing. In the GTL+ bus, this is exactly what it does. It offers special addressing lines for a memory logic that pages, because that's all the OS does. Even the early EM64T processors had to deal with this "limitation" of GTL+ platform. You want to ensure you get a new EM64T platform that doesn't have that approach. -- Bryan J. Smith mailto:b.j.smith at ieee.org
Bryan J. Smith <b.j.smith@ieee.org>
2005-Jun-28 17:28 UTC
[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing
From: Steven Vishoot <sir_funzone at yahoo.com>> As Carol Brady said "what was all that about" I dont > know if anyone else feels this way, but i think this > subjuct has been beaten to death and beyond. Not sure > if i remember right, but wasnt the original post about > upgrading from 4.0 to 4.1? How did this topic become > an engineering course? sorry for being a crab...but > come on....If you deploy servers with more than 1GB of RAM, and definitely more than 4GB of RAM, you should be aware of what limitations there are with Intel solutions. That's what this was about. If someone else wanted to make it into a discussion of board traces and other, irrelevant non-sense, then I'm sorry for getting into that with him. -- Bryan J. Smith mailto:b.j.smith at ieee.org