Is there any reason (other than having things inherited this way from Linux) that we cannot call identify_cpu() for the boot CPU at the end of early_cpu_init() rather than explicitly from __start_xen()? And if not, it would seem reasonable to me to at once move the two CR4 twiddling pieces out of __start_xen, too. (I''m not asking because I want to beautify the code, but because I want the identify to happen earlier, namely I want to fully set up the VESA console as early as possible, but there I''d like to be able to set MTRRs, which in turn depends on identify_cpu() having executed. Thanks, Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/7/07 17:16, "Jan Beulich" <jbeulich@novell.com> wrote:> Is there any reason (other than having things inherited this way from Linux) > that > we cannot call identify_cpu() for the boot CPU at the end of early_cpu_init() > rather than explicitly from __start_xen()? And if not, it would seem > reasonable > to me to at once move the two CR4 twiddling pieces out of __start_xen, too. > > (I''m not asking because I want to beautify the code, but because I want the > identify to happen earlier, namely I want to fully set up the VESA console as > early as possible, but there I''d like to be able to set MTRRs, which in turn > depends on identify_cpu() having executed.Isn''t it a fairly safe bet that the BIOS will have done this for us and, if not, that the penalty is a performance loss (probably using WB or UC instead of WC) rather than a correctness issue? And hence, if we bother to update the MTRRs at all, then it can at least be left until later in the boot? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 13.07.07 18:26 >>> >On 13/7/07 17:16, "Jan Beulich" <jbeulich@novell.com> wrote: > >> Is there any reason (other than having things inherited this way from Linux) >> that >> we cannot call identify_cpu() for the boot CPU at the end of early_cpu_init() >> rather than explicitly from __start_xen()? And if not, it would seem >> reasonable >> to me to at once move the two CR4 twiddling pieces out of __start_xen, too. >> >> (I''m not asking because I want to beautify the code, but because I want the >> identify to happen earlier, namely I want to fully set up the VESA console as >> early as possible, but there I''d like to be able to set MTRRs, which in turn >> depends on identify_cpu() having executed. > >Isn''t it a fairly safe bet that the BIOS will have done this for us and, if >not, that the penalty is a performance loss (probably using WB or UC instead >of WC) rather than a correctness issue? And hence, if we bother to update >the MTRRs at all, then it can at least be left until later in the boot?I''ve never seen a BIOS set the frame buffer to WC (or anything other than uncachable), presumably not the least because the address space may get laid out anew by the OS. And the performance loss in our case is quite significant (though I assume WC would at best help a little, I''m considering other approaches, too): Scrolling a 1280x1024x16 screen takes, on the test system I''m primarily trying this out on, on the order of a second. This is because we will want to not disturb what dom0 may have written to the screen, and hence we can''t simply redraw. And I''m not certain I want to special case boot time here (although I may have no other option - delay in that order can easily lead to other failures [like CPUs not properly coming up]). Of course, I could make the two stage vesafb initialization as it is right now a three stage one, doing just the MTRR request in the last. But I wanted to avoid making the code more spaghetti like than necessary just because of this little feature... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/7/07 07:14, "Jan Beulich" <jbeulich@novell.com> wrote:> I''ve never seen a BIOS set the frame buffer to WC (or anything other than > uncachable), presumably not the least because the address space may get > laid out anew by the OS. And the performance loss in our case is quite > significant (though I assume WC would at best help a little, I''m considering > other approaches, too): Scrolling a 1280x1024x16 screen takes, on the test > system I''m primarily trying this out on, on the order of a second. This is > because we will want to not disturb what dom0 may have written to the > screen, and hence we can''t simply redraw. And I''m not certain I want to > special case boot time here (although I may have no other option - delay > in that order can easily lead to other failures [like CPUs not properly coming > up]). > > Of course, I could make the two stage vesafb initialization as it is right now > a three stage one, doing just the MTRR request in the last. But I wanted to > avoid making the code more spaghetti like than necessary just because of > this little feature...I''d rather have a later call (but not that late -- before secondary CPUs come up is fine) than re-order start-of-day cpu detection code. At least in a first patchset! The MTRR update will still happen before scrolling needs to occur for the first time, and I don''t see that the code will become spaghetti because of it. By the way, what makes you think that redrawing the whole screen (presumably re-pasting text characters one-by-one) would be faster than scrolling? Sounds slower to me, or is this because read-plus-write of UC memory sucks? How much faster is scrolling of WC framebuffer on your test system? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>I''d rather have a later call (but not that late -- before secondary CPUs >come up is fine) than re-order start-of-day cpu detection code. At least in >a first patchset! The MTRR update will still happen before scrolling needs >to occur for the first time, and I don''t see that the code will become >spaghetti because of it.Okay, will do it that way then.>By the way, what makes you think that redrawing the whole screen (presumably >re-pasting text characters one-by-one) would be faster than scrolling? >Sounds slower to me, or is this because read-plus-write of UC memory sucks?Yes, exactly that (really its presumably mostly the data dependency of the writes on the reads, which in a write-only scenario is so much smaller).>How much faster is scrolling of WC framebuffer on your test system?Haven''t tested yet, as I haven''t made the adjustments to make WC work so far (finding why it doesn''t work was the last thing I did on Friday)... But as I said, I don''t expect much gain from *just* the attribute change, as reads will continue to be done UC. As I said, I''m considering alternatives... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/7/07 07:41, "Jan Beulich" <jbeulich@novell.com> wrote:>> By the way, what makes you think that redrawing the whole screen (presumably >> re-pasting text characters one-by-one) would be faster than scrolling? >> Sounds slower to me, or is this because read-plus-write of UC memory sucks? > > Yes, exactly that (really its presumably mostly the data dependency of the > writes on the reads, which in a write-only scenario is so much smaller). > >> How much faster is scrolling of WC framebuffer on your test system? > > Haven''t tested yet, as I haven''t made the adjustments to make WC work so far > (finding why it doesn''t work was the last thing I did on Friday)... But as I > said, > I don''t expect much gain from *just* the attribute change, as reads will > continue > to be done UC. As I said, I''m considering alternatives...Yes, I''d be surprised if WC is that much of a win, since Linux certainly doesn''t appear to mess with MTRRs by default. The best option would be to re-draw until dom0 starts to boot. After that switch to scroll, but from that point on Xen doesn''t write much to the console anyway. Supporting both ways doesn''t sound that hard, and there''s already a console_endboot() hook to trigger the switch in behaviour. But if Linux re-draws the whole screen instead of scrolling, won''t Xen''s output get overwritten anyway? That would limit ''vga=keep''s usefulness, except for crash dumps after which Linux dom0 will not print any more (I suppose that is the main most useful case though). And if Linux *does* scroll, how come *its* performance doesn''t suck? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/7/07 07:57, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote:> But if Linux re-draws the whole screen instead of scrolling, won''t Xen''s > output get overwritten anyway? That would limit ''vga=keep''s usefulness, > except for crash dumps after which Linux dom0 will not print any more (I > suppose that is the main most useful case though).If you only cared about crash dumps you could just always use re-draw behaviour, disable Xen output after console_endboot(), but always *re-enable* output, at coordinate (0,0) top-left of screen, on any call to start_sync_console(). That usually indicates something important is being printed. :-) Starting at top-left means you do not obliterate the most recent dom0 output. Would that be acceptable? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>The best option would be to re-draw until dom0 starts to boot. After that >switch to scroll, but from that point on Xen doesn''t write much to the >console anyway. Supporting both ways doesn''t sound that hard, and there''s... depending on the log level. I found it quite helpful to force loglvl=all, and certainly there''s stuff being printed with that.>already a console_endboot() hook to trigger the switch in behaviour.Sure.>But if Linux re-draws the whole screen instead of scrolling, won''t Xen''s >output get overwritten anyway? That would limit ''vga=keep''s usefulness,Yes, it does. But there''s no way to avoid that other than really establishing co-operation between Xen and dom0, which I don''t think would look like being upstreamable.>except for crash dumps after which Linux dom0 will not print any more (I >suppose that is the main most useful case though). And if Linux *does*Yes, that''s certainly the main intention (after all, you have to force keeping console output in the first place in order to get into that situation).>scroll, how come *its* performance doesn''t suck?Again, redraw scrolling doesn''t suck severely, only moving video memory from one place to another does. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <Keir.Fraser@cl.cam.ac.uk> 16.07.07 09:01 >>> >On 16/7/07 07:57, "Keir Fraser" <Keir.Fraser@cl.cam.ac.uk> wrote: > >> But if Linux re-draws the whole screen instead of scrolling, won''t Xen''s >> output get overwritten anyway? That would limit ''vga=keep''s usefulness, >> except for crash dumps after which Linux dom0 will not print any more (I >> suppose that is the main most useful case though). > >If you only cared about crash dumps you could just always use re-draw >behaviour, disable Xen output after console_endboot(), but always >*re-enable* output, at coordinate (0,0) top-left of screen, on any call to >start_sync_console(). That usually indicates something important is being >printed. :-) Starting at top-left means you do not obliterate the most >recent dom0 output. > >Would that be acceptable?It would be an option, but not my preferred solution. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>How much faster is scrolling of WC framebuffer on your test system?That''s a win of about 30%, and using movsq in memcpy() is another win of about 50%. Will probably create a vidmemmove()... Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> significant (though I assume WC would at best help a little, I''m considering > other approaches, too): Scrolling a 1280x1024x16 screen takes, on the test > system I''m primarily trying this out on, on the order of a second. This isA lot of work on this has been done in X by the X and Gnome developers including producing some routines which use the largest possible aligned loads to reduce the big cost (which is PCI reads). Having a write combining video memory can help (but some cards also put stuff like command queues there which X cannot WC). It is usually much cheaper to simply rewrite a text display onto a framebuffer than scroll it as you can avoid any PCI read traffic so you get streaming WC writes (or async sse writes even without that). Even more so when you skip unchanged quadwords. Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/7/07 09:18, "Jan Beulich" <jbeulich@novell.com> wrote:>> How much faster is scrolling of WC framebuffer on your test system? > > That''s a win of about 30%, and using movsq in memcpy() is another win of > about 50%. Will probably create a vidmemmove()...That still totally sucks then. Paging a full screen of text of, say 60 lines, done line-by-line will still take, say 60*300ms == 20 seconds(ish). Implement this as a command-line option only if you must have it. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 16/7/07 10:23, "Alan Cox" <alan@lxorguk.ukuu.org.uk> wrote:>> significant (though I assume WC would at best help a little, I''m considering >> other approaches, too): Scrolling a 1280x1024x16 screen takes, on the test >> system I''m primarily trying this out on, on the order of a second. This is > > A lot of work on this has been done in X by the X and Gnome developers > including producing some routines which use the largest possible aligned > loads to reduce the big cost (which is PCI reads). Having a write > combining video memory can help (but some cards also put stuff like > command queues there which X cannot WC).Does this mean that defaulting to setting up WC on a power-of-two-sized region starting at the framebuffer address is not really safe on a fair number of systems? -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 17.07.07 09:50 >>> >On 16/7/07 09:18, "Jan Beulich" <jbeulich@novell.com> wrote: > >>> How much faster is scrolling of WC framebuffer on your test system? >> >> That''s a win of about 30%, and using movsq in memcpy() is another win of >> about 50%. Will probably create a vidmemmove()... > >That still totally sucks then. Paging a full screen of text of, say 60 >lines, done line-by-line will still take, say 60*300ms == 20 seconds(ish).No, you got me wrong - with ''scrolling by one line'' I mean the scrolling the entire screen up by a line. It''s about half as fast as Linux during boot now, visibly slower post-boot. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Keir Fraser <keir@xensource.com> 17.07.07 09:55 >>> >On 16/7/07 10:23, "Alan Cox" <alan@lxorguk.ukuu.org.uk> wrote: > >>> significant (though I assume WC would at best help a little, I''m considering >>> other approaches, too): Scrolling a 1280x1024x16 screen takes, on the test >>> system I''m primarily trying this out on, on the order of a second. This is >> >> A lot of work on this has been done in X by the X and Gnome developers >> including producing some routines which use the largest possible aligned >> loads to reduce the big cost (which is PCI reads). Having a write >> combining video memory can help (but some cards also put stuff like >> command queues there which X cannot WC). > >Does this mean that defaulting to setting up WC on a power-of-two-sized >region starting at the framebuffer address is not really safe on a fair >number of systems?So I suppose, also based on the fact that Linux is defensive here too in defaulting to not touching the MTRRs at all. I implemented this in a similar fashion for Xen now. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/7/07 10:58, "Jan Beulich" <jbeulich@novell.com> wrote:>>> That''s a win of about 30%, and using movsq in memcpy() is another win of >>> about 50%. Will probably create a vidmemmove()... >> >> That still totally sucks then. Paging a full screen of text of, say 60 >> lines, done line-by-line will still take, say 60*300ms == 20 seconds(ish). > > No, you got me wrong - with ''scrolling by one line'' I mean the scrolling the > entire screen up by a line. It''s about half as fast as Linux during boot now, > visibly slower post-boot.Re-draw is going to be the default, even if I have to implement it on top of your patch. :-) The fact that dom0 will overwrite Xen''s output anyway as it re-draws seems to make scrolling-as-default the inferior option because it''s significantly slower and can provide benefit really only for crash dumps (where a re-draw based scheme can work just as well). So I don''t understand your attachment to scrolling. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/7/07 11:00, "Jan Beulich" <jbeulich@novell.com> wrote:>> Does this mean that defaulting to setting up WC on a power-of-two-sized >> region starting at the framebuffer address is not really safe on a fair >> number of systems? > > So I suppose, also based on the fact that Linux is defensive here too in > defaulting to not touching the MTRRs at all. I implemented this in a similar > fashion for Xen now.How about remapping the framebuffer specifying WC in the PAT bits, if the CPU is detected to support PAT? This might work more safely because we can map at 4kB granularity rather than merely power-of-two. It depends on how close the command queues actually are to the lfb in the VRAM map. If we just mapped the 4kB-rounded region specified by lfb_base to lfb_base+lfb_size (as determined via the VBE Get Mode Info call) as WC, would that be safe? If so we could use that unconditionally and avoid any MTRR-poking code. PAT has been around for ages now. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 17/7/07 11:15, "Keir Fraser" <keir@xensource.com> wrote:> If we just mapped the 4kB-rounded region specified by lfb_base to > lfb_base+lfb_size (as determined via the VBE Get Mode Info call) as WC, > would that be safe?Actually this is probably okay because any software that does want to access command queues (e.g., X server) will have its own mapping that will not specify PAT.WC. In contrast MTRR type specification is used by all mappings of that physical address range. So long as aliasing of WC and UC was not a problem, this would work fine. But I think it''s WC aliasing with WB/WT that is a problem. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel