Robin Green
2005-Jan-30 21:09 UTC
[Xen-devel] Reproducable data corruption on xen-unstable
With the xen-unstable snapshot from today (and also the fedora-patched one from the 25th) I am seeing lots of display corruption, weird behaviour and crashes and hangs in X. Here is a reproducable test case (non-deterministic, but it fails every time for me) for crashing or incorrect behaviour, in case this is useful: Note when I say "crashes", I''m referring to userspace crashes. To reproduce: 1. Boot into Fedora Core 3 under Xen (see http://www.fedoraproject.org/wiki/FedoraXenQuickstart ) [not sure if this is necessary] 2. Disable X acceleration in Xorg.conf [not sure if this is necessary] 3. Download http://www.greenrd.org/sw/fptest/ which should be 100% deterministic, but running under Xen-unstable, it isn''t. It reads no input, and just does lots of floating point tests. 4. Build it with ./build 5. Start up a Konsole and run ./test to run the test 100 times. - Note it will NOT fail if you are using an xterm (presumably because they use different rendering techniques, and presumably the technique used by xterm makes this memory corruption or whatever it is much less likely to occur). Nor will it fail on the console. I haven''t tried other terminal emulators. Expected results: The last test should complete with no errors Actual results: After a while, one of the test runs either crashes, or detects floating-point errors, or both. None of the anomalous behaviour occurs under the same Fedora-patched kernel when it is not compiled for Xen. -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Rik van Riel
2005-Jan-30 22:30 UTC
[Xen-devel] Re: Reproducable data corruption on xen-unstable
On Sun, 30 Jan 2005, Robin Green wrote:> With the xen-unstable snapshot from today (and also the fedora-patched > one from the 25th) I am seeing lots of display corruption, weird > behaviour and crashes and hangs in X.I''ll: 1) upgrade Fedora rawhide to the latest xen-unstable code 2) apply the AGP patch ;) I will let the list know when a new Fedora package with these changes is available. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Jan-31 01:03 UTC
Re: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Sun, 30 Jan 2005, Rik van Riel wrote:> On Sun, 30 Jan 2005, Robin Green wrote: > >> With the xen-unstable snapshot from today (and also the fedora-patched >> one from the 25th) I am seeing lots of display corruption, weird >> behaviour and crashes and hangs in X. > > I''ll: > 1) upgrade Fedora rawhide to the latest xen-unstable code 2) apply the AGP > patch ;)I was under the impression that the Xorg server I''m using (savage) doesn''t use AGP yet, at least not in the release that''s in FC3. Er, how exactly do I tell if it''s using AGP? ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-01 08:02 UTC
RE: [Xen-devel] Reproducable data corruption on xen-unstable
> 3. Download http://www.greenrd.org/sw/fptest/ which should be 100% > deterministic, but running under Xen-unstable, it isn''t. It reads no > input, and just does lots of floating point tests. > > 4. Build it with ./build > > 5. Start up a Konsole and run ./test to run the test 100 times. > > - Note it will NOT fail if you are using an xterm (presumably > because they > use different rendering techniques, and presumably the > technique used by > xterm makes this memory corruption or whatever it is much > less likely to > occur).I''ve tried running two copies of fptest concurrently to see if there was some rare race with fp save/restore, but they''ve been running in a loop for a couple of hours now without any problems. I''ve also tried having two domains on the same CPU use FP, again no problem. As you surmise, the problem may be due to the drivers for your graphics card. I think it must be fairly specific to your setup otherwise I think we''d have seen it before. Are you using agpgart? Can you get your graphics card working without it? Ian ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Rik van Riel
2005-Feb-02 02:54 UTC
Re: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Sun, 30 Jan 2005, Robin Green wrote:> I was under the impression that the Xorg server I''m using (savage) doesn''t > use AGP yet, at least not in the release that''s in FC3. > Er, how exactly do I tell if it''s using AGP?Tonight''s rawhide kernel (2.6.10-1.1120_FC4) has the Xen agpgart patch, as well as last night''s xen-unstable source tree. Could you please try reproducing the bug with the latest kernel ? -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-03 01:21 UTC
Re: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Tue, 1 Feb 2005, Rik van Riel wrote:> On Sun, 30 Jan 2005, Robin Green wrote: > >> I was under the impression that the Xorg server I''m using (savage) doesn''t >> use AGP yet, at least not in the release that''s in FC3. >> Er, how exactly do I tell if it''s using AGP? > > Tonight''s rawhide kernel (2.6.10-1.1120_FC4) has the > Xen agpgart patch, as well as last night''s xen-unstable > source tree. Could you please try reproducing the bug > with the latest kernel ?It is reproducable still. Latest xen also. I also rebuilt 2.6.10-1.1121_FC4 without any AGP options enabled, and it is still reproducable. And whether savage or vesa X server is used, or whether NoAccel is on or off, it still occurs. (However, the konsole window should be quite tall or it may not occur - mine is 800 pixels [44 lines] high.) I also found that su -c ''renice 19 `/sbin/pidof X`'' makes the probability of the bug occuring higher. Counterintuitive, maybe, but true. If no-one else can reproduce it, I can attempt to put some assertions in the paranoia.c code to determine if this is main memory corruption or register corruption. (I suspect the latter.) Let me know if you want me to try this. Just to reiterate, no this is not hardware error, because it does not occur outside of Xen :) Could this possibly be related to the other bug I found, the "APIC error on CPU0"? That interrupt handler may be still operating, for all I know - my patch doesn''t _disable_ it, it just shuts it up. -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-04 00:58 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
> And whether savage or vesa X server is used, or whether > NoAccel is on or > off, it still occurs. (However, the konsole window should be > quite tall or > it may not occur - mine is 800 pixels [44 lines] high.)If it occurred just under the ''vesa'' X server I''d be very suspicious that we had an FP save/restore bug in the vm86 support code. I''m not sure whether the savage server uses vm86 or not. Probably not. I''d certainly be very interested to hear if anyone else running the vesa X server can reproduce the problem using the fptest/paranoia program. (vm86 is not widely used, so I can belive we could have lurking bugs on that path).> Could this possibly be related to the other bug I found, the > "APIC error > on CPU0"? That interrupt handler may be still operating, for > all I know - > my patch doesn''t _disable_ it, it just shuts it up.Xen certainly doesn''t sound too happy on your machine.... Ian ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Rik van Riel
2005-Feb-04 02:26 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Fri, 4 Feb 2005, Ian Pratt wrote:> (vm86 is not widely used, so I can belive we could have lurking bugs on > that path).Confirmed, by running Dave Jones''s scrashme program, inside a xenU domain with 32 virtual CPUs: vm86: could not access userspace vm86_info Unable to handle kernel paging request at virtual address 000171d3 printing eip: 0000ec83 *pde = ma 308ac067 pa 01709067 *pte = ma 00000000 pa 55555000 [<c0108d4b>] syscall_call+0x7/0xb Oops: 0000 [#2] SMP Modules linked in: nfs lockd md5 ipv6 autofs4 sunrpc dm_mod CPU: 0 EIP: 0855:[<0000ec83>] Not tainted VLI EFLAGS: 00030f42 (2.6.10-1.1121_FC4xenU) EIP is at 0xec83 eax: 64e9000b ebx: 0038cc83 ecx: 2cd0e800 edx: 1ee9000b esi: 8dffffff edi: 0038cc83 ebp: 2cc0e800 esp: c62fdf24 ds: 0000 es: 0000 ss: 0069 Process scrashme (pid: 19875, threadinfo=c62fc000 task=c82fd020) Stack: 104613c3 00008d00 00000003 00000fc2 00000001 00006693 31ff31ff f05589f6 0026b48d 8d000000 000027bc fe830000 8b187406 448b084d 453b40b1 890c74f0 07e82404 8d00047f 4601387c 7e0cfe83 74778ddd e8243489 ffff4162 c789c085 Call Trace: [<c0108d4b>] syscall_call+0x7/0xb Code: Bad EIP value. -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-04 02:59 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
> > (vm86 is not widely used, so I can belive we could have > lurking bugs on > > that path). > > Confirmed, by running Dave Jones''s scrashme program, inside a > xenU domain with 32 virtual CPUs:Even with a single vcpu it''s easy to cause an Oops with scrashme. However, an fptest running in another process at the same time doesn''t seem to experience register coruption or anything else nasty. (http://www.codemonkey.org.uk/projects/scrashme/scrashme-1.0.tar.gz) ./scrashme -r -c113 >/dev/null (vm86old) ./scrashme -r -c166 >/dev/null (vm86) It shouldn''t be too hard to fix the Oops, but I''m not seeing anything that would explain Robin''s problem. Ian ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-05 16:52 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Fri, 4 Feb 2005, Ian Pratt wrote:> > If it occurred just under the ''vesa'' X server I''d be very suspicious > that we had an FP save/restore bug in the vm86 support code. I''m not > sure whether the savage server uses vm86 or not. Probably not.It doesn''t, according to strace. On the assumption that this _is_ an FP save/restore bug, I was trying to find the FP save/restore code in the xen-patched kernel. I''m not familiar with low-level kernel issues like this. What I am trying to find is, where does fxrstor (or whatever) get invoked from, when the kernel is doing a normal user-process-to-user-process context switch? As far as I can tell, the idea seems to be that you don''t bother to restore the FPU state immediately - if and when the new process tries to access the FPU for the first time, the CPU automatically generates a trap, and only then does Linux restore the saved FPU state for that process - apparently in arch/xen/i386/kernel/entry.S under ENTRY(device_no_available), if my guess is right. Is that correct? If that''s the case, then, again still working on the assumption that it''s an FPU state bug, looks like it could only be one of the following: 1. Something leaves the FPU in a state where it has bogus data in it, but it won''t trap to tell the kernel to restore the old, correct data 2. Something forgets to save the FPU state when context-switching from one userland process to another 3. Something is overwriting the saved fpu state with bogus data (seems unlikely) 4. Something in the kernel or in xen is using the FPU (extremely unlikely, since both appear to now be compiled with soft-math). Is my reasoning sound? I''m a _little_ out of my depth here! -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-05 17:47 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
> On the assumption that this _is_ an FP save/restore bug, I > was trying to > find the FP save/restore code in the xen-patched kernel.It''s likely to be quite specific to either your CPU, or other hardware drivers. If it was a general problem it would almost certainly have been spotted by someone else. What happens if you run two copies of fptest in parallel on the text console? This would fail if it was a general fpsave/restore bug. Your problem seems to be quite specific to also having the framebuffer active. What kernel modules is your X server using? Ian ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-05 18:52 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Sat, 5 Feb 2005, Ian Pratt wrote:> What happens if you run two copies of fptest in parallel on the text > console?They work perfectly.> Your > problem seems to be quite specific to also having the framebuffer > active.If by "framebuffer" you mean "kernel fb driver", it doesn''t use that. The X server just talks to the graphics card directly.> What kernel modules is your X server using?As far as I know, it isn''t using any. I''ve specifically deselected agp at kernel compile time. My suspicion is that there is some unusual code path where the FP save/restore doesn''t work, and the fact of konsole doing large amounts of text rendering (which, I believe, involves FP calculations) and/or scrolling, makes this code path more likely. (The window that is rendering doesn''t have to be the tty of fptest. Any konsole window that''s displaying a large amount of scrolling text is enough to trigger it.) -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-07 03:05 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Sat, 5 Feb 2005, Robin Green wrote:> On the assumption that this _is_ an FP save/restore bug,Update: I have narrowed down this bug I have confirmed that there IS definitely an FP save/restore bug with this kernel/xen combination (i.e. I''ve eliminated the possibility that it was just a non-floating-point-related bug)! I identified it using a different test case (running wget -d in a konsole), and I have established that it is case 1 in the list of possible causes I gave, namely:> 1. Something leaves the FPU in a state where it has bogus data in it, > but it won''t trap to tell the kernel to restore the old, correct dataMore specifically, in this particular case, according to my printf''s, what happened was: A syscall was made (connect). Immediately before the syscall, the floating-point stack was empty; immediately after the syscall, the floating-point stack was nonempty, and the TS flag (Task Switch) was _cleared_. (Source code and output available on request.) This may not immediately cause problems. But over time, it would tend to lead to floating-point stack overflow, which leads to floating-point calculations generating bogus output. So, in theory there are two possible algorithms which the kernel could be supposed to be following to avoid this situation. A. Always set TS on task switch (Seems like the logical choice!) B. Always set TS on task switch - except when the FPU has not been used by the switched-to process, in which case do an FINIT on task switch. (This seems pointlessly complicated and slow, so I doubt the kernel follows this approach.) So, it looks like we are looking for a code path in which TS doesn''t end up set after a task switch. (And it might be specifically to do with syscalls.) I will look for one - but does anyone have any ideas for what that code path might be, or how I could efficiently debug the kernel (while in X, remember, because this doesn''t seem to occur in text mode!) to find out what that code path is? I don''t have a serial console. -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-07 04:05 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
A few minutes ago, I wrote:> So, it looks like we are looking for a code path in which TS doesn''t end > up set after a task switch.Aha! Shouldn''t the stts macro in xeno-linux be calling __HYPERVISOR_fpu_taskswitch instead of trying to write to CR0 itself? Writing to CR0 directly is impossible in ring 1, isn''t it? I think I may have solved the mystery! I''ll have to try that out in the next few days. stts is called by _mmx_memcpy, which is called by memcpy on Athlons. That _might_ explain why people who aren''t using Athlons haven''t seen this. -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-07 13:05 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
> A few minutes ago, I wrote: > > So, it looks like we are looking for a code path in which > TS doesn''t end > > up set after a task switch. > > Aha! Shouldn''t the stts macro in xeno-linux be calling > __HYPERVISOR_fpu_taskswitch instead of trying to write to CR0 itself? > Writing to CR0 directly is impossible in ring 1, isn''t it?Please can you try the unstable tree: it has an extended instruction emulator that was introduced to avoid some the code edits that some people on the linux-kernel list were complaining about. Ian ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-07 15:35 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Mon, 7 Feb 2005, Ian Pratt wrote:>> A few minutes ago, I wrote: >> Aha! Shouldn''t the stts macro in xeno-linux be calling >> __HYPERVISOR_fpu_taskswitch instead of trying to write to CR0 itself? >> Writing to CR0 directly is impossible in ring 1, isn''t it? > > Please can you try the unstable tree: it has an extended instruction > emulator that was introduced to avoid some the code edits that some > people on the linux-kernel list were complaining about.Ah, sorry, I''m already using the unstable tree, but I''d forgotten about the emulation. So, never mind - it''s not the use of mov _, cr0 in ring 1 that''s the problem; it must be something else. Disregard that post. -- Robin ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-07 15:41 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
It would be interesting to know whether you can reproduce in 2.0-testing Ian> -----Original Message----- > From: Robin Green [mailto:greenrd@presidium.org] > Sent: 07 February 2005 15:36 > To: Ian Pratt > Cc: Rik van Riel; xen-devel@lists.sourceforge.net; > ian.pratt@cl.cam.ac.uk > Subject: RE: [Xen-devel] Re: Reproducable data corruption on > xen-unstable > > On Mon, 7 Feb 2005, Ian Pratt wrote: > >> A few minutes ago, I wrote: > >> Aha! Shouldn''t the stts macro in xeno-linux be calling > >> __HYPERVISOR_fpu_taskswitch instead of trying to write to > CR0 itself? > >> Writing to CR0 directly is impossible in ring 1, isn''t it? > > > > Please can you try the unstable tree: it has an extended instruction > > emulator that was introduced to avoid some the code edits that some > > people on the linux-kernel list were complaining about. > > Ah, sorry, I''m already using the unstable tree, but I''d > forgotten about > the emulation. So, never mind - it''s not the use of mov _, > cr0 in ring 1 > that''s the problem; it must be something else. Disregard that post. > > -- > Robin > >------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-07 15:58 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
> Ah, sorry, I''m already using the unstable tree, but I''d > forgotten about > the emulation. So, never mind - it''s not the use of mov _, > cr0 in ring 1 > that''s the problem; it must be something else. Disregard that post.It''s possible the emulation is broken, so be suspicious... Ian ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-07 16:26 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
The vm86 ''Oops'' should now be fixed, although it''s not Robin''s real problem. Ian> -----Original Message----- > From: Rik van Riel [mailto:riel@redhat.com] > Sent: 04 February 2005 02:27 > To: Ian Pratt > Cc: Robin Green; xen-devel@lists.sourceforge.net > Subject: RE: [Xen-devel] Re: Reproducable data corruption on > xen-unstable > > On Fri, 4 Feb 2005, Ian Pratt wrote: > > > (vm86 is not widely used, so I can belive we could have > lurking bugs on > > that path). > > Confirmed, by running Dave Jones''s scrashme program, inside a > xenU domain with 32 virtual CPUs: > > vm86: could not access userspace vm86_info > Unable to handle kernel paging request at virtual address 000171d3 > printing eip: > 0000ec83 > *pde = ma 308ac067 pa 01709067 > *pte = ma 00000000 pa 55555000 > [<c0108d4b>] syscall_call+0x7/0xb > Oops: 0000 [#2] > SMP > Modules linked in: nfs lockd md5 ipv6 autofs4 sunrpc dm_mod > CPU: 0 > EIP: 0855:[<0000ec83>] Not tainted VLI > EFLAGS: 00030f42 (2.6.10-1.1121_FC4xenU) > EIP is at 0xec83 > eax: 64e9000b ebx: 0038cc83 ecx: 2cd0e800 edx: 1ee9000b > esi: 8dffffff edi: 0038cc83 ebp: 2cc0e800 esp: c62fdf24 > ds: 0000 es: 0000 ss: 0069 > Process scrashme (pid: 19875, threadinfo=c62fc000 task=c82fd020) > Stack: 104613c3 00008d00 00000003 00000fc2 00000001 00006693 > 31ff31ff f05589f6 > 0026b48d 8d000000 000027bc fe830000 8b187406 448b084d > 453b40b1 890c74f0 > 07e82404 8d00047f 4601387c 7e0cfe83 74778ddd e8243489 > ffff4162 c789c085 > Call Trace: > [<c0108d4b>] syscall_call+0x7/0xb > Code: Bad EIP value. > > -- > "Debugging is twice as hard as writing the code in the first place. > Therefore, if you write the code as cleverly as possible, you are, > by definition, not smart enough to debug it." - Brian W. Kernighan >------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Rik van Riel
2005-Feb-07 16:39 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Mon, 7 Feb 2005, Ian Pratt wrote:> The vm86 ''Oops'' should now be fixed, although it''s not Robin''s real > problem.This fragment ? ;) @@ -126,7 +129,7 @@ if (direct_remap_area_pages(&init_mm, (unsigned long) addr, phys_addr, size, __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED - | flags), DOMID_IO)) { + | flags), domid)) { vunmap((void __force *) addr); return NULL; } -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-07 16:50 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
No, this changeset: http://xen.bkbits.net:8080/xeno-unstable.bk/cset@1.1550.1.175?nav=index. html|ChangeSet@-1d Ian> -----Original Message----- > From: Rik van Riel [mailto:riel@redhat.com] > Sent: 07 February 2005 16:40 > To: Ian Pratt > Cc: Robin Green; xen-devel@lists.sourceforge.net; > ian.pratt@cl.cam.ac.uk > Subject: RE: [Xen-devel] Re: Reproducable data corruption on > xen-unstable > > On Mon, 7 Feb 2005, Ian Pratt wrote: > > > The vm86 ''Oops'' should now be fixed, although it''s not Robin''s real > > problem. > > This fragment ? ;) > > @@ -126,7 +129,7 @@ > if (direct_remap_area_pages(&init_mm, (unsigned > long) addr, phys_addr, > size, > __pgprot(_PAGE_PRESENT | _PAGE_RW | > > _PAGE_DIRTY | _PAGE_ACCESSED > - | flags), > DOMID_IO)) { > + | flags), domid)) { > vunmap((void __force *) addr); > return NULL; > } > > > -- > "Debugging is twice as hard as writing the code in the first place. > Therefore, if you write the code as cleverly as possible, you are, > by definition, not smart enough to debug it." - Brian W. Kernighan >------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Rik van Riel
2005-Feb-07 17:03 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Mon, 7 Feb 2005, Ian Pratt wrote:> No, this changeset: > > http://xen.bkbits.net:8080/xeno-unstable.bk/cset@1.1550.1.175?nav=index.html|ChangeSet@-1dDoh! No wonder I couldn''t find it in the changes I checked out this morning, since it was committed a little bit later. *bk pulls and restarts test rpm build* -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." - Brian W. Kernighan ------------------------------------------------------- This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting Tool for open source databases. Create drag-&-drop reports. Save time by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc. Download a FREE copy at http://www.intelliview.com/go/osdn_nl _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Robin Green
2005-Feb-13 22:12 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
On Sun, 6 Feb 2005, I wrote:> A syscall was made (connect). Immediately before the syscall, the > floating-point stack was empty; immediately after the syscall, the > floating-point stack was nonempty, and the TS flag (Task Switch) was > _cleared_.I now have an "easier" way to reproduce this problem. Apply the patch below to a xen0-kernel, which checks the FPU state against TS. What it basically does is: if (TS == 0 && fpu_stack_size > 0) panic ("Corrupt FPU"); An equivalent patch against a non-xen kernel yields no problems that I can detect, but patching a xen0-kernel with this patch, causes it to panic and reboot as soon as it hits the graphical login manager (in my case, kdm). (Of course, it might be specific to kdm, or my hardware, or who knows what.) *** HELP WANTED! *** If someone on a machine with a debug console could reproduce this, I''d be most grateful. I don''t have a serial console yet, so I''m a bit stuck. ******************** The logic behind this patch is, if there is something on the FPU stack from _another_ process, TS should be 1 to prevent data leakage between processes. If, on the other hand, there is something on the FPU stack from the _same_ process being switched to, TS should still be 1, because who would have cleared it since it was set when that process was last switched away from? So, in either case, TS should be 1. Also, I was wrong in my previous post:> So, in theory there are two possible algorithms which the kernel could be > supposed to be following to avoid this situation. > > A. Always set TS on task switch (Seems like the logical choice!) > > B. Always set TS on task switch - except when the FPU has not been used > by the switched-to process, in which case do an FINIT on task switch. (This > seems pointlessly complicated and slow, so I doubt the kernel follows this > approach.)The _actual_ algorithm appears to be: C. Always set TS on task switch - except when the FPU has not been used in the previous timeslice by the switched-FROM process, in which case we assume (incorrectly in the case of xen0-kernels, but correctly in the case of normal kernels!) that TS must be _already_ set if the FPU is dirty. -- Robin
Robin Green
2005-Feb-14 00:06 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable (SOLVED!)
On Sun, 13 Feb 2005, I wrote (in pseudocode):> if (TS == 0 && fpu_stack_size > 0) panic ("Corrupt FPU");And replacing the panic with stts() in my patch fixes the problem completely - and all the display corruption, which must have been a symptom. :-) This is only a workaround, though - I''m not sure what the underlying problem is. -- Robin ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
Ian Pratt
2005-Feb-15 04:08 UTC
RE: [Xen-devel] Re: Reproducable data corruption on xen-unstable
> On Sun, 6 Feb 2005, I wrote: > > A syscall was made (connect). Immediately before the syscall, the > > floating-point stack was empty; immediately after the syscall, the > > floating-point stack was nonempty, and the TS flag (Task > Switch) was > > _cleared_. > > I now have an "easier" way to reproduce this problem. Apply the patch > below to a xen0-kernel, which checks the FPU state against TS. What it > basically does is: > > if (TS == 0 && fpu_stack_size > 0) panic ("Corrupt FPU"); > > An equivalent patch against a non-xen kernel yields no > problems that I can > detect, but patching a xen0-kernel with this patch, causes it > to panic and > reboot as soon as it hits the graphical login manager (in my > case, kdm). > (Of course, it might be specific to kdm, or my hardware, or who knows > what.)The fact that the bug is triggered when the Xserver starts makes me suspect that the vm86 system call may have something to do with this. Please can you find out whether your Xserver is using the vm86 bios or vesa modules. Also instrument the vm86 syscall in linux just to make sure. It may be able to get the Xserver to run without those modules -- you could try moving them off the module search path. Also, what CPU type do you compile your kernel for? I''m wandering whether this is an AMD-specific issue. Another place to look is the fpu_kernel_begin()/end() to see whether they''re correct. Ian ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_ide95&alloc_id396&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel