This strikes me as quite nasty, although i''ve no idea if it''s xen related at all... i was just doing a ''make world'' under dom0, when the compile crashed out thus: gcc -D__KERNEL__ -I/usr/src/xeno-unstable.bk/linux-2.4.26-xen0/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno -strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 -nostdinc -iwithprefi x include -DKBUILD_BASENAME=svcsock -c -o svcsock.o svcsock.c svcsock.c:144:25: warning: null character(s) preserved in literal svcsock.c:144:25: missing terminating " character svcsock.c:1176:1: unterminated argument list invoking macro "dprintk" svcsock.c: In function `svc_sock_enqueue'': svcsock.c:144: error: `dprintk'' undeclared (first use in this function) svcsock.c:144: error: (Each undeclared identifier is reported only once svcsock.c:144: error: for each function it appears in.) svcsock.c:144: error: parse error at end of input svcsock.c:118: warning: unused variable `rqstp'' svcsock.c:66: warning: `svc_setup_socket'' declared `static'' but never defined svcsock.c:67: warning: `svc_udp_data_ready'' declared `static'' but never defined svcsock.c:68: warning: `svc_udp_recvfrom'' declared `static'' but never defined svcsock.c:69: warning: `svc_udp_sendto'' declared `static'' but never defined svcsock.c:116: warning: `svc_sock_enqueue'' defined but not used and sure enough svcsock.c had a big lump of corruption right in the middle of it. My most recent reboot was afaik orderly and so the corruption shouldn''t have come from an unclean halt. My last compile was right before my last reboot (i think) and the svcsock.c was not corrupt then. I copied the file back from a known good source and restarted the compile, but it aborted pretty quickly with internal gcc errors etc. I then tried a basic file corruption test - copying large files of known data back and forth lots and then finally compare to the original but that started seg faulting, and then the process hung in a ''D'' state, so i''m rebooting now. When it comes back up i''ll try to break it again. My tree is about 30 hours old. I should probably completely refresh it to ensure there is no other corruption... is there a bk command to do that? or at least to sum all the files to check them against the originals? thanks James
On Fri, Jul 16, 2004 at 04:06:20PM +1000, James Harper wrote:> This strikes me as quite nasty, although i''ve no idea if it''s xen > related at all... i was just doing a ''make world'' under dom0, when the > compile crashed out thus: > > svcsock.c:144:25: warning: null character(s) preserved in literal > svcsock.c:144:25: missing terminating " character > svcsock.c:1176:1: unterminated argument list invoking macro "dprintk"[snip]> and sure enough svcsock.c had a big lump of corruption right in the > middle of it. My most recent reboot was afaik orderly and so the > corruption shouldn''t have come from an unclean halt. My last compile > was right before my last reboot (i think) and the svcsock.c was not > corrupt then. > > I copied the file back from a known good source and restarted the > compile, but it aborted pretty quickly with internal gcc errors etc. > > I then tried a basic file corruption test - copying large files of > known data back and forth lots and then finally compare to the > original but that started seg faulting, and then the process hung in a > ''D'' state, so i''m rebooting now. > > When it comes back up i''ll try to break it again.Sounds like hardware failure. Run memtest86 for an hour or so and let us know whether it finds any errors. -andy ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 17 Jul 2004, at 16:25, Andy Isaacson wrote:> On Fri, Jul 16, 2004 at 04:06:20PM +1000, James Harper wrote:>> I copied the file back from a known good source and restarted the >> compile, but it aborted pretty quickly with internal gcc errors etc. >> >> I then tried a basic file corruption test - copying large files of >> known data back and forth lots and then finally compare to the >> original but that started seg faulting, and then the process hung in a >> ''D'' state, so i''m rebooting now. >> >> When it comes back up i''ll try to break it again. > > Sounds like hardware failure. Run memtest86 for an hour or so and let > us know whether it finds any errors.Well, I also have similar corruption in my domain0 filesystems, and I''ve heard of another instance, so there could be something in this. Running -unstable as of yesterday, but with v1.9 of linux-2.4.26-xen-sparse/arch/xen/drivers/blkif/backend/vbd.c, as the v1.11 changes seem to break exporting block devices from domain0. The filesystems are on devicemapper LVs, but copying to a file and exporting the loop device didn''t help -- the xenU kernel reports ''no device'' for /dev/sda1 when it tries to mount root. The same export statement worked when booting the kernel with earlier vbd.c. I''ve just rebooted into a 2.6.6 kernel and fscked everything, and domain0''s /usr needed plenty of repair. I''m going to let the machine chug for a bit in 2.6.6 and see if any problems show up. (The machine is remote, so I can''t easily run memtest86.) I also have a serial console problem - the console connection works fine with Linux 2.6, but appears to only work in one direction with Xen - I see the bios output, grub output, then Xen output, but once domain0 is booting I can''t send characters, although ^A^A^A does switch between domain0 and Xen and I do see further console jibber. I''m booting with this grub stanza: title Xen Virtualised Linux root (hd0,0) kernel /boot/xen.gz dom0_mem=131072 com1=9600,8n1 watchdog console=com1,vga module /boot/vmlinuz-2.4.26-xen0 root=/dev/sda1 ro console=tty0 console=ttyS0,9600 panic=30 Cheers, Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> Well, I also have similar corruption in my domain0 filesystems, and > I''ve heard of another instance, so there could be something in this.That''s obviously very worrying to hear -- we just haven''t seen anything like this. It would be very interesting to hear whether you get the problem with the 2.6.7 xen linux. It might give us a clue as to whether the problem is with the backend blk driver or within the domain itself (the 2.6.7 implementation is completely different).> Running -unstable as of yesterday, but with v1.9 of > linux-2.4.26-xen-sparse/arch/xen/drivers/blkif/backend/vbd.c, as the > v1.11 changes seem to break exporting block devices from domain0. The > filesystems are on devicemapper LVs, but copying to a file and > exporting the loop device didn''t help -- the xenU kernel reports ''no > device'' for /dev/sda1 when it tries to mount root. The same export > statement worked when booting the kernel with earlier vbd.c.I''ve checked in a fix. It should work fine with LVM and loop devices again.> I also have a serial console problem - the console connection works > fine with Linux 2.6, but appears to only work in one direction with Xen > - I see the bios output, grub output, then Xen output, but once domain0 > is booting I can''t send characters, although ^A^A^A does switch between > domain0 and Xen and I do see further console jibber. > > I''m booting with this grub stanza: > > title Xen Virtualised Linux > root (hd0,0) > kernel /boot/xen.gz dom0_mem=131072 com1=9600,8n1 watchdog > console=com1,vga > module /boot/vmlinuz-2.4.26-xen0 root=/dev/sda1 ro console=tty0 > console=ttyS0,9600 panic=30Nothing obviously wrong here, though could you try the simpler: module /boot/vmlinuz-2.4.26-xen0 root=/dev/sda1 ro console=ttyS0 What happens if you run a getty on /dev/ttyS0 ? Ian ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 17 Jul 2004, at 21:21, Ian Pratt wrote:> It would be very interesting to hear whether you get the problem > with the 2.6.7 xen linux. It might give us a clue as to whether > the problem is with the backend blk driver or within the domain > itself (the 2.6.7 implementation is completely different).I can certainly give the 2.6.7 guest another try. I did have it booting, but I didn''t persist with it long enough to tell if there was fs corruption -- there seemed to be issues loading modules, and when I compiled everything in, I got a gpf when racoon tried to use a PF_KEY socket. I''ll try and get some useful dumps for both these problems.> I''ve checked in a fix. It should work fine with LVM and loop > devices again.All seems well with that fix, with domain 0 using LVs, and exporting them to other domains. I''ll give the machine some load and see what happens.> Nothing obviously wrong here, though could you try the simpler: > > module /boot/vmlinuz-2.4.26-xen0 root=/dev/sda1 ro console=ttyS0 > > What happens if you run a getty on /dev/ttyS0 ?That module line gives the same symptoms. I do have a getty on ttyS0, and I see the login banner from it, but can''t log in. Actually I do have problems with Linux 2.6.6 on the same system. Once the kernel initialises the serial driver, the port settings appear to change -- I get the symptoms of incorrect baud rate. When userspace starts, it seems to switch back (although I have to reset my terminal). The hardware is a Dell 1650, with console redirection on, but redirection after boot disabled. Domain 0 runs debian testing -- would I need to disable the calls to setserial in the initscripts, or should they just fail safely? Cheers, Chris. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> > On 17 Jul 2004, at 21:21, Ian Pratt wrote: > > > It would be very interesting to hear whether you get the problem > > with the 2.6.7 xen linux. It might give us a clue as to whether > > the problem is with the backend blk driver or within the domain > > itself (the 2.6.7 implementation is completely different). > > I can certainly give the 2.6.7 guest another try. I did have it > booting, but I didn''t persist with it long enough to tell if there was > fs corruption -- there seemed to be issues loading modules, and when I > compiled everything in, I got a gpf when racoon tried to use a PF_KEY > socket. I''ll try and get some useful dumps for both these problems.I haven''t tried loading modules, but I can''t think why it wouldn''t work (assuming the mechanism is basically the same as 2.4). BTW: what''s racoon, and what''s a PF_KEY socket?> > I''ve checked in a fix. It should work fine with LVM and loop > > devices again. > > All seems well with that fix, with domain 0 using LVs, and exporting > them to other domains. I''ll give the machine some load and see what > happens.Thanks for the confirmation.> > Nothing obviously wrong here, though could you try the simpler: > > > > module /boot/vmlinuz-2.4.26-xen0 root=/dev/sda1 ro console=ttyS0 > > > > What happens if you run a getty on /dev/ttyS0 ? > > That module line gives the same symptoms. I do have a getty on ttyS0, > and I see the login banner from it, but can''t log in.There was a bug along these lines, but it''s believed fixed. If you''re using the latest repo, that''s a concern.> Actually I do have problems with Linux 2.6.6 on the same system. Once > the kernel initialises the serial driver, the port settings appear to > change -- I get the symptoms of incorrect baud rate. When userspace > starts, it seems to switch back (although I have to reset my terminal). > The hardware is a Dell 1650, with console redirection on, but > redirection after boot disabled.With the default configuration, xen owns the serial uart at all times, so linux shouldn''t be able to mess with the baud rate etc.> Domain 0 runs debian testing -- would I need to disable the calls to > setserial in the initscripts, or should they just fail safely?I''m not sure how setserial works -- presumably it tries to do ioctls on /dev/ttyS0 rather than trying to inb/outb the uart directly? The former should just be ignored by the xencons ttyS0 driver and do no harm. If the latter, it''s possible that it is messing things up as the the default configuration is to allow dom0 any IO privs it asks for. Disabling them seems safest in the first instance. Ian ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 18 Jul 2004, at 17:22, Chris Andrews wrote:> >> I''ve checked in a fix. It should work fine with LVM and loop >> devices again. > > All seems well with that fix, with domain 0 using LVs, and exporting > them to other domains. I''ll give the machine some load and see what > happens.So, I''ve got the following results. The machine has an aacraid controller, which reports no errors. My test, such as it is, is to rebuild the ''festival'' package -- simply because that''s what I was doing when I first saw corruption. 2.4.26 plus the device-mapper patch and the VFS-locking patch - stable. Xen, 2.4.26 domain0 plus device-mapper, VFS-locking, domain 0 only - stable. Xen, 2.4.26 domain0 etc, and a 2.6.7-xenU guest - stable. Xen, 2.4.26 domain0 and a 2.4.26-xenU guest - corruption in dom0, and an oops. The oops below happened shortly after I started the 2.4 guest, and killed the machine. I''ll run memtest86 as soon as I''ve got it built with serial support and hooked into grub... Chris. Unable to handle kernel paging request at virtual address 258b0619 c0124f32 Oops: 0000 CPU: 0 EIP: 0819:[<c0124f32>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00213202 eax: 258b0601 ebx: 00000001 ecx: 258b0601 edx: 00000000 esi: c66bdaa0 edi: c1a31034 ebp: c66bdaa0 esp: c767dbd8 ds: 0821 es: 0821 ss: 0821 Process find (pid: 2877, stackpage=c767d000)<1> Stack: c0261d6c 00000000 c66bdaa0 c75eb220 c0261da3 c66bdaa0 c66bdaa0 c0261ed9 c66bdaa0 00000000 00000030 952163ff c0288163 c66bdaa0 c66bdaa0 00000000 c66bdaa0 c1189800 00000000 c02ac920 00000001 952163cf c02766a0 007de526 Call Trace: [<c0261d6c>] [<c0261da3>] [<c0261ed9>] [<c0288163>] [<c02ac920>] [<c02766a0>] [<c028feb2>] [<c0290353>] [<c02af7e0>] [<c027667e>] [<c02af7e0>] [<c02766a0>] [<c0276859>] [<c026e85a>] [<c02b04b7>] [<c026e506>] [<c02766a0>] [<c026e85a>] [<c02766a0>] [<c027646e>] [<c02766a0>] [<c02667d7>] [<c0266935>] [<c0266a93>] [<c010ddba>] [<c01b07ee>] [<c01b4ffd>] [<c01aefab>] [<c01bf403>] [<c01be83d>] [<c01bf23f>] [<c01c0ab7>] [<c01bef3b>] [<c01bc5fa>] [<c01c0920>] [<c012c273>] [<c01aee97>] [<c01a0833>] <0>Kernel panic: Aiee, killing interrupt handler! Warning (Oops_read): Code line not seen, dumping what data is available >>EIP; c0124f32 <__free_pages+2/20> <==== >>eax; 258b0601 <__start___xen_guest+258ad3b2/c00fcdb1> >>ecx; 258b0601 <__start___xen_guest+258ad3b2/c00fcdb1> Trace; c0261d6c <skb_release_data+7c/a0> Trace; c0261da3 <kfree_skbmem+13/30> Trace; c0261ed9 <__kfree_skb+119/190> Trace; c0288163 <tcp_rcv_established+483/850> Trace; c02ac920 <br_handle_frame_finish+0/170> Trace; c02766a0 <ip_rcv_finish+0/210> Trace; c028feb2 <tcp_v4_do_rcv+122/130> Trace; c0290353 <tcp_v4_rcv+493/5d0> Trace; c02af7e0 <br_nf_pre_routing_finish+0/280> Trace; c027667e <ip_local_deliver_finish+14e/170> Trace; c02af7e0 <br_nf_pre_routing_finish+0/280> Trace; c02766a0 <ip_rcv_finish+0/210> Trace; c0276859 <ip_rcv_finish+1b9/210> Trace; c026e85a <nf_hook_slow+7a/190> Trace; c02b04b7 <ipv4_sabotage_in+27/30> Trace; c026e506 <nf_iterate+76/b0> Trace; c02766a0 <ip_rcv_finish+0/210> Trace; c026e85a <nf_hook_slow+7a/190> Trace; c02766a0 <ip_rcv_finish+0/210> Trace; c027646e <ip_rcv+19e/260> Trace; c02766a0 <ip_rcv_finish+0/210> Trace; c02667d7 <netif_receive_skb+137/210> Trace; c0266935 <process_backlog+85/160> Trace; c0266a93 <net_rx_action+83/160> Trace; c010ddba <do_softirq+da/f0> Trace; c01b07ee <do_IRQ+9e/a0> Trace; c01b4ffd <evtchn_do_upcall+ad/110> Trace; c01aefab <hypervisor_callback+33/49> Trace; c01bf403 <opost_block+b3/180> Trace; c01be83d <tty_default_put_char+2d/40> Trace; c01bf23f <opost+9f/1b0> Trace; c01c0ab7 <write_chan+197/220> Trace; c01bef3b <do_tty_write+db/134> Trace; c01bc5fa <tty_write+13a/170> Trace; c01c0920 <write_chan+0/220> Trace; c012c273 <sys_write+a3/140> Trace; c01aee97 <system_call+2f/33> Trace; c01a0833 <nlmsvc_decode_notify+23/70> 1 warning and 1 error issued. Results may not be reliable. ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
On 18 Jul 2004, at 18:48, Ian Pratt wrote:>> >> On 17 Jul 2004, at 21:21, Ian Pratt wrote: >> >>> It would be very interesting to hear whether you get the problem >>> with the 2.6.7 xen linux. It might give us a clue as to whether >>> the problem is with the backend blk driver or within the domain >>> itself (the 2.6.7 implementation is completely different). >> >> I can certainly give the 2.6.7 guest another try. I did have it >> booting, but I didn''t persist with it long enough to tell if there was >> fs corruption -- there seemed to be issues loading modules, and when I >> compiled everything in, I got a gpf when racoon tried to use a PF_KEY >> socket. I''ll try and get some useful dumps for both these problems. > > I haven''t tried loading modules, but I can''t think why it > wouldn''t work (assuming the mechanism is basically the same as > 2.4).It''s different enough to need new userspace tools. The symptoms of failure are a GPF, and the userspace process stuck in D (be it insmod or lsmod). The results of feeding the GPF to ksymoops are below (I hesitate to say it''s actually decoded).> BTW: what''s racoon, and what''s a PF_KEY socket?racoon is the ISAKMP daemon used with the 2.6 kernel''s KAME IPSec code. It uses a PF_KEY socket to communicate with the kernel. I''ve successfully used it in a 2.4 guest. Chris. No modules in ksyms, skipping objects No ksyms, skipping lsmod CPU: 0 EIP: 0061:[<c01471a7>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 (2.6.7-xenU) eax: 00000600 ebx: c5400000 ecx: 00000001 edx: 00000600 esi: c0102c54 edi: c5089000 ebp: c5087000 esp: c04b1ec4 ds: 0069 es: 0069 ss: 0069 Stack: c0102c50 c5087000 00002000 c122c6a8 c122c6e0 00000001 c01473f8 c122c6a8 c5087000 fffffffe c0147491 c5087000 00000000 c5055c19 c5084380 c5015000 fffffffe c5084380 c014753e c5087000 00000001 c012d9c3 c5087000 c5087000 Call Trace: c04b1ed0: [<c01473f8>] c04b1ee0: [<c0147491>] c04b1f00: [<c014753e>] c04b1f0c: [<c012d9c3>] c04b1f38: [<c02da440>] c04b1f94: [<c012dc5d>] c04b1fb4: [<c010a663>] Code: 0f 22 e2 0f 20 d9 0f 22 d9 0f 22 e0 83 c4 0c 5b 5e 5f c3 e8 >>EIP; c01471a7 <unmap_vm_area+5d/80> <==== >>ebx; c5400000 <pg0+50c8000/3bcc5000> >>esi; c0102c54 <swapper_pg_dir+c54/1000> >>edi; c5089000 <pg0+4d51000/3bcc5000> >>ebp; c5087000 <pg0+4d4f000/3bcc5000> >>esp; c04b1ec4 <pg0+179ec4/3bcc5000> Code; c01471a7 <unmap_vm_area+5d/80> 00000000 <_EIP>: Code; c01471a7 <unmap_vm_area+5d/80> <==== 0: 0f 22 e2 mov %edx,%cr4 <====Code; c01471aa <unmap_vm_area+60/80> 3: 0f 20 d9 mov %cr3,%ecx Code; c01471ad <unmap_vm_area+63/80> 6: 0f 22 d9 mov %ecx,%cr3 Code; c01471b0 <unmap_vm_area+66/80> 9: 0f 22 e0 mov %eax,%cr4 Code; c01471b3 <unmap_vm_area+69/80> c: 83 c4 0c add $0xc,%esp Code; c01471b6 <unmap_vm_area+6c/80> f: 5b pop %ebx Code; c01471b7 <unmap_vm_area+6d/80> 10: 5e pop %esi Code; c01471b8 <unmap_vm_area+6e/80> 11: 5f pop %edi Code; c01471b9 <unmap_vm_area+6f/80> 12: c3 ret Code; c01471ba <unmap_vm_area+70/80> 13: e8 00 00 00 00 call 18 <_EIP+0x18> ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I''m not in a position to test this, but is it possible that the corruption problem could manifest itself after an out of memory condition? When I first noticed the corruption I rebooted as quickly as possible so it didn''t continue and so didn''t check, but it''s possible that it ran out of memory first. I guess I could test this but don''t really want to do anything to risk corruption any further :) speaking of memory, I have 3 domains running currently, 0 + 2U, all declared with 128mb memory, but xm list shows this: Dom Name Mem(MB) CPU State Time(s) 0 Domain-0 119 0 r---- 1293.0 6 gaia 127 1 -b--- 81.9 7 mail2 126 0 -b--- 1597.9 ''free'' under mail2 and gaia shows 128124 as the total amount of memory. I appreciate that maybe something about dom0 means that it shows something different, but why would the other two report different amounts of memory when they both have the same amount??? Both are running identical kernels. James From: Chris Andrews Sent: Mon 19/07/2004 8:43 AM To: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] file corruption!!! On 18 Jul 2004, at 18:48, Ian Pratt wrote:>> >> On 17 Jul 2004, at 21:21, Ian Pratt wrote: >> >>> It would be very interesting to hear whether you get the problem >>> with the 2.6.7 xen linux. It might give us a clue as to whether >>> the problem is with the backend blk driver or within the domain >>> itself (the 2.6.7 implementation is completely different). >> >> I can certainly give the 2.6.7 guest another try. I did have it >> booting, but I didn''t persist with it long enough to tell if there was >> fs corruption -- there seemed to be issues loading modules, and when I >> compiled everything in, I got a gpf when racoon tried to use a PF_KEY >> socket. I''ll try and get some useful dumps for both these problems. > > I haven''t tried loading modules, but I can''t think why it > wouldn''t work (assuming the mechanism is basically the same as > 2.4).It''s different enough to need new userspace tools. The symptoms of failure are a GPF, and the userspace process stuck in D (be it insmod or lsmod). The results of feeding the GPF to ksymoops are below (I hesitate to say it''s actually decoded).> BTW: what''s racoon, and what''s a PF_KEY socket?racoon is the ISAKMP daemon used with the 2.6 kernel''s KAME IPSec code. It uses a PF_KEY socket to communicate with the kernel. I''ve successfully used it in a 2.4 guest. Chris. No modules in ksyms, skipping objects No ksyms, skipping lsmod CPU: 0 EIP: 0061:[<c01471a7>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 (2.6.7-xenU) eax: 00000600 ebx: c5400000 ecx: 00000001 edx: 00000600 esi: c0102c54 edi: c5089000 ebp: c5087000 esp: c04b1ec4 ds: 0069 es: 0069 ss: 0069 Stack: c0102c50 c5087000 00002000 c122c6a8 c122c6e0 00000001 c01473f8 c122c6a8 c5087000 fffffffe c0147491 c5087000 00000000 c5055c19 c5084380 c5015000 fffffffe c5084380 c014753e c5087000 00000001 c012d9c3 c5087000 c5087000 Call Trace: c04b1ed0: [<c01473f8>] c04b1ee0: [<c0147491>] c04b1f00: [<c014753e>] c04b1f0c: [<c012d9c3>] c04b1f38: [<c02da440>] c04b1f94: [<c012dc5d>] c04b1fb4: [<c010a663>] Code: 0f 22 e2 0f 20 d9 0f 22 d9 0f 22 e0 83 c4 0c 5b 5e 5f c3 e8 >>EIP; c01471a7 <unmap_vm_area+5d/80> <==== >>ebx; c5400000 <pg0+50c8000/3bcc5000> >>esi; c0102c54 <swapper_pg_dir+c54/1000> >>edi; c5089000 <pg0+4d51000/3bcc5000> >>ebp; c5087000 <pg0+4d4f000/3bcc5000> >>esp; c04b1ec4 <pg0+179ec4/3bcc5000> Code; c01471a7 <unmap_vm_area+5d/80> 00000000 <_EIP>: Code; c01471a7 <unmap_vm_area+5d/80> <==== 0: 0f 22 e2 mov %edx,%cr4 <====Code; c01471aa <unmap_vm_area+60/80> 3: 0f 20 d9 mov %cr3,%ecx Code; c01471ad <unmap_vm_area+63/80> 6: 0f 22 d9 mov %ecx,%cr3 Code; c01471b0 <unmap_vm_area+66/80> 9: 0f 22 e0 mov %eax,%cr4 Code; c01471b3 <unmap_vm_area+69/80> c: 83 c4 0c add $0xc,%esp Code; c01471b6 <unmap_vm_area+6c/80> f: 5b pop %ebx Code; c01471b7 <unmap_vm_area+6d/80> 10: 5e pop %esi Code; c01471b8 <unmap_vm_area+6e/80> 11: 5f pop %edi Code; c01471b9 <unmap_vm_area+6f/80> 12: c3 ret Code; c01471ba <unmap_vm_area+70/80> 13: e8 00 00 00 00 call 18 <_EIP+0x18> ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
I just tried another bk pull + make world, and it failed because it couldn''t gunzip linux-2.4.26.tar.gz. I tried it manually and sure enough it failed. ''xm list'' etc just seg faulted too. After a reboot though, the file was fine again, so the corruption in this case was a read error not a write error. I''m assuming that if I had done enough io to flush any buffers and then tried to gunzip the file again it probably would have worked. Just prior to this I had run a little C program which would just try and allocate memory in 1mb chunks until it was killed.. After reboot I tried the same thing again and it appears to be staying up okay now, unfortunately. It almost seems like I only start to get errors after a day or so uptime and a fair bit of I/O. Curiously though, the first time I ran my memory exhausting program, all my xenU domains restarted... Since starting this email I have managed to induce corruption again, i''ll reboot and try it again without starting any other domains. The server is a Compaq ProLiant 1600 2x550mhz P3 with 768mb memory. All the memory is ECC and up until I acquired it for Linux purposes, it was running as another company''s main Windows server, so I wouldn''t have suspected a hardware issue. I''ll follow up shortly hopefully with some instructions on inducing the corruption on this server for anyone else to try to see if we have a general problem. There haven''t been any fixes in the last 2 days that would correct this problem have there? I''m a few days out of date i think. James From: James Harper Sent: Mon 19/07/2004 9:36 AM To: xen-devel@lists.sourceforge.net Subject: RE: [Xen-devel] file corruption!!! I''m not in a position to test this, but is it possible that the corruption problem could manifest itself after an out of memory condition? When I first noticed the corruption I rebooted as quickly as possible so it didn''t continue and so didn''t check, but it''s possible that it ran out of memory first. I guess I could test this but don''t really want to do anything to risk corruption any further :) speaking of memory, I have 3 domains running currently, 0 + 2U, all declared with 128mb memory, but xm list shows this: Dom Name Mem(MB) CPU State Time(s) 0 Domain-0 119 0 r---- 1293.0 6 gaia 127 1 -b--- 81.9 7 mail2 126 0 -b--- 1597.9 ''free'' under mail2 and gaia shows 128124 as the total amount of memory. I appreciate that maybe something about dom0 means that it shows something different, but why would the other two report different amounts of memory when they both have the same amount??? Both are running identical kernels. James From: Chris Andrews Sent: Mon 19/07/2004 8:43 AM To: xen-devel@lists.sourceforge.net Subject: Re: [Xen-devel] file corruption!!! On 18 Jul 2004, at 18:48, Ian Pratt wrote:>> >> On 17 Jul 2004, at 21:21, Ian Pratt wrote: >> >>> It would be very interesting to hear whether you get the problem >>> with the 2.6.7 xen linux. It might give us a clue as to whether >>> the problem is with the backend blk driver or within the domain >>> itself (the 2.6.7 implementation is completely different). >> >> I can certainly give the 2.6.7 guest another try. I did have it >> booting, but I didn''t persist with it long enough to tell if there was >> fs corruption -- there seemed to be issues loading modules, and when I >> compiled everything in, I got a gpf when racoon tried to use a PF_KEY >> socket. I''ll try and get some useful dumps for both these problems. > > I haven''t tried loading modules, but I can''t think why it > wouldn''t work (assuming the mechanism is basically the same as > 2.4).It''s different enough to need new userspace tools. The symptoms of failure are a GPF, and the userspace process stuck in D (be it insmod or lsmod). The results of feeding the GPF to ksymoops are below (I hesitate to say it''s actually decoded).> BTW: what''s racoon, and what''s a PF_KEY socket?racoon is the ISAKMP daemon used with the 2.6 kernel''s KAME IPSec code. It uses a PF_KEY socket to communicate with the kernel. I''ve successfully used it in a 2.4 guest. Chris. No modules in ksyms, skipping objects No ksyms, skipping lsmod CPU: 0 EIP: 0061:[<c01471a7>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010246 (2.6.7-xenU) eax: 00000600 ebx: c5400000 ecx: 00000001 edx: 00000600 esi: c0102c54 edi: c5089000 ebp: c5087000 esp: c04b1ec4 ds: 0069 es: 0069 ss: 0069 Stack: c0102c50 c5087000 00002000 c122c6a8 c122c6e0 00000001 c01473f8 c122c6a8 c5087000 fffffffe c0147491 c5087000 00000000 c5055c19 c5084380 c5015000 fffffffe c5084380 c014753e c5087000 00000001 c012d9c3 c5087000 c5087000 Call Trace: c04b1ed0: [<c01473f8>] c04b1ee0: [<c0147491>] c04b1f00: [<c014753e>] c04b1f0c: [<c012d9c3>] c04b1f38: [<c02da440>] c04b1f94: [<c012dc5d>] c04b1fb4: [<c010a663>] Code: 0f 22 e2 0f 20 d9 0f 22 d9 0f 22 e0 83 c4 0c 5b 5e 5f c3 e8 >>EIP; c01471a7 <unmap_vm_area+5d/80> <==== >>ebx; c5400000 <pg0+50c8000/3bcc5000> >>esi; c0102c54 <swapper_pg_dir+c54/1000> >>edi; c5089000 <pg0+4d51000/3bcc5000> >>ebp; c5087000 <pg0+4d4f000/3bcc5000> >>esp; c04b1ec4 <pg0+179ec4/3bcc5000> Code; c01471a7 <unmap_vm_area+5d/80> 00000000 <_EIP>: Code; c01471a7 <unmap_vm_area+5d/80> <==== 0: 0f 22 e2 mov %edx,%cr4 <====Code; c01471aa <unmap_vm_area+60/80> 3: 0f 20 d9 mov %cr3,%ecx Code; c01471ad <unmap_vm_area+63/80> 6: 0f 22 d9 mov %ecx,%cr3 Code; c01471b0 <unmap_vm_area+66/80> 9: 0f 22 e0 mov %eax,%cr4 Code; c01471b3 <unmap_vm_area+69/80> c: 83 c4 0c add $0xc,%esp Code; c01471b6 <unmap_vm_area+6c/80> f: 5b pop %ebx Code; c01471b7 <unmap_vm_area+6d/80> 10: 5e pop %esi Code; c01471b8 <unmap_vm_area+6e/80> 11: 5f pop %edi Code; c01471b9 <unmap_vm_area+6f/80> 12: c3 ret Code; c01471ba <unmap_vm_area+70/80> 13: e8 00 00 00 00 call 18 <_EIP+0x18> ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel
> I''m not in a position to test this, but is it possible that the corruption problem could manifest itself after an out of memory condition? When I first noticed the corruption I rebooted as quickly as possible so it didn''t continue and so didn''t check, but it''s possible that it ran out of memory first. I guess I could test this but don''t really want to do anything to risk corruption any further :) > > speaking of memory, I have 3 domains running currently, 0 + 2U, all declared with 128mb memory, but xm list shows this: > Dom Name Mem(MB) CPU State Time(s) > 0 Domain-0 119 0 r---- 1293.0 > 6 gaia 127 1 -b--- 81.9 > 7 mail2 126 0 -b--- 1597.9 > > ''free'' under mail2 and gaia shows 128124 as the total amount of memory. > > I appreciate that maybe something about dom0 means that it shows something different, but why would the other two report different amounts of memory when they both have the same amount??? Both are running identical kernels. > > JamesThe backend network and blkdev drivers in DOM0 allocate multi-MB chunks of memory, then free all the pages in that chunk back to Xen. The chunks are then used for ephemeral mappings of I/O buffers from frontend drivers in other guest OSes. The total used for this is about 7 or 8MB, so DOM0''s estimate of its memory allocation will be thrown off by about that amount. It doesn''t realise that those large allocated chunks have had all their memory released. :-) -- Keir ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Xen-devel mailing list Xen-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel