Moritz Möller
2008-Sep-19 11:39 UTC
[Xen-devel] disk io errors possibly caused by high network load?
Hi, we had a very strange situation yesterday. In one second, 13 of 25 xen boxes died with disk errors (domU and dom0, something like end_request: I/O error dev hda sector ...), but worked well again after a reboot. Some minutes before a technician plugged in a wrong cable, creating a network loop - so the error could be caused by a high network io load. The disks are okay, and the error occurred with both scsi raid controllers and plain sata disks. Here is some info of a host that crashed: root/mmoeller@srv002050:/root$ xm info host : srv002050 release : 2.6.21-2950.fc8xen version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 machine : x86_64 nr_cpus : 8 nr_nodes : 1 cores_per_socket : 4 threads_per_core : 1 cpu_mhz : 1866 hw_caps : bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001 total_memory : 8190 free_memory : 12 node_to_cpu : node0:0-7 xen_major : 3 xen_minor : 2 xen_extra : .0 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) cc_compile_by : root cc_compile_domain : office.bigpoint.net cc_compile_date : Tue Mar 11 13:57:28 CET 2008 xend_config_format : 4 root/mmoeller@srv002050:/root$ uname -r 2.6.21-2950.fc8xen And here of a host that did not crash: root/mmoeller@srv006215:/root$ xm info host : srv006215 release : 2.6.21-2950.fc8xen version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 machine : x86_64 nr_cpus : 4 nr_nodes : 1 cores_per_socket : 4 threads_per_core : 1 cpu_mhz : 2394 hw_caps : bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001 total_memory : 8190 free_memory : 10 node_to_cpu : node0:0-3 xen_major : 3 xen_minor : 2 xen_extra : .0 xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p xen_scheduler : credit xen_pagesize : 4096 platform_params : virt_start=0xffff800000000000 xen_changeset : unavailable cc_compiler : gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) cc_compile_by : root cc_compile_domain : office.bigpoint.net cc_compile_date : Tue Mar 11 13:57:28 CET 2008 xend_config_format : 4 root/mmoeller@srv006215:/root$ uname -r 2.6.21-2950.fc8xen Does someone have an idea how this could happen? Thanks, Moritz _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2008-Sep-19 12:44 UTC
RE: [Xen-devel] disk io errors possibly caused by high network load?
> we had a very strange situation yesterday. In one second, 13 of 25 xen > boxes died with disk errors (domU and dom0, something like end_request: > I/O error dev hda sector ...), but worked well again after a reboot. > > Some minutes before a technician plugged in a wrong cable, creating a > network loop - so the error could be caused by a high network io load. > The disks are okay, and the error occurred with both scsi raid > controllers and plain sata disks.This is quite remarkable -- I don''t think anyone has reported anything similar before, despite there being many large xen deployments. Are you saying that IO errors were reported from both dom0 and the domU''s? Did you actually track down the specific device major/minor that was reporting the error? Is there any network storage (e.g. iSCSI, AOE) in your setup? Ian> Here is some info of a host that crashed: > > root/mmoeller@srv002050:/root$ xm info > host : srv002050 > release : 2.6.21-2950.fc8xen > version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 > machine : x86_64 > nr_cpus : 8 > nr_nodes : 1 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 1866 > hw_caps : > bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001 > total_memory : 8190 > free_memory : 12 > node_to_cpu : node0:0-7 > xen_major : 3 > xen_minor : 2 > xen_extra : .0 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > cc_compiler : gcc version 4.1.2 20061115 (prerelease) > (Debian > 4.1.1-21) > cc_compile_by : root > cc_compile_domain : office.bigpoint.net > cc_compile_date : Tue Mar 11 13:57:28 CET 2008 > xend_config_format : 4 > root/mmoeller@srv002050:/root$ uname -r > 2.6.21-2950.fc8xen > > And here of a host that did not crash: > > root/mmoeller@srv006215:/root$ xm info > host : srv006215 > release : 2.6.21-2950.fc8xen > version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 > machine : x86_64 > nr_cpus : 4 > nr_nodes : 1 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 2394 > hw_caps : > bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001 > total_memory : 8190 > free_memory : 10 > node_to_cpu : node0:0-3 > xen_major : 3 > xen_minor : 2 > xen_extra : .0 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > cc_compiler : gcc version 4.1.2 20061115 (prerelease) > (Debian > 4.1.1-21) > cc_compile_by : root > cc_compile_domain : office.bigpoint.net > cc_compile_date : Tue Mar 11 13:57:28 CET 2008 > xend_config_format : 4 > root/mmoeller@srv006215:/root$ uname -r > 2.6.21-2950.fc8xen > > Does someone have an idea how this could happen? > > > Thanks, > > > Moritz > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Moritz Möller
2008-Sep-19 12:59 UTC
RE: [Xen-devel] disk io errors possibly caused by high network load?
We rebooted the machines really quickly because it was a productive system, so I didn''t have the time to copy the logs, and on the disks I see nothing about this in the logfiles, propably because the IO was already down. The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32 GB RAM, and some have a mdraid setup with two SATA drives with the on board sata controller (intel ICH), other have a dedicated 3ware / AMCC 9660 or similar. The machines that crashed were on different power lines and connected to different switches, although on the same network segment. Also there were no physical interferences. The error was reported by domU and dom0 - both saying the disk would give a I/O error, but no specific information. Network card is intel e1000. Lsmod: -----Original Message----- From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] Sent: Friday, September 19, 2008 2:44 PM To: Moritz Möller; xen-devel@lists.xensource.com Cc: Ian Pratt Subject: RE: [Xen-devel] disk io errors possibly caused by high network load?> we had a very strange situation yesterday. In one second, 13 of 25 xen > boxes died with disk errors (domU and dom0, something likeend_request:> I/O error dev hda sector ...), but worked well again after a reboot. > > Some minutes before a technician plugged in a wrong cable, creating a > network loop - so the error could be caused by a high network io load. > The disks are okay, and the error occurred with both scsi raid > controllers and plain sata disks.This is quite remarkable -- I don''t think anyone has reported anything similar before, despite there being many large xen deployments. Are you saying that IO errors were reported from both dom0 and the domU''s? Did you actually track down the specific device major/minor that was reporting the error? Is there any network storage (e.g. iSCSI, AOE) in your setup? Ian> Here is some info of a host that crashed: > > root/mmoeller@srv002050:/root$ xm info > host : srv002050 > release : 2.6.21-2950.fc8xen > version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 > machine : x86_64 > nr_cpus : 8 > nr_nodes : 1 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 1866 > hw_caps : > bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001 > total_memory : 8190 > free_memory : 12 > node_to_cpu : node0:0-7 > xen_major : 3 > xen_minor : 2 > xen_extra : .0 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > cc_compiler : gcc version 4.1.2 20061115 (prerelease) > (Debian > 4.1.1-21) > cc_compile_by : root > cc_compile_domain : office.bigpoint.net > cc_compile_date : Tue Mar 11 13:57:28 CET 2008 > xend_config_format : 4 > root/mmoeller@srv002050:/root$ uname -r > 2.6.21-2950.fc8xen > > And here of a host that did not crash: > > root/mmoeller@srv006215:/root$ xm info > host : srv006215 > release : 2.6.21-2950.fc8xen > version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 > machine : x86_64 > nr_cpus : 4 > nr_nodes : 1 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 2394 > hw_caps : > bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001 > total_memory : 8190 > free_memory : 10 > node_to_cpu : node0:0-3 > xen_major : 3 > xen_minor : 2 > xen_extra : .0 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > cc_compiler : gcc version 4.1.2 20061115 (prerelease) > (Debian > 4.1.1-21) > cc_compile_by : root > cc_compile_domain : office.bigpoint.net > cc_compile_date : Tue Mar 11 13:57:28 CET 2008 > xend_config_format : 4 > root/mmoeller@srv006215:/root$ uname -r > 2.6.21-2950.fc8xen > > Does someone have an idea how this could happen? > > > Thanks, > > > Moritz > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Moritz Möller
2008-Sep-19 13:00 UTC
RE: [Xen-devel] disk io errors possibly caused by high network load?
Okay - wrong key. Message continued -----Original Message----- From: Moritz Möller Sent: Friday, September 19, 2008 3:00 PM To: ''Ian Pratt''; xen-devel@lists.xensource.com Subject: RE: [Xen-devel] disk io errors possibly caused by high network load? We rebooted the machines really quickly because it was a productive system, so I didn''t have the time to copy the logs, and on the disks I see nothing about this in the logfiles, propably because the IO was already down. The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32 GB RAM, and some have a mdraid setup with two SATA drives with the on board sata controller (intel ICH), other have a dedicated 3ware / AMCC 9660 or similar. The machines that crashed were on different power lines and connected to different switches, although on the same network segment. Also there were no physical interferences. The error was reported by domU and dom0 - both saying the local disk (either sda or sdb on mdraid systems, and sda on raid systems) reports a I/O error, but no specific information. Network card is intel e1000. Lsmod: nfs 257112 1 w83792d 39320 0 w83781d 44840 0 i2c_isa 14720 1 w83781d w83793 46360 0 hwmon_vid 11264 2 w83781d,w83793 hwmon 12040 3 w83792d,w83781d,w83793 ipmi_devintf 20112 0 ipmi_si 52812 0 ipmi_msghandler 47096 2 ipmi_devintf,ipmi_si nls_utf8 10624 3 cifs 228112 3 xt_physdev 11152 4 iptable_filter 11392 1 ip_tables 28648 1 iptable_filter x_tables 29064 2 xt_physdev,ip_tables ipv6 339072 22 bridge 64936 0 8021q 29584 0 nfsd 263848 1 exportfs 14336 1 nfsd lockd 74800 2 nfs,nfsd nfs_acl 12160 2 nfs,nfsd sunrpc 186344 5 nfs,nfsd,lockd,nfs_acl blkbk 30776 0 [permanent] netbk 105184 0 [permanent] loop 26768 0 8250_pnp 19968 0 sg 45224 0 sr_mod 26148 0 cdrom 44072 1 sr_mod i2c_i801 17052 0 iTCO_wdt 20432 0 iTCO_vendor_support 12548 1 iTCO_wdt 8250 50120 1 8250_pnp serial_core 31616 1 8250 i2c_core 32256 5 w83792d,w83781d,i2c_isa,w83793,i2c_i801 serio_raw 16004 0 pcspkr 11776 0 joydev 19584 0 ext3 141200 2 jbd 72432 1 ext3 mbcache 18184 1 ext3 dm_mirror 30528 0 dm_snapshot 25416 0 dm_mod 69520 21 dm_mirror,dm_snapshot raid1 32768 3 sd_mod 35200 8 usb_storage 90304 0 ata_piix 25092 6 ata_generic 17412 0 floppy 68904 0 ehci_hcd 41100 0 uhci_hcd 32544 0 libata 126896 2 ata_piix,ata_generic scsi_mod 166968 5 sg,sr_mod,sd_mod,usb_storage,libata e1000 130880 0 xenbus_be 12800 2 blkbk,netbk xennet 37512 0 xenblk 26720 0 -----Original Message----- From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] Sent: Friday, September 19, 2008 2:44 PM To: Moritz Möller; xen-devel@lists.xensource.com Cc: Ian Pratt Subject: RE: [Xen-devel] disk io errors possibly caused by high network load?> we had a very strange situation yesterday. In one second, 13 of 25 xen > boxes died with disk errors (domU and dom0, something likeend_request:> I/O error dev hda sector ...), but worked well again after a reboot. > > Some minutes before a technician plugged in a wrong cable, creating a > network loop - so the error could be caused by a high network io load. > The disks are okay, and the error occurred with both scsi raid > controllers and plain sata disks.This is quite remarkable -- I don''t think anyone has reported anything similar before, despite there being many large xen deployments. Are you saying that IO errors were reported from both dom0 and the domU''s? Did you actually track down the specific device major/minor that was reporting the error? Is there any network storage (e.g. iSCSI, AOE) in your setup? Ian> Here is some info of a host that crashed: > > root/mmoeller@srv002050:/root$ xm info > host : srv002050 > release : 2.6.21-2950.fc8xen > version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 > machine : x86_64 > nr_cpus : 8 > nr_nodes : 1 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 1866 > hw_caps : > bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001 > total_memory : 8190 > free_memory : 12 > node_to_cpu : node0:0-7 > xen_major : 3 > xen_minor : 2 > xen_extra : .0 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > cc_compiler : gcc version 4.1.2 20061115 (prerelease) > (Debian > 4.1.1-21) > cc_compile_by : root > cc_compile_domain : office.bigpoint.net > cc_compile_date : Tue Mar 11 13:57:28 CET 2008 > xend_config_format : 4 > root/mmoeller@srv002050:/root$ uname -r > 2.6.21-2950.fc8xen > > And here of a host that did not crash: > > root/mmoeller@srv006215:/root$ xm info > host : srv006215 > release : 2.6.21-2950.fc8xen > version : #1 SMP Tue Oct 23 12:23:33 EDT 2007 > machine : x86_64 > nr_cpus : 4 > nr_nodes : 1 > cores_per_socket : 4 > threads_per_core : 1 > cpu_mhz : 2394 > hw_caps : > bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001 > total_memory : 8190 > free_memory : 10 > node_to_cpu : node0:0-3 > xen_major : 3 > xen_minor : 2 > xen_extra : .0 > xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p > xen_scheduler : credit > xen_pagesize : 4096 > platform_params : virt_start=0xffff800000000000 > xen_changeset : unavailable > cc_compiler : gcc version 4.1.2 20061115 (prerelease) > (Debian > 4.1.1-21) > cc_compile_by : root > cc_compile_domain : office.bigpoint.net > cc_compile_date : Tue Mar 11 13:57:28 CET 2008 > xend_config_format : 4 > root/mmoeller@srv006215:/root$ uname -r > 2.6.21-2950.fc8xen > > Does someone have an idea how this could happen? > > > Thanks, > > > Moritz > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
James Harper
2008-Sep-19 13:21 UTC
RE: [Xen-devel] disk io errors possibly caused by high network load?
> > We rebooted the machines really quickly because it was a productive > system, so I didn''t have the time to copy the logs, and on the disks I > see nothing about this in the logfiles, propably because the IO was > already down. > > The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32 > GB RAM, and some have a mdraid setup with two SATA drives with the on > board sata controller (intel ICH), other have a dedicated 3ware / AMCC > 9660 or similar. > > The machines that crashed were on different power lines and connected to > different switches, although on the same network segment. Also there > were no physical interferences. > > The error was reported by domU and dom0 - both saying the disk would > give a I/O error, but no specific information. > > Network card is intel e1000.The error wasn''t a timeout was it? We had a similar problem under Windows (no Xen involved at all) where the switch the server was plugged into was looped back to itself one evening. Any broadcast packet sent to the switch would just circulate around the switch indefinitely, until there were enough broadcast packets looping around that everything ground to a halt. The server was a HP DL380, so a more than capable machine, but there were enough interrupts occurring due to a completely saturated network that everything was reporting timeouts. In this case the server didn''t require a reboot. It sat in that state the whole night, reporting disk timeouts etc but the moment we rectified the cabling fault in the morning it instantly bounced back to life. It could be that Linux treats timeout errors a little more severely? Can anyone say if the layer above blkfront in the Linux kernel will report timeouts? Or would the errors have been coming through from Dom0? Anyway, do you have a test environment you can reproduce the problem on? If the problem is as simple as a looped switch then it shouldn''t be too hard to reproduce... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Moritz Möller
2008-Sep-19 13:23 UTC
RE: [Xen-devel] disk io errors possibly caused by high network load?
A strange thing is that not a single non-xen machine went down. I will set up two xen machines on a switch with a loop and see what I''ll get ;) -----Original Message----- From: James Harper [mailto:james.harper@bendigoit.com.au] Sent: Friday, September 19, 2008 3:22 PM To: Moritz Möller; Ian Pratt; xen-devel@lists.xensource.com Subject: RE: [Xen-devel] disk io errors possibly caused by high network load?> > We rebooted the machines really quickly because it was a productive > system, so I didn''t have the time to copy the logs, and on the disks I > see nothing about this in the logfiles, propably because the IO was > already down. > > The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32 > GB RAM, and some have a mdraid setup with two SATA drives with the on > board sata controller (intel ICH), other have a dedicated 3ware / AMCC > 9660 or similar. > > The machines that crashed were on different power lines and connectedto> different switches, although on the same network segment. Also there > were no physical interferences. > > The error was reported by domU and dom0 - both saying the disk would > give a I/O error, but no specific information. > > Network card is intel e1000.The error wasn''t a timeout was it? We had a similar problem under Windows (no Xen involved at all) where the switch the server was plugged into was looped back to itself one evening. Any broadcast packet sent to the switch would just circulate around the switch indefinitely, until there were enough broadcast packets looping around that everything ground to a halt. The server was a HP DL380, so a more than capable machine, but there were enough interrupts occurring due to a completely saturated network that everything was reporting timeouts. In this case the server didn''t require a reboot. It sat in that state the whole night, reporting disk timeouts etc but the moment we rectified the cabling fault in the morning it instantly bounced back to life. It could be that Linux treats timeout errors a little more severely? Can anyone say if the layer above blkfront in the Linux kernel will report timeouts? Or would the errors have been coming through from Dom0? Anyway, do you have a test environment you can reproduce the problem on? If the problem is as simple as a looped switch then it shouldn''t be too hard to reproduce... James _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel