thr3ads.net - Xen devel - [Xen-devel] disk io errors possibly caused by high network load? [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Moritz Möller

2008-Sep-19 11:39 UTC

[Xen-devel] disk io errors possibly caused by high network load?

Hi,

we had a very strange situation yesterday. In one second, 13 of 25 xen
boxes died with disk errors (domU and dom0, something like end_request:
I/O error dev hda sector ...), but worked well again after a reboot.

Some minutes before a technician plugged in a wrong cable, creating a
network loop - so the error could be caused by a high network io load.
The disks are okay, and the error occurred with both scsi raid
controllers and plain sata disks.

Here is some info of a host that crashed:

root/mmoeller@srv002050:/root$ xm info
host                   : srv002050
release                : 2.6.21-2950.fc8xen
version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
machine                : x86_64
nr_cpus                : 8
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 1866
hw_caps                :
bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
total_memory           : 8190
free_memory            : 12
node_to_cpu            : node0:0-7
xen_major              : 3
xen_minor              : 2
xen_extra              : .0
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20061115 (prerelease) (Debian
4.1.1-21)
cc_compile_by          : root
cc_compile_domain      : office.bigpoint.net
cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
xend_config_format     : 4
root/mmoeller@srv002050:/root$ uname -r
2.6.21-2950.fc8xen

And here of a host that did not crash:

root/mmoeller@srv006215:/root$ xm info
host                   : srv006215
release                : 2.6.21-2950.fc8xen
version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
machine                : x86_64
nr_cpus                : 4
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 2394
hw_caps                :
bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001
total_memory           : 8190
free_memory            : 10
node_to_cpu            : node0:0-3
xen_major              : 3
xen_minor              : 2
xen_extra              : .0
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
cc_compiler            : gcc version 4.1.2 20061115 (prerelease) (Debian
4.1.1-21)
cc_compile_by          : root
cc_compile_domain      : office.bigpoint.net
cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
xend_config_format     : 4
root/mmoeller@srv006215:/root$ uname -r
2.6.21-2950.fc8xen

Does someone have an idea how this could happen?


Thanks,


Moritz


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2008-Sep-19 12:44 UTC

head link

RE: [Xen-devel] disk io errors possibly caused by high network load?

> we had a very strange situation yesterday. In one second, 13 of 25 xen
> boxes died with disk errors (domU and dom0, something like end_request:
> I/O error dev hda sector ...), but worked well again after a reboot.
> 
> Some minutes before a technician plugged in a wrong cable, creating a
> network loop - so the error could be caused by a high network io load.
> The disks are okay, and the error occurred with both scsi raid
> controllers and plain sata disks.
This is quite remarkable -- I don''t think anyone has reported anything
similar before, despite there being many large xen deployments.

Are you saying that IO errors were reported from both dom0 and the
domU''s?

Did you actually track down the specific device major/minor that was reporting
the error?

Is there any network storage (e.g. iSCSI, AOE) in your setup?

Ian 
> Here is some info of a host that crashed:
> 
> root/mmoeller@srv002050:/root$ xm info
> host                   : srv002050
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 8
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 1866
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 12
> node_to_cpu            : node0:0-7
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv002050:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> And here of a host that did not crash:
> 
> root/mmoeller@srv006215:/root$ xm info
> host                   : srv006215
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 4
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 2394
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 10
> node_to_cpu            : node0:0-3
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv006215:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> Does someone have an idea how this could happen?
> 
> 
> Thanks,
> 
> 
> Moritz
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Moritz Möller

2008-Sep-19 12:59 UTC

head link

RE: [Xen-devel] disk io errors possibly caused by high network load?

We rebooted the machines really quickly because it was a productive
system, so I didn''t have the time to copy the logs, and on the disks I
see nothing about this in the logfiles, propably because the IO was
already down.

The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32
GB RAM, and some have a mdraid setup with two SATA drives with the on
board sata controller (intel ICH), other have a dedicated 3ware / AMCC
9660 or similar.

The machines that crashed were on different power lines and connected to
different switches, although on the same network segment. Also there
were no physical interferences.

The error was reported by domU and dom0 - both saying the disk would
give a I/O error, but no specific information.

Network card is intel e1000.

Lsmod:



-----Original Message-----
From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] 
Sent: Friday, September 19, 2008 2:44 PM
To: Moritz Möller; xen-devel@lists.xensource.com
Cc: Ian Pratt
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
load?
> we had a very strange situation yesterday. In one second, 13 of 25 xen
> boxes died with disk errors (domU and dom0, something like
end_request:> I/O error dev hda sector ...), but worked well again after a reboot.
> 
> Some minutes before a technician plugged in a wrong cable, creating a
> network loop - so the error could be caused by a high network io load.
> The disks are okay, and the error occurred with both scsi raid
> controllers and plain sata disks.
This is quite remarkable -- I don''t think anyone has reported anything
similar before, despite there being many large xen deployments.

Are you saying that IO errors were reported from both dom0 and the
domU''s? 

Did you actually track down the specific device major/minor that was
reporting the error?

Is there any network storage (e.g. iSCSI, AOE) in your setup?

Ian 
> Here is some info of a host that crashed:
> 
> root/mmoeller@srv002050:/root$ xm info
> host                   : srv002050
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 8
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 1866
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 12
> node_to_cpu            : node0:0-7
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv002050:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> And here of a host that did not crash:
> 
> root/mmoeller@srv006215:/root$ xm info
> host                   : srv006215
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 4
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 2394
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 10
> node_to_cpu            : node0:0-3
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv006215:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> Does someone have an idea how this could happen?
> 
> 
> Thanks,
> 
> 
> Moritz
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Moritz Möller

2008-Sep-19 13:00 UTC

head link

RE: [Xen-devel] disk io errors possibly caused by high network load?

Okay - wrong key. Message continued

-----Original Message-----
From: Moritz Möller 
Sent: Friday, September 19, 2008 3:00 PM
To: ''Ian Pratt''; xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
load?

We rebooted the machines really quickly because it was a productive
system, so I didn''t have the time to copy the logs, and on the disks I
see nothing about this in the logfiles, propably because the IO was
already down.

The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32
GB RAM, and some have a mdraid setup with two SATA drives with the on
board sata controller (intel ICH), other have a dedicated 3ware / AMCC
9660 or similar.

The machines that crashed were on different power lines and connected to
different switches, although on the same network segment. Also there
were no physical interferences.

The error was reported by domU and dom0 - both saying the local disk
(either sda or sdb on mdraid systems, and sda on raid systems) reports a
I/O error, but no specific information.

Network card is intel e1000.

Lsmod:

nfs                   257112  1
w83792d                39320  0
w83781d                44840  0
i2c_isa                14720  1 w83781d
w83793                 46360  0
hwmon_vid              11264  2 w83781d,w83793
hwmon                  12040  3 w83792d,w83781d,w83793
ipmi_devintf           20112  0
ipmi_si                52812  0
ipmi_msghandler        47096  2 ipmi_devintf,ipmi_si
nls_utf8               10624  3
cifs                  228112  3
xt_physdev             11152  4
iptable_filter         11392  1
ip_tables              28648  1 iptable_filter
x_tables               29064  2 xt_physdev,ip_tables
ipv6                  339072  22
bridge                 64936  0
8021q                  29584  0
nfsd                  263848  1
exportfs               14336  1 nfsd
lockd                  74800  2 nfs,nfsd
nfs_acl                12160  2 nfs,nfsd
sunrpc                186344  5 nfs,nfsd,lockd,nfs_acl
blkbk                  30776  0 [permanent]
netbk                 105184  0 [permanent]
loop                   26768  0
8250_pnp               19968  0
sg                     45224  0
sr_mod                 26148  0
cdrom                  44072  1 sr_mod
i2c_i801               17052  0
iTCO_wdt               20432  0
iTCO_vendor_support    12548  1 iTCO_wdt
8250                   50120  1 8250_pnp
serial_core            31616  1 8250
i2c_core               32256  5 w83792d,w83781d,i2c_isa,w83793,i2c_i801
serio_raw              16004  0
pcspkr                 11776  0
joydev                 19584  0
ext3                  141200  2
jbd                    72432  1 ext3
mbcache                18184  1 ext3
dm_mirror              30528  0
dm_snapshot            25416  0
dm_mod                 69520  21 dm_mirror,dm_snapshot
raid1                  32768  3
sd_mod                 35200  8
usb_storage            90304  0
ata_piix               25092  6
ata_generic            17412  0
floppy                 68904  0
ehci_hcd               41100  0
uhci_hcd               32544  0
libata                126896  2 ata_piix,ata_generic
scsi_mod              166968  5 sg,sr_mod,sd_mod,usb_storage,libata
e1000                 130880  0
xenbus_be              12800  2 blkbk,netbk
xennet                 37512  0
xenblk                 26720  0




-----Original Message-----
From: Ian Pratt [mailto:Ian.Pratt@eu.citrix.com] 
Sent: Friday, September 19, 2008 2:44 PM
To: Moritz Möller; xen-devel@lists.xensource.com
Cc: Ian Pratt
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
load?
> we had a very strange situation yesterday. In one second, 13 of 25 xen
> boxes died with disk errors (domU and dom0, something like
end_request:> I/O error dev hda sector ...), but worked well again after a reboot.
> 
> Some minutes before a technician plugged in a wrong cable, creating a
> network loop - so the error could be caused by a high network io load.
> The disks are okay, and the error occurred with both scsi raid
> controllers and plain sata disks.
This is quite remarkable -- I don''t think anyone has reported anything
similar before, despite there being many large xen deployments.

Are you saying that IO errors were reported from both dom0 and the
domU''s? 

Did you actually track down the specific device major/minor that was
reporting the error?

Is there any network storage (e.g. iSCSI, AOE) in your setup?

Ian 
> Here is some info of a host that crashed:
> 
> root/mmoeller@srv002050:/root$ xm info
> host                   : srv002050
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 8
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 1866
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 12
> node_to_cpu            : node0:0-7
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv002050:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> And here of a host that did not crash:
> 
> root/mmoeller@srv006215:/root$ xm info
> host                   : srv006215
> release                : 2.6.21-2950.fc8xen
> version                : #1 SMP Tue Oct 23 12:23:33 EDT 2007
> machine                : x86_64
> nr_cpus                : 4
> nr_nodes               : 1
> cores_per_socket       : 4
> threads_per_core       : 1
> cpu_mhz                : 2394
> hw_caps                :
> bfebfbff:20100800:00000000:00000140:0000e3bd:00000000:00000001
> total_memory           : 8190
> free_memory            : 10
> node_to_cpu            : node0:0-3
> xen_major              : 3
> xen_minor              : 2
> xen_extra              : .0
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> cc_compiler            : gcc version 4.1.2 20061115 (prerelease)
> (Debian
> 4.1.1-21)
> cc_compile_by          : root
> cc_compile_domain      : office.bigpoint.net
> cc_compile_date        : Tue Mar 11 13:57:28 CET 2008
> xend_config_format     : 4
> root/mmoeller@srv006215:/root$ uname -r
> 2.6.21-2950.fc8xen
> 
> Does someone have an idea how this could happen?
> 
> 
> Thanks,
> 
> 
> Moritz
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

James Harper

2008-Sep-19 13:21 UTC

head link

RE: [Xen-devel] disk io errors possibly caused by high network load?

> 
> We rebooted the machines really quickly because it was a productive
> system, so I didn''t have the time to copy the logs, and on the
disks I
> see nothing about this in the logfiles, propably because the IO was
> already down.
> 
> The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32
> GB RAM, and some have a mdraid setup with two SATA drives with the on
> board sata controller (intel ICH), other have a dedicated 3ware / AMCC
> 9660 or similar.
> 
> The machines that crashed were on different power lines and connected to
> different switches, although on the same network segment. Also there
> were no physical interferences.
> 
> The error was reported by domU and dom0 - both saying the disk would
> give a I/O error, but no specific information.
> 
> Network card is intel e1000.
The error wasn''t a timeout was it? We had a similar problem under
Windows (no Xen involved at all) where the switch the server was plugged into
was looped back to itself one evening. Any broadcast packet sent to the switch
would just circulate around the switch indefinitely, until there were enough
broadcast packets looping around that everything ground to a halt.

The server was a HP DL380, so a more than capable machine, but there were enough
interrupts occurring due to a completely saturated network that everything was
reporting timeouts. In this case the server didn''t require a reboot. It
sat in that state the whole night, reporting disk timeouts etc but the moment we
rectified the cabling fault in the morning it instantly bounced back to life.

It could be that Linux treats timeout errors a little more severely?

Can anyone say if the layer above blkfront in the Linux kernel will report
timeouts? Or would the errors have been coming through from Dom0?

Anyway, do you have a test environment you can reproduce the problem on? If the
problem is as simple as a looped switch then it shouldn''t be too hard
to reproduce...

James


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Moritz Möller

2008-Sep-19 13:23 UTC

head link

RE: [Xen-devel] disk io errors possibly caused by high network load?

A strange thing is that not a single non-xen machine went down.

I will set up two xen machines on a switch with a loop and see what
I''ll
get ;)

-----Original Message-----
From: James Harper [mailto:james.harper@bendigoit.com.au] 
Sent: Friday, September 19, 2008 3:22 PM
To: Moritz Möller; Ian Pratt; xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] disk io errors possibly caused by high network
load?
> 
> We rebooted the machines really quickly because it was a productive
> system, so I didn''t have the time to copy the logs, and on the
disks I
> see nothing about this in the logfiles, propably because the IO was
> already down.
> 
> The machines are Supermicro, Intel Xeon Quad or Dual-Quadcore, 8 to 32
> GB RAM, and some have a mdraid setup with two SATA drives with the on
> board sata controller (intel ICH), other have a dedicated 3ware / AMCC
> 9660 or similar.
> 
> The machines that crashed were on different power lines and connected
to> different switches, although on the same network segment. Also there
> were no physical interferences.
> 
> The error was reported by domU and dom0 - both saying the disk would
> give a I/O error, but no specific information.
> 
> Network card is intel e1000.
The error wasn''t a timeout was it? We had a similar problem under
Windows (no Xen involved at all) where the switch the server was plugged
into was looped back to itself one evening. Any broadcast packet sent to
the switch would just circulate around the switch indefinitely, until
there were enough broadcast packets looping around that everything
ground to a halt.

The server was a HP DL380, so a more than capable machine, but there
were enough interrupts occurring due to a completely saturated network
that everything was reporting timeouts. In this case the server didn''t
require a reboot. It sat in that state the whole night, reporting disk
timeouts etc but the moment we rectified the cabling fault in the
morning it instantly bounced back to life.

It could be that Linux treats timeout errors a little more severely?

Can anyone say if the layer above blkfront in the Linux kernel will
report timeouts? Or would the errors have been coming through from Dom0?

Anyway, do you have a test environment you can reproduce the problem on?
If the problem is as simple as a looped switch then it shouldn''t be too
hard to reproduce...

James

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Sep 2008 - disk io errors possibly caused by high network load?

[Xen-devel] disk io errors possibly caused by high network load?

RE: [Xen-devel] disk io errors possibly caused by high network load?

RE: [Xen-devel] disk io errors possibly caused by high network load?

RE: [Xen-devel] disk io errors possibly caused by high network load?

RE: [Xen-devel] disk io errors possibly caused by high network load?

RE: [Xen-devel] disk io errors possibly caused by high network load?