Andreas Olsowski
2011-Aug-11 13:59 UTC
[Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
Hello xen-devel, as one of the people using Dell Servers i am aware that the LSI megaraid drivers are quite old in the current 2.6.32 pvops tree, but it seems that, once again, i have run into problems that are more rare than the usual "cant find disk" issues. (Of which i had none, ever) The situation: -------------- I have 2 dom0 kernels, 2.6.32.44 and 3.0.1 that work fine when booted bare-metal. I can run stress -m 40 -d 4 -i 1 for hours on end without any error occuring. The 2.6.32.44 kernels use version 00.00.05.30 megasas modules. When i boot that kernel on my R610 servers under xen (4.1 and 4.2) the kernels work fine too. I create 10 virtual machines, each running 4 "stress -m 40" and can do disk i/o on my local storage as much as i want to. But on my Dell R710 system things dont look so good. Booted bare-metal, both kernels work fine. When i boot them as dom0 under xen, everything seems to be okay at first. Then i create my 10 virtual machines that put some load on the memory. And as soon as i do i/o to the local disk, even a "ls /usr/src/" can suffice, i/o freezes, the system stops to respond to anything that requires disk acccess. After a while the kernel will start spewing out error messages: #### lots of these sd 0:2:0:0: [sda] megasas: RESET -83318 cmd=2a retries=0 megaraid_sas: HBA reset handler invoked without an internal reset condition. megasas: [ 0]waiting for 16 commands to complete megaraid_sas: no more pending commands remain after reset handling. megasas: reset successful ### ### then some of these sd 0:2:0:0: Device offlined - not ready after error recovery ### ### goes on to sd 0:2:0:0: [sda] Unhandled error code sd 0:2:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT sd 0:2:0:0: [sda] CDB: Write(10): 2a 00 08 45 6f 00 00 01 88 00 end_request: I/O error, dev sda, sector 138768128 Buffer I/O error on device sda2, logical block 5138912 lost page write due to I/O error on sda2 Buffer I/O error on device sda2, logical block 5138913 ### ### and finally these, as often as one tries to access the disk sd 0:2:0:0: rejecting I/O to offline device sd 0:2:0:0: rejecting I/O to offline device sd 0:2:0:0: rejecting I/O to offline device If a kernel works fine on one set of servers (Dell R610 with LSI Logic / Symbios Logic LSI MegaSAS 9260 (rev 05) raid controllers) and crashes on another server (Dell R710 with a LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) raid controller), it would seem logical to assume, that the kernel does not support the hardware properly. But when run bare-metal, no errors occur. I for one ran out of things to try, the R710 worked fine before i upgraded its firmware to the most current versions and went from xen4.0.1 to xen4.1/4.2. So i put it to you, fine sirs of xen-devel: is it: A.) a hardware problem, because the software works on different hardware or B.) a xen problem, because the hardware runs fine in a non-virtualized scenario with the same kernel Or is it something else entirely? Help, input, questions and suggestions are, as always, greatly appreciated. With best regards -- Andreas Olsowski Leuphana Universität Lüneburg Rechen- und Medienzentrum Scharnhorststraße 1, C7.015 21335 Lüneburg Tel: ++49 4131 677 1309 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Simon Rowe
2011-Aug-11 16:27 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Thursday 11 August 2011 14:59:39 Andreas Olsowski wrote:> as one of the people using Dell Servers i am aware that the LSI megaraid > drivers are quite old in the current 2.6.32 pvops tree, > but it seems that, once again, i have run into problems that are more > rare than the usual "cant find disk" issues. (Of which i had none, ever) > > > The situation: > -------------- > I have 2 dom0 kernels, 2.6.32.44 and 3.0.1 that work fine when booted > bare-metal. I can run stress -m 40 -d 4 -i 1 for hours on end without > any error occuring. > The 2.6.32.44 kernels use version 00.00.05.30 megasas modules.I''ve had reports of something similar with a Dell R710 and MegaRAID SAS controller. We''re using version 00.00.05.33 of megaraid_sas on a classic Xen 2.6.32 kernel. Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Aug-11 22:51 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Thu, Aug 11, 2011 at 05:27:25PM +0100, Simon Rowe wrote:> On Thursday 11 August 2011 14:59:39 Andreas Olsowski wrote: > > > as one of the people using Dell Servers i am aware that the LSI megaraid > > drivers are quite old in the current 2.6.32 pvops tree, > > but it seems that, once again, i have run into problems that are more > > rare than the usual "cant find disk" issues. (Of which i had none, ever) > > > > > > The situation: > > -------------- > > I have 2 dom0 kernels, 2.6.32.44 and 3.0.1 that work fine when booted > > bare-metal. I can run stress -m 40 -d 4 -i 1 for hours on end without > > any error occuring. > > The 2.6.32.44 kernels use version 00.00.05.30 megasas modules. > > I''ve had reports of something similar with a Dell R710 and MegaRAID SAS > controller. We''re using version 00.00.05.33 of megaraid_sas on a classic Xen > 2.6.32 kernel.With the same version of hypervisor? Andreas, Can you try running without irqbalance daemon?> > Simon > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Mark Schneider
2011-Aug-12 06:31 UTC
[Xen-devel] xen.frontend flag for higher display resolution (vnc) for HVM domU domains
Good morning Konrad, As far as I understood you have add a flag for HVM domU domains to get higher display resolution. Where can I find more details about the syntax and how can I check if my xen packages (4.1.2*) have already this patch? (Boris xen 4.1.2* sources) Thank you in advance for short hint. Best regards, Mark Ps. my last xen debian live images I tested with: http://www.it-infrastrukturen.com/fileadmin/linux/debian-live-xen/README.xen-live rsync -avP rsync://www.it-infrastrukturen.ch/ftp/xen411-wheezy-kernel3-amd64-live-gnome-binary-hybrid.iso . I used there kernel 3.0.1 final with the following kernel-config. http://www.it-infrastrukturen.com/fileadmin/linux/debian-live-xen/config-3.0.1 -- ms@it-infrastrukturen.org _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Marc - A. Dahlhaus
2011-Aug-12 07:26 UTC
Re: [Xen-devel] xen.frontend flag for higher display resolution (vnc) for HVM domU domains
Mark, could you please stop hijacking other threads to start you own? You reuse the unmodified "References"-Header in doing so and your Mail will thus end up following the hijacked thread. This is bad practice. Other readers of this list that use a threaded view to read the list will delete an entire collapsed thread-tree (with your mail included) if they have no interest in the topic that you hijacked. You would handle in you own interest in changing your habits here. Thanks, Marc _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Simon Rowe
2011-Aug-12 07:42 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Thursday 11 August 2011 23:51:19 Konrad Rzeszutek Wilk wrote:> With the same version of hypervisor?Xen 4.1.1 (but with a slew of patches), Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Simon Rowe
2011-Aug-12 09:02 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
I''ve now reproduced this on a Dell R905 (16 cores, 265GB) which also has a MegaRAID SAS 1078, Simon INFO: task kjournald:1691 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kjournald D 00003b87 0 1691 2 0x00000000 eac3fed4 00000246 e92f92b8 00003b87 00000001 00000000 c16cc580 eac3fe74 7ea17f6b 00003b72 ebd2b954 ebd2b844 ebd2b7b0 ebd2b954 c16cc100 00000001 d3aad740 c0122a72 c16cc138 0002a01a 00000000 00000005 01038103 00008888 Call Trace: [<c0122a72>] ? update_curr+0x72/0xf0 [<c0142d4c>] ? prepare_to_wait+0x4c/0x60 [<ed9af3dc>] journal_commit_transaction+0x12c/0x1060 [jbd] [<c0382a69>] ? schedule+0x2e9/0x970 [<c0142b10>] ? autoremove_wake_function+0x0/0x50 [<c0138234>] ? try_to_del_timer_sync+0x54/0x80 [<ed9b31e0>] kjournald+0xc0/0x240 [jbd] [<c0142b10>] ? autoremove_wake_function+0x0/0x50 [<ed9b3120>] ? kjournald+0x0/0x240 [jbd] [<c0142854>] kthread+0x74/0x80 [<c01427e0>] ? kthread+0x0/0x80 [<c010483b>] kernel_thread_helper+0x7/0x10 ... sd 0:2:0:0: [sda] megasas: RESET -46318 cmd=2a retries=0 megaraid_sas: HBA reset handler invoked without an internal reset condition. megaraid_sas: no more pending commands remain after reset handling. megasas: reset successful _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Aug-12 09:11 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
(Ooops, i accidently used "reply" instead of "reply-list" ...) > Andreas, > > Can you try running without irqbalance daemon? I do not have one running: root@tarballerina:/etc/init.d# ps aux |grep irq |grep -v grep root 4 0.0 0.0 0 0 ? S Aug09 0:00 [ksoftirqd/0] ... Nor do i have one installed: root@tarballerina:/etc/init.d# locate irq |grep balance root@tarballerina:/etc/init.d# echo $? 1 Should i have one installed and running? -- Andreas Olsowski Leuphana Universität Lüneburg Rechen- und Medienzentrum Scharnhorststraße 1, C7.015 21335 Lüneburg Tel: ++49 4131 677 1309 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Simon Rowe
2011-Aug-12 09:23 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Friday 12 August 2011 10:11:45 Andreas Olsowski wrote:> > Can you try running without irqbalance daemon?> Should i have one installed and running?We run with irqbalance normally but it has no impact on this issue. In general we''ve seen significant performance improvements since enabling it, particularly network throughput. Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2011-Aug-12 16:25 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Thu, Aug 11, 2011 at 03:59:39PM +0200, Andreas Olsowski wrote:> Hello xen-devel, >Hello,> as one of the people using Dell Servers i am aware that the LSI megaraid > drivers are quite old in the current 2.6.32 pvops tree, > but it seems that, once again, i have run into problems that are more > rare than the usual "cant find disk" issues. (Of which i had none, ever) >Btw did you see this thread about lsi drivers and 2.6.32: http://lists.xensource.com/archives/html/xen-devel/2010-11/msg00250.html I''ve been successfully using version 4.3x megaraid_sas drivers.. (Latest available from LSI''s support site). -- Pasi> > The situation: > -------------- > I have 2 dom0 kernels, 2.6.32.44 and 3.0.1 that work fine when booted > bare-metal. I can run stress -m 40 -d 4 -i 1 for hours on end without > any error occuring. > The 2.6.32.44 kernels use version 00.00.05.30 megasas modules. > > When i boot that kernel on my R610 servers under xen (4.1 and 4.2) the > kernels work fine too. I create 10 virtual machines, each running 4 > "stress -m 40" and can do disk i/o on my local storage as much as i want > to. > > But on my Dell R710 system things dont look so good. > Booted bare-metal, both kernels work fine. > When i boot them as dom0 under xen, everything seems to be okay at first. > Then i create my 10 virtual machines that put some load on the memory. > And as soon as i do i/o to the local disk, even a "ls /usr/src/" can > suffice, i/o freezes, the system stops to respond to anything that > requires disk acccess. > After a while the kernel will start spewing out error messages: > > #### lots of these > sd 0:2:0:0: [sda] megasas: RESET -83318 cmd=2a retries=0 > megaraid_sas: HBA reset handler invoked without an internal reset condition. > megasas: [ 0]waiting for 16 commands to complete > megaraid_sas: no more pending commands remain after reset handling. > megasas: reset successful > ### > > ### then some of these > sd 0:2:0:0: Device offlined - not ready after error recovery > ### > > ### goes on to > sd 0:2:0:0: [sda] Unhandled error code > sd 0:2:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT > sd 0:2:0:0: [sda] CDB: Write(10): 2a 00 08 45 6f 00 00 01 88 00 > end_request: I/O error, dev sda, sector 138768128 > Buffer I/O error on device sda2, logical block 5138912 > lost page write due to I/O error on sda2 > Buffer I/O error on device sda2, logical block 5138913 > ### > > ### and finally these, as often as one tries to access the disk > sd 0:2:0:0: rejecting I/O to offline device > sd 0:2:0:0: rejecting I/O to offline device > sd 0:2:0:0: rejecting I/O to offline device > > > If a kernel works fine on one set of servers (Dell R610 with LSI Logic / > Symbios Logic LSI MegaSAS 9260 (rev 05) raid controllers) and crashes on > another server (Dell R710 with a LSI Logic / Symbios Logic MegaRAID SAS > 1078 (rev 04) raid controller), > it would seem logical to assume, that the kernel does not support the > hardware properly. > But when run bare-metal, no errors occur. > > I for one ran out of things to try, the R710 worked fine before i > upgraded its firmware to the most current versions and went from > xen4.0.1 to xen4.1/4.2. > > So i put it to you, fine sirs of xen-devel: > is it: > A.) a hardware problem, because the software works on different hardware > or > B.) a xen problem, because the hardware runs fine in a non-virtualized > scenario with the same kernel > > Or is it something else entirely? > > Help, input, questions and suggestions are, as always, greatly appreciated. > > > With best regards > > -- > Andreas Olsowski > Leuphana Universität Lüneburg > Rechen- und Medienzentrum > Scharnhorststraße 1, C7.015 > 21335 Lüneburg > > Tel: ++49 4131 677 1309 >> _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2011-Aug-12 16:26 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Fri, Aug 12, 2011 at 10:02:38AM +0100, Simon Rowe wrote:> I''ve now reproduced this on a Dell R905 (16 cores, 265GB) which also has a > MegaRAID SAS 1078, >Hey, Does this help: http://lists.xensource.com/archives/html/xen-devel/2010-11/msg00250.html ie. update to latest LSI megaraid_sas driver.. (available from LSI''s support site). -- Pasi> Simon > > > INFO: task kjournald:1691 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kjournald D 00003b87 0 1691 2 0x00000000 > eac3fed4 00000246 e92f92b8 00003b87 00000001 00000000 c16cc580 eac3fe74 > 7ea17f6b 00003b72 ebd2b954 ebd2b844 ebd2b7b0 ebd2b954 c16cc100 00000001 > d3aad740 c0122a72 c16cc138 0002a01a 00000000 00000005 01038103 00008888 > Call Trace: > [<c0122a72>] ? update_curr+0x72/0xf0 > [<c0142d4c>] ? prepare_to_wait+0x4c/0x60 > [<ed9af3dc>] journal_commit_transaction+0x12c/0x1060 [jbd] > [<c0382a69>] ? schedule+0x2e9/0x970 > [<c0142b10>] ? autoremove_wake_function+0x0/0x50 > [<c0138234>] ? try_to_del_timer_sync+0x54/0x80 > [<ed9b31e0>] kjournald+0xc0/0x240 [jbd] > [<c0142b10>] ? autoremove_wake_function+0x0/0x50 > [<ed9b3120>] ? kjournald+0x0/0x240 [jbd] > [<c0142854>] kthread+0x74/0x80 > [<c01427e0>] ? kthread+0x0/0x80 > [<c010483b>] kernel_thread_helper+0x7/0x10 > > ... > > sd 0:2:0:0: [sda] megasas: RESET -46318 cmd=2a retries=0 > megaraid_sas: HBA reset handler invoked without an internal reset condition. > megaraid_sas: no more pending commands remain after reset handling. > megasas: reset successful > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Simon Rowe
2011-Aug-15 07:44 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Friday 12 August 2011 17:26:21 Pasi Kärkkäinen wrote:> Hey, > > Does this help: > http://lists.xensource.com/archives/html/xen-devel/2010-11/msg00250.html > > ie. update to latest LSI megaraid_sas driver.. (available from LSI''s > support site).I''ll give it a go but I know 4.27 still has the issue and I would assume our newer driver (5.33) would contain any relevant fixes, Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Simon Rowe
2011-Aug-15 10:49 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
I''ve found adding options megaraid_sas poll_mode_io=1 makes both of the systems we''re seeing this on stable. Have you run your system with Xen 3.4? Did you see the same behaviour? Simon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Aug-15 12:52 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 08/15/2011 12:49 PM, Simon Rowe wrote:> I''ve found adding > > options megaraid_sas poll_mode_io=1 > > makes both of the systems we''re seeing this on stable.ive been told to try that one and it works for me too (been running test io for roughly 5 minutes now). driver version megasas: 00.00.05.30 Tue. Jan. 4 17:00:00 PDT 2011 -- Andreas Olsowski Leuphana Universität Lüneburg Rechen- und Medienzentrum Scharnhorststraße 1, C7.015 21335 Lüneburg Tel: ++49 4131 677 1309 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-19 12:28 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 15/08/11 13:52, Andreas Olsowski wrote:> On 08/15/2011 12:49 PM, Simon Rowe wrote: >> I''ve found adding >> >> options megaraid_sas poll_mode_io=1 >> >> makes both of the systems we''re seeing this on stable. > ive been told to try that one and it works for me too (been running test > io for roughly 5 minutes now). > > driver version > megasas: 00.00.05.30 Tue. Jan. 4 17:00:00 PDT 2011Hello - I am now debugging. It seems that the megaraid_sas driver will try and use either MSI-X or legacy PCI interrupts mode, but will never try to use MSI. The box we can reproduce the problem on has MSI support but not MSI-X support. As an experiment, I put a single call to pci_enable_msi() in the megasas_probe_one() function, immediately after pci_set_master(). I now cannot reproduce the problem. Do any of the boxes you have which reproduce the problem set up MSI-X interrupts for the megasas driver, or are they all using legacy PCI interrupts? (I am also emailing an LSI contact asking why they do not use MSI interrupts) ~Andrew -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Aug-19 14:17 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
Am 19.08.2011 14:28, schrieb Andrew Cooper:> On 15/08/11 13:52, Andreas Olsowski wrote: >> On 08/15/2011 12:49 PM, Simon Rowe wrote: >>> I''ve found adding >>> >>> options megaraid_sas poll_mode_io=1 >>> >>> makes both of the systems we''re seeing this on stable. >> ive been told to try that one and it works for me too (been running test >> io for roughly 5 minutes now). >> >> driver version >> megasas: 00.00.05.30 Tue. Jan. 4 17:00:00 PDT 2011 > > Hello - I am now debugging. > > It seems that the megaraid_sas driver will try and use either MSI-X or > legacy PCI interrupts mode, but will never try to use MSI. The box we > can reproduce the problem on has MSI support but not MSI-X support. > > As an experiment, I put a single call to pci_enable_msi() in the > megasas_probe_one() function, immediately after pci_set_master(). I now > cannot reproduce the problem. > > Do any of the boxes you have which reproduce the problem set up MSI-X > interrupts for the megasas driver, or are they all using legacy PCI > interrupts? > > (I am also emailing an LSI contact asking why they do not use MSI > interrupts) > > ~Andrew >No the affected systems DO NOT use MSI-X Below is output from 3 Servers, xenturio1 and tarballerina are affected (same old raid controller) whereas netcatarina is not (newer raid controller). May i add, the 1078 series raid controller isnt listed on the LSI homepage, the 9260 is. The affected servers are Dell PE2950 and R710. Unaffectes is are the R610s. Hope this helps, below is some output of the servers: root@xenturio1:~# cat /proc/interrupts |grep mega 16: 2545 0 0 0 0 0 0 0 xen-pirq-ioapic-level megasas root@xenturio1:~# lspci |grep LSI 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) root@tarballerina:~# cat /proc/interrupts |grep mega 33: 47264 0 0 0 0 0 0 0 xen-pirq-ioapic-level megasas root@tarballerina:~# lspci |grep LSI 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) root@netcatarina:~# cat /proc/interrupts |grep mega 2237: 88684 0 0 0 0 0 0 0 xen-pirq-msi-x megasas root@netcatarina:~# lspci |grep LSI 03:00.0 RAID bus controller: LSI Logic / Symbios Logic LSI MegaSAS 9260 (rev 05) -- Andreas Olsowski Leuphana Universität Lüneburg Rechen- und Medienzentrum Scharnhorststraße 1, C7.015 21335 Lüneburg Tel: ++49 4131 677 1309 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-19 14:57 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 19/08/11 15:17, Andreas Olsowski wrote:> Am 19.08.2011 14:28, schrieb Andrew Cooper: >> On 15/08/11 13:52, Andreas Olsowski wrote: >>> On 08/15/2011 12:49 PM, Simon Rowe wrote: >>>> I''ve found adding >>>> >>>> options megaraid_sas poll_mode_io=1 >>>> >>>> makes both of the systems we''re seeing this on stable. >>> ive been told to try that one and it works for me too (been running >>> test >>> io for roughly 5 minutes now). >>> >>> driver version >>> megasas: 00.00.05.30 Tue. Jan. 4 17:00:00 PDT 2011 >> >> Hello - I am now debugging. >> >> It seems that the megaraid_sas driver will try and use either MSI-X or >> legacy PCI interrupts mode, but will never try to use MSI. The box we >> can reproduce the problem on has MSI support but not MSI-X support. >> >> As an experiment, I put a single call to pci_enable_msi() in the >> megasas_probe_one() function, immediately after pci_set_master(). I now >> cannot reproduce the problem. >> >> Do any of the boxes you have which reproduce the problem set up MSI-X >> interrupts for the megasas driver, or are they all using legacy PCI >> interrupts? >> >> (I am also emailing an LSI contact asking why they do not use MSI >> interrupts) >> >> ~Andrew >> > > No the affected systems DO NOT use MSI-X > > Below is output from 3 Servers, xenturio1 and tarballerina are > affected (same old raid controller) whereas netcatarina is not (newer > raid controller). > > May i add, the 1078 series raid controller isnt listed on the LSI > homepage, the 9260 is. > > The affected servers are Dell PE2950 and R710. > Unaffectes is are the R610s. > > Hope this helps, below is some output of the servers: > > > root@xenturio1:~# cat /proc/interrupts |grep mega > 16: 2545 0 0 0 0 > 0 0 0 xen-pirq-ioapic-level megasas > root@xenturio1:~# lspci |grep LSI > 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS > 1078 (rev 04) > > > root@tarballerina:~# cat /proc/interrupts |grep mega > 33: 47264 0 0 0 0 > 0 0 0 xen-pirq-ioapic-level megasas > root@tarballerina:~# lspci |grep LSI > 03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS > 1078 (rev 04) > > > root@netcatarina:~# cat /proc/interrupts |grep mega > 2237: 88684 0 0 0 0 > 0 0 0 xen-pirq-msi-x megasas > root@netcatarina:~# lspci |grep LSI > 03:00.0 RAID bus controller: LSI Logic / Symbios Logic LSI MegaSAS > 9260 (rev 05) >This further confirms my findings. Do you mind intserting a call to pci_enable_msi() in the probe function and see if that sorts out your two problem cases? -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Aug-19 16:37 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
Am 19.08.2011 16:57, Andrew Cooper:> > This further confirms my findings. > > Do you mind intserting a call to pci_enable_msi() in the probe function > and see if that sorts out your two problem cases? >/* Try to enable MSI-X */ if ((instance->pdev->device != PCI_DEVICE_ID_LSI_SAS1078R) && (instance->pdev->device != PCI_DEVICE_ID_LSI_SAS1078DE) && (instance->pdev->device != PCI_DEVICE_ID_LSI_VERDE_ZCR) && !msix_disable && !pci_enable_msix(instance->pdev, &instance->msixentry, 1)) instance->msi_flag = 1; /* My device is a: 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) and this excluded for some reason. There are more references to this particular type of raid controller. Can you think of a reason why msi would not work on some specific harware? Anyway since i dont have much C-experience and wouldnt know where to put " pci_enable_msi() " exactly, i modified the above mentioned code to just do: instance->msi_flag = 1; I also removed "options megaraid_sas poll_mode_io=1" from the module options. The kernel did not boot properly at that point. Something about /dev not beeing mounted due to missing device. Followed by udev not doing much and the system generally doing nothing that would further the bootup process. So i went back to put that module option in again. It still did not boot properly, here is some output: Loading, please wait... mount: mounting none on /dev failed: No such device ... megasas: 00.00.05.30 Tue. Jan. 4 17:00:00 PDT 2011 megasas: 0x1000:0x0060:0x1028:0x1f0c: bus 1:slot 0:func 0 megaraid_sas 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 megaraid_sas 0000:01:00.0: setting latency timer to 64 megasas: FW now in Ready state megasas: cpx is not supported. megasas: INIT adapter done ... and then came the udev failing to load something and basically nothing happened after that point. I guess there really is a good reason not to enable msi for that type of controller. The fact that corresponding problems from not having the module option do not happen on bare-metal isnt very helpful either. Especially when you cant test the kernel bare-metal due to the fact that it wont boot bare-metal anymore ... but i digress. I guess an acceptable fix would be to make the module option a default for those raid controllers in the next version of the megasas modules. glad to be of service with best regards: -- Andreas Olsowski _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-19 16:49 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 19/08/11 17:37, Andreas Olsowski wrote:> Am 19.08.2011 16:57, Andrew Cooper: > >> >> This further confirms my findings. >> >> Do you mind intserting a call to pci_enable_msi() in the probe function >> and see if that sorts out your two problem cases? >> > > /* Try to enable MSI-X */ > if ((instance->pdev->device != PCI_DEVICE_ID_LSI_SAS1078R) && > (instance->pdev->device != PCI_DEVICE_ID_LSI_SAS1078DE) && > (instance->pdev->device != PCI_DEVICE_ID_LSI_VERDE_ZCR) && > !msix_disable && !pci_enable_msix(instance->pdev, > &instance->msixentry, 1)) > instance->msi_flag = 1; > > /* > > My device is a: > 01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS > 1078 (rev 04) > > and this excluded for some reason. > There are more references to this particular type of raid controller. > > Can you think of a reason why msi would not work on some specific > harware? > > Anyway since i dont have much C-experience and wouldnt know where to > put " pci_enable_msi() " exactly, i modified the above mentioned code > to just do: > instance->msi_flag = 1; > > I also removed "options megaraid_sas poll_mode_io=1" from the module > options. > > The kernel did not boot properly at that point. > Something about /dev not beeing mounted due to missing device. > Followed by udev not doing much and the system generally doing nothing > that would further the bootup process. > > So i went back to put that module option in again. > > It still did not boot properly, here is some output: > Loading, please wait... > mount: mounting none on /dev failed: No such device > ... > megasas: 00.00.05.30 Tue. Jan. 4 17:00:00 PDT 2011 > megasas: 0x1000:0x0060:0x1028:0x1f0c: bus 1:slot 0:func 0 > megaraid_sas 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > megaraid_sas 0000:01:00.0: setting latency timer to 64 > megasas: FW now in Ready state > megasas: cpx is not supported. > megasas: INIT adapter done > ... > and then came the udev failing to load something and basically nothing > happened after that point. > > I guess there really is a good reason not to enable msi for that type > of controller. > > The fact that corresponding problems from not having the module option > do not happen on bare-metal isnt very helpful either. > Especially when you cant test the kernel bare-metal due to the fact > that it wont boot bare-metal anymore ... but i digress. > > I guess an acceptable fix would be to make the module option a default > for those raid controllers in the next version of the megasas modules. > > glad to be of service > > with best regards: >The only change you need to make is in megasas_probe_one() in megaraid_sas_base.c Add a call to pci_enable_msi(pdev) immediately after the current call to pci_set_master(pdev); ~Andrew -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Aug-19 18:10 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
Am 19.08.2011 18:49, schrieb Andrew Cooper: > The only change you need to make is in megasas_probe_one() in > megaraid_sas_base.c > > Add a call to pci_enable_msi(pdev) immediately after the current call to > pci_set_master(pdev); > > ~Andrew > Yep, that works fine. Removed the module option as well. root@tarballerina:~# cat /proc/interrupts |grep mega 2236: 69010 0 0 0 0 0 0 0 xen-pirq-msi megasas The same procedure that would have lead to almost instant errors has not brought them to appear again. -- Andreas Olsowski _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-22 09:05 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 19/08/11 19:10, Andreas Olsowski wrote:> Am 19.08.2011 18:49, schrieb Andrew Cooper: > > > The only change you need to make is in megasas_probe_one() in > > megaraid_sas_base.c > > > > Add a call to pci_enable_msi(pdev) immediately after the current > call to > > pci_set_master(pdev); > > > > ~Andrew > > > > Yep, that works fine. Removed the module option as well. > > root@tarballerina:~# cat /proc/interrupts |grep mega > 2236: 69010 0 0 0 0 > 0 0 0 xen-pirq-msi megasas > > The same procedure that would have lead to almost instant errors has > not brought them to appear again. >Good. This is what we are seeing as well. I am still awaiting a reply from LSI on this topic. Unfortunately, this does point to a regression in the way Xen deals with legacy interrupts. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-24 12:06 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 22/08/11 10:05, Andrew Cooper wrote:> On 19/08/11 19:10, Andreas Olsowski wrote: >> Am 19.08.2011 18:49, schrieb Andrew Cooper: >> >>> The only change you need to make is in megasas_probe_one() in >>> megaraid_sas_base.c >>> >>> Add a call to pci_enable_msi(pdev) immediately after the current >> call to >>> pci_set_master(pdev); >>> >>> ~Andrew >>> >> Yep, that works fine. Removed the module option as well. >> >> root@tarballerina:~# cat /proc/interrupts |grep mega >> 2236: 69010 0 0 0 0 >> 0 0 0 xen-pirq-msi megasas >> >> The same procedure that would have lead to almost instant errors has >> not brought them to appear again. >> > Good. This is what we are seeing as well. I am still awaiting a reply > from LSI on this topic. > > Unfortunately, this does point to a regression in the way Xen deals with > legacy interrupts.Out of interest, on all 3 of your boxes with the megaraid_sas cards, could you gather the io_apic information? It is the z xen debug key on the serial console (or alternatively put apic_verbosity=debug on the xen commandline and the information gets dumped into the dmesg) -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-24 16:57 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 24/08/11 13:06, Andrew Cooper wrote:> On 22/08/11 10:05, Andrew Cooper wrote: >> On 19/08/11 19:10, Andreas Olsowski wrote: >>> Am 19.08.2011 18:49, schrieb Andrew Cooper: >>> >>>> The only change you need to make is in megasas_probe_one() in >>>> megaraid_sas_base.c >>>> >>>> Add a call to pci_enable_msi(pdev) immediately after the current >>> call to >>>> pci_set_master(pdev); >>>> >>>> ~Andrew >>>> >>> Yep, that works fine. Removed the module option as well. >>> >>> root@tarballerina:~# cat /proc/interrupts |grep mega >>> 2236: 69010 0 0 0 0 >>> 0 0 0 xen-pirq-msi megasas >>> >>> The same procedure that would have lead to almost instant errors has >>> not brought them to appear again. >>> >> Good. This is what we are seeing as well. I am still awaiting a reply >> from LSI on this topic. >> >> Unfortunately, this does point to a regression in the way Xen deals with >> legacy interrupts. > Out of interest, on all 3 of your boxes with the megaraid_sas cards, > could you gather the io_apic information? > > It is the z xen debug key on the serial console (or alternatively put > apic_verbosity=debug on the xen commandline and the information gets > dumped into the dmesg)You can ignore this - it is not relevant. I have narrowed the problem to a bug in the interrupt migration code. The bug occurs when the move pending flag is set, and somehow another interrupt comes in on the old pcpu without triggering the move completion code. This leaves the IO_APIC with ack''d but not EOI''d interrupt from the megaraid_sas device. This basically locks the server until something (as yet undetermined) triggers the move completion code, at which point the server unlocks itself. When this locked state lasts for more than 2 minutes, the scsi subsystem decides to kill the megaraid_sas driver, from which dom0 cant recover. I think (although am not certain) that the megaraid_sas device gets reset by the driver after each of these locked states, making further IO problems for dom0. I believe this issue to be some sort of race condition, because I have noticed my debugging printf''s significantly altering the rarity of the problem. To make matters worse, it appears that certain OEM firmware causes a deadlock in the megaraid_sas probe function if you try to enable MSI interrupts, which possibly explains why the driver never tries to enable them in the first place (I have still not had any response from LSI) -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2011-Aug-24 17:09 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On Wed, Aug 24, 2011 at 05:57:06PM +0100, Andrew Cooper wrote:> On 24/08/11 13:06, Andrew Cooper wrote: > > On 22/08/11 10:05, Andrew Cooper wrote: > >> On 19/08/11 19:10, Andreas Olsowski wrote: > >>> Am 19.08.2011 18:49, schrieb Andrew Cooper: > >>> > >>>> The only change you need to make is in megasas_probe_one() in > >>>> megaraid_sas_base.c > >>>> > >>>> Add a call to pci_enable_msi(pdev) immediately after the current > >>> call to > >>>> pci_set_master(pdev); > >>>> > >>>> ~Andrew > >>>> > >>> Yep, that works fine. Removed the module option as well. > >>> > >>> root@tarballerina:~# cat /proc/interrupts |grep mega > >>> 2236: 69010 0 0 0 0 > >>> 0 0 0 xen-pirq-msi megasas > >>> > >>> The same procedure that would have lead to almost instant errors has > >>> not brought them to appear again. > >>> > >> Good. This is what we are seeing as well. I am still awaiting a reply > >> from LSI on this topic. > >> > >> Unfortunately, this does point to a regression in the way Xen deals with > >> legacy interrupts. > > Out of interest, on all 3 of your boxes with the megaraid_sas cards, > > could you gather the io_apic information? > > > > It is the z xen debug key on the serial console (or alternatively put > > apic_verbosity=debug on the xen commandline and the information gets > > dumped into the dmesg) > > You can ignore this - it is not relevant. > > I have narrowed the problem to a bug in the interrupt migration code.Goodies!> > The bug occurs when the move pending flag is set, and somehow another > interrupt comes in on the old pcpu without triggering the move > completion code. This leaves the IO_APIC with ack''d but not EOI''d > interrupt from the megaraid_sas device.Ah, so the interrupt is delievered to Dom0 on the old per_cpu event which is ignored. Ignored b/c we have rebinded the event channel to the other CPU, right? Is there any code in the Hypervisor to turn off interrupt migration code? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-24 17:20 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 24/08/11 18:09, Konrad Rzeszutek Wilk wrote:> On Wed, Aug 24, 2011 at 05:57:06PM +0100, Andrew Cooper wrote: >> On 24/08/11 13:06, Andrew Cooper wrote: >>> On 22/08/11 10:05, Andrew Cooper wrote: >>>> On 19/08/11 19:10, Andreas Olsowski wrote: >>>>> Am 19.08.2011 18:49, schrieb Andrew Cooper: >>>>> >>>>>> The only change you need to make is in megasas_probe_one() in >>>>>> megaraid_sas_base.c >>>>>> >>>>>> Add a call to pci_enable_msi(pdev) immediately after the current >>>>> call to >>>>>> pci_set_master(pdev); >>>>>> >>>>>> ~Andrew >>>>>> >>>>> Yep, that works fine. Removed the module option as well. >>>>> >>>>> root@tarballerina:~# cat /proc/interrupts |grep mega >>>>> 2236: 69010 0 0 0 0 >>>>> 0 0 0 xen-pirq-msi megasas >>>>> >>>>> The same procedure that would have lead to almost instant errors has >>>>> not brought them to appear again. >>>>> >>>> Good. This is what we are seeing as well. I am still awaiting a reply >>>> from LSI on this topic. >>>> >>>> Unfortunately, this does point to a regression in the way Xen deals with >>>> legacy interrupts. >>> Out of interest, on all 3 of your boxes with the megaraid_sas cards, >>> could you gather the io_apic information? >>> >>> It is the z xen debug key on the serial console (or alternatively put >>> apic_verbosity=debug on the xen commandline and the information gets >>> dumped into the dmesg) >> You can ignore this - it is not relevant. >> >> I have narrowed the problem to a bug in the interrupt migration code. > Goodies! >> The bug occurs when the move pending flag is set, and somehow another >> interrupt comes in on the old pcpu without triggering the move >> completion code. This leaves the IO_APIC with ack''d but not EOI''d >> interrupt from the megaraid_sas device. > Ah, so the interrupt is delievered to Dom0 on the old per_cpu > event which is ignored. Ignored b/c we have rebinded the event channel > to the other CPU, right?The interrupt is not ignored - it seems to be being serviced by the device driver in dom0. I will admit that my debugging code may be a bit flaky - I started by trying to match IRQ35 (which is always claimed by PCI INTA on this server - very useful for debugging) between do_IRQ and its related PHYSDEVOP_eoi. I am currently trying to track the exact order of events around this interrupt which misses the move completion code.> Is there any code in the Hypervisor to turn off interrupt migration code?Not that I have found, although playing around with vcpu and task pinning should work. My debugging shows that Xen-4.1.1 is migrating this interrupt between PCPUs on average once every 4 real interrupts when dom0 is under any load whatsoever. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-26 18:16 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 24/08/11 18:20, Andrew Cooper wrote:> > On 24/08/11 18:09, Konrad Rzeszutek Wilk wrote: >> On Wed, Aug 24, 2011 at 05:57:06PM +0100, Andrew Cooper wrote: >>> On 24/08/11 13:06, Andrew Cooper wrote: >>>> On 22/08/11 10:05, Andrew Cooper wrote: >>>>> On 19/08/11 19:10, Andreas Olsowski wrote: >>>>>> Am 19.08.2011 18:49, schrieb Andrew Cooper: >>>>>> >>>>>>> The only change you need to make is in megasas_probe_one() in >>>>>>> megaraid_sas_base.c >>>>>>> >>>>>>> Add a call to pci_enable_msi(pdev) immediately after the current >>>>>> call to >>>>>>> pci_set_master(pdev); >>>>>>> >>>>>>> ~Andrew >>>>>>> >>>>>> Yep, that works fine. Removed the module option as well. >>>>>> >>>>>> root@tarballerina:~# cat /proc/interrupts |grep mega >>>>>> 2236: 69010 0 0 0 0 >>>>>> 0 0 0 xen-pirq-msi megasas >>>>>> >>>>>> The same procedure that would have lead to almost instant errors has >>>>>> not brought them to appear again. >>>>>> >>>>> Good. This is what we are seeing as well. I am still awaiting a reply >>>>> from LSI on this topic. >>>>> >>>>> Unfortunately, this does point to a regression in the way Xen deals with >>>>> legacy interrupts. >>>> Out of interest, on all 3 of your boxes with the megaraid_sas cards, >>>> could you gather the io_apic information? >>>> >>>> It is the z xen debug key on the serial console (or alternatively put >>>> apic_verbosity=debug on the xen commandline and the information gets >>>> dumped into the dmesg) >>> You can ignore this - it is not relevant. >>> >>> I have narrowed the problem to a bug in the interrupt migration code. >> Goodies! >>> The bug occurs when the move pending flag is set, and somehow another >>> interrupt comes in on the old pcpu without triggering the move >>> completion code. This leaves the IO_APIC with ack''d but not EOI''d >>> interrupt from the megaraid_sas device. >> Ah, so the interrupt is delievered to Dom0 on the old per_cpu >> event which is ignored. Ignored b/c we have rebinded the event channel >> to the other CPU, right? > The interrupt is not ignored - it seems to be being serviced by the > device driver in dom0. I will admit that my debugging code may be a > bit flaky - I started by trying to match IRQ35 (which is always claimed > by PCI INTA on this server - very useful for debugging) between do_IRQ > and its related PHYSDEVOP_eoi. > > I am currently trying to track the exact order of events around this > interrupt which misses the move completion code. > >> Is there any code in the Hypervisor to turn off interrupt migration code? > Not that I have found, although playing around with vcpu and task > pinning should work. My debugging shows that Xen-4.1.1 is migrating > this interrupt between PCPUs on average once every 4 real interrupts > when dom0 is under any load whatsoever. >Please try attached patch. It is a hack, but it works as far as I can test. (Patch is taken against xen-4.1.1 but should be trivial to port if it doesn''t apply cleanly) ~Andrew -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-26 18:32 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 26/08/11 19:16, Andrew Cooper wrote:> On 24/08/11 18:20, Andrew Cooper wrote: >> On 24/08/11 18:09, Konrad Rzeszutek Wilk wrote: >>> On Wed, Aug 24, 2011 at 05:57:06PM +0100, Andrew Cooper wrote: >>>> On 24/08/11 13:06, Andrew Cooper wrote: >>>>> On 22/08/11 10:05, Andrew Cooper wrote: >>>>>> On 19/08/11 19:10, Andreas Olsowski wrote: >>>>>>> Am 19.08.2011 18:49, schrieb Andrew Cooper: >>>>>>> >>>>>>>> The only change you need to make is in megasas_probe_one() in >>>>>>>> megaraid_sas_base.c >>>>>>>> >>>>>>>> Add a call to pci_enable_msi(pdev) immediately after the current >>>>>>> call to >>>>>>>> pci_set_master(pdev); >>>>>>>> >>>>>>>> ~Andrew >>>>>>>> >>>>>>> Yep, that works fine. Removed the module option as well. >>>>>>> >>>>>>> root@tarballerina:~# cat /proc/interrupts |grep mega >>>>>>> 2236: 69010 0 0 0 0 >>>>>>> 0 0 0 xen-pirq-msi megasas >>>>>>> >>>>>>> The same procedure that would have lead to almost instant errors has >>>>>>> not brought them to appear again. >>>>>>> >>>>>> Good. This is what we are seeing as well. I am still awaiting a reply >>>>>> from LSI on this topic. >>>>>> >>>>>> Unfortunately, this does point to a regression in the way Xen deals with >>>>>> legacy interrupts. >>>>> Out of interest, on all 3 of your boxes with the megaraid_sas cards, >>>>> could you gather the io_apic information? >>>>> >>>>> It is the z xen debug key on the serial console (or alternatively put >>>>> apic_verbosity=debug on the xen commandline and the information gets >>>>> dumped into the dmesg) >>>> You can ignore this - it is not relevant. >>>> >>>> I have narrowed the problem to a bug in the interrupt migration code. >>> Goodies! >>>> The bug occurs when the move pending flag is set, and somehow another >>>> interrupt comes in on the old pcpu without triggering the move >>>> completion code. This leaves the IO_APIC with ack''d but not EOI''d >>>> interrupt from the megaraid_sas device. >>> Ah, so the interrupt is delievered to Dom0 on the old per_cpu >>> event which is ignored. Ignored b/c we have rebinded the event channel >>> to the other CPU, right? >> The interrupt is not ignored - it seems to be being serviced by the >> device driver in dom0. I will admit that my debugging code may be a >> bit flaky - I started by trying to match IRQ35 (which is always claimed >> by PCI INTA on this server - very useful for debugging) between do_IRQ >> and its related PHYSDEVOP_eoi. >> >> I am currently trying to track the exact order of events around this >> interrupt which misses the move completion code. >> >>> Is there any code in the Hypervisor to turn off interrupt migration code? >> Not that I have found, although playing around with vcpu and task >> pinning should work. My debugging shows that Xen-4.1.1 is migrating >> this interrupt between PCPUs on average once every 4 real interrupts >> when dom0 is under any load whatsoever. >> > Please try attached patch. It is a hack, but it works as far as I can test. > > (Patch is taken against xen-4.1.1 but should be trivial to port if it > doesn''t apply cleanly) > > ~Andrew >Apologies - previous patch fails to compile (i forgot to hg qrefresh before sending - it has been a very long day). Try this patch. ~Andrew -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andreas Olsowski
2011-Aug-30 12:02 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
--snip-- > Apologies - previous patch fails to compile (i forgot to hg qrefresh > before sending - it has been a very long day). Try this patch. Testing right now, so far it seems to do fine, patching worked, so did compilation. A scenario that previously stopped io does no longer stop it. Ill give it a couple of more tries and days, but it sure looks good. Any chance of introducing this patch into xen-4.1-testing and making it a part of the upcoming xen-4.1.2? with best regards, Andreas _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andrew Cooper
2011-Aug-30 12:11 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 30/08/11 13:02, Andreas Olsowski wrote:> --snip-- > > Apologies - previous patch fails to compile (i forgot to hg qrefresh > > before sending - it has been a very long day). Try this patch. > Testing right now, so far it seems to do fine, patching worked, so did > compilation. > > A scenario that previously stopped io does no longer stop it. > > Ill give it a couple of more tries and days, but it sure looks good. > > Any chance of introducing this patch into xen-4.1-testing and making > it a part of the upcoming xen-4.1.2? > > > with best regards, > > Andreas >That is up to Keir. My opinion is that this patch is more of a hack than a solution, especially as it does involve changing the API for interrupt ops, but that does not necessarily prevent it from being included. I will soon be working on some significant changes to the interrupt code (cleanup of structures, cleanup of logic - specifically the logic which is now false with per-cpu IDTs) with an intension to upstream them, but whether these patches are suitable to backport is an entirely different question. On a completely different note, we have got in contact with LSI who are altering their driver to consider MSI interrupts as well as MSI-X interrupts, which will be sensible to take, as MSI interrupts will give you an order of magnitude faster disk IO, irrespective of the line level bug in Xen. ~Andrew -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2011-Aug-30 12:46 UTC
Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2
On 30/08/2011 13:11, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:> On 30/08/11 13:02, Andreas Olsowski wrote: >> --snip-- >>> Apologies - previous patch fails to compile (i forgot to hg qrefresh >>> before sending - it has been a very long day). Try this patch. >> Testing right now, so far it seems to do fine, patching worked, so did >> compilation. >> >> A scenario that previously stopped io does no longer stop it. >> >> Ill give it a couple of more tries and days, but it sure looks good. >> >> Any chance of introducing this patch into xen-4.1-testing and making >> it a part of the upcoming xen-4.1.2? >> >> >> with best regards, >> >> Andreas >> > > That is up to Keir. My opinion is that this patch is more of a hack > than a solution, especially as it does involve changing the API for > interrupt ops, but that does not necessarily prevent it from being included.I think this is the right sort of minimal, focused fix that is appropriate for our stable branch, just as it is appropriate for a product patch queue.> I will soon be working on some significant changes to the interrupt code > (cleanup of structures, cleanup of logic - specifically the logic which > is now false with per-cpu IDTs) with an intension to upstream them, but > whether these patches are suitable to backport is an entirely different > question.Yes, it''s questionable, it''s definitely not going to be ready let alone tested in time for 4.1.2. Please post your hacky fix against 4.1-testing with signed-off-by line. I think we should go with it, much preferable to releasing 4.1.2 with a known bug of this type left unfixed. Possibly it is appropriate for 4.0.3 as well? It has the per-cpu idt logic as well. Thanks, Keir> > On a completely different note, we have got in contact with LSI who are > altering their driver to consider MSI interrupts as well as MSI-X > interrupts, which will be sensible to take, as MSI interrupts will give > you an order of magnitude faster disk IO, irrespective of the line level > bug in Xen. > > ~Andrew_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel