Hello all, I got some severe iscsi connection loss on my dom0 (Gentoo 2.6.20-xen-r6, xen 3.1.1). Happening several times a day. open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS product. Here is a snip of my messages log file: May 5 16:52:50 ying connection226:0: iscsi: detected conn error (1011) May 5 16:52:51 ying iscsid: connect failed (111) May 5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 error (1011) state (3) May 5 16:52:53 ying connection215:0: iscsi: detected conn error (1011) May 5 16:52:53 ying iscsid: connect failed (111) May 5 16:52:53 ying iscsid: connect failed (111) May 5 16:52:53 ying iscsid: connect failed (111) May 5 16:52:53 ying iscsid: connect failed (111) [...] and sometimes: May 5 16:53:11 ying iscsid: connection227:0 is operational after recovery (6 attempts) May 5 16:53:11 ying iscsid: connection221:0 is operational after recovery (6 attempts) May 5 16:53:12 ying iscsid: connection214:0 is operational after recovery (9 attempts) Usually, this means loss of my Windows HVM machines.. paravirtualized machines seem to handle that ok, oddly (qemu?). I have read that this could be due to network state change/asymetric routing.. but dunno really in my case. I have 4 network interfaces (2 dualport cards, Intel PRO/1000 MT): - 1 is dedicated to storage, with jumbo frames enabled. - 1 for admin tasks (web interface, ssh) - 2 for various vlans used Anyone experienced this already? Found a solution? Any recommendations? Any help much welcome. Thank you. fred _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Fred Blaise wrote:> > Hello all, > > I got some severe iscsi connection loss on my dom0 (Gentoo > 2.6.20-xen-r6, xen 3.1.1). Happening several times a day. > open-iscsi version is 2.0.865.12. Target iscsi is the open-e > DSS product. > > Here is a snip of my messages log file: > May 5 16:52:50 ying connection226:0: iscsi: detected conn error (1011) > May 5 16:52:51 ying iscsid: connect failed (111) > May 5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 error (1011) state (3) > May 5 16:52:53 ying connection215:0: iscsi: detected conn error (1011) > May 5 16:52:53 ying iscsid: connect failed (111) > May 5 16:52:53 ying iscsid: connect failed (111) > May 5 16:52:53 ying iscsid: connect failed (111) > May 5 16:52:53 ying iscsid: connect failed (111) > [...] > > and sometimes: > May 5 16:53:11 ying iscsid: connection227:0 is operational after recovery (6 attempts) > May 5 16:53:11 ying iscsid: connection221:0 is operational after recovery (6 attempts) > May 5 16:53:12 ying iscsid: connection214:0 is operational after recovery (9 attempts) > > Usually, this means loss of my Windows HVM machines.. paravirtualized > machines seem to handle that ok, oddly (qemu?). > > I have read that this could be due to network state change/asymetric > routing.. but dunno really in my case. I have 4 network interfaces (2 > dualport cards, Intel PRO/1000 MT): > > - 1 is dedicated to storage, with jumbo frames enabled. > - 1 for admin tasks (web interface, ssh) > - 2 for various vlans used > > Anyone experienced this already? Found a solution? Any > recommendations? > Any help much welcome.Try disabling jumbo frames. I have seen a lot of cases of jumbo frames causing a stall in the switch ports on some switches. Also if using jumbo frames, make sure flow control isn''t a problem as a lot of switches have inadequate port buffers to handle flow control and jumbo frames. To note: Jumbo frames on 1Gbe isn''t necessary and will in fact increase latency which will decrease throughput. Jumbo frames is really meant to reduce interrupts and is a lot more affective with 10Gbe then 1Gbe. On 1Gbe if interrupts are running too high I would use interrupt coalescence as a first attempt to reduce them. -Ross ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Fred Blaise schrieb:> Hello all, > > I got some severe iscsi connection loss on my dom0 (Gentoo > 2.6.20-xen-r6, xen 3.1.1). Happening several times a day. > open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS product. > > Here is a snip of my messages log file: > May 5 16:52:50 ying connection226:0: iscsi: detected conn error (1011) > May 5 16:52:51 ying iscsid: connect failed (111) > May 5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 > error (1011) state (3) > May 5 16:52:53 ying connection215:0: iscsi: detected conn error (1011) > May 5 16:52:53 ying iscsid: connect failed (111) > May 5 16:52:53 ying iscsid: connect failed (111) > May 5 16:52:53 ying iscsid: connect failed (111) > May 5 16:52:53 ying iscsid: connect failed (111) > [...] > > and sometimes: > May 5 16:53:11 ying iscsid: connection227:0 is operational after > recovery (6 attempts) > May 5 16:53:11 ying iscsid: connection221:0 is operational after > recovery (6 attempts) > May 5 16:53:12 ying iscsid: connection214:0 is operational after > recovery (9 attempts)I doubt it''s Xen related. I''m running lots of dom0s and domUs (and non-Xen) running as iSCSI initiator mostly without such problems. If it ever happens, it can mean a problem with: 1) iSCSI target implementation, 2) either the target or initiator is very loaded (or both). Did you try changing the iSCSI target, either to tgt or SCST? I''m not sure what targer you have with e-open; I think they wanted to migrate to SCST, but used buggy IET before (or stil use, I''m not sure). Any other messages/logs? 2.6.25 has a nice feature with soft lockups detection, i.e. it will print such messages when machine is severely loaded (it may indicate some problems): May 3 00:46:33 backup1 kernel: INFO: task sync:4875 blocked for more than 120 seconds. -- Tomasz Chmielewski http://wpkg.org _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Tomasz Chmielewski wrote:> > Fred Blaise schrieb: > > Hello all, > > > > I got some severe iscsi connection loss on my dom0 (Gentoo > > 2.6.20-xen-r6, xen 3.1.1). Happening several times a day. > > open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS product. > > > > Here is a snip of my messages log file: > > May 5 16:52:50 ying connection226:0: iscsi: detected conn error (1011) > > May 5 16:52:51 ying iscsid: connect failed (111) > > May 5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 > > error (1011) state (3) > > May 5 16:52:53 ying connection215:0: iscsi: detected conn error (1011) > > May 5 16:52:53 ying iscsid: connect failed (111) > > May 5 16:52:53 ying iscsid: connect failed (111) > > May 5 16:52:53 ying iscsid: connect failed (111) > > May 5 16:52:53 ying iscsid: connect failed (111) > > [...] > > > > and sometimes: > > May 5 16:53:11 ying iscsid: connection227:0 is operational after > > recovery (6 attempts) > > May 5 16:53:11 ying iscsid: connection221:0 is operational after > > recovery (6 attempts) > > May 5 16:53:12 ying iscsid: connection214:0 is operational after > > recovery (9 attempts) > > I doubt it''s Xen related. > > I''m running lots of dom0s and domUs (and non-Xen) running as iSCSI > initiator mostly without such problems. > > If it ever happens, it can mean a problem with: > > 1) iSCSI target implementation, > 2) either the target or initiator is very loaded (or both). > > > Did you try changing the iSCSI target, either to tgt or SCST? I''m not > sure what targer you have with e-open; I think they wanted to migrate to > SCST, but used buggy IET before (or stil use, I''m not sure).Open-e isn''t forth coming to the exact version of IET it uses so I don''t know if it''s running the latest, but they heavily patch it internally, so the code base is diverged. It''s kinda like what Redhat does with their Linux kernels.> > Any other messages/logs? > > > 2.6.25 has a nice feature with soft lockups detection, i.e. it will > print such messages when machine is severely loaded (it may indicate > some problems): > > May 3 00:46:33 backup1 kernel: INFO: task sync:4875 blocked for more > than 120 seconds.The OP may want to get a hold of the logs on the Open-e box too in case there is any hardware failure occurring there. -Ross ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof. _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hey Thomasz, I could get an interesting and clear trace when things started going south this morning... for no appearant reason (ie: no load). Load shouldn''t be a problem is this environment yet. Starting with nop-out timing out.. then sank from there to fail all I/O. I indeed also opened a ticket with open-e, but haven''t gotten an answer yet. I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if I am losing icmp packets when the iscsi connections are lost. Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may upgrade everything at once in a while if the problem persist and no solution is found. The switches doesn''t have anything in the log that could indicate any issue with jumbo frames, or anything else for that matter. Thanks all, fred Tomasz Chmielewski wrote:> Fred Blaise schrieb: >> Hello all, >> >> I got some severe iscsi connection loss on my dom0 (Gentoo >> 2.6.20-xen-r6, xen 3.1.1). Happening several times a day. >> open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS product. >> >> Here is a snip of my messages log file: >> May 5 16:52:50 ying connection226:0: iscsi: detected conn error (1011) >> May 5 16:52:51 ying iscsid: connect failed (111) >> May 5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 >> error (1011) state (3) >> May 5 16:52:53 ying connection215:0: iscsi: detected conn error (1011) >> May 5 16:52:53 ying iscsid: connect failed (111) >> May 5 16:52:53 ying iscsid: connect failed (111) >> May 5 16:52:53 ying iscsid: connect failed (111) >> May 5 16:52:53 ying iscsid: connect failed (111) >> [...] >> >> and sometimes: >> May 5 16:53:11 ying iscsid: connection227:0 is operational after >> recovery (6 attempts) >> May 5 16:53:11 ying iscsid: connection221:0 is operational after >> recovery (6 attempts) >> May 5 16:53:12 ying iscsid: connection214:0 is operational after >> recovery (9 attempts) > > I doubt it''s Xen related. > > I''m running lots of dom0s and domUs (and non-Xen) running as iSCSI > initiator mostly without such problems. > > If it ever happens, it can mean a problem with: > > 1) iSCSI target implementation, > 2) either the target or initiator is very loaded (or both). > > > Did you try changing the iSCSI target, either to tgt or SCST? I''m not > sure what targer you have with e-open; I think they wanted to migrate to > SCST, but used buggy IET before (or stil use, I''m not sure). > > Any other messages/logs? > > > 2.6.25 has a nice feature with soft lockups detection, i.e. it will > print such messages when machine is severely loaded (it may indicate > some problems): > > May 3 00:46:33 backup1 kernel: INFO: task sync:4875 blocked for more > than 120 seconds. > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Fred Blaise schrieb:> Hey Thomasz, > > I could get an interesting and clear trace when things started going > south this morning... for no appearant reason (ie: no load). Load > shouldn''t be a problem is this environment yet. > > Starting with nop-out timing out.. then sank from there to fail all I/O.Does it by chance happen about when you restart your iSCSI initiators or disconnect briefly? I''m not sure what e-open uses, but here is some read about it: http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/> I indeed also opened a ticket with open-e, but haven''t gotten an answer > yet. > > I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if I > am losing icmp packets when the iscsi connections are lost.And?> Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may > upgrade everything at once in a while if the problem persist and no > solution is found. > > The switches doesn''t have anything in the log that could indicate any > issue with jumbo frames, or anything else for that matter.These timeouts are longer than 120 seconds, and the session is dropped in that case. You could increase the timeout in /etc/iscsi/iscsid.conf (node.session.timeo.replacement_timeout) to a much greater value - in most cases it''s a good idea (not only a workaround for your problem until you find a solution, but often helpful if you want to upgrade the iSCSI target, replace cabling, switches etc.). Anyway, the topic is not very Xen-related and should be directed to a iSCSI-specific list (or e-open support, maybe). -- Tomasz Chmielewski _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Tomasz Chmielewski wrote:> Fred Blaise schrieb: >> Hey Thomasz, >> >> I could get an interesting and clear trace when things started going >> south this morning... for no appearant reason (ie: no load). Load >> shouldn''t be a problem is this environment yet. >> >> Starting with nop-out timing out.. then sank from there to fail all I/O. > > Does it by chance happen about when you restart your iSCSI initiators or > disconnect briefly? > I''m not sure what e-open uses, but here is some read about it: > > http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/Good read. Thanks for that.> >> I indeed also opened a ticket with open-e, but haven''t gotten an >> answer yet. >> >> I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if >> I am losing icmp packets when the iscsi connections are lost. > > And?Still waiting for a timeout... I can report back, even though it will be my last post (at least directly to your email).> > >> Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may >> upgrade everything at once in a while if the problem persist and no >> solution is found. >> >> The switches doesn''t have anything in the log that could indicate any >> issue with jumbo frames, or anything else for that matter. > > These timeouts are longer than 120 seconds, and the session is dropped > in that case. > > You could increase the timeout in /etc/iscsi/iscsid.conf > (node.session.timeo.replacement_timeout) to a much greater value - in > most cases it''s a good idea (not only a workaround for your problem > until you find a solution, but often helpful if you want to upgrade the > iSCSI target, replace cabling, switches etc.). > > Anyway, the topic is not very Xen-related and should be directed to a > iSCSI-specific list (or e-open support, maybe).... right, as said above. Thanks a lot for the insights, very helpful.> >Best, fred _______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
Hi, Just for info, bumping up the timeout settings seemed to have a positive effect. I haven''t had a machine crash in the last couple days, whereas before, it''d happen a few times a day. Thank you. fred Tomasz Chmielewski wrote:> Fred Blaise schrieb: >> Hey Thomasz, >> >> I could get an interesting and clear trace when things started going >> south this morning... for no appearant reason (ie: no load). Load >> shouldn''t be a problem is this environment yet. >> >> Starting with nop-out timing out.. then sank from there to fail all I/O. > > Does it by chance happen about when you restart your iSCSI initiators or > disconnect briefly? > I''m not sure what e-open uses, but here is some read about it: > > http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/ > > > >> I indeed also opened a ticket with open-e, but haven''t gotten an >> answer yet. >> >> I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if >> I am losing icmp packets when the iscsi connections are lost. > > And? > > >> Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may >> upgrade everything at once in a while if the problem persist and no >> solution is found. >> >> The switches doesn''t have anything in the log that could indicate any >> issue with jumbo frames, or anything else for that matter. > > These timeouts are longer than 120 seconds, and the session is dropped > in that case. > > You could increase the timeout in /etc/iscsi/iscsid.conf > (node.session.timeo.replacement_timeout) to a much greater value - in > most cases it''s a good idea (not only a workaround for your problem > until you find a solution, but often helpful if you want to upgrade the > iSCSI target, replace cabling, switches etc.). > > Anyway, the topic is not very Xen-related and should be directed to a > iSCSI-specific list (or e-open support, maybe). > >_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users