thr3ads.net - Xen users - [Xen-users] iscsi conn error: Xen related? [May 2008]

If this information is useful, please help other people find it:
Share via:

Fred Blaise

2008-May-05 15:29 UTC

[Xen-users] iscsi conn error: Xen related?

Hello all,

I got some severe iscsi connection loss on my dom0 (Gentoo 
2.6.20-xen-r6, xen 3.1.1). Happening several times a day.
open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS product.

Here is a snip of my messages log file:
May  5 16:52:50 ying connection226:0: iscsi: detected conn error (1011)
May  5 16:52:51 ying iscsid: connect failed (111)
May  5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 
error (1011) state (3)
May  5 16:52:53 ying connection215:0: iscsi: detected conn error (1011)
May  5 16:52:53 ying iscsid: connect failed (111)
May  5 16:52:53 ying iscsid: connect failed (111)
May  5 16:52:53 ying iscsid: connect failed (111)
May  5 16:52:53 ying iscsid: connect failed (111)
[...]

and sometimes:
May  5 16:53:11 ying iscsid: connection227:0 is operational after 
recovery (6 attempts)
May  5 16:53:11 ying iscsid: connection221:0 is operational after 
recovery (6 attempts)
May  5 16:53:12 ying iscsid: connection214:0 is operational after 
recovery (9 attempts)

Usually, this means loss of my Windows HVM machines.. paravirtualized 
machines seem to handle that ok, oddly (qemu?).

I have read that this could be due to network state change/asymetric 
routing.. but dunno really in my case. I have 4 network interfaces (2 
dualport cards, Intel PRO/1000 MT):

- 1 is dedicated to storage, with jumbo frames enabled.
- 1 for admin tasks (web interface, ssh)
- 2 for various vlans used

Anyone experienced this already? Found a solution? Any recommendations? 
Any help much welcome.

Thank you.

fred

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Ross S. W. Walker

2008-May-05 15:39 UTC

head link

RE: [Xen-users] iscsi conn error: Xen related?

Fred Blaise wrote:> 
> Hello all,
> 
> I got some severe iscsi connection loss on my dom0 (Gentoo 
> 2.6.20-xen-r6, xen 3.1.1). Happening several times a day.
> open-iscsi version is 2.0.865.12. Target iscsi is the open-e 
> DSS product.
> 
> Here is a snip of my messages log file:
> May  5 16:52:50 ying connection226:0: iscsi: detected conn error (1011)
> May  5 16:52:51 ying iscsid: connect failed (111)
> May  5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 error
(1011) state (3)
> May  5 16:52:53 ying connection215:0: iscsi: detected conn error (1011)
> May  5 16:52:53 ying iscsid: connect failed (111)
> May  5 16:52:53 ying iscsid: connect failed (111)
> May  5 16:52:53 ying iscsid: connect failed (111)
> May  5 16:52:53 ying iscsid: connect failed (111)
> [...]
> 
> and sometimes:
> May  5 16:53:11 ying iscsid: connection227:0 is operational after recovery
(6 attempts)
> May  5 16:53:11 ying iscsid: connection221:0 is operational after recovery
(6 attempts)
> May  5 16:53:12 ying iscsid: connection214:0 is operational after recovery
(9 attempts)
> 
> Usually, this means loss of my Windows HVM machines.. paravirtualized 
> machines seem to handle that ok, oddly (qemu?).
> 
> I have read that this could be due to network state change/asymetric 
> routing.. but dunno really in my case. I have 4 network interfaces (2 
> dualport cards, Intel PRO/1000 MT):
> 
> - 1 is dedicated to storage, with jumbo frames enabled.
> - 1 for admin tasks (web interface, ssh)
> - 2 for various vlans used
> 
> Anyone experienced this already? Found a solution? Any 
> recommendations? 
> Any help much welcome.
Try disabling jumbo frames. I have seen a lot of cases of jumbo
frames causing a stall in the switch ports on some switches. Also
if using jumbo frames, make sure flow control isn''t a problem as
a lot of switches have inadequate port buffers to handle flow
control and jumbo frames.

To note: Jumbo frames on 1Gbe isn''t necessary and will in fact
increase latency which will decrease throughput. Jumbo frames
is really meant to reduce interrupts and is a lot more
affective with 10Gbe then 1Gbe. On 1Gbe if interrupts are
running too high I would use interrupt coalescence as a first
attempt to reduce them.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Tomasz Chmielewski

2008-May-05 15:41 UTC

head link

Re: [Xen-users] iscsi conn error: Xen related?

Fred Blaise schrieb:> Hello all,
> 
> I got some severe iscsi connection loss on my dom0 (Gentoo 
> 2.6.20-xen-r6, xen 3.1.1). Happening several times a day.
> open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS product.
> 
> Here is a snip of my messages log file:
> May  5 16:52:50 ying connection226:0: iscsi: detected conn error (1011)
> May  5 16:52:51 ying iscsid: connect failed (111)
> May  5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 
> error (1011) state (3)
> May  5 16:52:53 ying connection215:0: iscsi: detected conn error (1011)
> May  5 16:52:53 ying iscsid: connect failed (111)
> May  5 16:52:53 ying iscsid: connect failed (111)
> May  5 16:52:53 ying iscsid: connect failed (111)
> May  5 16:52:53 ying iscsid: connect failed (111)
> [...]
> 
> and sometimes:
> May  5 16:53:11 ying iscsid: connection227:0 is operational after 
> recovery (6 attempts)
> May  5 16:53:11 ying iscsid: connection221:0 is operational after 
> recovery (6 attempts)
> May  5 16:53:12 ying iscsid: connection214:0 is operational after 
> recovery (9 attempts)
I doubt it''s Xen related.

I''m running lots of dom0s and domUs (and non-Xen) running as iSCSI 
initiator mostly without such problems.

If it ever happens, it can mean a problem with:

1) iSCSI target implementation,
2) either the target or initiator is very loaded (or both).


Did you try changing the iSCSI target, either to tgt or SCST? I''m not 
sure what targer you have with e-open; I think they wanted to migrate to 
SCST, but used buggy IET before (or stil use, I''m not sure).

Any other messages/logs?


2.6.25 has a nice feature with soft lockups detection, i.e. it will 
print such messages when machine is severely loaded (it may indicate 
some problems):

May  3 00:46:33 backup1 kernel: INFO: task sync:4875 blocked for more 
than 120 seconds.


-- 
Tomasz Chmielewski
http://wpkg.org

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Ross S. W. Walker

2008-May-05 15:47 UTC

head link

RE: [Xen-users] iscsi conn error: Xen related?

Tomasz Chmielewski wrote:> 
> Fred Blaise schrieb:
> > Hello all,
> > 
> > I got some severe iscsi connection loss on my dom0 (Gentoo 
> > 2.6.20-xen-r6, xen 3.1.1). Happening several times a day.
> > open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS
product.
> > 
> > Here is a snip of my messages log file:
> > May  5 16:52:50 ying connection226:0: iscsi: detected conn error
(1011)
> > May  5 16:52:51 ying iscsid: connect failed (111)
> > May  5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 
> > error (1011) state (3)
> > May  5 16:52:53 ying connection215:0: iscsi: detected conn error
(1011)
> > May  5 16:52:53 ying iscsid: connect failed (111)
> > May  5 16:52:53 ying iscsid: connect failed (111)
> > May  5 16:52:53 ying iscsid: connect failed (111)
> > May  5 16:52:53 ying iscsid: connect failed (111)
> > [...]
> > 
> > and sometimes:
> > May  5 16:53:11 ying iscsid: connection227:0 is operational after 
> > recovery (6 attempts)
> > May  5 16:53:11 ying iscsid: connection221:0 is operational after 
> > recovery (6 attempts)
> > May  5 16:53:12 ying iscsid: connection214:0 is operational after 
> > recovery (9 attempts)
> 
> I doubt it''s Xen related.
> 
> I''m running lots of dom0s and domUs (and non-Xen) running as iSCSI
> initiator mostly without such problems.
> 
> If it ever happens, it can mean a problem with:
> 
> 1) iSCSI target implementation,
> 2) either the target or initiator is very loaded (or both).
> 
> 
> Did you try changing the iSCSI target, either to tgt or SCST? I''m
not
> sure what targer you have with e-open; I think they wanted to migrate to 
> SCST, but used buggy IET before (or stil use, I''m not sure).
Open-e isn''t forth coming to the exact version of IET it uses so I
don''t know if it''s running the latest, but they heavily patch
it
internally, so the code base is diverged. It''s kinda like what
Redhat does with their Linux kernels.
> 
> Any other messages/logs?
> 
> 
> 2.6.25 has a nice feature with soft lockups detection, i.e. it will 
> print such messages when machine is severely loaded (it may indicate 
> some problems):
> 
> May  3 00:46:33 backup1 kernel: INFO: task sync:4875 blocked for more 
> than 120 seconds.
The OP may want to get a hold of the logs on the Open-e box too in
case there is any hardware failure occurring there.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Fred Blaise

2008-May-06 10:40 UTC

head link

Re: [Xen-users] iscsi conn error: Xen related?

Hey Thomasz,

I could get an interesting and clear trace when things started going 
south this morning... for no appearant reason (ie: no load). Load 
shouldn''t be a problem is this environment yet.

Starting with nop-out timing out.. then sank from there to fail all I/O.

I indeed also opened a ticket with open-e, but haven''t gotten an answer
yet.

I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if I 
am losing icmp packets when the iscsi connections are lost.

Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may 
upgrade everything at once in a while if the problem persist and no 
solution is found.

The switches doesn''t have anything in the log that could indicate any 
issue with jumbo frames, or anything else for that matter.

Thanks all,

fred

Tomasz Chmielewski wrote:> Fred Blaise schrieb:
>> Hello all,
>>
>> I got some severe iscsi connection loss on my dom0 (Gentoo 
>> 2.6.20-xen-r6, xen 3.1.1). Happening several times a day.
>> open-iscsi version is 2.0.865.12. Target iscsi is the open-e DSS
product.
>>
>> Here is a snip of my messages log file:
>> May  5 16:52:50 ying connection226:0: iscsi: detected conn error (1011)
>> May  5 16:52:51 ying iscsid: connect failed (111)
>> May  5 16:52:51 ying iscsid: Kernel reported iSCSI connection 226:0 
>> error (1011) state (3)
>> May  5 16:52:53 ying connection215:0: iscsi: detected conn error (1011)
>> May  5 16:52:53 ying iscsid: connect failed (111)
>> May  5 16:52:53 ying iscsid: connect failed (111)
>> May  5 16:52:53 ying iscsid: connect failed (111)
>> May  5 16:52:53 ying iscsid: connect failed (111)
>> [...]
>>
>> and sometimes:
>> May  5 16:53:11 ying iscsid: connection227:0 is operational after 
>> recovery (6 attempts)
>> May  5 16:53:11 ying iscsid: connection221:0 is operational after 
>> recovery (6 attempts)
>> May  5 16:53:12 ying iscsid: connection214:0 is operational after 
>> recovery (9 attempts)
> 
> I doubt it''s Xen related.
> 
> I''m running lots of dom0s and domUs (and non-Xen) running as iSCSI
> initiator mostly without such problems.
> 
> If it ever happens, it can mean a problem with:
> 
> 1) iSCSI target implementation,
> 2) either the target or initiator is very loaded (or both).
> 
> 
> Did you try changing the iSCSI target, either to tgt or SCST? I''m
not
> sure what targer you have with e-open; I think they wanted to migrate to 
> SCST, but used buggy IET before (or stil use, I''m not sure).
> 
> Any other messages/logs?
> 
> 
> 2.6.25 has a nice feature with soft lockups detection, i.e. it will 
> print such messages when machine is severely loaded (it may indicate 
> some problems):
> 
> May  3 00:46:33 backup1 kernel: INFO: task sync:4875 blocked for more 
> than 120 seconds.
> 
> 

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Tomasz Chmielewski

2008-May-06 10:49 UTC

head link

Re: [Xen-users] iscsi conn error: Xen related?

Fred Blaise schrieb:> Hey Thomasz,
> 
> I could get an interesting and clear trace when things started going 
> south this morning... for no appearant reason (ie: no load). Load 
> shouldn''t be a problem is this environment yet.
> 
> Starting with nop-out timing out.. then sank from there to fail all I/O.
Does it by chance happen about when you restart your iSCSI initiators or 
disconnect briefly?
I''m not sure what e-open uses, but here is some read about it:

http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/

> I indeed also opened a ticket with open-e, but haven''t gotten an
answer
> yet.
> 
> I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if I 
> am losing icmp packets when the iscsi connections are lost.
And?

> Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may 
> upgrade everything at once in a while if the problem persist and no 
> solution is found.
> 
> The switches doesn''t have anything in the log that could indicate
any
> issue with jumbo frames, or anything else for that matter.
These timeouts are longer than 120 seconds, and the session is dropped 
in that case.

You could increase the timeout in /etc/iscsi/iscsid.conf 
(node.session.timeo.replacement_timeout) to a much greater value - in 
most cases it''s a good idea (not only a workaround for your problem 
until you find a solution, but often helpful if you want to upgrade the 
iSCSI target, replace cabling, switches etc.).

Anyway, the topic is not very Xen-related and should be directed to a 
iSCSI-specific list (or e-open support, maybe).


-- 
Tomasz Chmielewski


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Fred Blaise

2008-May-06 12:17 UTC

head link

Re: [Xen-users] iscsi conn error: Xen related?

Tomasz Chmielewski wrote:> Fred Blaise schrieb:
>> Hey Thomasz,
>>
>> I could get an interesting and clear trace when things started going 
>> south this morning... for no appearant reason (ie: no load). Load 
>> shouldn''t be a problem is this environment yet.
>>
>> Starting with nop-out timing out.. then sank from there to fail all
I/O.
> 
> Does it by chance happen about when you restart your iSCSI initiators or 
> disconnect briefly?
> I''m not sure what e-open uses, but here is some read about it:
> 
>
http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/Good read. Thanks for that.
> 
>> I indeed also opened a ticket with open-e, but haven''t gotten
an
>> answer yet.
>>
>> I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if 
>> I am losing icmp packets when the iscsi connections are lost.
> 
> And?Still waiting for a timeout... I can report back, even though it will be 
my last post (at least directly to your email).
> 
> 
>> Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may 
>> upgrade everything at once in a while if the problem persist and no 
>> solution is found.
>>
>> The switches doesn''t have anything in the log that could
indicate any
>> issue with jumbo frames, or anything else for that matter.
> 
> These timeouts are longer than 120 seconds, and the session is dropped 
> in that case.
> 
> You could increase the timeout in /etc/iscsi/iscsid.conf 
> (node.session.timeo.replacement_timeout) to a much greater value - in 
> most cases it''s a good idea (not only a workaround for your
problem
> until you find a solution, but often helpful if you want to upgrade the 
> iSCSI target, replace cabling, switches etc.).
> 
> Anyway, the topic is not very Xen-related and should be directed to a 
> iSCSI-specific list (or e-open support, maybe).
... right, as said above.

Thanks a lot for the insights, very helpful.> 
>Best,

fred


_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Fred Blaise

2008-May-08 14:19 UTC

head link

Re: [Xen-users] iscsi conn error: Xen related?

Hi,

Just for info, bumping up the timeout settings seemed to have a positive 
effect. I haven''t had a machine crash in the last couple days, whereas 
before, it''d happen a few times a day.

Thank you.

fred

Tomasz Chmielewski wrote:> Fred Blaise schrieb:
>> Hey Thomasz,
>>
>> I could get an interesting and clear trace when things started going 
>> south this morning... for no appearant reason (ie: no load). Load 
>> shouldn''t be a problem is this environment yet.
>>
>> Starting with nop-out timing out.. then sank from there to fail all
I/O.
> 
> Does it by chance happen about when you restart your iSCSI initiators or 
> disconnect briefly?
> I''m not sure what e-open uses, but here is some read about it:
> 
>
http://blog.wpkg.org/2007/09/09/solving-reliability-and-scalability-problems-with-iscsi/
> 
> 
> 
>> I indeed also opened a ticket with open-e, but haven''t gotten
an
>> answer yet.
>>
>> I also launched a ping -s 8192 -i 3 -I ethXX to the storage, to see if 
>> I am losing icmp packets when the iscsi connections are lost.
> 
> And?
> 
> 
>> Upgrade can be an option soon.. I also saw xen 3.1.2 was out, so I may 
>> upgrade everything at once in a while if the problem persist and no 
>> solution is found.
>>
>> The switches doesn''t have anything in the log that could
indicate any
>> issue with jumbo frames, or anything else for that matter.
> 
> These timeouts are longer than 120 seconds, and the session is dropped 
> in that case.
> 
> You could increase the timeout in /etc/iscsi/iscsid.conf 
> (node.session.timeo.replacement_timeout) to a much greater value - in 
> most cases it''s a good idea (not only a workaround for your
problem
> until you find a solution, but often helpful if you want to upgrade the 
> iSCSI target, replace cabling, switches etc.).
> 
> Anyway, the topic is not very Xen-related and should be directed to a 
> iSCSI-specific list (or e-open support, maybe).
> 
> 
_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Possibly Parallel Threads

Search for more maybe matching threads

Xen users - May 2008 - iscsi conn error: Xen related?

[Xen-users] iscsi conn error: Xen related?

RE: [Xen-users] iscsi conn error: Xen related?

Re: [Xen-users] iscsi conn error: Xen related?

RE: [Xen-users] iscsi conn error: Xen related?

Re: [Xen-users] iscsi conn error: Xen related?

Re: [Xen-users] iscsi conn error: Xen related?

Re: [Xen-users] iscsi conn error: Xen related?

Re: [Xen-users] iscsi conn error: Xen related?

Possibly Parallel Threads