thr3ads.net - zfs discuss - [zfs-discuss] [networking-discuss] Help needed on big transfers failure with e1000g [Feb 2010]

If this information is useful, please help other people find it:
Share via:
Arnaud Brand
2010-Feb-15 13:28 UTC
[zfs-discuss] [networking-discuss] Help needed on big transfers failure with e1000g

Hi,

Sending to zfs-discuss too since this seems to be related to the zfs 
receive operation.
The following only holds true when the replication stream (ie the delta 
between snap1 and snap2) is more than about 800GB.

If I proceed with this command the transfer fails after some variable 
amount of time, usually near about 790GB.
pfexec /sbin/zfs send -DRp -I tank/tsm at snap1 tank/tsm at snap2 | ssh -c 
arcfour nc-tanktsm pfexec /sbin/zfs recv -F -d tank

If however, I any no problems I proceed with :
pfexec /sbin/zfs send -DRp -I tank/tsm at snap1 tank/tsm at snap2 | 
/usr/gnu/bin/dd of=/tank/stream.zfs bs=1M
manually copy the stream.zfs to the remote host through scp and then
/usr/gnu/bin/dd if=/tank/stream.zfs bs=1M | pfexec /sbin/zfs recv -F -d tank

An additional thing to note (perhaps) is that there where no snapshots 
between snap1 and snap2.
I reproduced the behaviour with two different replication streams (one 
weighting 1.63TB and the other one 1.48TB).

I tried to add some buffering to the first procedure by piping (on both 
ends) through dd or mbuffer.
While the global throughput slightly improves, the transfer still fails.
I also suspected problems with ssh and used netcat, with and without dd, 
with or without mbuffer.
I had even more throughput but it still failed.

I guess asking to transfer a 1TB+ replication stream "live" might just
be too much and I might be the only crazy fool to try to do it.
I''ve setup zfs-auto-snap and launch my replication stream more 
frequently, so as long as zfs-auto-snap doesn''t remove intermediate 
snaps before the script can send them, this is kind of solved it for me, 
but still, there might be a problem somewhere that others might encounter.
Just thought I should report it.

- Arnaud

Le 09/02/2010 17:41, Arnaud Brand a ?crit :> Sorry for the double-post, I forgot to answer your questions.
>
> Arnaud
>
> Le 09/02/2010 17:31, Arnaud Brand a ?crit :
>> Hi James,
>>
>> Sorry to bother you again, I think I found the problem but need 
>> confirmation/advice.
>>
>> It appears that A wants to send through SSH more data to B that B can 
>> accept.
>> In fact, B is in the process of committing the zfs recv of a big 
>> snapshot, but A still has some snaps to send (zfs send -R).
>> After exactly 6 minutes, B sends a RST to A and both close their 
>> connections.
>> Sadly the receive operation goes away with the ssh session.
>>
>> Whats looks strange is that B doesn''t reduce its window, it
just
>> keeps ACKing the last byte SSH could eat and leaves the window at
64436.
>> I guess it''s related to SSH channel multiplexing features : it
can''t
>> reduce the window or other channels won''t make progress
either, am I
>> right here ?
>>
>> I did some ndd lookups in /dev/tcp and /dev/ip to see if I can find a 
>> timeout to raise, but found nothing matching the 6 minutes.
>> The docs related to tcp/ip tunables I found on sun''s website
didn''t
>> bring me much further, I found nothing that seemed applicable.
>>
>> I agree my case is perhaps a bit overstretched and I''m going
generate
>> the replication stream in a local file and send it over "by
hand".
>> Later I shouldn''t have snapshots that are that big and I wish
zfs
>> recv wouldn''t block for so long either, but still I''m
asking myself
>> if the behavior I''m observing is correct or if it''s
the sign of some
>> misconfiguration on my part (or perhaps I''ve once more
forgotten how
>> things work).
>>
>> Sorry for my bad english, I hope you understood what I meant.
>> Just in case I''ve attached the tcpdump output of the ssh
session
>> starting at the very last packet that is accepted and acked by B.
>> A is 192.0.2.2 and B is 192.0.2.1.
>>
>> If you could shed some light on this case I''d be very
grateful, but I
>> don''t want to bother you.
>>
>> Thanks,
>> Arnaud
>>
>>
>> Le 09/02/2010 14:39, James Carlson a ?crit :
>>> Arnaud Brand wrote:
>>>> Le 08/02/10 23:18, James Carlson a ?crit :
>>>>> Causes for RST include:
>>>>>
>>>>>     - peer application is intentionally setting the linger
time to
>>>>> zero
>>>>>       and issuing close(2), which results in TCP RST
generation.
>>>>>
>>>> Might be possible, but I can''t see why the receiving
end would do
>>>> that.
>>> No idea, but a debugger on that side might be able to detect
something.
>>>
>>>>>     - bugs in one or both peers (often related to TCP
keepalive; key
>>>>>       signature of such a problem is an apparent two-hour
time
>>>>> limit).
>>>>>
>>>> That could be it, but I doubt it since disconnections appeared 
>>>> anywhere
>>>> randomly in the range 10 minutes to 13 hours.
>>>> It should be noted that the node sending the RST keeps the
connection
>>>> open (netstat -a shows its still established).
>>>> To be honest that puzzles me.
>>> That sounds horrible.  There''s no way a node that still
has state for
>>> the connection should be sending RST.  Normal procedure is to
generate
>>> RST when you do _not_ have state for the connection or (if
you''re
>>> intentionally aborting the connection) to discard the state at the
same
>>> time you send RST.
>>>
>>> That points to either a bug in the peer''s TCP/IP
implementation or one
>>> of the causes that you''ve dismissed (particularly either a
duplicate IP
>>> address or a firewall/NAT problem).
> It was with netcat, not ssh.
> I was indeed very surprised to see the connection still
"established"
> instead of "listening"
>>>>> You (at least) have to analyze the packet sequences to
determine
>>>>> what is
>>>>> going wrong.  Depending on the nature of the problem, it
may also
>>>>> take
>>>>> in-depth kernel debugging on one or both peers to locate
the cause.
>>>>>
>>>> I relaunched another transfer and I''m tcpdumping both
servers in the
>>>> hope that I find something.
>>>> In the mean time I''ve received a beta bios from tyan
which provides
>>>> support for IKVM over tagged VLANs.
>>>> Until now the intel chips (on which the IKVM/IPMI card is 
>>>> piggy-backed)
>>>> are working better than before.
>>>> I can''t tell if it''s related or not,
I''m crossing fingers.
>>> That could EASILY be related.  That was key information to include.
>>>
>>> IPMI, as I recall, hijacks the node''s Ethernet controller
to provide
>>> low-level node control service.  From what I remember out of ARC
>>> reviews, the architecture is pretty brutal^W"clever".
>>>
>>> I wouldn''t be surprised in the least if this is the
problem.  I like
>>> the
>>> idea of remote management, but that''s the sort of thing
I''d never
>>> enable
>>> on my systems ...
> Sadly, I can''t disable it.
> The VLAN setting didn''t change anything.
>>>> Regarding kernel debugging I though I would look for dtrace 
>>>> scripts, and
>>>> found some, but nothing that seemed relevant in my case.
>>>> As I a complete beginner (read: copy-paste) in dtrace I
couldn''t yet
>>>> figure out how to write one myself.
>>> Is the remote end that''s generating RST also OpenSolaris?
> Yes, they were both snv_132.
> I reproduced the disconnection with snv_131.
> I think I still have a snv_130 be, so I could give it a try too.
>
> _______________________________________________
> networking-discuss mailing list
> networking-discuss at opensolaris.org
zfs discuss - Feb 2010 - [networking-discuss] Help needed on big transfers failure with e1000g

[zfs-discuss] [networking-discuss] Help needed on big transfers failure with e1000g