thr3ads.net - openssh unix dev - internal-sftp + chroot [was: Parallel transfers] [Apr 2020]

If this information is useful, please help other people find it:
Share via:

Cyril Servant

2020-Apr-09 15:01 UTC

Parallel transfers with sftp (call for testing / advice)

> Le 9 avr. 2020 ? 00:34, Nico Kadel-Garcia <nkadel at gmail.com> a
?crit :
> 
> On Wed, Apr 8, 2020 at 11:31 AM Cyril Servant <cyril.servant at
gmail.com> wrote:
>> 
>> Hello, I'd like to share with you an evolution I made on sftp.
> 
> It *sounds* like you should be using rparallelized rsync over xargs.
> Partial sftp or scp transfers are almost inevitable in builk transfers
> over a crowded network, and sftp does not have good support for
> "mirroring", only for copying content.
> 
> See
https://stackoverflow.com/questions/24058544/speed-up-rsync-with-simultaneous-concurrent-file-transfers
This solution is perfect for parallel sending a lot of files. But in the case of
sending one really big file, it does not improve transfer speed.
>> I'm working at CEA (Commissariat ? l'?nergie atomique et aux
?nergies
>> alternatives) in France. We have a compute cluster complex, and our
customers
>> regularly need to transfer big files from and to the cluster. Each of
our front
>> nodes has an outgoing bandwidth limit (let's say 1Gb/s each,
generally more
>> limited by the CPU than by the network bandwidth), but the total
interconnection
>> to the customer is higher (let's say 10Gb/s). Each front node
shares a
>> distributed file system on an internal high bandwidth network. So the
contention
>> point is the 1Gb/s limit of a connection. If the customer wants to use
more than
>> 1Gb/s, he currently uses GridFTP. We want to provide a solution based
on ssh to
>> our customers.
>> 
>> 2. The solution
>> 
>> I made some changes in the sftp client. The new option "-n"
(defaults to 0) sets
>> the number of extra channels. There is one main ssh channel, and n
extra
>> channels. The main ssh channel does everything, except the put and get
commands.
>> Put and get commands are parallelized on the n extra channels. Thanks
to this,
>> when the customer uses "-n 5", he can transfer his files up
to 5Gb/s. There is
>> no server side change. Everything is made on the client side.
> 
> While the option sounds useful for niche cases, I'd be leery of
> partial transfers and being compelled to replicate content to handle
> partial transfers. rsync has been very good, for years, in completing
> partial transfers.
I can fully understand this. In our case, the network is not really crowded, as
customers are generally using research / educational links. Indeed, this is
totally a niche case, but still a need for us. The main use case is putting data
you want to process into the cluster, and when the job is finished, getting the
output of the process. There is rarely the need for synchronising files, except
for the code you want to execute on the cluster, which is considered small
compared to the data. rsync is the obvious choice for synchronising the code,
but not for putting / getting huge amounts of data.

The only other ssh based tool that can speed up the transfer of one big file is
lftp, and it only works for get commands, not for put commands.
>> 3. Some details
>> 
>> Each extra channel has its own ssh channel, and its own thread. Orders
are sent
>> by the main channel to the threads via a queue. When the user sends a
get or put
>> request, the main channel checks what to do. If the file is small
enough, one
>> simple order is added to the queue. If the file is big, the main
channel writes
>> the last block of the file (in order to create a sparse file), then
adds
>> multiple orders to the queue. Each of these orders are put (or get) of
a chunk
>> of the file. One notable change is the progress meter (in interactive
mode).
>> There is no more one progress meter for each file, now there is only
one
>> progress meter which shows the name of the last dequeued file, and a
total of
>> transferred bytes.
>> 
>> 4. Any thoughts ?
>> 
>> You will find the code here:
>>    https://github.com/cea-hpc/openssh-portable/tree/parallel_sftp
>> The branch parallel_sftp is based on the tag V_8_2_P1. There may be a
lot of
>> newbie mistakes in the code, I'll gladly take any advice and
criticism, I'm open
>> minded. And finally, if there is even the slightest chance for these
changes to
>> be merged upstream, please show me the path.
Thank you,
--
Cyril

Nico Kadel-Garcia

2020-Apr-10 22:41 UTC

head link

Parallel transfers with sftp (call for testing / advice)

On Thu, Apr 9, 2020 at 11:01 AM Cyril Servant <cyril.servant at gmail.com>
wrote:>
> > Le 9 avr. 2020 ? 00:34, Nico Kadel-Garcia <nkadel at gmail.com>
a ?crit :
> >
> > On Wed, Apr 8, 2020 at 11:31 AM Cyril Servant <cyril.servant at
gmail.com> wrote:
> >>
> >> Hello, I'd like to share with you an evolution I made on sftp.
> >
> > It *sounds* like you should be using rparallelized rsync over xargs.
> > Partial sftp or scp transfers are almost inevitable in builk transfers
> > over a crowded network, and sftp does not have good support for
> > "mirroring", only for copying content.
> >
> > See
https://stackoverflow.com/questions/24058544/speed-up-rsync-with-simultaneous-concurrent-file-transfers
>
> This solution is perfect for parallel sending a lot of files. But in the
case of
> sending one really big file, it does not improve transfer speed.
It's helpful because it allows you to retry where the last
transmission failed, and it does not leave a partial upload sitting
there tempting people. It uploads to .filename-hash, and moves the
upload in place when the individual file upload is completed.
> >> I'm working at CEA (Commissariat ? l'?nergie atomique et
aux ?nergies
> >> alternatives) in France. We have a compute cluster complex, and
our customers
> >> regularly need to transfer big files from and to the cluster. Each
of our front
> >> nodes has an outgoing bandwidth limit (let's say 1Gb/s each,
generally more
> >> limited by the CPU than by the network bandwidth), but the total
interconnection
> >> to the customer is higher (let's say 10Gb/s). Each front node
shares a
> >> distributed file system on an internal high bandwidth network. So
the contention
> >> point is the 1Gb/s limit of a connection. If the customer wants to
use more than
> >> 1Gb/s, he currently uses GridFTP. We want to provide a solution
based on ssh to
> >> our customers.
> >>
> >> 2. The solution
> >>
> >> I made some changes in the sftp client. The new option
"-n" (defaults to 0) sets
> >> the number of extra channels. There is one main ssh channel, and n
extra
> >> channels. The main ssh channel does everything, except the put and
get commands.
> >> Put and get commands are parallelized on the n extra channels.
Thanks to this,
> >> when the customer uses "-n 5", he can transfer his files
up to 5Gb/s. There is
> >> no server side change. Everything is made on the client side.
> >
> > While the option sounds useful for niche cases, I'd be leery of
> > partial transfers and being compelled to replicate content to handle
> > partial transfers. rsync has been very good, for years, in completing
> > partial transfers.
>
> I can fully understand this. In our case, the network is not really
crowded, as
> customers are generally using research / educational links. Indeed, this is
> totally a niche case, but still a need for us. The main use case is putting
data
> you want to process into the cluster, and when the job is finished, getting
the
> output of the process. There is rarely the need for synchronising files,
except
> for the code you want to execute on the cluster, which is considered small
> compared to the data. rsync is the obvious choice for synchronising the
code,
> but not for putting / getting huge amounts of data.
>
> The only other ssh based tool that can speed up the transfer of one big
file is
> lftp, and it only works for get commands, not for put commands.
yeah, lftp can also support ftps. ftps is supporied by the vsftpd
FTP server, and I use it in places where I do not want OpenSSH
server's tendency ro let people with access look around the rest of
the filesystem.

Peter Stuge

2020-Apr-11 00:23 UTC

head link

internal-sftp + chroot [was: Parallel transfers]

Nico Kadel-Garcia wrote:> in places where I do not want OpenSSH server's tendency ro let
> people with access look around the rest of the filesystem.
If you want users to be able to use *only* SFTP then set a ChrootDirectory
and ForceCommand internal-sftp in a Match for the user in sshd_config.


//Peter

Bob Proulx

2020-Apr-12 21:34 UTC

head link

Parallel transfers with sftp (call for testing / advice)

Cyril Servant wrote:> >> 2. The solution
> >> 
> >> I made some changes in the sftp client. The new option
"-n" (defaults to 0) sets
> >> the number of extra channels. There is one main ssh channel, and n
extra
> >> channels. The main ssh channel does everything, except the put and
get commands.
> >> Put and get commands are parallelized on the n extra channels.
Thanks to this,
> >> when the customer uses "-n 5", he can transfer his files
up to 5Gb/s. There is
> >> no server side change. Everything is made on the client side.
...> I can fully understand this. In our case, the network is not really
crowded, as
> customers are generally using research / educational links. Indeed, this is
> totally a niche case, but still a need for us. The main use case is putting
data
> you want to process into the cluster, and when the job is finished, getting
the
> output of the process. There is rarely the need for synchronising files,
except
> for the code you want to execute on the cluster, which is considered small
> compared to the data. rsync is the obvious choice for synchronising the
code,
> but not for putting / getting huge amounts of data.
When I read through the thread immediately I thought this would have a
good chance of speeding the transfer up by simulating sharing of the
bandwidth through unfair sharing.  Which may still be fair.  But that
was my thought.

With some simplification for discussion...  A network connection that
is handling multiple connections will share the bandwidth among those
connections.  Let's assume every connection is using whatever the
maximum packet size is for that link and not worry about efficiency of
different sized packets on the link.  The effect is that if two
connections are streaming then each will get half of the available
bandwidth.  And if 10 are streaming then each will get 1/10th of the
bandwidth.  And if 100 then each will get 1/100th.

Which means that *if* a connection is mostly operating near'ish
capacity then a user can get more of it by using more parallel
connections.  As scheduling walks through the connections giving each
bandwidth in turn then the one with more parallel connections will get
more of it.

Let's assume a link with 20 connections streaming.  If I add my own
then I am number 21 and everyone will get 1/21 of the bandwidth.  But
if instead I open 50 connections and transfer in parallel then
20+50=70 and each connection getting 1/70th of the bandwidth.  But
since 50 of those are mine I get 50/70 of the bandwidth for my own
data and each of the other 20 users get 1/70 for those other 20 users.

I say this based upon experience with software that relied upon this
behavior transferring big data in my previous career.  It's been a few
years but we had the same issue and a very similar solution back then.
Someone hacked together a process to chop up big data files and
transparently send them in parallel assembling them on the remote end.
At the time for our case it was leased lines between sites.  Using
this technique one group was able to get high priority data transferred
between sites faster.  Faster by slowing down the other people in the
company who were also transferring data across the same network links.
But this was good because the high priority data had priority.

Whether that is good or bad depends upon all of the details.  If the
connection is my network, or my group's network, and this is the
priority task, then this is a good way to make use of more bandwidth.
Think of it like Quality of Service tuning.  If on the other hand it
is me borrowing someone else's network and I am negatively impacting
them, then maybe not so good.  For example if I were downloading data
in my home neighborhood and preventing my neighbors from streaming
video during this time of "stay-at-home let's defeat the COVID-19
virus" then that would be a bad and unfair use.  It's neither good nor
bad intrinsically but just how it is used that is important.

I am thinking that this improves bandwidth for one connection by
implementing parallel connections which allows a greater share of the
bandwidth.  I might be completely wrong here.  It's just my thought
from sitting back in my comfy chair while staying at home to defeat
the virus.  And you asked for any thoughts...

Bob

David Newall

2020-Apr-13 03:19 UTC

head link

Parallel transfers with sftp (call for testing / advice)

What I noticed, from the "example" session, is that each connection
goes
to a different IP address.? I thought that was interesting because it 
implied that the destination machines were bandwidth limited (and 
sending to multiple interfaces produces greater nett bandwidth) but the 
source machine was not.? That surprised me.

It seems to me that proposed solution is for a very specific situation 
and unlikely to be of use to anyone else.

I also wonder if a better solution would be to bond all of the NICs in 
the target machine instead of giving each a separate IP address.? That 
should result in a similar nett bandwidth but without needing to hack 
every high data-volume protocol.

I think I'm struggling to see the advantage tp modifying openssh to open 
multiple parallel channels.

-------- Forwarded Message --------
Subject: 	Re: Parallel transfers with sftp (call for testing / advice)
Date: 	Sun, 12 Apr 2020 15:34:53 -0600
From: 	Bob Proulx <bob at proulx.com>
To: 	openssh-unix-dev at mindrot.org

Cyril Servant wrote:> >> 2. The solution
> >> >> I made some changes in the sftp client. The new option
"-n"
> (defaults to 0) sets
> >> the number of extra channels. There is one main ssh channel, and n
> extra
> >> channels. The main ssh channel does everything, except the put and
> get commands.
> >> Put and get commands are parallelized on the n extra channels. 
> Thanks to this,
> >> when the customer uses "-n 5", he can transfer his files
up to
> 5Gb/s. There is
> >> no server side change. Everything is made on the client side.
...> I can fully understand this. In our case, the network is not really 
> crowded, as
> customers are generally using research / educational links. Indeed, 
> this is
> totally a niche case, but still a need for us. The main use case is 
> putting data
> you want to process into the cluster, and when the job is finished, 
> getting the
> output of the process. There is rarely the need for synchronising 
> files, except
> for the code you want to execute on the cluster, which is considered small
> compared to the data. rsync is the obvious choice for synchronising 
> the code,
> but not for putting / getting huge amounts of data.
When I read through the thread immediately I thought this would have a
good chance of speeding the transfer up by simulating sharing of the
bandwidth through unfair sharing. Which may still be fair. But that
was my thought.

With some simplification for discussion... A network connection that
is handling multiple connections will share the bandwidth among those
connections. Let's assume every connection is using whatever the
maximum packet size is for that link and not worry about efficiency of
different sized packets on the link. The effect is that if two
connections are streaming then each will get half of the available
bandwidth. And if 10 are streaming then each will get 1/10th of the
bandwidth. And if 100 then each will get 1/100th.

Which means that *if* a connection is mostly operating near'ish
capacity then a user can get more of it by using more parallel
connections. As scheduling walks through the connections giving each
bandwidth in turn then the one with more parallel connections will get
more of it.

Let's assume a link with 20 connections streaming. If I add my own
then I am number 21 and everyone will get 1/21 of the bandwidth. But
if instead I open 50 connections and transfer in parallel then
20+50=70 and each connection getting 1/70th of the bandwidth. But
since 50 of those are mine I get 50/70 of the bandwidth for my own
data and each of the other 20 users get 1/70 for those other 20 users.

I say this based upon experience with software that relied upon this
behavior transferring big data in my previous career. It's been a few
years but we had the same issue and a very similar solution back then.
Someone hacked together a process to chop up big data files and
transparently send them in parallel assembling them on the remote end.
At the time for our case it was leased lines between sites. Using
this technique one group was able to get high priority data transferred
between sites faster. Faster by slowing down the other people in the
company who were also transferring data across the same network links.
But this was good because the high priority data had priority.

Whether that is good or bad depends upon all of the details. If the
connection is my network, or my group's network, and this is the
priority task, then this is a good way to make use of more bandwidth.
Think of it like Quality of Service tuning. If on the other hand it
is me borrowing someone else's network and I am negatively impacting
them, then maybe not so good. For example if I were downloading data
in my home neighborhood and preventing my neighbors from streaming
video during this time of "stay-at-home let's defeat the COVID-19
virus" then that would be a bad and unfair use. It's neither good nor
bad intrinsically but just how it is used that is important.

I am thinking that this improves bandwidth for one connection by
implementing parallel connections which allows a greater share of the
bandwidth. I might be completely wrong here. It's just my thought
from sitting back in my comfy chair while staying at home to defeat
the virus. And you asked for any thoughts...

Bob
_______________________________________________
openssh-unix-dev mailing list
openssh-unix-dev at mindrot.org
https://lists.mindrot.org/mailman/listinfo/openssh-unix-dev

Seemingly Similar Threads

Search for more reasonably related threads

openssh unix dev - Apr 2020 - internal-sftp + chroot [was: Parallel transfers]

Parallel transfers with sftp (call for testing / advice)

Parallel transfers with sftp (call for testing / advice)

internal-sftp + chroot [was: Parallel transfers]

Parallel transfers with sftp (call for testing / advice)

Parallel transfers with sftp (call for testing / advice)

Seemingly Similar Threads