thr3ads.net - Lustre discuss - [Lustre-discuss] LZW block-level compression for improving lustre read/write speeds? [Jul 2006]

If this information is useful, please help other people find it:
Share via:

EKC

2006-Jul-31 04:47 UTC

[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?

I''m playing around with improving the read/write performance of a
lustre 1.5.91 filesystem running on gigabit ethernet, and I am
wondering whether anyone has experimented with using client-side
block-level compression?

Specifically, could LZW be used on the client-side to
compress/decompress reads/writes to lustre? LZW could be readily
modified to make it less compute intensive, and slightly less space
efficient, were that a concern.

There are some experimental FUSE filesystems with transparent compression:
http://www.miio.net/fusecompress/
http://north.one.pl/~kazik/pub/LZOlayer/

However, it would be nice to have this on Lustre. Is this a feature
that ClusterFS has planned?

Thanks

eser

Oleg Drokin

2006-Jul-31 05:46 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?

Hello!

On Mon, Jul 31, 2006 at 03:47:35AM -0700, EKC wrote:> I''m playing around with improving the read/write performance of a
> lustre 1.5.91 filesystem running on gigabit ethernet, and I am
> wondering whether anyone has experimented with using client-side
> block-level compression?
Client-side means you will only have this for writes, right?
> Specifically, could LZW be used on the client-side to
> compress/decompress reads/writes to lustre? LZW could be readily
> modified to make it less compute intensive, and slightly less space
> efficient, were that a concern.
I remember there were some measurements done that shown tcp/ip overhead itself
is so big that you can only saturate ~ 1.5 gbps link with modern enough
opteron cpu. This does not leave all that much CPU to perform compression,
esp. taking into account that normally there are other jobs that need CPU too.

Bye,
    Oleg

Scott Atchley

2006-Jul-31 09:09 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?

On Jul 31, 2006, at 7:21 AM, Oleg Drokin wrote:
> Hello!
>
> On Mon, Jul 31, 2006 at 03:47:35AM -0700, EKC wrote:
>> I''m playing around with improving the read/write performance
of a
>> lustre 1.5.91 filesystem running on gigabit ethernet, and I am
>> wondering whether anyone has experimented with using client-side
>> block-level compression?
>
> Client-side means you will only have this for writes, right?
>
>> Specifically, could LZW be used on the client-side to
>> compress/decompress reads/writes to lustre? LZW could be readily
>> modified to make it less compute intensive, and slightly less space
>> efficient, were that a concern.
>
> I remember there were some measurements done that shown tcp/ip  
> overhead itself
> is so big that you can only saturate ~ 1.5 gbps link with modern  
> enough
> opteron cpu. This does not leave all that much CPU to perform  
> compression,
> esp. taking into account that normally there are other jobs that  
> need CPU too.
>
> Bye,
>     Oleg
Hi Oleg,

Using our 10Gb/s card in Ethernet mode using the SOCKLND driver with  
the Zero-Copy TCP patch with Opteron 280s (dual core, dual cpu), I  
can get about 600 MB/s (4.8 Gb/s) using a single client for read and  
write. With three clients reading, I can nearly saturate the link  
(1,180 MB/s with 8 threads per client). CPU utilization never  
exceeded 30%. See this page for the full results.

https://mail.clusterfs.com/wikis/lustre/Myri-10G_Ethernet

The problem I see is that compression is not done in place, which  
means that zero-copy performance gains are lost.

Scott

Peter J. Braam

2006-Jul-31 10:52 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

In fact we have talked about this.  What we would really like to do is
to let the clients compress / decompress and store the compressed data
on the servers.  There are lots of clients generally, and often, in HPC
at least, there isn''t much else they can do when they do IO, but the
servers should definitely not be given the overhead of compression /
decompression.

Andreas has explained to me that we could for example store the server
files as sparse files, and compress on X MB boundaries (X at least 1).
There is apparently existing ext2 code that does some of this.  How this
fits in with a reservation based allocator we are designing at the
moment isn''t completely clear to me yet, but Alex perhaps has some
ideas
about that.

This is particularly promising for one of our most common usage
scenarios: checkpoint restart dumps.  Although it hasn''t been verified,
the data in those dumps is slowly varying floating pointdata that is
possibly extremely compressible.

- Peter -


 

 > -----Original Message-----
 > From: lustre-discuss-bounces@clusterfs.com 
 > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of EKC
 > Sent: Monday, July 31, 2006 4:48 AM
 > To: Lustre User Discussion Mailing List
 > Subject: [Lustre-discuss] LZW block-level compression for 
 > improving lustreread/write speeds?
 > 
 > I''m playing around with improving the read/write performance 
 > of a lustre 1.5.91 filesystem running on gigabit ethernet, 
 > and I am wondering whether anyone has experimented with 
 > using client-side block-level compression?
 > 
 > Specifically, could LZW be used on the client-side to 
 > compress/decompress reads/writes to lustre? LZW could be 
 > readily modified to make it less compute intensive, and 
 > slightly less space efficient, were that a concern.
 > 
 > There are some experimental FUSE filesystems with 
 > transparent compression:
 > http://www.miio.net/fusecompress/
 > http://north.one.pl/~kazik/pub/LZOlayer/
 > 
 > However, it would be nice to have this on Lustre. Is this a 
 > feature that ClusterFS has planned?
 > 
 > Thanks
 > 
 > eser
 > 
 > _______________________________________________
 > Lustre-discuss mailing list
 > Lustre-discuss@clusterfs.com
 > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
 > 
 >

EKC

2006-Aug-01 05:09 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

I''ve been digging further into the use of compression to increase
networked-filesystem throughput, and it appears that Google is doing
something similar on their own cluster filesystem (GFS). There are
some lecture notes on this here:
http://andrewhitchcock.org/?post=214

And a video of a lecture concerning this at University of Washington here:
http://video.google.com/videoplay?docid=7278544055668715642

On 7/31/06, Peter J. Braam <braam@clusterfs.com>
wrote:> In fact we have talked about this.  What we would really like to do is
> to let the clients compress / decompress and store the compressed data
> on the servers.  There are lots of clients generally, and often, in HPC
> at least, there isn''t much else they can do when they do IO, but
the
> servers should definitely not be given the overhead of compression /
> decompression.
>
> Andreas has explained to me that we could for example store the server
> files as sparse files, and compress on X MB boundaries (X at least 1).
> There is apparently existing ext2 code that does some of this.  How this
> fits in with a reservation based allocator we are designing at the
> moment isn''t completely clear to me yet, but Alex perhaps has some
ideas
> about that.
>
> This is particularly promising for one of our most common usage
> scenarios: checkpoint restart dumps.  Although it hasn''t been
verified,
> the data in those dumps is slowly varying floating pointdata that is
> possibly extremely compressible.
>
> - Peter -
>
>
>
>
>  > -----Original Message-----
>  > From: lustre-discuss-bounces@clusterfs.com
>  > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of EKC
>  > Sent: Monday, July 31, 2006 4:48 AM
>  > To: Lustre User Discussion Mailing List
>  > Subject: [Lustre-discuss] LZW block-level compression for
>  > improving lustreread/write speeds?
>  >
>  > I''m playing around with improving the read/write performance
>  > of a lustre 1.5.91 filesystem running on gigabit ethernet,
>  > and I am wondering whether anyone has experimented with
>  > using client-side block-level compression?
>  >
>  > Specifically, could LZW be used on the client-side to
>  > compress/decompress reads/writes to lustre? LZW could be
>  > readily modified to make it less compute intensive, and
>  > slightly less space efficient, were that a concern.
>  >
>  > There are some experimental FUSE filesystems with
>  > transparent compression:
>  > http://www.miio.net/fusecompress/
>  > http://north.one.pl/~kazik/pub/LZOlayer/
>  >
>  > However, it would be nice to have this on Lustre. Is this a
>  > feature that ClusterFS has planned?
>  >
>  > Thanks
>  >
>  > eser
>  >
>  > _______________________________________________
>  > Lustre-discuss mailing list
>  > Lustre-discuss@clusterfs.com
>  > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>  >
>  >
>
>

Goswin von Brederlow

2006-Aug-01 08:42 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

"Peter J. Braam" <braam@clusterfs.com> writes:
> In fact we have talked about this.  What we would really like to do is
> to let the clients compress / decompress and store the compressed data
> on the servers.  There are lots of clients generally, and often, in HPC
> at least, there isn''t much else they can do when they do IO, but
the
> servers should definitely not be given the overhead of compression /
> decompression.
>
> Andreas has explained to me that we could for example store the server
> files as sparse files, and compress on X MB boundaries (X at least 1).
> There is apparently existing ext2 code that does some of this.  How this
> fits in with a reservation based allocator we are designing at the
> moment isn''t completely clear to me yet, but Alex perhaps has some
ideas
> about that.
>
> This is particularly promising for one of our most common usage
> scenarios: checkpoint restart dumps.  Although it hasn''t been
verified,
> the data in those dumps is slowly varying floating pointdata that is
> possibly extremely compressible.
>
> - Peter -
I would suggest ignoring the fact that compression makes data smaller
and would allow storing it more compact.

Instead, for each X MB block allocate the full X MB and store the
compressed data in that block leaving any remaining space empty. For
this to work there has to be one bit somewhere that says if a block is
compressed or not so blocks that would grow by compression can be
stored verbatim.

By always using as much space as the uncompressed data would take the
allocator code should remain unchanged and editing existing files
should not cause problems, growing or shrinking of compressed blocks
have no effect on the disk layout.

The client could also abort compression when it detects that a block
is quite uncompressable. Say the first 100K of a block don''t compress
then there is probably little value in trying the rest. Just send it
uncompressed. Same if the block only compresses by 1%. The time to
uncompress 990K data on each client is probably longer than sending
the extra 10K for a full 1MB block.

MfG
        Goswin

Peter J. Braam

2006-Aug-01 09:21 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

Hi

This has occurred to me also, but it''s not clear that it would take
care
to avoid it to eliminate the benefits we are after. 

It is mandatory that what is written to the disks is still large enough
to realize the full disk bandwidth.  For example, writing less than 1MB
chunks to many RAID arrays costs equally much time as writing 1MB
chunks.  (The network is less sensitive to this, but in principle the
same holds.)

So somewhere in the compression path we need to make sure that we
continue to send large enough chunks over to the servers, and that these
end up contiguously on disk. 

This is not contradicting what you write but it adds a dimension to the
problem, namely knowing roughly how much you compress by to avoid losing
the bandwidth somewhere along the line.

- Peter -

 > -----Original Message-----
 > From: Goswin von Brederlow 
 > [mailto:brederlo@informatik.uni-tuebingen.de] 
 > Sent: Tuesday, August 01, 2006 8:37 AM
 > To: Peter J. Braam
 > Cc: EKC; Lustre User Discussion Mailing List
 > Subject: Re: [Lustre-discuss] LZW block-level compression 
 > for improving lustreread/write speeds?
 > 
 > "Peter J. Braam" <braam@clusterfs.com> writes:
 > 
 > > In fact we have talked about this.  What we would really 
 > like to do is 
 > > to let the clients compress / decompress and store the 
 > compressed data 
 > > on the servers.  There are lots of clients generally, and 
 > often, in 
 > > HPC at least, there isn''t much else they can do when they 
 > do IO, but 
 > > the servers should definitely not be given the overhead of 
 > compression 
 > > / decompression.
 > >
 > > Andreas has explained to me that we could for example 
 > store the server 
 > > files as sparse files, and compress on X MB boundaries (X 
 > at least 1).
 > > There is apparently existing ext2 code that does some of 
 > this.  How 
 > > this fits in with a reservation based allocator we are 
 > designing at 
 > > the moment isn''t completely clear to me yet, but Alex 
 > perhaps has some 
 > > ideas about that.
 > >
 > > This is particularly promising for one of our most common usage
 > > scenarios: checkpoint restart dumps.  Although it hasn''t
been
 > > verified, the data in those dumps is slowly varying 
 > floating pointdata 
 > > that is possibly extremely compressible.
 > >
 > > - Peter -
 > 
 > I would suggest ignoring the fact that compression makes 
 > data smaller and would allow storing it more compact.
 > 
 > Instead, for each X MB block allocate the full X MB and 
 > store the compressed data in that block leaving any 
 > remaining space empty. For this to work there has to be one 
 > bit somewhere that says if a block is compressed or not so 
 > blocks that would grow by compression can be stored verbatim.
 > 
 > By always using as much space as the uncompressed data would 
 > take the allocator code should remain unchanged and editing 
 > existing files should not cause problems, growing or 
 > shrinking of compressed blocks have no effect on the disk layout.
 > 
 > The client could also abort compression when it detects that 
 > a block is quite uncompressable. Say the first 100K of a 
 > block don''t compress then there is probably little value in 
 > trying the rest. Just send it uncompressed. Same if the 
 > block only compresses by 1%. The time to uncompress 990K 
 > data on each client is probably longer than sending the 
 > extra 10K for a full 1MB block.
 > 
 > MfG
 >         Goswin
 > 
 >

Goswin von Brederlow

2006-Aug-01 10:10 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

"Peter J. Braam" <braam@clusterfs.com> writes:
> Hi
>
> This has occurred to me also, but it''s not clear that it would
take care
> to avoid it to eliminate the benefits we are after. 
>
> It is mandatory that what is written to the disks is still large enough
> to realize the full disk bandwidth.  For example, writing less than 1MB
> chunks to many RAID arrays costs equally much time as writing 1MB
> chunks.  (The network is less sensitive to this, but in principle the
> same holds.)
>
> So somewhere in the compression path we need to make sure that we
> continue to send large enough chunks over to the servers, and that these
> end up contiguously on disk. 
>
> This is not contradicting what you write but it adds a dimension to the
> problem, namely knowing roughly how much you compress by to avoid losing
> the bandwidth somewhere along the line.
>
> - Peter -
I don''t see that as a problem. It is easy to get the raid to
read/write at 300 MiB/s. Even if you only write 50% pay data on each
1MiB chunk you still have enough for the 1.5x GBit connections that
would saturate the cpu.

If you have highly compressed data and big read/write requests chunks
larger than 1MiB might be better. With my idea you will have a seek
after every chunk which might indead be as costly for the raid as
writing the full chunk.

MfG
        Goswin

Peter J. Braam

2006-Aug-01 10:16 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

The numbers given yesterday were wrong (they were from an ia32 era).  

We are typically targetting servers with 2 Gbytes / sec now.  700 MB on
a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit processors,
and 2-4 I/B NICs will be used multirail soon.

It is absolutely critical to pipeline the IO with 100% efficiency.

- Peter -  

 > -----Original Message-----
 > From: Goswin von Brederlow 
 > [mailto:brederlo@informatik.uni-tuebingen.de] 
 > Sent: Tuesday, August 01, 2006 10:06 AM
 > To: Peter J. Braam
 > Cc: Goswin von Brederlow; EKC; Lustre User Discussion Mailing List
 > Subject: Re: [Lustre-discuss] LZW block-level compression 
 > for improving lustreread/write speeds?
 > 
 > "Peter J. Braam" <braam@clusterfs.com> writes:
 > 
 > > Hi
 > >
 > > This has occurred to me also, but it''s not clear that it 
 > would take 
 > > care to avoid it to eliminate the benefits we are after.
 > >
 > > It is mandatory that what is written to the disks is still large 
 > > enough to realize the full disk bandwidth.  For example, 
 > writing less 
 > > than 1MB chunks to many RAID arrays costs equally much 
 > time as writing 
 > > 1MB chunks.  (The network is less sensitive to this, but 
 > in principle 
 > > the same holds.)
 > >
 > > So somewhere in the compression path we need to make sure that we 
 > > continue to send large enough chunks over to the servers, and that 
 > > these end up contiguously on disk.
 > >
 > > This is not contradicting what you write but it adds a 
 > dimension to 
 > > the problem, namely knowing roughly how much you compress 
 > by to avoid 
 > > losing the bandwidth somewhere along the line.
 > >
 > > - Peter -
 > 
 > I don''t see that as a problem. It is easy to get the raid to 
 > read/write at 300 MiB/s. Even if you only write 50% pay data 
 > on each 1MiB chunk you still have enough for the 1.5x GBit 
 > connections that would saturate the cpu.
 > 
 > If you have highly compressed data and big read/write 
 > requests chunks larger than 1MiB might be better. With my 
 > idea you will have a seek after every chunk which might 
 > indead be as costly for the raid as writing the full chunk.
 > 
 > MfG
 >         Goswin
 > 
 >

Sean Ziegeler, Contractor

2006-Aug-01 11:00 UTC

head link

[Lustre-discuss] IB Multirail (was LZW block-level compression)

On Tue, 2006-08-01 at 11:15, Peter J. Braam wrote:
[snip]> We are typically targetting servers with 2 Gbytes / sec now.  700 MB on
> a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit processors,
> and 2-4 I/B NICs will be used multirail soon.[snip]

Sorry to diverge from the topic, but I was under the impression that
"channel-bonding-like" approaches weren''t planned.  That is,
things like
multirail IB wouldn''t be supported by Luster.  Or do you mean some sort
of driver-level support similar transparent to all applications,
including Lustre?  Or Did I misunderstand completely?

Thanks,
Sean

Goswin von Brederlow

2006-Aug-02 10:47 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

"Peter J. Braam" <braam@clusterfs.com> writes:
> The numbers given yesterday were wrong (they were from an ia32 era).  
>
> We are typically targetting servers with 2 Gbytes / sec now.  700 MB on
> a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit processors,
> and 2-4 I/B NICs will be used multirail soon.
>
> It is absolutely critical to pipeline the IO with 100% efficiency.
>
> - Peter -  
Sure. With infiniband we get 600-700MiB/s speeds too. But the use case
for compression was for "slow" networks. It certainly will not be good
enough for fast networks.

I wonder though how much penalty you get real live if you write half a
MiB, seek half a MiB, write, seek, write, seek,... How much does that
actualy cost? What about 2, 4, 8 MiB chunks? At some size it will
certainly cost less than writing the full chunk.

MfG
        Goswin

Peter J. Braam

2006-Aug-03 08:13 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

We are interested in using this on the fastest possible networks.

The breakover point of the IO chunks is highly array dependent.  4MB is
likely reasonable assuming 2-3x compression.

- Peter -  

 > -----Original Message-----
 > From: Goswin von Brederlow 
 > [mailto:brederlo@informatik.uni-tuebingen.de] 
 > Sent: Wednesday, August 02, 2006 10:42 AM
 > To: Peter J. Braam
 > Cc: Goswin von Brederlow; EKC; Lustre User Discussion Mailing List
 > Subject: Re: [Lustre-discuss] LZW block-level compression 
 > for improving lustreread/write speeds?
 > 
 > "Peter J. Braam" <braam@clusterfs.com> writes:
 > 
 > > The numbers given yesterday were wrong (they were from an 
 > ia32 era).  
 > >
 > > We are typically targetting servers with 2 Gbytes / sec 
 > now.  700 MB 
 > > on a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit 
 > > processors, and 2-4 I/B NICs will be used multirail soon.
 > >
 > > It is absolutely critical to pipeline the IO with 100% efficiency.
 > >
 > > - Peter -
 > 
 > Sure. With infiniband we get 600-700MiB/s speeds too. But 
 > the use case for compression was for "slow" networks. It 
 > certainly will not be good enough for fast networks.
 > 
 > I wonder though how much penalty you get real live if you 
 > write half a MiB, seek half a MiB, write, seek, write, 
 > seek,... How much does that actualy cost? What about 2, 4, 8 
 > MiB chunks? At some size it will certainly cost less than 
 > writing the full chunk.
 > 
 > MfG
 >         Goswin
 > 
 >

Goswin von Brederlow

2006-Aug-04 08:28 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

"Peter J. Braam" <braam@clusterfs.com> writes:
> We are interested in using this on the fastest possible networks.
Then you will just waste cpu cycles. With infiniband and rdma you get
the data directly into memory without the cpu touching it.
Compression/Decompression will be a major slowdown at 300MiB/s
(current speed we get for a single client). You would need some
hardware compression/decompression module that can handle such speeds.

MfG
        Goswin

Scott Atchley

2006-Aug-04 08:49 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

On Aug 4, 2006, at 10:23 AM, Goswin von Brederlow wrote:
> "Peter J. Braam" <braam@clusterfs.com> writes:
>
>> We are interested in using this on the fastest possible networks.
>
> Then you will just waste cpu cycles. With infiniband and rdma you get
> the data directly into memory without the cpu touching it.
> Compression/Decompression will be a major slowdown at 300MiB/s
> (current speed we get for a single client). You would need some
> hardware compression/decompression module that can handle such speeds.
>
> MfG
>         Goswin
To use compression at all, you must be able to compress a block in  
less time than it takes to send the uncompressed block. If not, you  
will slow down the sends. To really make sense, you must be able to  
compress the block in less time that it takes to send the  
_compressed_ block. In this case, it is free (except for the initial  
N blocks where N is the number of concurrent large message transfers).

This assumes we have cycles to burn on the client (the servers only  
handle the compressed or uncompressed data). This assumption is fine  
if the data is a multi-TB output data set that is being stored at the  
end of a batch process. This assumption is bad if the process is  
overlapping computation and storage.

When I was at UT working with the LoCI group on IBP, etc., we were  
able to store the original length and the compressed length in the  
inode equivalent (we called it an exNode). Since we were using TCP  
over relatively slow links (100 Mb/s or gigabit Ethernet), we could  
afford to compress. I''m not sure that holds at 9.4 Gb/s on Myri-10G  
using MXLND.

Scott

Peter J. Braam

2006-Aug-04 13:11 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

Don''t forget you have 1000''s of clients compressing vs a few
servers.
Also, we are fighting disk bandwidth typically, not the network.  

There is possibly a huge benefit to this for dumping checkpoint-restore
data, client''s are doing much else usually during dumping, and if it
compresses, which is very likely, it can dramatically cut down on the IO
bottleneck.

- Peter - 
> -----Original Message-----
> From: Scott Atchley [mailto:atchley@myri.com] 
> Sent: Friday, August 04, 2006 8:49 AM
> To: Goswin von Brederlow
> Cc: Peter J. Braam; Lustre User Discussion Mailing List
> Subject: Re: [Lustre-discuss] LZW block-level compression for 
> improving lustreread/write speeds?
> 
> On Aug 4, 2006, at 10:23 AM, Goswin von Brederlow wrote:
> 
> > "Peter J. Braam" <braam@clusterfs.com> writes:
> >
> >> We are interested in using this on the fastest possible networks.
> >
> > Then you will just waste cpu cycles. With infiniband and 
> rdma you get 
> > the data directly into memory without the cpu touching it.
> > Compression/Decompression will be a major slowdown at 300MiB/s 
> > (current speed we get for a single client). You would need some 
> > hardware compression/decompression module that can handle 
> such speeds.
> >
> > MfG
> >         Goswin
> 
> To use compression at all, you must be able to compress a 
> block in less time than it takes to send the uncompressed 
> block. If not, you will slow down the sends. To really make 
> sense, you must be able to compress the block in less time 
> that it takes to send the _compressed_ block. In this case, 
> it is free (except for the initial N blocks where N is the 
> number of concurrent large message transfers).
> 
> This assumes we have cycles to burn on the client (the 
> servers only handle the compressed or uncompressed data). 
> This assumption is fine if the data is a multi-TB output data 
> set that is being stored at the end of a batch process. This 
> assumption is bad if the process is overlapping computation 
> and storage.
> 
> When I was at UT working with the LoCI group on IBP, etc., we 
> were able to store the original length and the compressed 
> length in the inode equivalent (we called it an exNode). 
> Since we were using TCP over relatively slow links (100 Mb/s 
> or gigabit Ethernet), we could afford to compress. I''m not 
> sure that holds at 9.4 Gb/s on Myri-10G using MXLND.
> 
> Scott
> 
>

Nikita Danilov

2006-Aug-04 13:21 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

Peter J. Braam writes:
 > Don''t forget you have 1000''s of clients compressing vs a
few servers.
 > Also, we are fighting disk bandwidth typically, not the network.  
 > 
 > There is possibly a huge benefit to this for dumping checkpoint-restore
 > data, client''s are doing much else usually during dumping, and if
it
 > compresses, which is very likely, it can dramatically cut down on the IO
 > bottleneck.

Reiser4 has (not yet production ready) compression support, and it
proved to be an advantage even for local file system. Two points of
interest are that one wants to compress large chunks of data at once
(which, in lustre case, probably implies making per-client data cache
larger), and using computationally cheap --even if sub-optimal--
compression algorithms.

 > 
 > - Peter -

Nikita.

Jean-Marc Saffroy

2006-Aug-04 13:53 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

On Fri, 4 Aug 2006, Peter J. Braam wrote:
> Don''t forget you have 1000''s of clients compressing vs a
few servers.
> Also, we are fighting disk bandwidth typically, not the network.
Agreed. With simple math, I find that if:
  T is the per-processor I/O throughput
  C is the per-processor compression throughput
  R is the compression ratio (ie. compressed size / original size)

Then, if I''m not mistaken, for a write I/O bound program (yes, this is
not
every workload), the time W to write S bytes is:
  - without compression: W = S / T
  - with compression: W = S / C + S * R / T

Thus, compression can speed up things if: T / C + R < 1

For example, on a recent PC I observed gzip -1:
  - with random data: R = 1, C = 15MB/s
  - with zeroes:  R = 1/200, C = 100MB/s
  - with a vmlinux: R = 1/2, C = 10MB/s

I suspect specialized compression schemes achieve higher speeds and lower 
compression ratios, but let''s keep the vmlinux figures.

If a processor can compress at C = 10MB/s and achieve R = 1/2, it makes 
sense to use compress with gzip -1 if T < 20MB/s.

If a cluster has, say, 1k processors, this means its global I/O throughput 
has to be less than 20GB/s for gzip -1 to be useful when writing vmlinux 
files.

Chip makers sell their processors for peanuts ;-) but a storage cluster of 
RAID arrays and servers yielding 20GB/s is not that cheap.
> There is possibly a huge benefit to this for dumping checkpoint-restore 
> data, client''s are doing much else usually during dumping, and if
it
> compresses, which is very likely, it can dramatically cut down on the IO 
> bottleneck.
Isn''t it easier to add compression to the user-space checkpoint code?


-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Goswin von Brederlow

2006-Aug-05 10:11 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

Scott Atchley <atchley@myri.com> writes:
> On Aug 4, 2006, at 10:23 AM, Goswin von Brederlow wrote:
>
>> "Peter J. Braam" <braam@clusterfs.com> writes:
>>
>>> We are interested in using this on the fastest possible networks.
>>
>> Then you will just waste cpu cycles. With infiniband and rdma you get
>> the data directly into memory without the cpu touching it.
>> Compression/Decompression will be a major slowdown at 300MiB/s
>> (current speed we get for a single client). You would need some
>> hardware compression/decompression module that can handle such speeds.
>>
>> MfG
>>         Goswin
>
> To use compression at all, you must be able to compress a block in
> less time than it takes to send the uncompressed block. If not, you
> will slow down the sends. To really make sense, you must be able to
> compress the block in less time that it takes to send the
> _compressed_ block. In this case, it is free (except for the initial
> N blocks where N is the number of concurrent large message transfers).
The time to compress and send the compressed time should be less than
the time to send the uncompressed. In that case it is always free (but
still at cpu cost). Otherwise it just helps to fight a bottleneck in
the transport layer, e.g. just one slow GBit line.
> This assumes we have cycles to burn on the client (the servers only
> handle the compressed or uncompressed data). This assumption is fine
> if the data is a multi-TB output data set that is being stored at the
> end of a batch process. This assumption is bad if the process is
> overlapping computation and storage.
>
> When I was at UT working with the LoCI group on IBP, etc., we were
> able to store the original length and the compressed length in the
> inode equivalent (we called it an exNode). Since we were using TCP
> over relatively slow links (100 Mb/s or gigabit Ethernet), we could
> afford to compress. I''m not sure that holds at 9.4 Gb/s on
Myri-10G
> using MXLND.
I would say quite the opposite. Getting a higher throughput with
compression compared to without will be nearly impossible. And then it
still costs you cpu. Not only on save but on read too. Think what
happens if one node saves calculation data compressed and then all
other nodes have to read that for the next run.
> Scott
MfG
        Goswin

PS: can Lustre do multicast if two or more nodes read the same file?

Goswin von Brederlow

2006-Aug-05 10:22 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

Jean-Marc Saffroy <jean-marc.saffroy@ext.bull.net> writes:
> If a cluster has, say, 1k processors, this means its global I/O
> throughput has to be less than 20GB/s for gzip -1 to be useful when
> writing vmlinux files.
>
> Chip makers sell their processors for peanuts ;-) but a storage
> cluster of RAID arrays and servers yielding 20GB/s is not that cheap.
You assume they all write at the same time.

A lot of the time you have I/O and computations interleaved and you
also have local caches on the servers and raid boxes allowing for huge
spikes when writing data.

Another thing is that previously I assumed the bottleneck is the
network and not the disk. I said to store compressed blocks just like
uncompressed blocks but with extra unused space at the end.

Advantage:

- All allocations are still blocks. One simple big unit to deal
with. No big change for the allocator.
- Rewriting a file will not suddenly run out of disk space. A file
will never use more space than raw data contained in it.
- Rewriting a block does not create (more) gaps in the file or run out
of space for that block requiring a relocation (and fragmentation).

Disadvantage:

- All compressed blocks are followed by a gap. As said raid speed can
be slowed down by that to match the speed of uncompressed blocks.

To get a better disk performance for compressed blocks you would
probably have to change a ton of code. The layout and allocation
functions, add a block reallocator and somehow handle files growing on
rewrites (probably just count free space as if files where
uncompressed).

MfG
        Goswin

Jean-Marc Saffroy

2006-Aug-05 19:18 UTC

head link

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

On Sat, 5 Aug 2006, Goswin von Brederlow wrote:
> Jean-Marc Saffroy <jean-marc.saffroy@ext.bull.net> writes:
>
>> If a cluster has, say, 1k processors, this means its global I/O
>> throughput has to be less than 20GB/s for gzip -1 to be useful when
>> writing vmlinux files.
>>
>> Chip makers sell their processors for peanuts ;-) but a storage
>> cluster of RAID arrays and servers yielding 20GB/s is not that cheap.
>
> You assume they all write at the same time.
Yes, and that is true of some workloads, such as a checkpoint/restart 
operation, which is not rare I think.
> A lot of the time you have I/O and computations interleaved and you also 
> have local caches on the servers and raid boxes allowing for huge spikes 
> when writing data.
Caches on servers and storage systems are not that big, if you consider 
that there is often something like one I/O server for 10 compute nodes, 
which each have loads of RAM for applications.
> Another thing is that previously I assumed the bottleneck is the
> network and not the disk.
This can happen, but if performance is a concern then it''s a waste of
disk
bandwidth, which is awfully expensive compared to network bandwidth.


-- 
Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net

Lustre discuss - Jul 2006 - LZW block-level compression for improving lustre read/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] IB Multirail (was LZW block-level compression)

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?

[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?