EKC
2006-Jul-31 04:47 UTC
[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?
I''m playing around with improving the read/write performance of a lustre 1.5.91 filesystem running on gigabit ethernet, and I am wondering whether anyone has experimented with using client-side block-level compression? Specifically, could LZW be used on the client-side to compress/decompress reads/writes to lustre? LZW could be readily modified to make it less compute intensive, and slightly less space efficient, were that a concern. There are some experimental FUSE filesystems with transparent compression: http://www.miio.net/fusecompress/ http://north.one.pl/~kazik/pub/LZOlayer/ However, it would be nice to have this on Lustre. Is this a feature that ClusterFS has planned? Thanks eser
Oleg Drokin
2006-Jul-31 05:46 UTC
[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?
Hello! On Mon, Jul 31, 2006 at 03:47:35AM -0700, EKC wrote:> I''m playing around with improving the read/write performance of a > lustre 1.5.91 filesystem running on gigabit ethernet, and I am > wondering whether anyone has experimented with using client-side > block-level compression?Client-side means you will only have this for writes, right?> Specifically, could LZW be used on the client-side to > compress/decompress reads/writes to lustre? LZW could be readily > modified to make it less compute intensive, and slightly less space > efficient, were that a concern.I remember there were some measurements done that shown tcp/ip overhead itself is so big that you can only saturate ~ 1.5 gbps link with modern enough opteron cpu. This does not leave all that much CPU to perform compression, esp. taking into account that normally there are other jobs that need CPU too. Bye, Oleg
Scott Atchley
2006-Jul-31 09:09 UTC
[Lustre-discuss] LZW block-level compression for improving lustre read/write speeds?
On Jul 31, 2006, at 7:21 AM, Oleg Drokin wrote:> Hello! > > On Mon, Jul 31, 2006 at 03:47:35AM -0700, EKC wrote: >> I''m playing around with improving the read/write performance of a >> lustre 1.5.91 filesystem running on gigabit ethernet, and I am >> wondering whether anyone has experimented with using client-side >> block-level compression? > > Client-side means you will only have this for writes, right? > >> Specifically, could LZW be used on the client-side to >> compress/decompress reads/writes to lustre? LZW could be readily >> modified to make it less compute intensive, and slightly less space >> efficient, were that a concern. > > I remember there were some measurements done that shown tcp/ip > overhead itself > is so big that you can only saturate ~ 1.5 gbps link with modern > enough > opteron cpu. This does not leave all that much CPU to perform > compression, > esp. taking into account that normally there are other jobs that > need CPU too. > > Bye, > OlegHi Oleg, Using our 10Gb/s card in Ethernet mode using the SOCKLND driver with the Zero-Copy TCP patch with Opteron 280s (dual core, dual cpu), I can get about 600 MB/s (4.8 Gb/s) using a single client for read and write. With three clients reading, I can nearly saturate the link (1,180 MB/s with 8 threads per client). CPU utilization never exceeded 30%. See this page for the full results. https://mail.clusterfs.com/wikis/lustre/Myri-10G_Ethernet The problem I see is that compression is not done in place, which means that zero-copy performance gains are lost. Scott
Peter J. Braam
2006-Jul-31 10:52 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
In fact we have talked about this. What we would really like to do is to let the clients compress / decompress and store the compressed data on the servers. There are lots of clients generally, and often, in HPC at least, there isn''t much else they can do when they do IO, but the servers should definitely not be given the overhead of compression / decompression. Andreas has explained to me that we could for example store the server files as sparse files, and compress on X MB boundaries (X at least 1). There is apparently existing ext2 code that does some of this. How this fits in with a reservation based allocator we are designing at the moment isn''t completely clear to me yet, but Alex perhaps has some ideas about that. This is particularly promising for one of our most common usage scenarios: checkpoint restart dumps. Although it hasn''t been verified, the data in those dumps is slowly varying floating pointdata that is possibly extremely compressible. - Peter - > -----Original Message----- > From: lustre-discuss-bounces@clusterfs.com > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of EKC > Sent: Monday, July 31, 2006 4:48 AM > To: Lustre User Discussion Mailing List > Subject: [Lustre-discuss] LZW block-level compression for > improving lustreread/write speeds? > > I''m playing around with improving the read/write performance > of a lustre 1.5.91 filesystem running on gigabit ethernet, > and I am wondering whether anyone has experimented with > using client-side block-level compression? > > Specifically, could LZW be used on the client-side to > compress/decompress reads/writes to lustre? LZW could be > readily modified to make it less compute intensive, and > slightly less space efficient, were that a concern. > > There are some experimental FUSE filesystems with > transparent compression: > http://www.miio.net/fusecompress/ > http://north.one.pl/~kazik/pub/LZOlayer/ > > However, it would be nice to have this on Lustre. Is this a > feature that ClusterFS has planned? > > Thanks > > eser > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
EKC
2006-Aug-01 05:09 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
I''ve been digging further into the use of compression to increase networked-filesystem throughput, and it appears that Google is doing something similar on their own cluster filesystem (GFS). There are some lecture notes on this here: http://andrewhitchcock.org/?post=214 And a video of a lecture concerning this at University of Washington here: http://video.google.com/videoplay?docid=7278544055668715642 On 7/31/06, Peter J. Braam <braam@clusterfs.com> wrote:> In fact we have talked about this. What we would really like to do is > to let the clients compress / decompress and store the compressed data > on the servers. There are lots of clients generally, and often, in HPC > at least, there isn''t much else they can do when they do IO, but the > servers should definitely not be given the overhead of compression / > decompression. > > Andreas has explained to me that we could for example store the server > files as sparse files, and compress on X MB boundaries (X at least 1). > There is apparently existing ext2 code that does some of this. How this > fits in with a reservation based allocator we are designing at the > moment isn''t completely clear to me yet, but Alex perhaps has some ideas > about that. > > This is particularly promising for one of our most common usage > scenarios: checkpoint restart dumps. Although it hasn''t been verified, > the data in those dumps is slowly varying floating pointdata that is > possibly extremely compressible. > > - Peter - > > > > > > -----Original Message----- > > From: lustre-discuss-bounces@clusterfs.com > > [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of EKC > > Sent: Monday, July 31, 2006 4:48 AM > > To: Lustre User Discussion Mailing List > > Subject: [Lustre-discuss] LZW block-level compression for > > improving lustreread/write speeds? > > > > I''m playing around with improving the read/write performance > > of a lustre 1.5.91 filesystem running on gigabit ethernet, > > and I am wondering whether anyone has experimented with > > using client-side block-level compression? > > > > Specifically, could LZW be used on the client-side to > > compress/decompress reads/writes to lustre? LZW could be > > readily modified to make it less compute intensive, and > > slightly less space efficient, were that a concern. > > > > There are some experimental FUSE filesystems with > > transparent compression: > > http://www.miio.net/fusecompress/ > > http://north.one.pl/~kazik/pub/LZOlayer/ > > > > However, it would be nice to have this on Lustre. Is this a > > feature that ClusterFS has planned? > > > > Thanks > > > > eser > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@clusterfs.com > > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > > > > >
Goswin von Brederlow
2006-Aug-01 08:42 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
"Peter J. Braam" <braam@clusterfs.com> writes:> In fact we have talked about this. What we would really like to do is > to let the clients compress / decompress and store the compressed data > on the servers. There are lots of clients generally, and often, in HPC > at least, there isn''t much else they can do when they do IO, but the > servers should definitely not be given the overhead of compression / > decompression. > > Andreas has explained to me that we could for example store the server > files as sparse files, and compress on X MB boundaries (X at least 1). > There is apparently existing ext2 code that does some of this. How this > fits in with a reservation based allocator we are designing at the > moment isn''t completely clear to me yet, but Alex perhaps has some ideas > about that. > > This is particularly promising for one of our most common usage > scenarios: checkpoint restart dumps. Although it hasn''t been verified, > the data in those dumps is slowly varying floating pointdata that is > possibly extremely compressible. > > - Peter -I would suggest ignoring the fact that compression makes data smaller and would allow storing it more compact. Instead, for each X MB block allocate the full X MB and store the compressed data in that block leaving any remaining space empty. For this to work there has to be one bit somewhere that says if a block is compressed or not so blocks that would grow by compression can be stored verbatim. By always using as much space as the uncompressed data would take the allocator code should remain unchanged and editing existing files should not cause problems, growing or shrinking of compressed blocks have no effect on the disk layout. The client could also abort compression when it detects that a block is quite uncompressable. Say the first 100K of a block don''t compress then there is probably little value in trying the rest. Just send it uncompressed. Same if the block only compresses by 1%. The time to uncompress 990K data on each client is probably longer than sending the extra 10K for a full 1MB block. MfG Goswin
Peter J. Braam
2006-Aug-01 09:21 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
Hi This has occurred to me also, but it''s not clear that it would take care to avoid it to eliminate the benefits we are after. It is mandatory that what is written to the disks is still large enough to realize the full disk bandwidth. For example, writing less than 1MB chunks to many RAID arrays costs equally much time as writing 1MB chunks. (The network is less sensitive to this, but in principle the same holds.) So somewhere in the compression path we need to make sure that we continue to send large enough chunks over to the servers, and that these end up contiguously on disk. This is not contradicting what you write but it adds a dimension to the problem, namely knowing roughly how much you compress by to avoid losing the bandwidth somewhere along the line. - Peter - > -----Original Message----- > From: Goswin von Brederlow > [mailto:brederlo@informatik.uni-tuebingen.de] > Sent: Tuesday, August 01, 2006 8:37 AM > To: Peter J. Braam > Cc: EKC; Lustre User Discussion Mailing List > Subject: Re: [Lustre-discuss] LZW block-level compression > for improving lustreread/write speeds? > > "Peter J. Braam" <braam@clusterfs.com> writes: > > > In fact we have talked about this. What we would really > like to do is > > to let the clients compress / decompress and store the > compressed data > > on the servers. There are lots of clients generally, and > often, in > > HPC at least, there isn''t much else they can do when they > do IO, but > > the servers should definitely not be given the overhead of > compression > > / decompression. > > > > Andreas has explained to me that we could for example > store the server > > files as sparse files, and compress on X MB boundaries (X > at least 1). > > There is apparently existing ext2 code that does some of > this. How > > this fits in with a reservation based allocator we are > designing at > > the moment isn''t completely clear to me yet, but Alex > perhaps has some > > ideas about that. > > > > This is particularly promising for one of our most common usage > > scenarios: checkpoint restart dumps. Although it hasn''t been > > verified, the data in those dumps is slowly varying > floating pointdata > > that is possibly extremely compressible. > > > > - Peter - > > I would suggest ignoring the fact that compression makes > data smaller and would allow storing it more compact. > > Instead, for each X MB block allocate the full X MB and > store the compressed data in that block leaving any > remaining space empty. For this to work there has to be one > bit somewhere that says if a block is compressed or not so > blocks that would grow by compression can be stored verbatim. > > By always using as much space as the uncompressed data would > take the allocator code should remain unchanged and editing > existing files should not cause problems, growing or > shrinking of compressed blocks have no effect on the disk layout. > > The client could also abort compression when it detects that > a block is quite uncompressable. Say the first 100K of a > block don''t compress then there is probably little value in > trying the rest. Just send it uncompressed. Same if the > block only compresses by 1%. The time to uncompress 990K > data on each client is probably longer than sending the > extra 10K for a full 1MB block. > > MfG > Goswin > >
Goswin von Brederlow
2006-Aug-01 10:10 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
"Peter J. Braam" <braam@clusterfs.com> writes:> Hi > > This has occurred to me also, but it''s not clear that it would take care > to avoid it to eliminate the benefits we are after. > > It is mandatory that what is written to the disks is still large enough > to realize the full disk bandwidth. For example, writing less than 1MB > chunks to many RAID arrays costs equally much time as writing 1MB > chunks. (The network is less sensitive to this, but in principle the > same holds.) > > So somewhere in the compression path we need to make sure that we > continue to send large enough chunks over to the servers, and that these > end up contiguously on disk. > > This is not contradicting what you write but it adds a dimension to the > problem, namely knowing roughly how much you compress by to avoid losing > the bandwidth somewhere along the line. > > - Peter -I don''t see that as a problem. It is easy to get the raid to read/write at 300 MiB/s. Even if you only write 50% pay data on each 1MiB chunk you still have enough for the 1.5x GBit connections that would saturate the cpu. If you have highly compressed data and big read/write requests chunks larger than 1MiB might be better. With my idea you will have a seek after every chunk which might indead be as costly for the raid as writing the full chunk. MfG Goswin
Peter J. Braam
2006-Aug-01 10:16 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
The numbers given yesterday were wrong (they were from an ia32 era). We are typically targetting servers with 2 Gbytes / sec now. 700 MB on a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit processors, and 2-4 I/B NICs will be used multirail soon. It is absolutely critical to pipeline the IO with 100% efficiency. - Peter - > -----Original Message----- > From: Goswin von Brederlow > [mailto:brederlo@informatik.uni-tuebingen.de] > Sent: Tuesday, August 01, 2006 10:06 AM > To: Peter J. Braam > Cc: Goswin von Brederlow; EKC; Lustre User Discussion Mailing List > Subject: Re: [Lustre-discuss] LZW block-level compression > for improving lustreread/write speeds? > > "Peter J. Braam" <braam@clusterfs.com> writes: > > > Hi > > > > This has occurred to me also, but it''s not clear that it > would take > > care to avoid it to eliminate the benefits we are after. > > > > It is mandatory that what is written to the disks is still large > > enough to realize the full disk bandwidth. For example, > writing less > > than 1MB chunks to many RAID arrays costs equally much > time as writing > > 1MB chunks. (The network is less sensitive to this, but > in principle > > the same holds.) > > > > So somewhere in the compression path we need to make sure that we > > continue to send large enough chunks over to the servers, and that > > these end up contiguously on disk. > > > > This is not contradicting what you write but it adds a > dimension to > > the problem, namely knowing roughly how much you compress > by to avoid > > losing the bandwidth somewhere along the line. > > > > - Peter - > > I don''t see that as a problem. It is easy to get the raid to > read/write at 300 MiB/s. Even if you only write 50% pay data > on each 1MiB chunk you still have enough for the 1.5x GBit > connections that would saturate the cpu. > > If you have highly compressed data and big read/write > requests chunks larger than 1MiB might be better. With my > idea you will have a seek after every chunk which might > indead be as costly for the raid as writing the full chunk. > > MfG > Goswin > >
Sean Ziegeler, Contractor
2006-Aug-01 11:00 UTC
[Lustre-discuss] IB Multirail (was LZW block-level compression)
On Tue, 2006-08-01 at 11:15, Peter J. Braam wrote: [snip]> We are typically targetting servers with 2 Gbytes / sec now. 700 MB on > a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit processors, > and 2-4 I/B NICs will be used multirail soon.[snip] Sorry to diverge from the topic, but I was under the impression that "channel-bonding-like" approaches weren''t planned. That is, things like multirail IB wouldn''t be supported by Luster. Or do you mean some sort of driver-level support similar transparent to all applications, including Lustre? Or Did I misunderstand completely? Thanks, Sean
Goswin von Brederlow
2006-Aug-02 10:47 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
"Peter J. Braam" <braam@clusterfs.com> writes:> The numbers given yesterday were wrong (they were from an ia32 era). > > We are typically targetting servers with 2 Gbytes / sec now. 700 MB on > a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit processors, > and 2-4 I/B NICs will be used multirail soon. > > It is absolutely critical to pipeline the IO with 100% efficiency. > > - Peter -Sure. With infiniband we get 600-700MiB/s speeds too. But the use case for compression was for "slow" networks. It certainly will not be good enough for fast networks. I wonder though how much penalty you get real live if you write half a MiB, seek half a MiB, write, seek, write, seek,... How much does that actualy cost? What about 2, 4, 8 MiB chunks? At some size it will certainly cost less than writing the full chunk. MfG Goswin
Peter J. Braam
2006-Aug-03 08:13 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
We are interested in using this on the fastest possible networks. The breakover point of the IO chunks is highly array dependent. 4MB is likely reasonable assuming 2-3x compression. - Peter - > -----Original Message----- > From: Goswin von Brederlow > [mailto:brederlo@informatik.uni-tuebingen.de] > Sent: Wednesday, August 02, 2006 10:42 AM > To: Peter J. Braam > Cc: Goswin von Brederlow; EKC; Lustre User Discussion Mailing List > Subject: Re: [Lustre-discuss] LZW block-level compression > for improving lustreread/write speeds? > > "Peter J. Braam" <braam@clusterfs.com> writes: > > > The numbers given yesterday were wrong (they were from an > ia32 era). > > > > We are typically targetting servers with 2 Gbytes / sec > now. 700 MB > > on a 10 GIGE is normal, saturating 3-4 Gige is wit say 2 64bit > > processors, and 2-4 I/B NICs will be used multirail soon. > > > > It is absolutely critical to pipeline the IO with 100% efficiency. > > > > - Peter - > > Sure. With infiniband we get 600-700MiB/s speeds too. But > the use case for compression was for "slow" networks. It > certainly will not be good enough for fast networks. > > I wonder though how much penalty you get real live if you > write half a MiB, seek half a MiB, write, seek, write, > seek,... How much does that actualy cost? What about 2, 4, 8 > MiB chunks? At some size it will certainly cost less than > writing the full chunk. > > MfG > Goswin > >
Goswin von Brederlow
2006-Aug-04 08:28 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
"Peter J. Braam" <braam@clusterfs.com> writes:> We are interested in using this on the fastest possible networks.Then you will just waste cpu cycles. With infiniband and rdma you get the data directly into memory without the cpu touching it. Compression/Decompression will be a major slowdown at 300MiB/s (current speed we get for a single client). You would need some hardware compression/decompression module that can handle such speeds. MfG Goswin
Scott Atchley
2006-Aug-04 08:49 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
On Aug 4, 2006, at 10:23 AM, Goswin von Brederlow wrote:> "Peter J. Braam" <braam@clusterfs.com> writes: > >> We are interested in using this on the fastest possible networks. > > Then you will just waste cpu cycles. With infiniband and rdma you get > the data directly into memory without the cpu touching it. > Compression/Decompression will be a major slowdown at 300MiB/s > (current speed we get for a single client). You would need some > hardware compression/decompression module that can handle such speeds. > > MfG > GoswinTo use compression at all, you must be able to compress a block in less time than it takes to send the uncompressed block. If not, you will slow down the sends. To really make sense, you must be able to compress the block in less time that it takes to send the _compressed_ block. In this case, it is free (except for the initial N blocks where N is the number of concurrent large message transfers). This assumes we have cycles to burn on the client (the servers only handle the compressed or uncompressed data). This assumption is fine if the data is a multi-TB output data set that is being stored at the end of a batch process. This assumption is bad if the process is overlapping computation and storage. When I was at UT working with the LoCI group on IBP, etc., we were able to store the original length and the compressed length in the inode equivalent (we called it an exNode). Since we were using TCP over relatively slow links (100 Mb/s or gigabit Ethernet), we could afford to compress. I''m not sure that holds at 9.4 Gb/s on Myri-10G using MXLND. Scott
Peter J. Braam
2006-Aug-04 13:11 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
Don''t forget you have 1000''s of clients compressing vs a few servers. Also, we are fighting disk bandwidth typically, not the network. There is possibly a huge benefit to this for dumping checkpoint-restore data, client''s are doing much else usually during dumping, and if it compresses, which is very likely, it can dramatically cut down on the IO bottleneck. - Peter -> -----Original Message----- > From: Scott Atchley [mailto:atchley@myri.com] > Sent: Friday, August 04, 2006 8:49 AM > To: Goswin von Brederlow > Cc: Peter J. Braam; Lustre User Discussion Mailing List > Subject: Re: [Lustre-discuss] LZW block-level compression for > improving lustreread/write speeds? > > On Aug 4, 2006, at 10:23 AM, Goswin von Brederlow wrote: > > > "Peter J. Braam" <braam@clusterfs.com> writes: > > > >> We are interested in using this on the fastest possible networks. > > > > Then you will just waste cpu cycles. With infiniband and > rdma you get > > the data directly into memory without the cpu touching it. > > Compression/Decompression will be a major slowdown at 300MiB/s > > (current speed we get for a single client). You would need some > > hardware compression/decompression module that can handle > such speeds. > > > > MfG > > Goswin > > To use compression at all, you must be able to compress a > block in less time than it takes to send the uncompressed > block. If not, you will slow down the sends. To really make > sense, you must be able to compress the block in less time > that it takes to send the _compressed_ block. In this case, > it is free (except for the initial N blocks where N is the > number of concurrent large message transfers). > > This assumes we have cycles to burn on the client (the > servers only handle the compressed or uncompressed data). > This assumption is fine if the data is a multi-TB output data > set that is being stored at the end of a batch process. This > assumption is bad if the process is overlapping computation > and storage. > > When I was at UT working with the LoCI group on IBP, etc., we > were able to store the original length and the compressed > length in the inode equivalent (we called it an exNode). > Since we were using TCP over relatively slow links (100 Mb/s > or gigabit Ethernet), we could afford to compress. I''m not > sure that holds at 9.4 Gb/s on Myri-10G using MXLND. > > Scott > >
Nikita Danilov
2006-Aug-04 13:21 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
Peter J. Braam writes: > Don''t forget you have 1000''s of clients compressing vs a few servers. > Also, we are fighting disk bandwidth typically, not the network. > > There is possibly a huge benefit to this for dumping checkpoint-restore > data, client''s are doing much else usually during dumping, and if it > compresses, which is very likely, it can dramatically cut down on the IO > bottleneck. Reiser4 has (not yet production ready) compression support, and it proved to be an advantage even for local file system. Two points of interest are that one wants to compress large chunks of data at once (which, in lustre case, probably implies making per-client data cache larger), and using computationally cheap --even if sub-optimal-- compression algorithms. > > - Peter - Nikita.
Jean-Marc Saffroy
2006-Aug-04 13:53 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
On Fri, 4 Aug 2006, Peter J. Braam wrote:> Don''t forget you have 1000''s of clients compressing vs a few servers. > Also, we are fighting disk bandwidth typically, not the network.Agreed. With simple math, I find that if: T is the per-processor I/O throughput C is the per-processor compression throughput R is the compression ratio (ie. compressed size / original size) Then, if I''m not mistaken, for a write I/O bound program (yes, this is not every workload), the time W to write S bytes is: - without compression: W = S / T - with compression: W = S / C + S * R / T Thus, compression can speed up things if: T / C + R < 1 For example, on a recent PC I observed gzip -1: - with random data: R = 1, C = 15MB/s - with zeroes: R = 1/200, C = 100MB/s - with a vmlinux: R = 1/2, C = 10MB/s I suspect specialized compression schemes achieve higher speeds and lower compression ratios, but let''s keep the vmlinux figures. If a processor can compress at C = 10MB/s and achieve R = 1/2, it makes sense to use compress with gzip -1 if T < 20MB/s. If a cluster has, say, 1k processors, this means its global I/O throughput has to be less than 20GB/s for gzip -1 to be useful when writing vmlinux files. Chip makers sell their processors for peanuts ;-) but a storage cluster of RAID arrays and servers yielding 20GB/s is not that cheap.> There is possibly a huge benefit to this for dumping checkpoint-restore > data, client''s are doing much else usually during dumping, and if it > compresses, which is very likely, it can dramatically cut down on the IO > bottleneck.Isn''t it easier to add compression to the user-space checkpoint code? -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
Goswin von Brederlow
2006-Aug-05 10:11 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
Scott Atchley <atchley@myri.com> writes:> On Aug 4, 2006, at 10:23 AM, Goswin von Brederlow wrote: > >> "Peter J. Braam" <braam@clusterfs.com> writes: >> >>> We are interested in using this on the fastest possible networks. >> >> Then you will just waste cpu cycles. With infiniband and rdma you get >> the data directly into memory without the cpu touching it. >> Compression/Decompression will be a major slowdown at 300MiB/s >> (current speed we get for a single client). You would need some >> hardware compression/decompression module that can handle such speeds. >> >> MfG >> Goswin > > To use compression at all, you must be able to compress a block in > less time than it takes to send the uncompressed block. If not, you > will slow down the sends. To really make sense, you must be able to > compress the block in less time that it takes to send the > _compressed_ block. In this case, it is free (except for the initial > N blocks where N is the number of concurrent large message transfers).The time to compress and send the compressed time should be less than the time to send the uncompressed. In that case it is always free (but still at cpu cost). Otherwise it just helps to fight a bottleneck in the transport layer, e.g. just one slow GBit line.> This assumes we have cycles to burn on the client (the servers only > handle the compressed or uncompressed data). This assumption is fine > if the data is a multi-TB output data set that is being stored at the > end of a batch process. This assumption is bad if the process is > overlapping computation and storage. > > When I was at UT working with the LoCI group on IBP, etc., we were > able to store the original length and the compressed length in the > inode equivalent (we called it an exNode). Since we were using TCP > over relatively slow links (100 Mb/s or gigabit Ethernet), we could > afford to compress. I''m not sure that holds at 9.4 Gb/s on Myri-10G > using MXLND.I would say quite the opposite. Getting a higher throughput with compression compared to without will be nearly impossible. And then it still costs you cpu. Not only on save but on read too. Think what happens if one node saves calculation data compressed and then all other nodes have to read that for the next run.> ScottMfG Goswin PS: can Lustre do multicast if two or more nodes read the same file?
Goswin von Brederlow
2006-Aug-05 10:22 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
Jean-Marc Saffroy <jean-marc.saffroy@ext.bull.net> writes:> If a cluster has, say, 1k processors, this means its global I/O > throughput has to be less than 20GB/s for gzip -1 to be useful when > writing vmlinux files. > > Chip makers sell their processors for peanuts ;-) but a storage > cluster of RAID arrays and servers yielding 20GB/s is not that cheap.You assume they all write at the same time. A lot of the time you have I/O and computations interleaved and you also have local caches on the servers and raid boxes allowing for huge spikes when writing data. Another thing is that previously I assumed the bottleneck is the network and not the disk. I said to store compressed blocks just like uncompressed blocks but with extra unused space at the end. Advantage: - All allocations are still blocks. One simple big unit to deal with. No big change for the allocator. - Rewriting a file will not suddenly run out of disk space. A file will never use more space than raw data contained in it. - Rewriting a block does not create (more) gaps in the file or run out of space for that block requiring a relocation (and fragmentation). Disadvantage: - All compressed blocks are followed by a gap. As said raid speed can be slowed down by that to match the speed of uncompressed blocks. To get a better disk performance for compressed blocks you would probably have to change a ton of code. The layout and allocation functions, add a block reallocator and somehow handle files growing on rewrites (probably just count free space as if files where uncompressed). MfG Goswin
Jean-Marc Saffroy
2006-Aug-05 19:18 UTC
[Lustre-discuss] LZW block-level compression for improving lustreread/write speeds?
On Sat, 5 Aug 2006, Goswin von Brederlow wrote:> Jean-Marc Saffroy <jean-marc.saffroy@ext.bull.net> writes: > >> If a cluster has, say, 1k processors, this means its global I/O >> throughput has to be less than 20GB/s for gzip -1 to be useful when >> writing vmlinux files. >> >> Chip makers sell their processors for peanuts ;-) but a storage >> cluster of RAID arrays and servers yielding 20GB/s is not that cheap. > > You assume they all write at the same time.Yes, and that is true of some workloads, such as a checkpoint/restart operation, which is not rare I think.> A lot of the time you have I/O and computations interleaved and you also > have local caches on the servers and raid boxes allowing for huge spikes > when writing data.Caches on servers and storage systems are not that big, if you consider that there is often something like one I/O server for 10 compute nodes, which each have loads of RAM for applications.> Another thing is that previously I assumed the bottleneck is the > network and not the disk.This can happen, but if performance is a concern then it''s a waste of disk bandwidth, which is awfully expensive compared to network bandwidth. -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net