Hi, I am dealing with very large datasets and it takes a long time to save a workspace image. The options to save compressed data are: "gzip", "bzip2" or "xz", the default being gzip. I wonder if it's possible to include the pbzip2 (http://compression.ca/pbzip2/) algorithm as an option when saving. "PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer" I tested this as follows with one of my smaller datasets, having only read in the raw data: ===========# Dumped an ascii image save.image(file='test', ascii=TRUE) # At the shell prompt: ls -l test -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test time bzip2 -9 test 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w time pbzip2 -9 test 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w =========== As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 seconds, admittedly on a 64 core machine (running at 50% load). Most modern machines are multicore so everyone would get some speedup. Is this feasible/practical? I am not a developer so I'm afraid this would be down to someone else... Thoughts? Cheers, Stewart -- Stewart W. Morris Centre for Genomic and Experimental Medicine The University of Edinburgh United Kingdom The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
On 15/01/2015 12:45, Stewart Morris wrote:> Hi, > > I am dealing with very large datasets and it takes a long time to save a > workspace image.Sounds like bad practice on your part ... saving images is not recommended for careful work.> The options to save compressed data are: "gzip", "bzip2" or "xz", the > default being gzip. I wonder if it's possible to include the pbzip2 > (http://compression.ca/pbzip2/) algorithm as an option when saving.It is not an 'algorithm', it is a command-line utility widely available for Linux at least.> "PBZIP2 is a parallel implementation of the bzip2 block-sorting file > compressor that uses pthreads and achieves near-linear speedup on SMP > machines. The output of this version is fully compatible with bzip2 > v1.0.2 or newer" > > I tested this as follows with one of my smaller datasets, having only > read in the raw data: > > ===========> # Dumped an ascii image > save.image(file='test', ascii=TRUE)Why do that if you are at all interested in speed? A pointless (and inaccurate) binary to decimal conversion is needed.> > # At the shell prompt: > ls -l test > -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test > > time bzip2 -9 test > 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w > > time pbzip2 -9 test > 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w > ===========> > As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took > 11 seconds, admittedly on a 64 core machine (running at 50% load). Most > modern machines are multicore so everyone would get some speedup.But R does not by default save bzip2-ed ASCII images ... and gzip is the default because its speed/compression tradeoffs (see ?save) are best for the typical R user. And your last point is a common misunderstanding, that people typically have lots of spare cores which are zero-price. Even on my 8 (virtual) core desktop when I typically do have spare cores, using them has a price in throttling turbo mode and cache contention. Quite a large price: an R session may run 1.5-2x slower if 7 other tasks are run in parallel.> Is this feasible/practical? I am not a developer so I'm afraid this > would be down to someone else...Not in base R. For example one would need a linkable library, which the site you quote is not obviously providing. Nothing is stopping you writing a sensible uncompressed image and optionally compressing it externally, but note that for some file systems compressed saves are faster because of reduced I/O.> Thoughts?> > Cheers, > > Stewart >-- Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK
On 01/15/2015 01:45 PM, Stewart Morris wrote:> Hi, > > I am dealing with very large datasets and it takes a long time to save a > workspace image. > > The options to save compressed data are: "gzip", "bzip2" or "xz", the > default being gzip. I wonder if it's possible to include the pbzip2 > (http://compression.ca/pbzip2/) algorithm as an option when saving. > > "PBZIP2 is a parallel implementation of the bzip2 block-sorting file > compressor that uses pthreads and achieves near-linear speedup on SMP > machines. The output of this version is fully compatible with bzip2 > v1.0.2 or newer" > > I tested this as follows with one of my smaller datasets, having only > read in the raw data: > > ===========> # Dumped an ascii image > save.image(file='test', ascii=TRUE) > > # At the shell prompt: > ls -l test > -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test > > time bzip2 -9 test > 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w > > time pbzip2 -9 test > 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w > ===========> > As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took > 11 seconds, admittedly on a 64 core machine (running at 50% load). Most > modern machines are multicore so everyone would get some speedup. > > Is this feasible/practical? I am not a developer so I'm afraid this > would be down to someone else...Take a look at the gdsfmt package. It supports the superfast Lz4 compression algorithm + it provides highly optimized functions to write to/read from disk. https://github.com/zhengxwen/gdsfmt> > Thoughts? > > Cheers, > > Stewart >
In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression:> d=lapply(1:10, function(x) as.integer(rnorm(1e7))) > system.time(saveRDS(d, file="test.rds.gz"))user system elapsed 17.210 0.148 17.397> system.time(saveRDS(d, file="test.rds", compress=F))user system elapsed 0.482 0.355 0.929 The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now. Cheers, Simon BTW: why in the world would you use ascii=TRUE? It's pretty much the slowest possible serialization you can use - it will even overshadow compression:> system.time(saveRDS(d, file="test.rds", compress=F))user system elapsed 0.459 0.383 0.940> system.time(saveRDS(d, file="test-a.rds", compress=F, ascii=T))user system elapsed 36.713 0.140 36.929 and the same goes for reading:> system.time(readRDS("test-a.rds"))user system elapsed 27.616 0.275 27.948> system.time(readRDS("test.rds"))user system elapsed 0.609 0.184 0.795> On Jan 15, 2015, at 7:45 AM, Stewart Morris <Stewart.Morris at igmm.ed.ac.uk> wrote: > > Hi, > > I am dealing with very large datasets and it takes a long time to save a workspace image. > > The options to save compressed data are: "gzip", "bzip2" or "xz", the default being gzip. I wonder if it's possible to include the pbzip2 (http://compression.ca/pbzip2/) algorithm as an option when saving. > > "PBZIP2 is a parallel implementation of the bzip2 block-sorting file compressor that uses pthreads and achieves near-linear speedup on SMP machines. The output of this version is fully compatible with bzip2 v1.0.2 or newer" > > I tested this as follows with one of my smaller datasets, having only read in the raw data: > > ===========> # Dumped an ascii image > save.image(file='test', ascii=TRUE) > > # At the shell prompt: > ls -l test > -rw-rw-r--. 1 swmorris swmorris 1794473126 Jan 14 17:33 test > > time bzip2 -9 test > 364.702u 3.148s 6:14.01 98.3% 0+0k 48+1273976io 1pf+0w > > time pbzip2 -9 test > 422.080u 18.708s 0:11.49 3836.2% 0+0k 0+1274176io 0pf+0w > ===========> > As you can see, bzip2 on its own took over 6 minutes whereas pbzip2 took 11 seconds, admittedly on a 64 core machine (running at 50% load). Most modern machines are multicore so everyone would get some speedup. > > Is this feasible/practical? I am not a developer so I'm afraid this would be down to someone else... > > Thoughts? > > Cheers, > > Stewart > > -- > Stewart W. Morris > Centre for Genomic and Experimental Medicine > The University of Edinburgh > United Kingdom > > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
On Thu, Jan 15, 2015 at 11:08 AM, Simon Urbanek <simon.urbanek at r-project.org> wrote:> In addition to the major points that others made: if you care about speed, don't use compression. With today's fast disks it's an order of magnitude slower to use compression: > >> d=lapply(1:10, function(x) as.integer(rnorm(1e7))) >> system.time(saveRDS(d, file="test.rds.gz")) > user system elapsed > 17.210 0.148 17.397 >> system.time(saveRDS(d, file="test.rds", compress=F)) > user system elapsed > 0.482 0.355 0.929 > > The above example is intentionally well compressible, in real life the differences are actually even bigger. As people that deal with big data know well, disks are no longer the bottleneck - it's the CPU now.Respectfully, while your example would imply this, I don't think this is correct in the general case. Much faster compression schemes exist, and using these can improve disk I/O tremendously. Some schemes that are so fast that it's even faster to transfer compressed data from main RAM to CPU cache and then decompress to avoid being limited by RAM bandwidth: https://github.com/Blosc/c-blosc Repeating that for emphasis, compressing and uncompressing can be actually be faster than a straight memcpy()! Really, the issue is that 'gzip' and 'bzip2' are bottlenecks. As Steward suggests, this can be mitigated by throwing more cores at the problem. This isn't a bad solution, as there are often excess underutilized cores. But much better would be to choose a faster compression scheme first, and then parallelize that across cores if still necessary. Sometimes the tradeoff is between amount of compression and speed, and sometimes some algorithms are just faster than others. Here's some sample data for the test file that your example creates:> d=lapply(1:10, function(x) as.integer(rnorm(1e7))) > system.time(saveRDS(d, file="test.rds", compress=F))user system elapsed 0.554 0.336 0.890 nate at ubuntu:~/R/rds$ ls -hs test.rds 382M test.rds nate at ubuntu:~/R/rds$ time gzip -c test.rds > test.rds.gz real: 16.207 sec nate at ubuntu:~/R/rds$ ls -hs test.rds.gz 35M test.rds.gz nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.330 sec nate at ubuntu:~/R/rds$ time gzip -c --fast test.rds > test.rds.gz real: 4.759 sec nate at ubuntu:~/R/rds$ ls -hs test.rds.gz 56M test.rds.gz nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.942 sec nate at ubuntu:~/R/rds$ time pigz -c test.rds > test.rds.gz real: 2.180 sec nate at ubuntu:~/R/rds$ ls -hs test.rds.gz 35M test.rds.gz nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.375 sec nate at ubuntu:~/R/rds$ time pigz -c --fast test.rds > test.rds.gz real: 0.739 sec nate at ubuntu:~/R/rds$ ls -hs test.rds.gz 57M test.rds.gz nate at ubuntu:~/R/rds$ time gunzip -c test.rds.gz > discard real: 2.851 sec nate at ubuntu:~/R/rds$ time lz4c test.rds > test.rds.lz4 Compressed 400000102 bytes into 125584749 bytes ==> 31.40% real: 1.024 sec nate at ubuntu:~/R/rds$ ls -hs test.rds.lz4 120M test.rds.lz4 nate at ubuntu:~/R/rds$ time lz4 test.rds.lz4 > discard Compressed 125584749 bytes into 95430573 bytes ==> 75.99% real: 0.775 sec Reading that last one more closely, with single threaded lz4 compression, we're getting 3x compression at about 400MB/s, and decompression at about 500MB/s. This is faster than almost any single disk will be. Multithreaded implementations will make even the fastest RAID be the bottleneck. It's probably worth noting that the speeds reported in your simple example for the uncompressed case are likely the speed of writing to memory, with the actual write to disk happening at some later time. Sustained throughput will likely be slower than your example would imply If saving data to disk is a bottleneck, I think Stewart is right that there is a lot of room for improvement. --nate