Many distros and browsers these days use zstd as the preferred compression method. For example if you unpack a .deb or .rpm file on Debian or Fedora there is zstd archive inside. It is claimed that zstd offers improved compression over gzip, but (unlike lzma) it has comparable decompression speed. Maybe it is interesting to get an estimate of how much R packages would benefit from zstd. Testing this for source packages and MacOS binary packages it is easy as we can gunzip and recompress tar.gz files without having to extract the tarball itself: OUTPUT="sizes.txt" echo "FILE GZIP ZSTD" > $OUTPUT for x in *gz; do FILE=$(basename $x) GZIP=$(wc -c "$x" | awk '{print $1}') ZSTD=$(gunzip -c $x | zstd -19 | wc -c) echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT done Attached are results of running this script on the 500 most downloaded CRAN packages. It shows about 16% size reduction for sources, and 19% for binaries. Zstd is BSD licensed C code that can easily be embedded in any project. -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: sources.txt URL: <stat.ethz.ch/pipermail/r-devel/attachments/20250111/90f91d5e/attachment.txt> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: binaries.txt URL: <stat.ethz.ch/pipermail/r-devel/attachments/20250111/90f91d5e/attachment-0001.txt>
zstd is accessible within R using the archive package [1]. I use it all the time when saving large objects, using code I adapted from [2]. Is your suggestion to import the libraries/source code into base? [1] CRAN.R-project.org/package=archive [2] coolbutuseless.github.io/2018/10/02/using-lz4-and-zstandard-to-compress-files-with-saverds On Fri, Jan 10, 2025 at 6:17?PM Jeroen Ooms <jeroenooms at gmail.com> wrote:> > Many distros and browsers these days use zstd as the preferred > compression method. For example if you unpack a .deb or .rpm file on > Debian or Fedora there is zstd archive inside. It is claimed that zstd > offers improved compression over gzip, but (unlike lzma) it has > comparable decompression speed. Maybe it is interesting to get an > estimate of how much R packages would benefit from zstd. > > Testing this for source packages and MacOS binary packages it is easy > as we can gunzip and recompress tar.gz files without having to extract > the tarball itself: > > OUTPUT="sizes.txt" > echo "FILE GZIP ZSTD" > $OUTPUT > for x in *gz; do > FILE=$(basename $x) > GZIP=$(wc -c "$x" | awk '{print $1}') > ZSTD=$(gunzip -c $x | zstd -19 | wc -c) > echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT > done > > Attached are results of running this script on the 500 most downloaded > CRAN packages. It shows about 16% size reduction for sources, and 19% > for binaries. > > Zstd is BSD licensed C code that can easily be embedded in any project. > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel
I think the first step would have to be to add zstd support to R. zstd is a bit controversial (as shown by the community blowback of the changes you mentioned) and their build system (calling it that is being very generous) is mess so it would require a bit of testing, but it is doable. That said, assuming the above is solved, we have been debating the change of compression at CRAN in general for a bit, but the assumptions about the file names are built into today?s tools so there would be certainly some fall-out - not just in R, but also the ecosystems around it. As you pointed out, possibly the safest place to start are binaries, since we have tighter control of those and they are used in fewer places. Personally, I think the higher priority is signing, so as we address that we may just include the compression change with it since it will require some tool changes anyway. I was thinking of using xz as that is more stable, already supported and less controversial, but I don?t think the choice really matters - it just has to be a compression which R supports (zstd and xz have different benefits, so it?s always a trade-off without a clear winner). Cheers, Simon> On 11 Jan 2025, at 12:16, Jeroen Ooms <jeroenooms at gmail.com> wrote: > > Many distros and browsers these days use zstd as the preferred > compression method. For example if you unpack a .deb or .rpm file on > Debian or Fedora there is zstd archive inside. It is claimed that zstd > offers improved compression over gzip, but (unlike lzma) it has > comparable decompression speed. Maybe it is interesting to get an > estimate of how much R packages would benefit from zstd. > > Testing this for source packages and MacOS binary packages it is easy > as we can gunzip and recompress tar.gz files without having to extract > the tarball itself: > > OUTPUT="sizes.txt" > echo "FILE GZIP ZSTD" > $OUTPUT > for x in *gz; do > FILE=$(basename $x) > GZIP=$(wc -c "$x" | awk '{print $1}') > ZSTD=$(gunzip -c $x | zstd -19 | wc -c) > echo "$FILE $GZIP $ZSTD" | tee -a $OUTPUT > done > > Attached are results of running this script on the 500 most downloaded > CRAN packages. It shows about 16% size reduction for sources, and 19% > for binaries. > > Zstd is BSD licensed C code that can easily be embedded in any project. > <sources.txt><binaries.txt>______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel