Sergio Oller
2025-Mar-26 16:47 UTC
[Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
Hello, I would like to submit a patch to R. Following 5 Submitting Feature Requests ? R Development Guide <https://contributor.r-project.org/rdevguide/chapters/submitting_feature_requests.html>, I would like to ask for feedback before proceeding with a ?formal? submission on bugzilla. It's my first attempt contributing to R and I do not currently have a bugzilla account. I am working at a company, and we use R with databricks. We want to install some packages on a distributed filesystem that is not fully POSIX compliant, as it does not support opening files in append mode. In C terms, `open(filename, "a")` gives an error. I guess other distributed file systems beyond the ones in databricks may have issues with append mode as well. Our current workaround is to install all packages on a local folder, and then copy/move the folder to the distributed file system. If I understand package installation correctly, when a package is installed, the installation happens inside a 00LOCK directory, and then the outcome is moved to the final destination. The contribution I would like to submit allows users/sysadmins to set an environment variable named PKG_LOCKDIR_PREFIX, that defines the location where the "00LOCK-" directories are created. The patch is backwards compatible and it consists of +28,-10 lines, hopefully easy enough to review. https://github.com/r-devel/r-svn/pull/196.diff When I use this patch, I can successfully install packages on a distributed file system by setting PKG_LOCKDIR_PREFIX to a directory in my local filesystem (R does all the file append stuff in the local file system, and finally copies all the package files to the distributed file system) This setting makes package installation transparent for all data scientists, since they may not even know that PKG_LOCKDIR_PREFIX has been set. Package installation just works as expected. I feel the patch has some added value over our workaround: Even if we implement the workaround with a simple wrapper over install.packages(), any third party package that depends on install.packages() (such as renv or others) won't use our workaround. Besides, with this patch merged any other R user benefits from being able to install packages in those filesystems. Any feedback is very much appreciated. Thanks for your time, Sergio [[alternative HTML version deleted]]
Tomas Kalibera
2025-Mar-26 20:34 UTC
[Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
On 3/26/25 17:47, Sergio Oller wrote:> Hello, > > I would like to submit a patch to R. Following 5 Submitting Feature > Requests ? R Development Guide > <https://contributor.r-project.org/rdevguide/chapters/submitting_feature_requests.html>, > I would like to ask for feedback before proceeding with a ?formal? > submission on bugzilla. It's my first attempt contributing to R and I do > not currently have a bugzilla account. > > I am working at a company, and we use R with databricks. We want to install > some packages on a distributed filesystem that is not fully POSIX > compliant, as it does not support opening files in append mode. In C terms, > `open(filename, "a")` gives an error. I guess other distributed file > systems beyond the ones in databricks may have issues with append mode as > well. > > Our current workaround is to install all packages on a local folder, and > then copy/move the folder to the distributed file system.This is something we try to keep working in R if possible, to allow users moving installed packages by moving the installation directories. If this practice works for you, it is probably fine. Currently, installing a binary package just means unpacking it to the target directory. Probably you could do this also? via binary packages: build binary packages on a local filesystem, and then install them to the non-POSIX filesystem (provided the unpacking/installation would work on such a filesystem). If the installation of a binary package doesn't work but could be (possibly optionally) made work, that might be of interest.> If I understand package installation correctly, when a package is > installed, the installation happens inside a 00LOCK directory, and then the > outcome is moved to the final destination. > > The contribution I would like to submit allows users/sysadmins to set an > environment variable named PKG_LOCKDIR_PREFIX, that defines the location > where the "00LOCK-" directories are created. The patch is backwards > compatible and it consists of +28,-10 lines, hopefully easy enough to > review. > > https://github.com/r-devel/r-svn/pull/196.diff > > When I use this patch, I can successfully install packages on a distributed > file system by setting PKG_LOCKDIR_PREFIX to a directory in my local > filesystem (R does all the file append stuff in the local file system, and > finally copies all the package files to the distributed file system)I am not excited about the idea combining this with the locking mechanism and staged installation in the described way. The current implementation takes advantage of that on a single filesystem, a move operation is either atomic (POSIX) or at least very fast (Windows). Copying an installed package to a different filesystem isn't. There is a risk that some other R session could see a partial installation of a package. Then, if the library was on a distributed filesystem accessed from different machines, there could even be corruption due to concurrent installation from multiple machines. In principle, this could be even on a single machine (checking existence of a directory on one filesystem and creating it on another wouldn't be atomic). Perhaps the staging/locking could be implemented in some special way on the target filesystem, some second-level staging and installation - but it is questionable whether it is worth the effort/maintenance in base R. Also keep in mind this could hardly be regularly tested as such filesystems are rare. Best Tomas P.S. about staged installation: https://developer.r-project.org/Blog/public/2019/02/14/staged-install/index.html> > This setting makes package installation transparent for all data > scientists, since they may not even know that PKG_LOCKDIR_PREFIX has been > set. Package installation just works as expected. > > I feel the patch has some added value over our workaround: Even if we > implement the workaround with a simple wrapper over install.packages(), any > third party package that depends on install.packages() (such as renv or > others) won't use our workaround. Besides, with this patch merged any other > R user benefits from being able to install packages in those filesystems. > > Any feedback is very much appreciated. > > Thanks for your time, > > Sergio > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Seemingly Similar Threads
- Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
- Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
- Questions/suggestions about new staged installation
- Questions/suggestions about new staged installation
- Problem with R, staged installation for packages, and samba share