Tomas Kalibera
2025-Mar-26 20:34 UTC
[Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
On 3/26/25 17:47, Sergio Oller wrote:> Hello, > > I would like to submit a patch to R. Following 5 Submitting Feature > Requests ? R Development Guide > <https://contributor.r-project.org/rdevguide/chapters/submitting_feature_requests.html>, > I would like to ask for feedback before proceeding with a ?formal? > submission on bugzilla. It's my first attempt contributing to R and I do > not currently have a bugzilla account. > > I am working at a company, and we use R with databricks. We want to install > some packages on a distributed filesystem that is not fully POSIX > compliant, as it does not support opening files in append mode. In C terms, > `open(filename, "a")` gives an error. I guess other distributed file > systems beyond the ones in databricks may have issues with append mode as > well. > > Our current workaround is to install all packages on a local folder, and > then copy/move the folder to the distributed file system.This is something we try to keep working in R if possible, to allow users moving installed packages by moving the installation directories. If this practice works for you, it is probably fine. Currently, installing a binary package just means unpacking it to the target directory. Probably you could do this also? via binary packages: build binary packages on a local filesystem, and then install them to the non-POSIX filesystem (provided the unpacking/installation would work on such a filesystem). If the installation of a binary package doesn't work but could be (possibly optionally) made work, that might be of interest.> If I understand package installation correctly, when a package is > installed, the installation happens inside a 00LOCK directory, and then the > outcome is moved to the final destination. > > The contribution I would like to submit allows users/sysadmins to set an > environment variable named PKG_LOCKDIR_PREFIX, that defines the location > where the "00LOCK-" directories are created. The patch is backwards > compatible and it consists of +28,-10 lines, hopefully easy enough to > review. > > https://github.com/r-devel/r-svn/pull/196.diff > > When I use this patch, I can successfully install packages on a distributed > file system by setting PKG_LOCKDIR_PREFIX to a directory in my local > filesystem (R does all the file append stuff in the local file system, and > finally copies all the package files to the distributed file system)I am not excited about the idea combining this with the locking mechanism and staged installation in the described way. The current implementation takes advantage of that on a single filesystem, a move operation is either atomic (POSIX) or at least very fast (Windows). Copying an installed package to a different filesystem isn't. There is a risk that some other R session could see a partial installation of a package. Then, if the library was on a distributed filesystem accessed from different machines, there could even be corruption due to concurrent installation from multiple machines. In principle, this could be even on a single machine (checking existence of a directory on one filesystem and creating it on another wouldn't be atomic). Perhaps the staging/locking could be implemented in some special way on the target filesystem, some second-level staging and installation - but it is questionable whether it is worth the effort/maintenance in base R. Also keep in mind this could hardly be regularly tested as such filesystems are rare. Best Tomas P.S. about staged installation: https://developer.r-project.org/Blog/public/2019/02/14/staged-install/index.html> > This setting makes package installation transparent for all data > scientists, since they may not even know that PKG_LOCKDIR_PREFIX has been > set. Package installation just works as expected. > > I feel the patch has some added value over our workaround: Even if we > implement the workaround with a simple wrapper over install.packages(), any > third party package that depends on install.packages() (such as renv or > others) won't use our workaround. Besides, with this patch merged any other > R user benefits from being able to install packages in those filesystems. > > Any feedback is very much appreciated. > > Thanks for your time, > > Sergio > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Sergio Oller
2025-Mar-27 12:26 UTC
[Rd] Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
Hi Tomas, Thanks for your feedback and the link to your blog post about staged installs. Missatge de Tomas Kalibera <tomas.kalibera at gmail.com> del dia dc., 26 de mar? 2025 a les 21:34:> > > > I am working at a company, and we use R with databricks. We want to install > > some packages on a distributed filesystem that is not fully POSIX > > compliant, as it does not support opening files in append mode. In C terms, > > `open(filename, "a")` gives an error. I guess other distributed file > > systems beyond the ones in databricks may have issues with append mode as > > well. > > > > Our current workaround is to install all packages on a local folder, and > > then copy/move the folder to the distributed file system. > > This is something we try to keep working in R if possible, to allow > users moving installed packages by moving the installation directories. > If this practice works for you, it is probably fine.Our current workaround kind of works, but when users expect to be able to install packages using renv or other tools that use install.packages to work; our wrapper is not that convenient.> > > Currently, installing a binary package just means unpacking it to the > target directory. Probably you could do this also via binary packages: > build binary packages on a local filesystem, and then install them to > the non-POSIX filesystem (provided the unpacking/installation would work > on such a filesystem). If the installation of a binary package doesn't > work but could be (possibly optionally) made work, that might be of > interest.Yes, binary package installation works. Still, source package installation does not, which is inconvenient especially when there is some mixture of binary/source packages being installed.> > I am not excited about the idea combining this with the locking > mechanism and staged installation in the described way. The current > implementation takes advantage of that on a single filesystem, a move > operation is either atomic (POSIX) or at least very fast (Windows). > Copying an installed package to a different filesystem isn't. There is a > risk that some other R session could see a partial installation of a > package. Then, if the library was on a distributed filesystem accessed > from different machines, there could even be corruption due to > concurrent installation from multiple machines. In principle, this could > be even on a single machine (checking existence of a directory on one > filesystem and creating it on another wouldn't be atomic). > > Perhaps the staging/locking could be implemented in some special way on > the target filesystem, some second-level staging and installation - but > it is questionable whether it is worth the effort/maintenance in base R. > Also keep in mind this could hardly be regularly tested as such > filesystems are rare.The patch I propose does not help fixing concurrent installations on the distributed file system. I fully agree that the lack of an atomic move creates a risk of leaving the library in a corrupted state in case of errors. I believe the best way to address that probably requires improving the distributed filesystem so it can append mode and handle atomic moves better, however that's beyond my abilities. (Or not using that file system in the first place, but that is the place we have for persisting files right now). So far I included detailed documentation, including this caveats, in the R admin manual. Feel free to see the updated patch: https://patch-diff.githubusercontent.com/raw/r-devel/r-svn/pull/196.diff I have also submitted this patch to bugzilla: https://bugs.r-project.org/show_bug.cgi?id=18876 I hope that even if the proposed solution is not perfect, it will be good and simple enough to be considered for merging, since it improves basic support for non POSIX compliant filesystems and it does not harm common use. However I would perfectly respect and understand that you decide to reject the patch Thanks for your time and your feedback, it is very much appreciated. Best, Sergio
Apparently Analagous Threads
- Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
- Suggestion: Install packages on non-appendable file systems (e.g. databricks volumes)
- [LLVMdev] Proposal: "load linked" and "store conditional" atomic instructions
- [LLVMdev] Floating point atomic load and add
- Server settings for BackgrounDRB?