Henrik Bengtsson
2012-Sep-15 17:21 UTC
[Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?
I hardly know anything about the format used in (non-compressed) serialization/RDS, but hoping someone with more knowledge could give me some feedback; Consider two R processes running in parallel on the same unknown file system. Both of them write and read to the same RDS file foo.rds (without compression) at random times using saveRDS(object, file="foo.rds", compress=FALSE) and object2 <- readRDS(file="foo.rds"). This happens frequently enough such that there is a risk for the two processes to write to the same "foo.rds" file at the same time (here one needs to acknowledge that file updates are not atomic nor instant). To simulate the event that two processes writes to the same file at the same time (and non-atomically) results in a interweaved/appended "foo.rds" file, I manually corrupted "foo.rds" by inserting/dropping/replacing a single random byte. It appears that readRDS() will detect this simple event, by throwing an error on "unknown input format", which is what I want. My question is now, is it reasonable to assume that if two or more processes happen to write to the same RDS file at the same time, it is extremely unlikely (*) that they would generate a file that would pass as valid by readRDS()? (*) extremely unlikely = if all of us would run this toy example we would not end up with a non-detect but still corrupt "foo.rds" file in, say, 10000 years. Background: The R.cache package allows memoization (caching of results) to file such that the cache is persistent across R sessions. The persistent part is achieved by writing cache files to the same file directory. This is safe when you run a single process, and even if readRDS() would fail to read a cache file it is no big deal; the memoization will just fail and the results will be recalculated and be resaved. The questions is what happens if you run this in parallel and push it to the extreme; is there a risk that the memoization will properly return but with invalid results. I prefer not having to synchronize this with a mutex/semaphore/common server, but instead rely on this try-an-see approach (cf. the Ethernet protocol on shared medium). My guess (and hope) is that the risk is extremely unlikely (*), but I'd like to hear if someone else thinks otherwise. Thanks, Henrik
Simon Urbanek
2012-Sep-15 19:17 UTC
[Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?
On Sep 15, 2012, at 1:21 PM, Henrik Bengtsson wrote:> I hardly know anything about the format used in (non-compressed) > serialization/RDS, but hoping someone with more knowledge could give > me some feedback; > > Consider two R processes running in parallel on the same unknown file > system. Both of them write and read to the same RDS file foo.rds > (without compression) at random times using saveRDS(object, > file="foo.rds", compress=FALSE) and object2 <- > readRDS(file="foo.rds"). This happens frequently enough such that > there is a risk for the two processes to write to the same "foo.rds" > file at the same time (here one needs to acknowledge that file updates > are not atomic nor instant). > > To simulate the event that two processes writes to the same file at > the same time (and non-atomically) results in a interweaved/appended > "foo.rds" file, I manually corrupted "foo.rds" by > inserting/dropping/replacing a single random byte. It appears that > readRDS() will detect this simple event, by throwing an error on > "unknown input format", which is what I want. My question is now, is > it reasonable to assume that if two or more processes happen to write > to the same RDS file at the same time, it is extremely unlikely (*) > that they would generate a file that would pass as valid by readRDS()? > (*) extremely unlikely = if all of us would run this toy example we > would not end up with a non-detect but still corrupt "foo.rds" file > in, say, 10000 years. >It's actually very probable that it will go undetected. In fact the probability in very high is you have large vectors, because you can corrupt almost the entire file and there will be no sign of corruption, because there is no checksum, so you can changed the the whole vector payload without any consequence. Just try saveRDS(rep(0L,100), "foo.rds", compress=T) and you can mess with anything after byte 21 and it will result in no error. Cheers, S> Background: The R.cache package allows memoization (caching of > results) to file such that the cache is persistent across R sessions. > The persistent part is achieved by writing cache files to the same > file directory. This is safe when you run a single process, and even > if readRDS() would fail to read a cache file it is no big deal; the > memoization will just fail and the results will be recalculated and be > resaved. The questions is what happens if you run this in parallel > and push it to the extreme; is there a risk that the memoization will > properly return but with invalid results. I prefer not having to > synchronize this with a mutex/semaphore/common server, but instead > rely on this try-an-see approach (cf. the Ethernet protocol on shared > medium). My guess (and hope) is that the risk is extremely unlikely > (*), but I'd like to hear if someone else thinks otherwise. > > Thanks, > > Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >
William Dunlap
2012-Sep-15 19:44 UTC
[Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()?
Why not write the RDS file more atomically - write it to a temporary file and rename that file to its final name when it is completely written? E.g., saveRDS.atomically function (object, file, ...) { tfile <- tempfile(basename(file), dirname(file)) on.exit(if (file.exists(tfile)) unlink(tfile)) retval <- saveRDS(object, tfile, ...) if (!file.rename(tfile, file)) { # perhaps want an if(file.exists(file))unlink(file) first stop("Cannot rename temporary file ", tfile, " to ", file) } invisible(retval) } (The file.rename may be tripped up by an overeager virus checker looking at the newly created tfile. I don't know the best way to deal with that.) Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-devel-bounces at r-project.org [mailto:r-devel-bounces at r-project.org] On Behalf > Of Henrik Bengtsson > Sent: Saturday, September 15, 2012 10:22 AM > To: R-devel > Subject: [Rd] Risk of readRDS() not detecting race conditions with parallel saveRDS()? > > I hardly know anything about the format used in (non-compressed) > serialization/RDS, but hoping someone with more knowledge could give > me some feedback; > > Consider two R processes running in parallel on the same unknown file > system. Both of them write and read to the same RDS file foo.rds > (without compression) at random times using saveRDS(object, > file="foo.rds", compress=FALSE) and object2 <- > readRDS(file="foo.rds"). This happens frequently enough such that > there is a risk for the two processes to write to the same "foo.rds" > file at the same time (here one needs to acknowledge that file updates > are not atomic nor instant). > > To simulate the event that two processes writes to the same file at > the same time (and non-atomically) results in a interweaved/appended > "foo.rds" file, I manually corrupted "foo.rds" by > inserting/dropping/replacing a single random byte. It appears that > readRDS() will detect this simple event, by throwing an error on > "unknown input format", which is what I want. My question is now, is > it reasonable to assume that if two or more processes happen to write > to the same RDS file at the same time, it is extremely unlikely (*) > that they would generate a file that would pass as valid by readRDS()? > (*) extremely unlikely = if all of us would run this toy example we > would not end up with a non-detect but still corrupt "foo.rds" file > in, say, 10000 years. > > Background: The R.cache package allows memoization (caching of > results) to file such that the cache is persistent across R sessions. > The persistent part is achieved by writing cache files to the same > file directory. This is safe when you run a single process, and even > if readRDS() would fail to read a cache file it is no big deal; the > memoization will just fail and the results will be recalculated and be > resaved. The questions is what happens if you run this in parallel > and push it to the extreme; is there a risk that the memoization will > properly return but with invalid results. I prefer not having to > synchronize this with a mutex/semaphore/common server, but instead > rely on this try-an-see approach (cf. the Ethernet protocol on shared > medium). My guess (and hope) is that the risk is extremely unlikely > (*), but I'd like to hear if someone else thinks otherwise. > > Thanks, > > Henrik > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Apparently Analagous Threads
- [FORGED] recordPlot/replayPlot not working with saveRDS/readRDS
- recordPlot/replayPlot not working with saveRDS/readRDS
- readRDS and saveRDS
- How to benchmark speed of load/readRDS correctly
- readRDS, In as.double.xts(fishReport$count) : NAs introduced by coercion