Maybe after all this is more appropriate on r-devel, so I'm forwarding it
here.
---------- Forwarded message ----------
From: Elena Grassi <grassi.e at gmail.com>
Date: Wed, Apr 3, 2013 at 2:19 PM
Subject: Process substitution and read.table/scan
To: r-help at r-project.org
Hello, I did the same question on stackoverflow
(http://stackoverflow.com/questions/15784373/process-substitution) but
did not understand completely the issue so I'm reporting it here:
"
I've given a look around about what puzzles me and I only found this:
http://stackoverflow.com/questions/4274171/do-some-programs-not-accept-process-substitution-for-input-files
which is partially helping, but I really would like to understand the
full story. I noticed that some of my R scripts give different (ie.
wrong) results when I use process substitution.
I tried to pinpoint the problem with a test case:
This script:
#!/usr/bin/Rscript
args <- commandArgs(TRUE)
file <-args[1]
cat(file)
cat("\n")
data <- read.table(file, header=F)
cat(mean(data$V1))
cat("\n")
with an input file generated in this way:
$ for i in `seq 1 10`; do echo $i >> p; done
$ for i in `seq 1 500`; do cat p >> test; done
leads me to this:
$ ./mean.R test
test
5.5
$ ./mean.R <(cat test)
/dev/fd/63
5.501476
Further tests reveal that some lines are lost...but I would like to
understand why. Does read.table (scan gives the same results) uses
seek?
Ps. with a smaller test file (100) an error is reported:
$./mean.R <(cat test3)
/dev/fd/63
Error in read.table(file, header = F) : no lines available in input
Execution halted
"
Other notes: with a modified script that uses scan the results are the same.
Printing the whole data.frame results in 5001 lines in the first case
(which is correct) and only 3050 with the process redirection.
I checked read.table source code and I saw that it goes around in the
file to check for column types and so on...I thought that this was an
explanation for this problem but I would prefer an error message
reported instead than a result gotten from partial data...then someone
on stackoverflow pointed me to fifo() which solves the problem (i.e
the mean is reported correctly even with the process redirection) and
therefore I'm even more puzzled: does fifo() allows seeks and peeks
around a named pipe?
I'm willing to read the relevant code to understand what's really
happening (and even help if someone thinks that this issue could
represent a small bug) but I would really appreciate some pointers.
Here the sessionInfo() and other possibly relevant
things:> sessionInfo()
R version 3.0.0 beta (2013-03-23 r62384)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
$ uname -a
Linux femto 3.6-trunk-amd64 #1 SMP Debian 3.6.9-1~experimental.1
x86_64 GNU/Linux
I use the debian R package: r-base-core, 3.0.0~20130324-1
Thanks,
Elena Grassi
ps.
I started on R-help as long as this could be of general interest,
sorry if that's a bad call.
--
$ pom
--
$ pom