Derek Stephen Elmerick
2008-Jan-09 19:01 UTC
[Rd] read.table: aborting based on a time constraint
Hello ? I am trying to write code that will read in multiple datasets; however, I would like to skip any dataset where the read-in process takes longer than some fixed cutoff. A generic version of the function is the following: for(k in 1:number.of.datasets) { X[k]=read.table(?) } The issue is that I cannot find a way to embed logic that will abort the read-in process of a specific dataset without manual intervention. I scanned the help manual and other postings, but no luck based on my search. Any thoughts? Thanks, Derek
On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric@gmail.com> wrote:> Hello – > > I am trying to write code that will read in multiple datasets; > however, I would like to skip any dataset where the read-in process > takes longer than some fixed cutoff. A generic version of the function > is the following: > > for(k in 1:number.of.datasets) > { > X[k]=read.table(…) > } > > The issue is that I cannot find a way to embed logic that will abort > the read-in process of a specific dataset without manual intervention. > I scanned the help manual and other postings, but no luck based on my > search. Any thoughts?A simple solution is to use nrows=1000000 or so (whatever makes sense). Then, any dataset larger than that will be truncated. If you use a connection, you could even check after the read.table completes to see if more rows are available--if so, the entire dataset has not been read. A slightly more complicated solution might be to read in 1000 lines or so (depends a bit on the data) at a time and then rbind the results of multiple read.table() calls at the end. If you capture the colClasses from the first read, this can potentially be even faster than standard read.table() on the whole dataset. You can read from a connection so that the file does not need to be reopened and the connection need not be reset. You could check the time after each chunk of lines to see if you have exceeded your threshold. There, of course, may be more clever solutions that I haven't thought of. Sean [[alternative HTML version deleted]]
Gabor Grothendieck
2008-Jan-09 20:27 UTC
[Rd] read.table: aborting based on a time constraint
Use file.file()$size to find out how large the file is and skip files larger than some cutoff. On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric at gmail.com> wrote:> Hello ? > > I am trying to write code that will read in multiple datasets; > however, I would like to skip any dataset where the read-in process > takes longer than some fixed cutoff. A generic version of the function > is the following: > > for(k in 1:number.of.datasets) > { > X[k]=read.table(?) > } > > The issue is that I cannot find a way to embed logic that will abort > the read-in process of a specific dataset without manual intervention. > I scanned the help manual and other postings, but no luck based on my > search. Any thoughts? > > Thanks, > Derek > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Gabor Grothendieck
2008-Jan-09 20:27 UTC
[Rd] read.table: aborting based on a time constraint
That was supposed to be file.info()$size On Jan 9, 2008 3:27 PM, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Use file.file()$size to find out how large the file is > and skip files larger than some cutoff. > > On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric at gmail.com> wrote: > > > Hello ? > > > > I am trying to write code that will read in multiple datasets; > > however, I would like to skip any dataset where the read-in process > > takes longer than some fixed cutoff. A generic version of the function > > is the following: > > > > for(k in 1:number.of.datasets) > > { > > X[k]=read.table(?) > > } > > > > The issue is that I cannot find a way to embed logic that will abort > > the read-in process of a specific dataset without manual intervention. > > I scanned the help manual and other postings, but no luck based on my > > search. Any thoughts? > > > > Thanks, > > Derek > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > >
On Wed, 9 Jan 2008, Derek Stephen Elmerick wrote:> Hello ? > > I am trying to write code that will read in multiple datasets; > however, I would like to skip any dataset where the read-in process > takes longer than some fixed cutoff. A generic version of the function > is the following: > > for(k in 1:number.of.datasets) > { > X[k]=read.table(?) > } > > The issue is that I cannot find a way to embed logic that will abort > the read-in process of a specific dataset without manual intervention. > I scanned the help manual and other postings, but no luck based on my > search. Any thoughts?A long time ago, for S, I write a timeout(expr, seconds) function that would return the value of expr if it completed before 'seconds' seconds went by but would throw an error otherwise. One could then use try(timeout(expr, seconds)) to catch the error. It only worked on Unix, as it spawned another process that would send an interrupt signal back to S after the allotted time. If the expression were evaluated before the interrupter process finished, then S would kill the interrupter process. It relied on S catching the interrupt and turning it into an error condition. Translated into R code (unix()->system(intern=T)), this function would be timeout <- function(expr, seconds = 60) { killer.pid <- system(intern = TRUE, paste(" (sleep", seconds, " ; echo 'Timed out after", seconds, "seconds' 1>&2 ; kill -INT", Sys.getpid(), ")>/dev/null&\n echo $!")) on.exit(system(paste("kill", killer.pid, "> /dev/null 2>&1"))) expr } E.g., > timeout(log(2), seconds=1) [1] 0.6931472 > timeout(while(TRUE)log(2), seconds=1) Timed out after 1 seconds > # you get the prompt back in 1 second R's try() doesn't catch interrupts and I haven't studied tryCatch enough to use it to catch the interrupt. This is pretty ugly, but I was wondering if R had the facilities to write such a timeout() function. I used to use it to automate tests of infinite-loop bugs. ---------------------------------------------------------------------------- Bill Dunlap Insightful Corporation bill at insightful dot com 360-428-8146 "All statements in this message represent the opinions of the author and do not necessarily reflect Insightful Corporation policy or position."
Dirk Eddelbuettel
2008-Jan-09 20:40 UTC
[Rd] read.table: aborting based on a time constraint
On Wed, Jan 09, 2008 at 03:27:31PM -0500, Gabor Grothendieck wrote:> Use file.file()$size to find out how large the file is > and skip files larger than some cutoff.You presumably meant file.info()$size Dirk> > On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric at gmail.com> wrote: > > Hello ? > > > > I am trying to write code that will read in multiple datasets; > > however, I would like to skip any dataset where the read-in process > > takes longer than some fixed cutoff. A generic version of the function > > is the following: > > > > for(k in 1:number.of.datasets) > > { > > X[k]=read.table(?) > > } > > > > The issue is that I cannot find a way to embed logic that will abort > > the read-in process of a specific dataset without manual intervention. > > I scanned the help manual and other postings, but no luck based on my > > search. Any thoughts? > > > > Thanks, > > Derek > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Three out of two people have difficulties with fractions.