thr3ads.net - R devel - [Rd] read.table: aborting based on a time constraint [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Derek Stephen Elmerick

2008-Jan-09 19:01 UTC

[Rd] read.table: aborting based on a time constraint

Hello ?

I am trying to write code that will read in multiple datasets;
however, I would like to skip any dataset where the read-in process
takes longer than some fixed cutoff. A generic version of the function
is the following:

for(k in  1:number.of.datasets)
{
   X[k]=read.table(?)
}

The issue is that I cannot find a way to embed logic that will abort
the read-in process of a specific dataset without manual intervention.
I scanned the help manual and other postings, but no luck based on my
search. Any thoughts?

Thanks,
Derek

Sean Davis

2008-Jan-09 19:34 UTC

head link

[Rd] read.table: aborting based on a time constraint

On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric@gmail.com> wrote:
> Hello –
>
> I am trying to write code that will read in multiple datasets;
> however, I would like to skip any dataset where the read-in process
> takes longer than some fixed cutoff. A generic version of the function
> is the following:
>
> for(k in  1:number.of.datasets)
> {
>   X[k]=read.table(…)
> }
>
> The issue is that I cannot find a way to embed logic that will abort
> the read-in process of a specific dataset without manual intervention.
> I scanned the help manual and other postings, but no luck based on my
> search. Any thoughts?

A simple solution is to use nrows=1000000 or so (whatever makes sense).
Then, any dataset larger than that will be truncated.  If you use a
connection, you could even check after the read.table completes to see if
more rows are available--if so, the entire dataset has not been read.

A slightly more complicated solution might be to read in 1000 lines or so
(depends a bit on the data) at a time and then rbind the results of multiple
read.table() calls at the end.  If you capture the colClasses from the first
read, this can potentially be even faster than standard read.table() on the
whole dataset.  You can read from a connection so that the file does not
need to be reopened and the connection need not be reset.  You could check
the time after each chunk of lines to see if you have exceeded your
threshold.

There, of course, may be more clever solutions that I haven't thought of.

Sean

	[[alternative HTML version deleted]]

Gabor Grothendieck

2008-Jan-09 20:27 UTC

head link

[Rd] read.table: aborting based on a time constraint

Use file.file()$size to find out how large the file is
and skip files larger than some cutoff.

On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric at gmail.com>
wrote:> Hello ?
>
> I am trying to write code that will read in multiple datasets;
> however, I would like to skip any dataset where the read-in process
> takes longer than some fixed cutoff. A generic version of the function
> is the following:
>
> for(k in  1:number.of.datasets)
> {
>   X[k]=read.table(?)
> }
>
> The issue is that I cannot find a way to embed logic that will abort
> the read-in process of a specific dataset without manual intervention.
> I scanned the help manual and other postings, but no luck based on my
> search. Any thoughts?
>
> Thanks,
> Derek
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Gabor Grothendieck

2008-Jan-09 20:27 UTC

head link

[Rd] read.table: aborting based on a time constraint

That was supposed to be file.info()$size

On Jan 9, 2008 3:27 PM, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> Use file.file()$size to find out how large the file is
> and skip files larger than some cutoff.
>
> On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric at
gmail.com> wrote:
>
> > Hello ?
> >
> > I am trying to write code that will read in multiple datasets;
> > however, I would like to skip any dataset where the read-in process
> > takes longer than some fixed cutoff. A generic version of the function
> > is the following:
> >
> > for(k in  1:number.of.datasets)
> > {
> >   X[k]=read.table(?)
> > }
> >
> > The issue is that I cannot find a way to embed logic that will abort
> > the read-in process of a specific dataset without manual intervention.
> > I scanned the help manual and other postings, but no luck based on my
> > search. Any thoughts?
> >
> > Thanks,
> > Derek
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>

Bill Dunlap

2008-Jan-09 20:33 UTC

head link

[Rd] read.table: aborting based on a time constraint

On Wed, 9 Jan 2008, Derek Stephen Elmerick wrote:
> Hello ?
>
> I am trying to write code that will read in multiple datasets;
> however, I would like to skip any dataset where the read-in process
> takes longer than some fixed cutoff. A generic version of the function
> is the following:
>
> for(k in  1:number.of.datasets)
> {
>    X[k]=read.table(?)
> }
>
> The issue is that I cannot find a way to embed logic that will abort
> the read-in process of a specific dataset without manual intervention.
> I scanned the help manual and other postings, but no luck based on my
> search. Any thoughts?
A long time ago, for S, I write a timeout(expr, seconds)
function that would return the value of expr if it completed
before 'seconds' seconds went by but would throw an error
otherwise.  One could then use try(timeout(expr, seconds))
to catch the error.

It only worked on Unix, as it spawned another process that would send
an interrupt signal back to S after the allotted time.  If the
expression were evaluated before the interrupter process finished, then S
would kill the interrupter process.  It relied on S catching the interrupt
and turning it into an error condition.

Translated into R code (unix()->system(intern=T)),
this function would be

   timeout <- function(expr, seconds = 60) {
     killer.pid <- system(intern = TRUE, paste(" (sleep", seconds,
        " ; echo 'Timed out after",
        seconds, "seconds' 1>&2 ; kill -INT", Sys.getpid(),
        ")>/dev/null&\n echo $!"))
     on.exit(system(paste("kill", killer.pid, "> /dev/null
2>&1")))
     expr
   }

E.g.,

   > timeout(log(2), seconds=1)
   [1] 0.6931472
   > timeout(while(TRUE)log(2), seconds=1)
   Timed out after 1 seconds

   >  # you get the prompt back in 1 second

R's try() doesn't catch interrupts and I haven't studied
tryCatch enough to use it to catch the interrupt.

This is pretty ugly, but I was wondering if R had the
facilities to write such a timeout() function.
I used to use it to automate tests of infinite-loop bugs.

----------------------------------------------------------------------------
Bill Dunlap
Insightful Corporation
bill at insightful dot com
360-428-8146

 "All statements in this message represent the opinions of the author and
do
 not necessarily reflect Insightful Corporation policy or position."

Dirk Eddelbuettel

2008-Jan-09 20:40 UTC

head link

[Rd] read.table: aborting based on a time constraint

On Wed, Jan 09, 2008 at 03:27:31PM -0500, Gabor Grothendieck
wrote:> Use file.file()$size to find out how large the file is
> and skip files larger than some cutoff.
You presumably meant file.info()$size 

Dirk
> 
> On Jan 9, 2008 2:01 PM, Derek Stephen Elmerick <delmeric at
gmail.com> wrote:
> > Hello ?
> >
> > I am trying to write code that will read in multiple datasets;
> > however, I would like to skip any dataset where the read-in process
> > takes longer than some fixed cutoff. A generic version of the function
> > is the following:
> >
> > for(k in  1:number.of.datasets)
> > {
> >   X[k]=read.table(?)
> > }
> >
> > The issue is that I cannot find a way to embed logic that will abort
> > the read-in process of a specific dataset without manual intervention.
> > I scanned the help manual and other postings, but no luck based on my
> > search. Any thoughts?
> >
> > Thanks,
> > Derek
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Three out of two people have difficulties with fractions.

Reasonably Related Threads

Search for more possibly parallel threads

R devel - Jan 2008 - read.table: aborting based on a time constraint

[Rd] read.table: aborting based on a time constraint

[Rd] read.table: aborting based on a time constraint

[Rd] read.table: aborting based on a time constraint

[Rd] read.table: aborting based on a time constraint

[Rd] read.table: aborting based on a time constraint

[Rd] read.table: aborting based on a time constraint

Reasonably Related Threads