thr3ads.net - R devel - An idea for something better than read.table [Feb 1999]

If this information is useful, please help other people find it:
Share via:

Peter Dalgaard BSA

1999-Feb-11 17:46 UTC

An idea for something better than read.table

I was recently converting some datasets for use in an R package and it
occurred to me that there really is no "neat" way to input a data
frame if it is to contain factor variables. 

One can use dput()/source or dump() after massaging data into the
right format, of course, but there isn't really anything which allows
you to store the input instructions with the data beyond the simple
header=T type format. 

So I thought of ways to enhance the header. The best idea I've been
able to come up with this far is to 

(a) Write a function - basically an extension of scan() - which allows
    you to specify the column data type in more detail. Let's call it
    data.file() for now. It would pretty much have to deparse all of
    its arguments and interpret things in slightly unusual ways, but R
    can do that, and some of functions (notably help() and data())
    already play this kind of game with the parser...

(b) Have a function, say read(), which parses the 1st expression in a
    file and executes it *with the remainder of the file as the
    argument*. (Currently, this is impossible, but it would be if
    one just kept track of the line number while parsing. parse()
    could stick it on as an attribute of the parsed expression list if
    asked to do so.)

This would make a file format something like the following possible.

[There's another loose idea in there involving a control item to handle
separators, na.strings, etc. - the intention being that read() plugs
in the file= and skip= arguments for the actual call.]

Would this be an approach worth pursuing?

--- Top of file ---
data.file(control(sep="w",na="."),
        Item =
factor(levels=1:4,labels=c("A","B","C","D")),
        Size = numeric(),
        Year = factor(levels=1980:1985)
)
1       0     1980    
1       10    1981    
1       14    1982    
1       20    1983    
1       25    1984    
1       30    1985    
2       0     1980    
2       5     1981    
2       6     1982    
2       8     1984    
3       0     1984    
3       2     1985    
4       0     1980    
4       20    1981    
4       30    1982    
4       30    1984    
4       35    1985    
--- End of file ---

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Kurt Hornik

1999-Feb-11 18:10 UTC

head link

An idea for something better than read.table

>>>>> Peter Dalgaard BSA writes:
> I was recently converting some datasets for use in an R package and it
> occurred to me that there really is no "neat" way to input a data
> frame if it is to contain factor variables. 
> One can use dput()/source or dump() after massaging data into the
> right format, of course, but there isn't really anything which allows
> you to store the input instructions with the data beyond the simple
> header=T type format. 
> So I thought of ways to enhance the header. The best idea I've been
> able to come up with this far is to 
> (a) Write a function - basically an extension of scan() - which allows
>     you to specify the column data type in more detail. Let's call it
>     data.file() for now. It would pretty much have to deparse all of
>     its arguments and interpret things in slightly unusual ways, but R
>     can do that, and some of functions (notably help() and data())
>     already play this kind of game with the parser...
> (b) Have a function, say read(), which parses the 1st expression in a
>     file and executes it *with the remainder of the file as the
>     argument*. (Currently, this is impossible, but it would be if
>     one just kept track of the line number while parsing. parse()
>     could stick it on as an attribute of the parsed expression list if
>     asked to do so.)
> This would make a file format something like the following possible.
> [There's another loose idea in there involving a control item to handle
> separators, na.strings, etc. - the intention being that read() plugs
> in the file= and skip= arguments for the actual call.]
> Would this be an approach worth pursuing?
I think so.  However, why can't we extend scan() accordingly?

E.g.,

  scan(FILE,
       what = list(Item =
factor(levels=1:4,labels=c("A","B","C","D")),
                   Size = numeric(),
                   Year = factor(levels=1980:1985)))

???

-k
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Kjetil Halvorsen

1999-Feb-24 20:54 UTC

head link

An idea for something better than read.table

It is nice if data files can have formats not to heavily
dependent on the package.  What I do to read in data is
having data (whith header) in, say, data.dat, and then data.R
with the commands for defining factors, levels, contrast or
whatever. That seems cleaner than mixing in one file data and
definitions.

Kjetil Halvorsen


Peter Dalgaard BSA wrote:> 
> I was recently converting some datasets for use in an R package and it
> occurred to me that there really is no "neat" way to input a data
> frame if it is to contain factor variables.
> 
> One can use dput()/source or dump() after massaging data into the
> right format, of course, but there isn't really anything which allows
> you to store the input instructions with the data beyond the simple
> header=T type format.
> 
> So I thought of ways to enhance the header. The best idea I've been
> able to come up with this far is to
> 
> (a) Write a function - basically an extension of scan() - which allows
>     you to specify the column data type in more detail. Let's call it
>     data.file() for now. It would pretty much have to deparse all of
>     its arguments and interpret things in slightly unusual ways, but R
>     can do that, and some of functions (notably help() and data())
>     already play this kind of game with the parser...
> 
> (b) Have a function, say read(), which parses the 1st expression in a
>     file and executes it *with the remainder of the file as the
>     argument*. (Currently, this is impossible, but it would be if
>     one just kept track of the line number while parsing. parse()
>     could stick it on as an attribute of the parsed expression list if
>     asked to do so.)
> 
> This would make a file format something like the following possible.
> 
> [There's another loose idea in there involving a control item to handle
> separators, na.strings, etc. - the intention being that read() plugs
> in the file= and skip= arguments for the actual call.]
> 
> Would this be an approach worth pursuing?
> 
> --- Top of file ---
> data.file(control(sep="w",na="."),
>         Item =
factor(levels=1:4,labels=c("A","B","C","D")),
>         Size = numeric(),
>         Year = factor(levels=1980:1985)
> )
> 1       0     1980
> 1       10    1981
> 1       14    1982
> 1       20    1983
> 1       25    1984
> 1       30    1985
> 2       0     1980
> 2       5     1981
> 2       6     1982
> 2       8     1984
> 3       0     1984
> 3       2     1985
> 4       0     1980
> 4       20    1981
> 4       30    1982
> 4       30    1984
> 4       35    1985
> --- End of file ---
> 
> --
>    O__  ---- Peter Dalgaard             Blegdamsvej 3
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To:
r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Reasonably Related Threads

Search for more seemingly similar threads

R devel - Feb 1999 - An idea for something better than read.table

An idea for something better than read.table

An idea for something better than read.table

An idea for something better than read.table

Reasonably Related Threads