I was recently converting some datasets for use in an R package and it occurred to me that there really is no "neat" way to input a data frame if it is to contain factor variables. One can use dput()/source or dump() after massaging data into the right format, of course, but there isn't really anything which allows you to store the input instructions with the data beyond the simple header=T type format. So I thought of ways to enhance the header. The best idea I've been able to come up with this far is to (a) Write a function - basically an extension of scan() - which allows you to specify the column data type in more detail. Let's call it data.file() for now. It would pretty much have to deparse all of its arguments and interpret things in slightly unusual ways, but R can do that, and some of functions (notably help() and data()) already play this kind of game with the parser... (b) Have a function, say read(), which parses the 1st expression in a file and executes it *with the remainder of the file as the argument*. (Currently, this is impossible, but it would be if one just kept track of the line number while parsing. parse() could stick it on as an attribute of the parsed expression list if asked to do so.) This would make a file format something like the following possible. [There's another loose idea in there involving a control item to handle separators, na.strings, etc. - the intention being that read() plugs in the file= and skip= arguments for the actual call.] Would this be an approach worth pursuing? --- Top of file --- data.file(control(sep="w",na="."), Item = factor(levels=1:4,labels=c("A","B","C","D")), Size = numeric(), Year = factor(levels=1980:1985) ) 1 0 1980 1 10 1981 1 14 1982 1 20 1983 1 25 1984 1 30 1985 2 0 1980 2 5 1981 2 6 1982 2 8 1984 3 0 1984 3 2 1985 4 0 1980 4 20 1981 4 30 1982 4 30 1984 4 35 1985 --- End of file --- -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> Peter Dalgaard BSA writes:> I was recently converting some datasets for use in an R package and it > occurred to me that there really is no "neat" way to input a data > frame if it is to contain factor variables.> One can use dput()/source or dump() after massaging data into the > right format, of course, but there isn't really anything which allows > you to store the input instructions with the data beyond the simple > header=T type format.> So I thought of ways to enhance the header. The best idea I've been > able to come up with this far is to> (a) Write a function - basically an extension of scan() - which allows > you to specify the column data type in more detail. Let's call it > data.file() for now. It would pretty much have to deparse all of > its arguments and interpret things in slightly unusual ways, but R > can do that, and some of functions (notably help() and data()) > already play this kind of game with the parser...> (b) Have a function, say read(), which parses the 1st expression in a > file and executes it *with the remainder of the file as the > argument*. (Currently, this is impossible, but it would be if > one just kept track of the line number while parsing. parse() > could stick it on as an attribute of the parsed expression list if > asked to do so.)> This would make a file format something like the following possible.> [There's another loose idea in there involving a control item to handle > separators, na.strings, etc. - the intention being that read() plugs > in the file= and skip= arguments for the actual call.]> Would this be an approach worth pursuing?I think so. However, why can't we extend scan() accordingly? E.g., scan(FILE, what = list(Item = factor(levels=1:4,labels=c("A","B","C","D")), Size = numeric(), Year = factor(levels=1980:1985))) ??? -k -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
It is nice if data files can have formats not to heavily dependent on the package. What I do to read in data is having data (whith header) in, say, data.dat, and then data.R with the commands for defining factors, levels, contrast or whatever. That seems cleaner than mixing in one file data and definitions. Kjetil Halvorsen Peter Dalgaard BSA wrote:> > I was recently converting some datasets for use in an R package and it > occurred to me that there really is no "neat" way to input a data > frame if it is to contain factor variables. > > One can use dput()/source or dump() after massaging data into the > right format, of course, but there isn't really anything which allows > you to store the input instructions with the data beyond the simple > header=T type format. > > So I thought of ways to enhance the header. The best idea I've been > able to come up with this far is to > > (a) Write a function - basically an extension of scan() - which allows > you to specify the column data type in more detail. Let's call it > data.file() for now. It would pretty much have to deparse all of > its arguments and interpret things in slightly unusual ways, but R > can do that, and some of functions (notably help() and data()) > already play this kind of game with the parser... > > (b) Have a function, say read(), which parses the 1st expression in a > file and executes it *with the remainder of the file as the > argument*. (Currently, this is impossible, but it would be if > one just kept track of the line number while parsing. parse() > could stick it on as an attribute of the parsed expression list if > asked to do so.) > > This would make a file format something like the following possible. > > [There's another loose idea in there involving a control item to handle > separators, na.strings, etc. - the intention being that read() plugs > in the file= and skip= arguments for the actual call.] > > Would this be an approach worth pursuing? > > --- Top of file --- > data.file(control(sep="w",na="."), > Item = factor(levels=1:4,labels=c("A","B","C","D")), > Size = numeric(), > Year = factor(levels=1980:1985) > ) > 1 0 1980 > 1 10 1981 > 1 14 1982 > 1 20 1983 > 1 25 1984 > 1 30 1985 > 2 0 1980 > 2 5 1981 > 2 6 1982 > 2 8 1984 > 3 0 1984 > 3 2 1985 > 4 0 1980 > 4 20 1981 > 4 30 1982 > 4 30 1984 > 4 35 1985 > --- End of file --- > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._