I have the following problem. It is not of earthshaking importance, but still I have spent a considerable amount of time thinking about it. PROBLEM: Is there any way I can have a single textfile that contains both a) data b) programcode The program should act on the data, if the textfile is source()'ed into R. BOUNDARY CONDITION: I want the data written in the textfile in exactly the same format as I would use, if I had data in a separate textfile, to be read by read.table(). That is, with 'horizontal inhomogeneity' and 'vertical homogeneity' in the type of entries. I want to write something like Sex Respons Male 1 Male 2 Female 3 Female 4 In effect, I am asking if there is some way I can convince read.table(), that the data is contained in the following n lines of text. ILLEGAL SOLUTIONS: I know I can simulate the behaviour by reading the columns of the dataframe one by one, and using data.frame() to glue them together. Like in data.frame(Sex = c('Male', 'Male', 'Female', 'Female'), Respons = c(1, 2, 3, 4)) I do not like this solution, because it represents the data in a "transposed" way in the textfile, and this transposition makes the structure of the dataframe less transparent - at least to me. It becomes even less comprehensible if the Sex-factor above is written with the help of rep() or gl() or the like. I know I can make read.table() read from stdin, so I could type the dataframe at the prompt. That is against the spirit of the problem, as I describe below. I know I can make read.table() do the job, if I split the data and the programcode in to different files. But as the purpose of the exercise is to distribute the data and the code to other people, splitting into several files is a complication. MOTIVATION: I frequently find myself distributing small chunks of code to my students, along with data on which the code can work. As an example, I might want to demonstrate how model.matrix() treats interactions, in a certain setting. For that I need a dataframe that is complex enough to exhibit the behaviour I want, but still so small that the model.matrix is easily understood. So I make such a dataframe. I am trying to distribute this dataframe along with my code, in a way that is as simple as possible to USE for the students (hence the one-file boundary condition) and to READ (hence the non-transposition boundary condition). Does anybody have any ideas? Ernst Hansen Department of Statistics University of Copenhagen
> I have the following problem. It is not of earthshaking importance, > but still I have spent a considerable amount of time thinking about > it. > > PROBLEM: Is there any way I can have a single textfile that contains > both > > a) data > > b) programcode > > The program should act on the data, if the textfile is source()'ed > into R. > > > BOUNDARY CONDITION: I want the data written in the textfile in exactly > the same format as I would use, if I had data in a separate textfile, > to be read by read.table(). That is, with 'horizontal inhomogeneity' > and 'vertical homogeneity' in the type of entries. I want to write > something like > > Sex Respons > Male 1 > Male 2 > Female 3 > Female 4 >something like tmpfilename <- tempfile() tmpfile <- file(tmpfilename, "w") cat( ### here comes my data "Sex Respons", "Male 1", "Male 2", "Female 3", "Female 4", ### end of data input file = tmpfile, sep="\n") close(tmpfile) read.table(tmpfilename, header = TRUE) best, Torsten> In effect, I am asking if there is some way I can convince > read.table(), that the data is contained in the following n lines of > text. > > > ILLEGAL SOLUTIONS: > I know I can simulate the behaviour by reading the columns of the > dataframe one by one, and using data.frame() to glue them together. > Like in > > data.frame(Sex = c('Male', 'Male', 'Female', 'Female'), > Respons = c(1, 2, 3, 4)) > > I do not like this solution, because it represents the data in a > "transposed" way in the textfile, and this transposition makes the > structure of the dataframe less transparent - at least to me. It > becomes even less comprehensible if the Sex-factor above is written > with the help of rep() or gl() or the like. > > I know I can make read.table() read from stdin, so I could type the > dataframe at the prompt. That is against the spirit of the problem, > as I describe below. > > > I know I can make read.table() do the job, if I split the data and the > programcode in to different files. But as the purpose of the exercise > is to distribute the data and the code to other people, splitting > into several files is a complication. > > > MOTIVATION: I frequently find myself distributing small chunks of code > to my students, along with data on which the code can work. > > As an example, I might want to demonstrate how model.matrix() treats > interactions, in a certain setting. For that I need a dataframe that > is complex enough to exhibit the behaviour I want, but still so small > that the model.matrix is easily understood. So I make such a > dataframe. > > I am trying to distribute this dataframe along with my code, in a way > that is as simple as possible to USE for the students (hence the > one-file boundary condition) and to READ (hence the non-transposition > boundary condition). > > > > Does anybody have any ideas? > > > Ernst Hansen > Department of Statistics > University of Copenhagen > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >
Following up on Thorsten's solution, this one doesn't need a tempfile: my.data<-read.table(textConnection(c( ### here comes my data "Sex Respons", "Male 1", "Male 2", "Female 3", "Female 4" ### end of data input )),header=T) print(my.data) HTH Thomas> -----Original Message----- > From: Torsten Hothorn [mailto:Torsten.Hothorn at rzmail.uni-erlangen.de] > Sent: 12 June 2003 14:00 > To: Ernst Hansen > Cc: r-help at stat.math.ethz.ch > Subject: Re: [R] Programcode and data in the same textfile > > > > I have the following problem. It is not of earthshaking importance, > > but still I have spent a considerable amount of time thinking about > > it. > > > > PROBLEM: Is there any way I can have a single textfile that contains > > both > > > > a) data > > > > b) programcode > > > > The program should act on the data, if the textfile is source()'ed > > into R. > > > > > > BOUNDARY CONDITION: I want the data written in the textfile > in exactly > > the same format as I would use, if I had data in a separate > textfile, > > to be read by read.table(). That is, with 'horizontal > inhomogeneity' > > and 'vertical homogeneity' in the type of entries. I want to write > > something like > > > > Sex Respons > > Male 1 > > Male 2 > > Female 3 > > Female 4 > > > > > something like > > tmpfilename <- tempfile() > tmpfile <- file(tmpfilename, "w") > cat( > > ### here comes my data > > "Sex Respons", > "Male 1", > "Male 2", > "Female 3", > "Female 4", > > ### end of data input > > file = tmpfile, sep="\n") > close(tmpfile) > read.table(tmpfilename, header = TRUE) > > > best, > > Torsten > > > In effect, I am asking if there is some way I can convince > > read.table(), that the data is contained in the following n lines of > > text. > > > > > > ILLEGAL SOLUTIONS: > > I know I can simulate the behaviour by reading the columns of the > > dataframe one by one, and using data.frame() to glue them together. > > Like in > > > > data.frame(Sex = c('Male', 'Male', 'Female', 'Female'), > > Respons = c(1, 2, 3, 4)) > > > > I do not like this solution, because it represents the data in a > > "transposed" way in the textfile, and this transposition makes the > > structure of the dataframe less transparent - at least to me. It > > becomes even less comprehensible if the Sex-factor above is written > > with the help of rep() or gl() or the like. > > > > I know I can make read.table() read from stdin, so I could type the > > dataframe at the prompt. That is against the spirit of the problem, > > as I describe below. > > > > > > I know I can make read.table() do the job, if I split the > data and the > > programcode in to different files. But as the purpose of > the exercise > > is to distribute the data and the code to other people, splitting > > into several files is a complication. > > > > > > MOTIVATION: I frequently find myself distributing small > chunks of code > > to my students, along with data on which the code can work. > > > > As an example, I might want to demonstrate how model.matrix() treats > > interactions, in a certain setting. For that I need a > dataframe that > > is complex enough to exhibit the behaviour I want, but > still so small > > that the model.matrix is easily understood. So I make such a > > dataframe. > > > > I am trying to distribute this dataframe along with my > code, in a way > > that is as simple as possible to USE for the students (hence the > > one-file boundary condition) and to READ (hence the > non-transposition > > boundary condition). > > > > > > > > Does anybody have any ideas? > > > > > > Ernst Hansen > > Department of Statistics > > University of Copenhagen > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > > > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >--- Thomas Hotz Research Associate in Medical Statistics University of Leicester United Kingdom Department of Epidemiology and Public Health 22-28 Princess Road West Leicester LE1 6TP Tel +44 116 252-5410 Fax +44 116 252-5423 Division of Medicine for the Elderly Department of Medicine The Glenfield Hospital Leicester LE3 9QP Tel +44 116 256-3643 Fax +44 116 232-2976
On 12-Jun-03 Ernst Hansen wrote:> I have the following problem. It is not of earthshaking importance, > but still I have spent a considerable amount of time thinking about > it. > > PROBLEM: Is there any way I can have a single textfile that contains > both > > a) data > > b) programcode > > The program should act on the data, if the textfile is source()'ed > into R. > > > BOUNDARY CONDITION: I want the data written in the textfile in exactly > the same format as I would use, if I had data in a separate textfile, > to be read by read.table(). That is, with 'horizontal inhomogeneity' > and 'vertical homogeneity' in the type of entries. I want to write > something like > > Sex Respons > Male 1 > Male 2 > Female 3 > Female 4 > > In effect, I am asking if there is some way I can convince > read.table(), that the data is contained in the following n lines of > text.A thought which occurs to me, which (as far as I can tell) is not already implemented (at any rate in read.table() which is where it could have a natural home) is that, in the same spirit as read,table(file="stdin") one could, if available, use read.table(file="<< EOT") i.e. the "here document" style of redirection that has been a part of Unix since approximately forever (if you take the origin of time as 01/01/70 00:00). Then the above data could be read in from within the source file by X<-read.table(header=TRUE,file="<< EOT") Sex Respons Male 1 Male 2 Female 3 Female 4 EOT I.e. this form of the command would take input from the following lines until "EOT" is encountered on a line by itself. In the Unix setup, "EOT" could be anything so long as it won't occur on a line by itself within the data, and is not included in the content which is read in. Ted, -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 167 1972 Date: 12-Jun-03 Time: 14:21:00 ------------------------------ XFMail ------------------------------
Ernst - Here's a solution which works for me, and seems to do what you want. It's a bit of a hack, since it requires you, the author, to know in advance what file path name the student will have saved the file as. In my example, this will be "./r.source.file", and this includes one blank line before the first assignment statement below. It also requires knowing how many lines of code precede the data lines. But it _is_ a one-file solution, as requested. Put the following 9 or 10 lines into a file named "r.source.file", then source it. data.01 <- read.table(file="r.source.file", header=T, skip=4, comment.char="")[-1] # junk Sex Response # Male 1 # Male 2 # Female 3 # Female 4 I'm quite surprised no one else has suggested this already. - tom blackwell - u michigan medical school - ann arbor - On Thu, 12 Jun 2003, Ernst Hansen wrote:> PROBLEM: Is there any way I can have a single textfile that contains both > a) data b) programcode > The program should act on the data, if the textfile is source()'ed > into R. > > BOUNDARY CONDITION: I want the data written in the textfile in exactly > the same format as I would use, if I had data in a separate textfile, > to be read by read.table(). something like > > Sex Respons > Male 1 > Male 2 > Female 3 > Female 4 > > MOTIVATION: I frequently find myself distributing small chunks of code > to my students, along with data on which the code can work. > > As an example, I might want to demonstrate how model.matrix() treats > interactions, in a certain setting. For that I need a dataframe that > is complex enough to exhibit the behaviour I want, but still so small > that the model.matrix is easily understood. So I make such a dataframe. > > I am trying to distribute this dataframe along with my code, in a way > that is as simple as possible to USE for the students (hence the > one-file boundary condition) and to READ (hence the non-transposition > boundary condition). > > Ernst Hansen > Department of Statistics > University of Copenhagen
Hi Ernst. I have found myself in a similar situation where I want to send code to someone with annotations that explain the different pieces in richer ways than comments will permit. If you want to contain both data and code within a single document, you will need to have some way to identify which is which so that the software can distinguish the different elements of the document. This is precisely what a markup language does. And rather than inventing ad hoc conventions, why not simply use a real markup language. XML is the most natural one, and doing something like <doc> <data> Sex Response Male 1 Male 2 Female 3 Female 4 </data> <code> ...... </code> </doc> Using the XML package, you can read the document into R and do what you will with it. To read the data, tr = xmlRoot(xmlTreeParse("myFile")) read.table(textConnection(xmlValue(tr[["data"]])), header=TRUE) and to access the code text xmlValue(tr[["code"]]) I have a variety of different variants of this style of thing that I occassionally add to the SXMLDocs package. But, for me at least, it is easy to write handlers to process the different content but to leave XML to identify them within the document. Hope this provides some ideas for thinking about the problem in a slightly broader light. D. Ernst Hansen wrote:> I have the following problem. It is not of earthshaking importance, > but still I have spent a considerable amount of time thinking about > it. > > PROBLEM: Is there any way I can have a single textfile that contains > both > > a) data > > b) programcode > > The program should act on the data, if the textfile is source()'ed > into R. > > > BOUNDARY CONDITION: I want the data written in the textfile in exactly > the same format as I would use, if I had data in a separate textfile, > to be read by read.table(). That is, with 'horizontal inhomogeneity' > and 'vertical homogeneity' in the type of entries. I want to write > something like > > Sex Respons > Male 1 > Male 2 > Female 3 > Female 4 > > In effect, I am asking if there is some way I can convince > read.table(), that the data is contained in the following n lines of > text. > > > ILLEGAL SOLUTIONS: > I know I can simulate the behaviour by reading the columns of the > dataframe one by one, and using data.frame() to glue them together. > Like in > > data.frame(Sex = c('Male', 'Male', 'Female', 'Female'), > Respons = c(1, 2, 3, 4)) > > I do not like this solution, because it represents the data in a > "transposed" way in the textfile, and this transposition makes the > structure of the dataframe less transparent - at least to me. It > becomes even less comprehensible if the Sex-factor above is written > with the help of rep() or gl() or the like. > > I know I can make read.table() read from stdin, so I could type the > dataframe at the prompt. That is against the spirit of the problem, > as I describe below. > > > I know I can make read.table() do the job, if I split the data and the > programcode in to different files. But as the purpose of the exercise > is to distribute the data and the code to other people, splitting > into several files is a complication. > > > MOTIVATION: I frequently find myself distributing small chunks of code > to my students, along with data on which the code can work. > > As an example, I might want to demonstrate how model.matrix() treats > interactions, in a certain setting. For that I need a dataframe that > is complex enough to exhibit the behaviour I want, but still so small > that the model.matrix is easily understood. So I make such a > dataframe. > > I am trying to distribute this dataframe along with my code, in a way > that is as simple as possible to USE for the students (hence the > one-file boundary condition) and to READ (hence the non-transposition > boundary condition). > > > > Does anybody have any ideas? > > > Ernst Hansen > Department of Statistics > University of Copenhagen > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help-- _______________________________________________________________ Duncan Temple Lang duncan at research.bell-labs.com Bell Labs, Lucent Technologies office: (908)582-3217 700 Mountain Avenue, Room 2C-259 fax: (908)582-3340 Murray Hill, NJ 07974-2070 http://cm.bell-labs.com/stat/duncan
My request for a way of having both data and R-code in the same textfile, resultet in a considerable number of very good suggestions, that I will now summarize. The boundary conditions for the problem were as follows: the data should be written in the textfile in a format that was readable to the human eye. And this ruled out the 'transposed' way of writing the data, that is used in most help-files, eg. in ?model.matrix. As the purpose of the exercise is to make the textfile easy to read, there is a limit to how complicated the extra code should be - otherwise it would make matters worse. I don't know if any of the solutions below qualify in this sense - but I surely learned a lot from them. The most popular idea was using textConnection() in a combination with read.table(). For instance Thomas Hotz wrote it like # Solution by Thomas Hotz MyFrame <- read.table(textConnection(c( 'Sex Respons', 'Male 1', 'Male 2', 'Female 3', 'Female 4' )), header = T) Gabor Grothendieck had a similar solution. James Holtman provided a nifty trick to get rid of the strategically placed commas and quotations, using escaped carriagereturns, # Solution by James Holtman MyFrame <- read.table(textConnection('\ Sex Respons \ Male 1 \ Male 2 \ Female 3 \ Female 4 \ '), header = T, skip = 1) Duncan Temple Lang suggested that the entire textfile should be wrapped up as XML, and parsed via the XML package. In the context of me and my students, I think that this would be overkill, and I also think it necessarily breaks the one-file boundary condition, but in a larger context it seems like an excellent advise. # Solution by Duncan Tempel Lang # Content of myFile.q <doc> <data> Sex Response Male 1 Male 2 Female 3 Female 4 </data> <code> ...... </code> </doc> To read the data, tr = xmlRoot(xmlTreeParse("myFile.q")) read.table(textConnection(xmlValue(tr[["data"]])), header=TRUE) and to access the code text xmlValue(tr[["code"]]) A number of approaches not based on textConnection() emerged, though. Torsten Hothorn suggested that the data should be surrounded by some kind of print-statement, writing it to a temporary file. Then read.table() could be used to retrieve the data: # Torsten Hothorns solution: tmpfilename <- tempfile() tmpfile <- file(tmpfilename, 'w') cat( 'Sex Respons', 'Male 1', 'Male 2', 'Female 3', 'Female 4', file = tmpfile, sep='\n') close(tmpfile) read.table(tmpfilename, header = TRUE) Barry Rowlingson suggested that the data should be written as a vector of characters, and then shaped by hand: # Barry Rowlingsons solution data <- c( 'Sex', 'Respons', 'Male', 1, 'Female', 2, 'Male', 3, 'Male', 2, ) ncol <- 2 nrow <- length(data)/ncol heads <- data[1:ncol];data <- data[-(1:ncol)] asDF <- data.frame(matrix(data,ncol=ncol,byrow=T)) asDF[,2] <- as.numeric(asDF[,2]) names(asDF) <- heads Finally, Thomas Blackwell and Greg Louis implemented a nice idea, where the data are commented out in the textfile, but where a call to read.table() from within the file, makes it read exactly those lines, using a different convention for comments: # Greg Louis' solution MyFrame <- read.table('myFile.q', header = T, skip = 28, nrows = 4, comment.char="")[-1] # Sex Respons # Male 1 # Male 2 # Female 3 # Female 4 Exactly how lines that will need to be skipped depends on the circumstances. nrows is the number of cases in the dataframe. The original request follows below. Thank you all for participating. Ernst Hansen Department of Statistics University of Copenhagen Ernst Hansen writes: > I have the following problem. It is not of earthshaking importance, > but still I have spent a considerable amount of time thinking about > it. > > PROBLEM: Is there any way I can have a single textfile that contains > both > > a) data > > b) programcode > > The program should act on the data, if the textfile is source()'ed > into R. > > > BOUNDARY CONDITION: I want the data written in the textfile in exactly > the same format as I would use, if I had data in a separate textfile, > to be read by read.table(). That is, with 'horizontal inhomogeneity' > and 'vertical homogeneity' in the type of entries. I want to write > something like > > Sex Respons > Male 1 > Male 2 > Female 3 > Female 4 > > In effect, I am asking if there is some way I can convince > read.table(), that the data is contained in the following n lines of > text. > > > ILLEGAL SOLUTIONS: > I know I can simulate the behaviour by reading the columns of the > dataframe one by one, and using data.frame() to glue them together. > Like in > > data.frame(Sex = c('Male', 'Male', 'Female', 'Female'), > Respons = c(1, 2, 3, 4)) > > I do not like this solution, because it represents the data in a > "transposed" way in the textfile, and this transposition makes the > structure of the dataframe less transparent - at least to me. It > becomes even less comprehensible if the Sex-factor above is written > with the help of rep() or gl() or the like. > > I know I can make read.table() read from stdin, so I could type the > dataframe at the prompt. That is against the spirit of the problem, > as I describe below. > > > I know I can make read.table() do the job, if I split the data and the > programcode in to different files. But as the purpose of the exercise > is to distribute the data and the code to other people, splitting > into several files is a complication. > > > MOTIVATION: I frequently find myself distributing small chunks of code > to my students, along with data on which the code can work. > > As an example, I might want to demonstrate how model.matrix() treats > interactions, in a certain setting. For that I need a dataframe that > is complex enough to exhibit the behaviour I want, but still so small > that the model.matrix is easily understood. So I make such a > dataframe. > > I am trying to distribute this dataframe along with my code, in a way > that is as simple as possible to USE for the students (hence the > one-file boundary condition) and to READ (hence the non-transposition > boundary condition). > > > > Does anybody have any ideas? > > > Ernst Hansen > Department of Statistics > University of Copenhagen > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >
Dear R users, I am looking for a more efficient way to compute the sum of columns of a matrix. I am currently using apply(data, 2, sum) however, I am building a data set from another one by summing the columns of some parts of the matrix. the loop is taking too long (about 1/2 hour) for a 4462 * 202 matrix. thanks, Jean Eid
Ernst Hansen (erhansen@math.ku.dk) had asked to combine code and data in a single text file that can be sourced such that the data is included in the sourced file a way that can be read by read.table. He has already summarized the responses and the solution below builds on that summary. It has the advantages of allowing multiple embedded data files and comments within the data. Its also reasonably simple requiring only (1) a one line function acting as an alternative to stdin() and (2) a call to this new function within the read.table, scan or other read statement. It would be even cleaner just to use stdin() but unfortunately stdin() does not work in sourced files (bug?). Thus, this solution can be regarded as a workaround to that. To run the example below, place it in the a text file called myFile.r in the top level directory and source it from the R command line: source("/myFile.r") # start of example myFile <- "/myFile.r" my.stdin <- function( filename, tag ) textConnection( sub(tag, "", grep(tag,readLines(filename),value=T)) ) x <- read.table( my.stdin(myFile,"^#x"), header=T ) #x Sex Response # this example has a header #x Male 1 #x Male 2 #x Female 3 #x Female 4 y <- read.table( my.stdin(myFile,"^#y") ) #y 3.4 4 # this example has no header #y 3 3 #y 6 6 z <- scan( my.stdin(myFile,"^#z") ) #z 3 5 4 6 7 #z 8
Here is a further improvement on sourcing code and data from the same file, namely, the sourced file no longer needs to specify its name and location. (Instead, my.stdin grabs this from the environment within the source command, which is one of its ancestors.) It also occurred to me that the use of my.stdin() does have one potential advantage over stdin(), even assuming that the problem with stdin() not working in sourced files is ultimately addressed in R. In the case where the data is lengthy, it might be desirable to place the data at the end of the code so as not to break it up. The data read by my.stdin() can be placed anywhere in the file. In the example below, the data for x is placed right after the statement which reads in x but the data for y and z are placed at the end of this file. The file and path of the file are no longer explicitly specified. # source the following file from R my.stdin <- function( tag, this.file = eval.parent(quote(file),n=3) ) textConnection( sub(tag, "", grep(tag,readLines(this.file),value=T)) ) x <- read.table( my.stdin("^#x"), header=T ) #x Sex Response # this data has a header #x Male 1 #x Male 2 #x Female 3 #x Female 4 y <- read.table( my.stdin("^#y") ) z <- scan( my.stdin("^#z") ) # -- data #y 3.4 4 # this is first line of y data #y 3 3 #y 6 6 #z 3 5 4 6 7 #z 8