Mark Connolly
2009-Oct-29 18:45 UTC
[R] What is the best way to efficiently construct a data frame from multiple source files?
I have an arbitrary number of spreadsheets that I want to consolidate into a single data frame. The spreadsheets all have the same structure: location, depth1Reading, depth2reading, depth3reading, depth4reading, depth5reading The spreadsheets have their reading date in their file name. This gets parsed out and added to the data frame as a factor. The file name gets recorded in the data frame as a reference factor. The depth readings are put into a normal form in the data frame. The target data frame structure is str(df) 'data.frame': 23100 obs. of 5 variables: $ measurement.date: Factor w/ 77 levels "2005/01/07","2003/01/08",..: 1 1 1 1 1 1 1 1 1 1 ... $ source : Factor w/ 77 levels "TDRREADINGS010705.xls",..: 1 1 1 1 1 1 1 1 1 1 ... $ location : int 1 1 1 1 1 2 2 2 2 2 ... $ position : num 1 2 3 4 5 1 2 3 4 5 ... $ theta.percent : num 24.8 23.5 30.7 26.6 NA 20.7 28.2 24.3 20.6 10 ... I am successfully using the following (nested) looping-and-rbinding method, but it is very slow. I tried allocating the whole data frame and replacing rows, but ran into issues with new factors. I would like to know if there is a general R approach for efficiently doing this sort of data frame construction (assuming R is generally considered appropriate for data cleanup and restructuring). code (works but slow): require("gdata") # for reading spreadsheet tdrs <- readLines(pipe("ls TDR*.xls")) na.strings=c("n", " n", "n ", " ", "jn", "N", "bent", "bent pin", "skip")) # ugly but not important # allocate empty data frame df <- data.frame(measurement.date=character(0), source=character(0), location=numeric(0), position=numeric(0), theta.percent=numeric(0)) # iterate over spreadsheets for (i in 1:length(tdrs)) { source <- tdrs[i] # slightly optimistic tdr <- read.xls(source, na.strings=na.strings) # standardize column names names(tdr)<-c("probe", "X1", "X2", "X3", "X4", "X5") # create a date that is nicely sortable measurement.date <- paste(paste(20,substr(source,16,17),sep=""), # year substr(source,12,13), # month substr(source,14,15), sep="/") # day for (j in 1:nrow(tdr)) { # iterate over each spreadsheet row tdrrow <- tdr[j,] location <- tdrrow$probe for (pos in 2:6) { # normalize the readings if (is.na(tdrrow[,pos])) { theta.percent<-NA } else theta.percent<-as.numeric(tdrrow[,pos]) position <- pos - 1 df <- rbind(df, data.frame(measurement.date=measurement.date, source=source, location=location, position=position, theta.percent=theta.percent)) } } }