Good morning, I have currently 63 .csv files most of which have lines which look like 01/06/05,23445 Though some files have two numbers beside each date. There are missing values, and currently the longest file has 318 rows. (merge() is losing the head and doing runaway memory allocation - but thats another question - I'm still trying to pin that issue down and make a small repeatable example) Currently I'm reading in these files with lines like a1 <- read.csv("daft_file_name_1.csv",header=F) ... a63 <- read.csv("another_silly_filename_63.csv",header=F) and then i'm naming the columns in these like... names(a1)[2] <- "silly column name" ... names(a63)[2] <- "daft column name" then trying to merge()... atot <- merge(a1, a2, all=T) and then using language manipulation to loop atot <- merge(atot, a3, all=T) ... atot <- merge(atot, a63, all=T) etc... followed by more language manipulation for() { rm(a1) } etc... i.e. for (i in 2:63) { atot <- merge(atot, eval(parse(text=paste("a", i, sep=""))), all=T) # eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) cat("i is ", i, gc(), "\n") # now delete these 63 temporary objects... # e.g. should look like rm(a33) eval(parse(text=paste("rm(a",i,")", sep=""))) } eventually getting a dataframe with the first column being the date, and the subsequent 63 columns being the data... with missing values coded as NA... so my question is... is there a better strategy for reading in lots of small files (only a few kbytes each) like that which are timeseries with missing data... which doesn't go through the above awkwardness (and language manipulation) but still ends up with a nice data.frame with NA values correctly coded etc. Many thanks, Sean O'Riordain
Hi, if you would use a list, to collect (append) all your data.frames from read.csv, you don't have to compute variable names like a1...a66, just iterate over contents of the list. By using function dir, you can read all files in a directory in a loop. Michael Am Thursday 11 May 2006 10:03 schrieb Sean O'Riordain:> Good morning, > I have currently 63 .csv files most of which have lines which look like > 01/06/05,23445 > Though some files have two numbers beside each date. There are > missing values, and currently the longest file has 318 rows. > > (merge() is losing the head and doing runaway memory allocation - but > thats another question - I'm still trying to pin that issue down and > make a small repeatable example) > > Currently I'm reading in these files with lines like > a1 <- read.csv("daft_file_name_1.csv",header=F) > ... > a63 <- read.csv("another_silly_filename_63.csv",header=F) > > and then i'm naming the columns in these like... > names(a1)[2] <- "silly column name" > ... > names(a63)[2] <- "daft column name" > > then trying to merge()... > atot <- merge(a1, a2, all=T) > and then using language manipulation to loop > atot <- merge(atot, a3, all=T) > ... > atot <- merge(atot, a63, all=T) > etc... > > followed by more language manipulation > for() { > rm(a1) > } etc... > > i.e. > for (i in 2:63) { > atot <- merge(atot, eval(parse(text=paste("a", i, sep=""))), all=T) > # eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) > > cat("i is ", i, gc(), "\n") > > # now delete these 63 temporary objects... > # e.g. should look like rm(a33) > eval(parse(text=paste("rm(a",i,")", sep=""))) > } > > eventually getting a dataframe with the first column being the date, > and the subsequent 63 columns being the data... with missing values > coded as NA... > > so my question is... is there a better strategy for reading in lots of > small files (only a few kbytes each) like that which are timeseries > with missing data... which doesn't go through the above awkwardness > (and language manipulation) but still ends up with a nice data.frame > with NA values correctly coded etc. > > Many thanks, > Sean O'Riordain > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html
This is what I would try: csvlist <- list.files(pattern="csv$") bigblob <- lapply(csvlist, read.csv, ...) ## Get all dates that appear in any one of them. all.dates <- unique(unlist(lapply(bigblob, "[[", 1))) bigdata <- matrix(NA, length(all.dates), length(bigblob)) dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant) ## loop through bigblob and populate corresponding columns ## of bigmatrix with the matching dates. for (i in seq(along=bigblob)) { bigmatrix[as.character(bigblob[[i]][, 1]), i] <- bigblob[[i]][, columnwithdata] } This is obviously untested, so hope it's of some help. Andy From: Sean O'Riordain> > Good morning, > I have currently 63 .csv files most of which have lines which > look like > 01/06/05,23445 > Though some files have two numbers beside each date. There > are missing values, and currently the longest file has 318 rows. > > (merge() is losing the head and doing runaway memory > allocation - but thats another question - I'm still trying to > pin that issue down and make a small repeatable example) > > Currently I'm reading in these files with lines like > a1 <- read.csv("daft_file_name_1.csv",header=F) > ... > a63 <- read.csv("another_silly_filename_63.csv",header=F) > > and then i'm naming the columns in these like... > names(a1)[2] <- "silly column name" > ... > names(a63)[2] <- "daft column name" > > then trying to merge()... > atot <- merge(a1, a2, all=T) > and then using language manipulation to loop > atot <- merge(atot, a3, all=T) > ... > atot <- merge(atot, a63, all=T) > etc... > > followed by more language manipulation > for() { > rm(a1) > } etc... > > i.e. > for (i in 2:63) { > atot <- merge(atot, eval(parse(text=paste("a", i, > sep=""))), all=T) > # eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) > > cat("i is ", i, gc(), "\n") > > # now delete these 63 temporary objects... > # e.g. should look like rm(a33) > eval(parse(text=paste("rm(a",i,")", sep=""))) } > > eventually getting a dataframe with the first column being > the date, and the subsequent 63 columns being the data... > with missing values coded as NA... > > so my question is... is there a better strategy for reading > in lots of small files (only a few kbytes each) like that > which are timeseries with missing data... which doesn't go > through the above awkwardness (and language manipulation) but > still ends up with a nice data.frame with NA values correctly > coded etc. > > Many thanks, > Sean O'Riordain > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Assuming: my.files <- c("file1.csv", "file2.csv", ..., "filen.csv") use read.zoo in the zoo package and merge.zoo (which can do a multiway merge): library(zoo) do.call("merge", lapply(my.files, read.zoo, ...any.other.read.zoo.args...)) After loading zoo see: vignette("zoo") ?read.zoo ?merge.zoo On 5/11/06, Sean O'Riordain <sean.oriordain at gmail.com> wrote:> Good morning, > I have currently 63 .csv files most of which have lines which look like > 01/06/05,23445 > Though some files have two numbers beside each date. There are > missing values, and currently the longest file has 318 rows. > > (merge() is losing the head and doing runaway memory allocation - but > thats another question - I'm still trying to pin that issue down and > make a small repeatable example) > > Currently I'm reading in these files with lines like > a1 <- read.csv("daft_file_name_1.csv",header=F) > ... > a63 <- read.csv("another_silly_filename_63.csv",header=F) > > and then i'm naming the columns in these like... > names(a1)[2] <- "silly column name" > ... > names(a63)[2] <- "daft column name" > > then trying to merge()... > atot <- merge(a1, a2, all=T) > and then using language manipulation to loop > atot <- merge(atot, a3, all=T) > ... > atot <- merge(atot, a63, all=T) > etc... > > followed by more language manipulation > for() { > rm(a1) > } etc... > > i.e. > for (i in 2:63) { > atot <- merge(atot, eval(parse(text=paste("a", i, sep=""))), all=T) > # eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) > > cat("i is ", i, gc(), "\n") > > # now delete these 63 temporary objects... > # e.g. should look like rm(a33) > eval(parse(text=paste("rm(a",i,")", sep=""))) > } > > eventually getting a dataframe with the first column being the date, > and the subsequent 63 columns being the data... with missing values > coded as NA... > > so my question is... is there a better strategy for reading in lots of > small files (only a few kbytes each) like that which are timeseries > with missing data... which doesn't go through the above awkwardness > (and language manipulation) but still ends up with a nice data.frame > with NA values correctly coded etc. > > Many thanks, > Sean O'Riordain > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
If you can show me the equivalent Python code in as few lines that perform much faster, I'd very much appreciate it. I had been trying to find an "excuse" to learn Python, but so far have found what I can do in R quite adequate. Also, it's much easier to keep track of work flow when everything is done in one place (R in my case). Andy From: Steve Miller> > Why torture yourself and probably get bad performance in the > process? You > should handle the data consolidation in python or ruby, which > are much more > suited to this type of task, piping the results to R. > > Steve Miller > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Liaw, Andy > Sent: Thursday, May 11, 2006 5:50 AM > To: seanpor at acm.org; r-help > Subject: Re: [R] data input strategy - lots of csv files > > This is what I would try: > > csvlist <- list.files(pattern="csv$") > bigblob <- lapply(csvlist, read.csv, ...) > ## Get all dates that appear in any one of them. > all.dates <- unique(unlist(lapply(bigblob, "[[", 1))) > bigdata <- matrix(NA, length(all.dates), length(bigblob)) > dimnames(bigdata) <- list(all.dates, whatevercolnamesyouwant) > ## loop through bigblob and populate corresponding columns > ## of bigmatrix with the matching dates. > for (i in seq(along=bigblob)) { > bigmatrix[as.character(bigblob[[i]][, 1]), i] <- > bigblob[[i]][, columnwithdata] > } > > This is obviously untested, so hope it's of some help. > > Andy > > From: Sean O'Riordain > > > > Good morning, > > I have currently 63 .csv files most of which have lines which > > look like > > 01/06/05,23445 > > Though some files have two numbers beside each date. There > > are missing values, and currently the longest file has 318 rows. > > > > (merge() is losing the head and doing runaway memory > > allocation - but thats another question - I'm still trying to > > pin that issue down and make a small repeatable example) > > > > Currently I'm reading in these files with lines like > > a1 <- read.csv("daft_file_name_1.csv",header=F) > > ... > > a63 <- read.csv("another_silly_filename_63.csv",header=F) > > > > and then i'm naming the columns in these like... > > names(a1)[2] <- "silly column name" > > ... > > names(a63)[2] <- "daft column name" > > > > then trying to merge()... > > atot <- merge(a1, a2, all=T) > > and then using language manipulation to loop > > atot <- merge(atot, a3, all=T) > > ... > > atot <- merge(atot, a63, all=T) > > etc... > > > > followed by more language manipulation > > for() { > > rm(a1) > > } etc... > > > > i.e. > > for (i in 2:63) { > > atot <- merge(atot, eval(parse(text=paste("a", i, > > sep=""))), all=T) > > # eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) > > > > cat("i is ", i, gc(), "\n") > > > > # now delete these 63 temporary objects... > > # e.g. should look like rm(a33) > > eval(parse(text=paste("rm(a",i,")", sep=""))) } > > > > eventually getting a dataframe with the first column being > > the date, and the subsequent 63 columns being the data... > > with missing values coded as NA... > > > > so my question is... is there a better strategy for reading > > in lots of small files (only a few kbytes each) like that > > which are timeseries with missing data... which doesn't go > > through the above awkwardness (and language manipulation) but > > still ends up with a nice data.frame with NA values correctly > > coded etc. > > > > Many thanks, > > Sean O'Riordain > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
Reasonably Related Threads
- win2k memory problem with merge()'ing repeatedly (long email)
- merge problem... extra lines appear in the presence of NAs
- strange strsplit gsub problem 0 is this a bug or a string length limitation?
- plot representation of calculated value known to be 7.4
- plot representation of calculated value known to be 7.4