I am using readLines to read a fairly large ASCII file. readLines reads a fixed number of lines, then other R code processes the data, then readLines reads the same number of lines again, then other R code processes the data, then .... Sort of like: conn<-file('filename','r') for (chunk in 1:100000) { Lines<-readLines(conn,n=25) # process "Lines" } The code is working, but I notice that it slows down greatly as time progresses. It took 2 seconds to read my first chunk of data, 4 seconds to read the next chunk, 10 after that. The quasi-exponential trend has slowed, thank goodness, but after about a hundred reads, the read time for the next chunk is over a minute. Let me stress that the number of lines read in each chunk of data is absolutely fixed. The only processing I am doing at the point is to parse the new data, and rbind the results to an existing data frame. Processing of new data in no way depends on earlier data. So, my question is why is the reading taking longer as time goes on? Is there a way to fix this? Is there a better method than readLines? Thanks. [[alternative HTML version deleted]]
More on my previous question ... I have put in timing statements to try to get a better idea of where the problem is, like so: conn<-file('filename','r') for (chunk in 1:100000) { print(paste('begin read at',date())) Lines<-readLines(conn,n=25) print(paste('begin processing at',date())) # process "Lines" print(paste('end loop at',date())) } Every time I go through the loop, all the date() functions return *exactly* the same time! It *looks like* it runs through each iteration very quickly and then takes longer and longer to simply start the next iteration. I don't believe this. I think R must be doing some kind of latency trick or something. But, anyway, the point is that I was assuming the problem was in the I/O, and now I don't know if it's I/O or processing. Either way, I don't understand it and would really appreciate some wisdom from you guys. Thanks.
On 03.10.2011 19:19, Cable, Sam B Civ USAF AFMC AFRL/RVBXI wrote:> I am using readLines to read a fairly large ASCII file. readLines reads > a fixed number of lines, then other R code processes the data, then > readLines reads the same number of lines again, then other R code > processes the data, then .... > > > > Sort of like: > > > > conn<-file('filename','r') > > for (chunk in 1:100000) { > > Lines<-readLines(conn,n=25) > > # process "Lines" > > } > > > > The code is working, but I notice that it slows down greatly as time > progresses. It took 2 seconds to read my first chunk of data, 4 seconds > to read the next chunk, 10 after that. The quasi-exponential trend has > slowed, thank goodness, but after about a hundred reads, the read time > for the next chunk is over a minute. Let me stress that the number of > lines read in each chunk of data is absolutely fixed. > > > > The only processing I am doing at the point is to parse the new data, > and rbind the results to an existing data frame.And that's may be the interesting point. Have you tried to allocate the whole data.frame and assign into it later? It is probbaly not readLines() slowing you down. A minute seems to be quite a lot for resonable sized data. How many columns are we talking about?. Uwe Ligges> Processing of new data > in no way depends on earlier data. > > > > So, my question is why is the reading taking longer as time goes on? Is > there a way to fix this? Is there a better method than readLines? > > > > Thanks. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.