Colleagues, Using R2.7.0 in OS X, I am having trouble understanding the command textConnection. My situation is as follows: 1. I am trying to read a lengthy file (45000 lines) that has headers ~ every 1000 lines. read.table (or its variants) fail because of the recurrent headers. 2. My present approach is the following: a. use readLines to read the file, save as an array b. use grep to find the recurrent headers (not including the first set) c. delete the recurrent headers from the array d. write the array to a temp file e. read the temp file using read.table f. delete the temp file 3. My understanding is to textConnection might enable me to replace steps d-f with a single step akin to read.table(textConnection(array)). This appears to work but it is very slow. I executed code on successively larger chunks of the array: for (Each in 1000 * 1:45) { cat("N lines =", Each, "\t", date(), "\n") A <- read.table(textConnection(Z[1:Each]), header=T) } yielding: N lines = 1000 Sun Oct 12 07:09:48 2008 N lines = 2000 Sun Oct 12 07:09:48 2008 N lines = 3000 Sun Oct 12 07:09:48 2008 N lines = 4000 Sun Oct 12 07:09:50 2008 N lines = 5000 Sun Oct 12 07:09:52 2008 N lines = 6000 Sun Oct 12 07:09:56 2008 N lines = 7000 Sun Oct 12 07:10:01 2008 N lines = 8000 Sun Oct 12 07:10:09 2008 N lines = 9000 Sun Oct 12 07:10:18 2008 N lines = 10000 Sun Oct 12 07:10:31 2008 N lines = 11000 Sun Oct 12 07:10:46 2008 N lines = 12000 Sun Oct 12 07:11:04 2008 N lines = 13000 Sun Oct 12 07:11:25 2008 N lines = 14000 Sun Oct 12 07:11:51 2008 N lines = 15000 Sun Oct 12 07:12:20 2008 N lines = 16000 Sun Oct 12 07:12:54 2008 N lines = 17000 Sun Oct 12 07:13:32 2008 N lines = 18000 Sun Oct 12 07:14:16 2008 N lines = 19000 Sun Oct 12 07:15:04 2008 N lines = 20000 Sun Oct 12 07:15:58 2008 N lines = 21000 Sun Oct 12 07:16:58 2008 N lines = 22000 Sun Oct 12 07:18:04 2008 N lines = 23000 Sun Oct 12 07:19:17 2008 N lines = 24000 Sun Oct 12 07:20:36 2008 N lines = 25000 Sun Oct 12 07:22:02 2008 N lines = 26000 Sun Oct 12 07:23:36 2008 Any clever ideas will be greatly appreciated. Dennis Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-415-564-2220 www.PLessThan.com [[alternative HTML version deleted]]
Dennis Fisher wrote:> Colleagues, > > Using R2.7.0 in OS X, I am having trouble understanding the command > textConnection. My situation is as follows: > 1. I am trying to read a lengthy file (45000 lines) that has headers > ~ every 1000 lines. read.table (or its variants) fail because of the > recurrent headers. > 2. My present approach is the following: > a. use readLines to read the file, save as an array > b. use grep to find the recurrent headers (not including the first > set) > c. delete the recurrent headers from the array > d. write the array to a temp file > e. read the temp file using read.table > f. delete the temp file > 3. My understanding is to textConnection might enable me to replace > steps d-f with a single step akin to > read.table(textConnection(array)). This appears to work but it is > very slow. I executed code on successively larger chunks of the array: > for (Each in 1000 * 1:45) > { > cat("N lines =", Each, "\t", date(), "\n") > A <- read.table(textConnection(Z[1:Each]), header=T) > } > yielding: > N lines = 1000 Sun Oct 12 07:09:48 2008 > N lines = 2000 Sun Oct 12 07:09:48 2008 > N lines = 3000 Sun Oct 12 07:09:48 2008 > N lines = 4000 Sun Oct 12 07:09:50 2008 > N lines = 5000 Sun Oct 12 07:09:52 2008 > N lines = 6000 Sun Oct 12 07:09:56 2008 > N lines = 7000 Sun Oct 12 07:10:01 2008 > N lines = 8000 Sun Oct 12 07:10:09 2008 > N lines = 9000 Sun Oct 12 07:10:18 2008 > N lines = 10000 Sun Oct 12 07:10:31 2008 > N lines = 11000 Sun Oct 12 07:10:46 2008 > N lines = 12000 Sun Oct 12 07:11:04 2008 > N lines = 13000 Sun Oct 12 07:11:25 2008 > N lines = 14000 Sun Oct 12 07:11:51 2008 > N lines = 15000 Sun Oct 12 07:12:20 2008 > N lines = 16000 Sun Oct 12 07:12:54 2008 > N lines = 17000 Sun Oct 12 07:13:32 2008 > N lines = 18000 Sun Oct 12 07:14:16 2008 > N lines = 19000 Sun Oct 12 07:15:04 2008 > N lines = 20000 Sun Oct 12 07:15:58 2008 > N lines = 21000 Sun Oct 12 07:16:58 2008 > N lines = 22000 Sun Oct 12 07:18:04 2008 > N lines = 23000 Sun Oct 12 07:19:17 2008 > N lines = 24000 Sun Oct 12 07:20:36 2008 > N lines = 25000 Sun Oct 12 07:22:02 2008 > N lines = 26000 Sun Oct 12 07:23:36 2008 > > Any clever ideas will be greatly appreciated. >So you are taking about 1.5 minutes to read a 26000 line part of the file? It's a bit hard to tell whether that is a lot or a little if you don't tell us what those lines contain... If you're exceeding the amount of available RAM, that could be causing problems. You're not closing the earlier connections though, so A <- read.table(con <- textConnection(Z[1:Each]), header=T) close(con) might help. Also notice that the usual tricks for speeding up read.table() still apply (use colClasses, e.g.).> Dennis > > > Dennis Fisher MD > P < (The "P Less Than" Company) > Phone: 1-866-PLessThan (1-866-753-7784) > Fax: 1-415-564-2220 > www.PLessThan.com > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Try one of these: Lines <- readLines("myfile.dat") Lines <- Lines[-grep("whatever", Lines)] DF <- read.table(textConnection(Lines), ...other.args...) or # use findstr /v instead of grep -v if you are on Windows DF <- read.table(pipe("grep -v whatever myfile.dat"), ...other.args...) On Sun, Oct 12, 2008 at 11:13 AM, Dennis Fisher <fisher at plessthan.com> wrote:> Colleagues, > > Using R2.7.0 in OS X, I am having trouble understanding the command > textConnection. My situation is as follows: > 1. I am trying to read a lengthy file (45000 lines) that has headers > ~ every 1000 lines. read.table (or its variants) fail because of the > recurrent headers. > 2. My present approach is the following: > a. use readLines to read the file, save as an array > b. use grep to find the recurrent headers (not including the first > set) > c. delete the recurrent headers from the array > d. write the array to a temp file > e. read the temp file using read.table > f. delete the temp file > 3. My understanding is to textConnection might enable me to replace > steps d-f with a single step akin to > read.table(textConnection(array)). This appears to work but it is > very slow. I executed code on successively larger chunks of the array: > for (Each in 1000 * 1:45) > { > cat("N lines =", Each, "\t", date(), "\n") > A <- read.table(textConnection(Z[1:Each]), header=T) > } > yielding: > N lines = 1000 Sun Oct 12 07:09:48 2008 > N lines = 2000 Sun Oct 12 07:09:48 2008 > N lines = 3000 Sun Oct 12 07:09:48 2008 > N lines = 4000 Sun Oct 12 07:09:50 2008 > N lines = 5000 Sun Oct 12 07:09:52 2008 > N lines = 6000 Sun Oct 12 07:09:56 2008 > N lines = 7000 Sun Oct 12 07:10:01 2008 > N lines = 8000 Sun Oct 12 07:10:09 2008 > N lines = 9000 Sun Oct 12 07:10:18 2008 > N lines = 10000 Sun Oct 12 07:10:31 2008 > N lines = 11000 Sun Oct 12 07:10:46 2008 > N lines = 12000 Sun Oct 12 07:11:04 2008 > N lines = 13000 Sun Oct 12 07:11:25 2008 > N lines = 14000 Sun Oct 12 07:11:51 2008 > N lines = 15000 Sun Oct 12 07:12:20 2008 > N lines = 16000 Sun Oct 12 07:12:54 2008 > N lines = 17000 Sun Oct 12 07:13:32 2008 > N lines = 18000 Sun Oct 12 07:14:16 2008 > N lines = 19000 Sun Oct 12 07:15:04 2008 > N lines = 20000 Sun Oct 12 07:15:58 2008 > N lines = 21000 Sun Oct 12 07:16:58 2008 > N lines = 22000 Sun Oct 12 07:18:04 2008 > N lines = 23000 Sun Oct 12 07:19:17 2008 > N lines = 24000 Sun Oct 12 07:20:36 2008 > N lines = 25000 Sun Oct 12 07:22:02 2008 > N lines = 26000 Sun Oct 12 07:23:36 2008 > > Any clever ideas will be greatly appreciated. > > Dennis > > > Dennis Fisher MD > P < (The "P Less Than" Company) > Phone: 1-866-PLessThan (1-866-753-7784) > Fax: 1-415-564-2220 > www.PLessThan.com > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >