I'm wondering if I need to use a function other than sapply as the following line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of memory on my machine for what seems like a very small dataset (data attached in a txt file wells.txt <http://r.789695.n4.nabble.com/file/n4656723/wells.txt> ). The R code is: wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr")) wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x), "_")[[1]])==2),] The 2nd line of R code above gets bogged down and takes all my RAM with it: <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png> I'm simply trying to extract all of the lines of data that have a single "_" in the first column and place them into a dataset called "wells2". If that were to work, I then want to extract the lines of data that have two "_" and put them into a separate dataset, say "wells3". Is there a better way to do this than the one-liner above? -Eric -- View this message in context: http://r.789695.n4.nabble.com/a-function-more-appropriate-than-sapply-tp4656723.html Sent from the R help mailing list archive at Nabble.com.
Hi, May be this helps: ?wells<-read.table("wells.txt",header=FALSE,stringsAsFactors=F) ?wells2<-wells[-grep(".*\\_.*\\_",wells[,1]),] ? head(wells2) ? # ? V1 V2 #1? w7_1? 0 #2 w11_1? 0 #3 w12_1? 0 #4 w13_1? 0 #5 w14_1? 0 #6 w15_1? 0 wellsNew<-wells[grep(".*\\_.*\\_",wells[,1]),] ?head(wellsNew) #??????????? V1 V2 #851 99_10_4395? 0 #852 99_10_4396? 0 #853 99_10_4400? 0 #854 99_10_4403? 0 #855 99_10_4404? 0 #856 99_10_4606? 0 ?nrow(wells) #[1] 46366 nrow(wells2) #[1] 38080 ?nrow(wellsNew) #[1] 8286 ?38080+8286 #[1] 46366 A.K. ----- Original Message ----- From: emorway <emorway at usgs.gov> To: r-help at r-project.org Cc: Sent: Saturday, January 26, 2013 1:43 PM Subject: [R] a function more appropriate than 'sapply'? I'm wondering if I need to use a function other than sapply as the following line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of memory on my machine for what seems like a very small dataset (data attached in a txt file? wells.txt <http://r.789695.n4.nabble.com/file/n4656723/wells.txt>? ).? The R code is: wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr")) wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x), "_")[[1]])==2),] The 2nd line of R code above gets bogged down and takes all my RAM with it: <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png> I'm simply trying to extract all of the lines of data that have a single "_" in the first column and place them into a dataset called "wells2".? If that were to work, I then want to extract the lines of data that have two "_" and put them into a separate dataset, say "wells3".? Is there a better way to do this than the one-liner above? -Eric -- View this message in context: http://r.789695.n4.nabble.com/a-function-more-appropriate-than-sapply-tp4656723.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On 26-01-2013, at 19:43, emorway <emorway at usgs.gov> wrote:> I'm wondering if I need to use a function other than sapply as the following > line of code runs indefinitely (or > 30 min so far) and uses up all 16Gb of > memory on my machine for what seems like a very small dataset (data attached > in a txt file wells.txt > <http://r.789695.n4.nabble.com/file/n4656723/wells.txt> ). The R code is: > > wells<-read.table("c:/temp/wells.txt",col.names=c("name","plc_hldr")) > wells2<-wells[sapply(wells[,1],function(x)length(strsplit(as.character(x), > "_")[[1]])==2),] > > The 2nd line of R code above gets bogged down and takes all my RAM with it: > <http://r.789695.n4.nabble.com/file/n4656723/memory_loss.png> > > I'm simply trying to extract all of the lines of data that have a single "_" > in the first column and place them into a dataset called "wells2". If that > were to work, I then want to extract the lines of data that have two "_" and > put them into a separate dataset, say "wells3". Is there a better way to do > this than the one-liner above?Read your file with wells<-read.table("wells.txt",col.names=c("name","plc_hldr"), stringsAsFactors=FALSE) Remove all non underscores with w.sub <- gsub("[^_]+","",wells[,1]) then select elements of w.sub with 2 underscores and a single underscore with u.2 <- which(w.sub=="__") u.1 <- which(w.sub=="_") and use u.1 and u.2 to select the appropriate rows of wells. I tried to select rows containing 1 or 2 underscores with grep regular expressions but that appeared to be more difficult than I had expected. The method above is quick. Berend