Dimitri Liakhovitski
2013-Jun-08 01:24 UTC
[R] splitting a string column into multiple columns faster
Hello! I have a column in my data frame that I have to split: I have to distill the numbers from the text. Below is my example and my solution. x<-data.frame(x=c("aaa1_bbb1_ccc3","aaa2_bbb3_ccc2","aaa3_bbb2_ccc1")) x library(stringr) out<-as.data.frame(str_split_fixed(x$x,"aaa",2)) out2<-as.data.frame(str_split_fixed(out$V2,"_bbb",2)) out3<-as.data.frame(str_split_fixed(out2$V2,"_ccc",2)) result<-cbind(x,out2[1],out3) result My problem is: str_split.fixed is relatively slow. In my real data frame I have over 80,000 rows so that it takes almost 30 seconds to run just one line (like out<-... above) And it's even slower because I have to do it step-by-step many times. Any way to do it by specifying all 3 delimiters at once ("aaa","_bbb","_ccc") and then split it in one swoop into a data frame with several columns? Thanks a lot for any pointers! -- Dimitri Liakhovitski [[alternative HTML version deleted]]
HI, May be this helps: res<-data.frame(x=x,read.table(text=gsub("[A-Za-z]","",x[,1]),sep="_",header=FALSE),stringsAsFactors=FALSE) res #?????????????? x V1 V2 V3 #1 aaa1_bbb1_ccc3? 1? 1? 3 #2 aaa2_bbb3_ccc2? 2? 3? 2 #3 aaa3_bbb2_ccc1? 3? 2? 1 A.K. ----- Original Message ----- From: Dimitri Liakhovitski <dimitri.liakhovitski at gmail.com> To: r-help <r-help at r-project.org> Cc: Sent: Friday, June 7, 2013 9:24 PM Subject: [R] splitting a string column into multiple columns faster Hello! I have a column in my data frame that I have to split: I have to distill the numbers from the text. Below is my example and my solution. x<-data.frame(x=c("aaa1_bbb1_ccc3","aaa2_bbb3_ccc2","aaa3_bbb2_ccc1")) x library(stringr) out<-as.data.frame(str_split_fixed(x$x,"aaa",2)) out2<-as.data.frame(str_split_fixed(out$V2,"_bbb",2)) out3<-as.data.frame(str_split_fixed(out2$V2,"_ccc",2)) result<-cbind(x,out2[1],out3) result My problem is: str_split.fixed is relatively slow. In my real data frame I have over 80,000 rows so that it takes almost 30 seconds to run just one line (like out<-... above) And it's even slower because I have to do it step-by-step many times. Any way to do it by specifying all 3 delimiters at once ("aaa","_bbb","_ccc") and then split it in one swoop into a data frame with several columns? Thanks a lot for any pointers! -- Dimitri Liakhovitski ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.