Thank you all for the help! LMH, Yes I would like to see the alternative. I am using this for a large data set and if the alternative is more efficient than this then I would be happy. On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 at gmail.com> wrote:> > To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply(). Here is one way. I think my comments suffice to explain. > > > ## which are the non "_" indices? > > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE) > > ## paste "_." to these > > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_") > > ## Now strsplit() and unlist() them to get a vector > > z <- unlist(strsplit(F1$text, "_")) > > ## now cbind() to the data frame > > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE)) > > F1 > ID1 ID2 text 1 2 > 1 A1 B1 NONE_. NONE . > 2 A1 B1 cf_12 cf 12 > 3 A1 B1 NONE_. NONE . > 4 A2 B2 X2_25 X2 25 > 5 A2 B3 fd_15 fd 15 > >## You can change the names of the 2 columns yourself > > Cheers, > Bert > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas at sapo.pt> wrote: >> >> Hello, >> >> A base R solution with strsplit, like in your code. >> >> F1$Y1 <- +grepl("_", F1$text) >> >> tmp <- strsplit(as.character(F1$text), "_") >> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) >> tmp <- do.call(rbind, tmp) >> colnames(tmp) <- c("X1", "X2") >> F1 <- cbind(F1[-3], tmp) # remove the original column >> rm(tmp) >> >> F1 >> # ID1 ID2 Y1 X1 X2 >> #1 A1 B1 0 NONE . >> #2 A1 B1 1 cf 12 >> #3 A1 B1 0 NONE . >> #4 A2 B2 1 X2 25 >> #5 A2 B3 1 fd 15 >> >> >> Note that cbind dispatches on F1, an object of class "data.frame". >> Therefore it's the method cbind.data.frame that is called and the result >> is also a df, though tmp is a "matrix". >> >> >> Hope this helps, >> >> Rui Barradas >> >> >> ?s 20:07 de 22/09/20, Rui Barradas escreveu: >> > Hello, >> > >> > Something like this? >> > >> > >> > F1$Y1 <- +grepl("_", F1$text) >> > F1 <- F1[c(1, 2, 4, 3)] >> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill >> > "right") >> > F1 >> > >> > >> > Hope this helps, >> > >> > Rui Barradas >> > >> > ?s 19:55 de 22/09/20, Val escreveu: >> >> HI All, >> >> >> >> I am trying to create new columns based on another column string >> >> content. First I want to identify rows that contain a particular >> >> string. If it contains, I want to split the string and create two >> >> variables. >> >> >> >> Here is my sample of data. >> >> F1<-read.table(text="ID1 ID2 text >> >> A1 B1 NONE >> >> A1 B1 cf_12 >> >> A1 B1 NONE >> >> A2 B2 X2_25 >> >> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F) >> >> If the variable "text" contains this "_" I want to create an indicator >> >> variable as shown below >> >> >> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0) >> >> >> >> >> >> Then I want to split that string in to two, before "_" and after "_" >> >> and create two variables as shown below >> >> x1= strsplit(as.character(F1$text),'_',2) >> >> >> >> My problem is how to combine this with the original data frame. The >> >> desired output is shown below, >> >> >> >> >> >> ID1 ID2 Y1 X1 X2 >> >> A1 B1 0 NONE . >> >> A1 B1 1 cf 12 >> >> A1 B1 0 NONE . >> >> A2 B2 1 X2 25 >> >> A2 B3 1 fd 15 >> >> >> >> Any help? >> >> Thank you. >> >> >> >> ______________________________________________ >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> >> >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Oh, if efficiency is a consideration, then my code is about 15 times as fast as Rui's:> F2 <- F1[rep(1:5,1e6),] ## 5 million rows##Rui's> system.time({+ F2$Y1 <- +grepl("_", F2$text) + tmp <- strsplit(as.character(F2$text), "_") + tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) + tmp <- do.call(rbind, tmp) + colnames(tmp) <- c("X1", "X2") + F2 <- cbind(F2[-3], tmp) # remove the original column + }) user system elapsed 20.072 0.625 20.786 ## my version> system.time({+ wh <- grep("_",F2$text, fixed = TRUE, invert = TRUE) + F2[wh,"text"] <- paste(F2[wh,"text"],".",sep = "_") + z <- unlist(strsplit(F1$text,"_")) + F2 <- cbind(F2, matrix(z, ncol = 2, byrow = TRUE)) + F2 + }) user system elapsed 1.256 0.019 1.281 Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Tue, Sep 22, 2020 at 5:04 PM Val <valkremk at gmail.com> wrote:> Thank you all for the help! > > LMH, Yes I would like to see the alternative. I am using this for a > large data set and if the alternative is more efficient than this > then I would be happy. > > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 at gmail.com> > wrote: > > > > To be clear, I think Rui's solution is perfectly fine and probably > better than what I offer below. But just for fun, I wanted to do it without > the lapply(). Here is one way. I think my comments suffice to explain. > > > > > ## which are the non "_" indices? > > > wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE) > > > ## paste "_." to these > > > F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_") > > > ## Now strsplit() and unlist() them to get a vector > > > z <- unlist(strsplit(F1$text, "_")) > > > ## now cbind() to the data frame > > > F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE)) > > > F1 > > ID1 ID2 text 1 2 > > 1 A1 B1 NONE_. NONE . > > 2 A1 B1 cf_12 cf 12 > > 3 A1 B1 NONE_. NONE . > > 4 A2 B2 X2_25 X2 25 > > 5 A2 B3 fd_15 fd 15 > > >## You can change the names of the 2 columns yourself > > > > Cheers, > > Bert > > > > Bert Gunter > > > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > > > > On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas at sapo.pt> > wrote: > >> > >> Hello, > >> > >> A base R solution with strsplit, like in your code. > >> > >> F1$Y1 <- +grepl("_", F1$text) > >> > >> tmp <- strsplit(as.character(F1$text), "_") > >> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) > >> tmp <- do.call(rbind, tmp) > >> colnames(tmp) <- c("X1", "X2") > >> F1 <- cbind(F1[-3], tmp) # remove the original column > >> rm(tmp) > >> > >> F1 > >> # ID1 ID2 Y1 X1 X2 > >> #1 A1 B1 0 NONE . > >> #2 A1 B1 1 cf 12 > >> #3 A1 B1 0 NONE . > >> #4 A2 B2 1 X2 25 > >> #5 A2 B3 1 fd 15 > >> > >> > >> Note that cbind dispatches on F1, an object of class "data.frame". > >> Therefore it's the method cbind.data.frame that is called and the result > >> is also a df, though tmp is a "matrix". > >> > >> > >> Hope this helps, > >> > >> Rui Barradas > >> > >> > >> ?s 20:07 de 22/09/20, Rui Barradas escreveu: > >> > Hello, > >> > > >> > Something like this? > >> > > >> > > >> > F1$Y1 <- +grepl("_", F1$text) > >> > F1 <- F1[c(1, 2, 4, 3)] > >> > F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill > > >> > "right") > >> > F1 > >> > > >> > > >> > Hope this helps, > >> > > >> > Rui Barradas > >> > > >> > ?s 19:55 de 22/09/20, Val escreveu: > >> >> HI All, > >> >> > >> >> I am trying to create new columns based on another column string > >> >> content. First I want to identify rows that contain a particular > >> >> string. If it contains, I want to split the string and create two > >> >> variables. > >> >> > >> >> Here is my sample of data. > >> >> F1<-read.table(text="ID1 ID2 text > >> >> A1 B1 NONE > >> >> A1 B1 cf_12 > >> >> A1 B1 NONE > >> >> A2 B2 X2_25 > >> >> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F) > >> >> If the variable "text" contains this "_" I want to create an > indicator > >> >> variable as shown below > >> >> > >> >> F1$Y1 <- ifelse(grepl("_", F1$text),1,0) > >> >> > >> >> > >> >> Then I want to split that string in to two, before "_" and after "_" > >> >> and create two variables as shown below > >> >> x1= strsplit(as.character(F1$text),'_',2) > >> >> > >> >> My problem is how to combine this with the original data frame. The > >> >> desired output is shown below, > >> >> > >> >> > >> >> ID1 ID2 Y1 X1 X2 > >> >> A1 B1 0 NONE . > >> >> A1 B1 1 cf 12 > >> >> A1 B1 0 NONE . > >> >> A2 B2 1 X2 25 > >> >> A2 B3 1 fd 15 > >> >> > >> >> Any help? > >> >> Thank you. > >> >> > >> >> ______________________________________________ > >> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> >> https://stat.ethz.ch/mailman/listinfo/r-help > >> >> PLEASE do read the posting guide > >> >> http://www.R-project.org/posting-guide.html > >> >> and provide commented, minimal, self-contained, reproducible code. > >> >> > >> > > >> > ______________________________________________ > >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide > >> > http://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
What is the delimiter is in the input data? Is it tab, space, etc? Is this going to be the same for the output data that you will use for R input? LMH Val wrote:> Thank you all for the help! > > LMH, Yes I would like to see the alternative. I am using this for a > large data set and if the alternative is more efficient than this > then I would be happy. > > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 at gmail.com> wrote: >> >> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply(). Here is one way. I think my comments suffice to explain. >> >>> ## which are the non "_" indices? >>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE) >>> ## paste "_." to these >>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_") >>> ## Now strsplit() and unlist() them to get a vector >>> z <- unlist(strsplit(F1$text, "_")) >>> ## now cbind() to the data frame >>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE)) >>> F1 >> ID1 ID2 text 1 2 >> 1 A1 B1 NONE_. NONE . >> 2 A1 B1 cf_12 cf 12 >> 3 A1 B1 NONE_. NONE . >> 4 A2 B2 X2_25 X2 25 >> 5 A2 B3 fd_15 fd 15 >>> ## You can change the names of the 2 columns yourself >> >> Cheers, >> Bert >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas at sapo.pt> wrote: >>> >>> Hello, >>> >>> A base R solution with strsplit, like in your code. >>> >>> F1$Y1 <- +grepl("_", F1$text) >>> >>> tmp <- strsplit(as.character(F1$text), "_") >>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) >>> tmp <- do.call(rbind, tmp) >>> colnames(tmp) <- c("X1", "X2") >>> F1 <- cbind(F1[-3], tmp) # remove the original column >>> rm(tmp) >>> >>> F1 >>> # ID1 ID2 Y1 X1 X2 >>> #1 A1 B1 0 NONE . >>> #2 A1 B1 1 cf 12 >>> #3 A1 B1 0 NONE . >>> #4 A2 B2 1 X2 25 >>> #5 A2 B3 1 fd 15 >>> >>> >>> Note that cbind dispatches on F1, an object of class "data.frame". >>> Therefore it's the method cbind.data.frame that is called and the result >>> is also a df, though tmp is a "matrix". >>> >>> >>> Hope this helps, >>> >>> Rui Barradas >>> >>> >>> ?s 20:07 de 22/09/20, Rui Barradas escreveu: >>>> Hello, >>>> >>>> Something like this? >>>> >>>> >>>> F1$Y1 <- +grepl("_", F1$text) >>>> F1 <- F1[c(1, 2, 4, 3)] >>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill >>>> "right") >>>> F1 >>>> >>>> >>>> Hope this helps, >>>> >>>> Rui Barradas >>>> >>>> ?s 19:55 de 22/09/20, Val escreveu: >>>>> HI All, >>>>> >>>>> I am trying to create new columns based on another column string >>>>> content. First I want to identify rows that contain a particular >>>>> string. If it contains, I want to split the string and create two >>>>> variables. >>>>> >>>>> Here is my sample of data. >>>>> F1<-read.table(text="ID1 ID2 text >>>>> A1 B1 NONE >>>>> A1 B1 cf_12 >>>>> A1 B1 NONE >>>>> A2 B2 X2_25 >>>>> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F) >>>>> If the variable "text" contains this "_" I want to create an indicator >>>>> variable as shown below >>>>> >>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0) >>>>> >>>>> >>>>> Then I want to split that string in to two, before "_" and after "_" >>>>> and create two variables as shown below >>>>> x1= strsplit(as.character(F1$text),'_',2) >>>>> >>>>> My problem is how to combine this with the original data frame. The >>>>> desired output is shown below, >>>>> >>>>> >>>>> ID1 ID2 Y1 X1 X2 >>>>> A1 B1 0 NONE . >>>>> A1 B1 1 cf 12 >>>>> A1 B1 0 NONE . >>>>> A2 B2 1 X2 25 >>>>> A2 B3 1 fd 15 >>>>> >>>>> Any help? >>>>> Thank you. >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Below is a script in bash the uses the awk tokenizer to do the work. This assumes that your input and output delimiter is space. The number of consecutive delimiters in the input is not important. This also assumes that the input file does not have a header row. That is easy to modify if you want. I always keep header rows in my data files as I think that removing them is asking for trouble down the road. I added a NULL for cases where there is no value for the last field. You could use "." if you want. You should be able to find how to run this from inside R if you want. You will, of course, need a bash environment to run this, so if you are not in linux you will need cygwin or something similar. This should be very fast, but let me know if needs to be faster. If the X1_X2 variant occurs less frequently than not then we should switch the order in which the logic evaluates the options. LMH #! /bin/bash # input filename input_file=$1 # output filename output_file=$2 # make sure the input file exists if [ ! -f $input_file ]; then echo $input_file " cannot be found" exit 0 fi # create the output file touch $output_file # make sure the output was created if [ ! -f $output_file ]; then echo $output_file " was not created" exit 0 fi # write the header row echo "ID1 ID2 Y1 X1 X2" >> $output_file # character to find in the third token look_for='_' # process with awk # if the 3rd token contains '_' # split the third token on '_' into F[1] and F[2] # print the first two tokens, the indicator value of 1, and the split fields F[1] and F[2] # otherwise, # print the first two tokens, the indicator value of 0, the 3rd token, and NULL cat $input_file | \ awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") } { print $1, $2, "1", F[1], F[2] } } else { print $1, $2, "0", $3, "NULL" } }' >> $output_file Val wrote:> Thank you all for the help! > > LMH, Yes I would like to see the alternative. I am using this for a > large data set and if the alternative is more efficient than this > then I would be happy. > > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 at gmail.com> wrote: >> >> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply(). Here is one way. I think my comments suffice to explain. >> >>> ## which are the non "_" indices? >>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE) >>> ## paste "_." to these >>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_") >>> ## Now strsplit() and unlist() them to get a vector >>> z <- unlist(strsplit(F1$text, "_")) >>> ## now cbind() to the data frame >>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE)) >>> F1 >> ID1 ID2 text 1 2 >> 1 A1 B1 NONE_. NONE . >> 2 A1 B1 cf_12 cf 12 >> 3 A1 B1 NONE_. NONE . >> 4 A2 B2 X2_25 X2 25 >> 5 A2 B3 fd_15 fd 15 >>> ## You can change the names of the 2 columns yourself >> >> Cheers, >> Bert >> >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas at sapo.pt> wrote: >>> >>> Hello, >>> >>> A base R solution with strsplit, like in your code. >>> >>> F1$Y1 <- +grepl("_", F1$text) >>> >>> tmp <- strsplit(as.character(F1$text), "_") >>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) >>> tmp <- do.call(rbind, tmp) >>> colnames(tmp) <- c("X1", "X2") >>> F1 <- cbind(F1[-3], tmp) # remove the original column >>> rm(tmp) >>> >>> F1 >>> # ID1 ID2 Y1 X1 X2 >>> #1 A1 B1 0 NONE . >>> #2 A1 B1 1 cf 12 >>> #3 A1 B1 0 NONE . >>> #4 A2 B2 1 X2 25 >>> #5 A2 B3 1 fd 15 >>> >>> >>> Note that cbind dispatches on F1, an object of class "data.frame". >>> Therefore it's the method cbind.data.frame that is called and the result >>> is also a df, though tmp is a "matrix". >>> >>> >>> Hope this helps, >>> >>> Rui Barradas >>> >>> >>> ?s 20:07 de 22/09/20, Rui Barradas escreveu: >>>> Hello, >>>> >>>> Something like this? >>>> >>>> >>>> F1$Y1 <- +grepl("_", F1$text) >>>> F1 <- F1[c(1, 2, 4, 3)] >>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill >>>> "right") >>>> F1 >>>> >>>> >>>> Hope this helps, >>>> >>>> Rui Barradas >>>> >>>> ?s 19:55 de 22/09/20, Val escreveu: >>>>> HI All, >>>>> >>>>> I am trying to create new columns based on another column string >>>>> content. First I want to identify rows that contain a particular >>>>> string. If it contains, I want to split the string and create two >>>>> variables. >>>>> >>>>> Here is my sample of data. >>>>> F1<-read.table(text="ID1 ID2 text >>>>> A1 B1 NONE >>>>> A1 B1 cf_12 >>>>> A1 B1 NONE >>>>> A2 B2 X2_25 >>>>> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F) >>>>> If the variable "text" contains this "_" I want to create an indicator >>>>> variable as shown below >>>>> >>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0) >>>>> >>>>> >>>>> Then I want to split that string in to two, before "_" and after "_" >>>>> and create two variables as shown below >>>>> x1= strsplit(as.character(F1$text),'_',2) >>>>> >>>>> My problem is how to combine this with the original data frame. The >>>>> desired output is shown below, >>>>> >>>>> >>>>> ID1 ID2 Y1 X1 X2 >>>>> A1 B1 0 NONE . >>>>> A1 B1 1 cf 12 >>>>> A1 B1 0 NONE . >>>>> A2 B2 1 X2 25 >>>>> A2 B3 1 fd 15 >>>>> >>>>> Any help? >>>>> Thank you. >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Thank you again for your help and giving me the opportunity to choose the efficient method. For a small data set there is no discernable difference between the different approaches. I will carry out a comparison using the large data set. On Wed, Sep 23, 2020 at 11:52 AM LMH <lmh_users-groups at molconn.com> wrote:> > Below is a script in bash the uses the awk tokenizer to do the work. > > This assumes that your input and output delimiter is space. The number of consecutive delimiters in > the input is not important. This also assumes that the input file does not have a header row. That > is easy to modify if you want. I always keep header rows in my data files as I think that removing > them is asking for trouble down the road. > > I added a NULL for cases where there is no value for the last field. You could use "." if you want. > > You should be able to find how to run this from inside R if you want. You will, of course, need a > bash environment to run this, so if you are not in linux you will need cygwin or something similar. > > This should be very fast, but let me know if needs to be faster. If the X1_X2 variant occurs less > frequently than not then we should switch the order in which the logic evaluates the options. > > LMH > > > #! /bin/bash > > # input filename > input_file=$1 > > # output filename > output_file=$2 > > # make sure the input file exists > if [ ! -f $input_file ]; then > echo $input_file " cannot be found" > exit 0 > fi > > # create the output file > touch $output_file > > # make sure the output was created > if [ ! -f $output_file ]; then > echo $output_file " was not created" > exit 0 > fi > > # write the header row > echo "ID1 ID2 Y1 X1 X2" >> $output_file > > # character to find in the third token > look_for='_' > > # process with awk > # if the 3rd token contains '_' > # split the third token on '_' into F[1] and F[2] > # print the first two tokens, the indicator value of 1, and the split fields F[1] and F[2] > # otherwise, > # print the first two tokens, the indicator value of 0, the 3rd token, and NULL > > cat $input_file | \ > awk -v find_char=$look_for '{ if($3 ~ find_char) { { split ($3, F, "_") } > { print $1, $2, "1", F[1], F[2] } > } > else { print $1, $2, "0", $3, "NULL" } > }' >> $output_file > > > > > > > > Val wrote: > > Thank you all for the help! > > > > LMH, Yes I would like to see the alternative. I am using this for a > > large data set and if the alternative is more efficient than this > > then I would be happy. > > > > On Tue, Sep 22, 2020 at 6:25 PM Bert Gunter <bgunter.4567 at gmail.com> wrote: > >> > >> To be clear, I think Rui's solution is perfectly fine and probably better than what I offer below. But just for fun, I wanted to do it without the lapply(). Here is one way. I think my comments suffice to explain. > >> > >>> ## which are the non "_" indices? > >>> wh <- grep("_",F1$text, fixed = TRUE, invert = TRUE) > >>> ## paste "_." to these > >>> F1[wh,"text"] <- paste(F1[wh,"text"],".",sep = "_") > >>> ## Now strsplit() and unlist() them to get a vector > >>> z <- unlist(strsplit(F1$text, "_")) > >>> ## now cbind() to the data frame > >>> F1 <- cbind(F1, matrix(z, ncol = 2, byrow = TRUE)) > >>> F1 > >> ID1 ID2 text 1 2 > >> 1 A1 B1 NONE_. NONE . > >> 2 A1 B1 cf_12 cf 12 > >> 3 A1 B1 NONE_. NONE . > >> 4 A2 B2 X2_25 X2 25 > >> 5 A2 B3 fd_15 fd 15 > >>> ## You can change the names of the 2 columns yourself > >> > >> Cheers, > >> Bert > >> > >> Bert Gunter > >> > >> "The trouble with having an open mind is that people keep coming along and sticking things into it." > >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > >> > >> > >> On Tue, Sep 22, 2020 at 12:19 PM Rui Barradas <ruipbarradas at sapo.pt> wrote: > >>> > >>> Hello, > >>> > >>> A base R solution with strsplit, like in your code. > >>> > >>> F1$Y1 <- +grepl("_", F1$text) > >>> > >>> tmp <- strsplit(as.character(F1$text), "_") > >>> tmp <- lapply(tmp, function(x) if(length(x) == 1) c(x, ".") else x) > >>> tmp <- do.call(rbind, tmp) > >>> colnames(tmp) <- c("X1", "X2") > >>> F1 <- cbind(F1[-3], tmp) # remove the original column > >>> rm(tmp) > >>> > >>> F1 > >>> # ID1 ID2 Y1 X1 X2 > >>> #1 A1 B1 0 NONE . > >>> #2 A1 B1 1 cf 12 > >>> #3 A1 B1 0 NONE . > >>> #4 A2 B2 1 X2 25 > >>> #5 A2 B3 1 fd 15 > >>> > >>> > >>> Note that cbind dispatches on F1, an object of class "data.frame". > >>> Therefore it's the method cbind.data.frame that is called and the result > >>> is also a df, though tmp is a "matrix". > >>> > >>> > >>> Hope this helps, > >>> > >>> Rui Barradas > >>> > >>> > >>> ?s 20:07 de 22/09/20, Rui Barradas escreveu: > >>>> Hello, > >>>> > >>>> Something like this? > >>>> > >>>> > >>>> F1$Y1 <- +grepl("_", F1$text) > >>>> F1 <- F1[c(1, 2, 4, 3)] > >>>> F1 <- tidyr::separate(F1, text, into = c("X1", "X2"), sep = "_", fill > >>>> "right") > >>>> F1 > >>>> > >>>> > >>>> Hope this helps, > >>>> > >>>> Rui Barradas > >>>> > >>>> ?s 19:55 de 22/09/20, Val escreveu: > >>>>> HI All, > >>>>> > >>>>> I am trying to create new columns based on another column string > >>>>> content. First I want to identify rows that contain a particular > >>>>> string. If it contains, I want to split the string and create two > >>>>> variables. > >>>>> > >>>>> Here is my sample of data. > >>>>> F1<-read.table(text="ID1 ID2 text > >>>>> A1 B1 NONE > >>>>> A1 B1 cf_12 > >>>>> A1 B1 NONE > >>>>> A2 B2 X2_25 > >>>>> A2 B3 fd_15 ",header=TRUE,stringsAsFactors=F) > >>>>> If the variable "text" contains this "_" I want to create an indicator > >>>>> variable as shown below > >>>>> > >>>>> F1$Y1 <- ifelse(grepl("_", F1$text),1,0) > >>>>> > >>>>> > >>>>> Then I want to split that string in to two, before "_" and after "_" > >>>>> and create two variables as shown below > >>>>> x1= strsplit(as.character(F1$text),'_',2) > >>>>> > >>>>> My problem is how to combine this with the original data frame. The > >>>>> desired output is shown below, > >>>>> > >>>>> > >>>>> ID1 ID2 Y1 X1 X2 > >>>>> A1 B1 0 NONE . > >>>>> A1 B1 1 cf 12 > >>>>> A1 B1 0 NONE . > >>>>> A2 B2 1 X2 25 > >>>>> A2 B3 1 fd 15 > >>>>> > >>>>> Any help? > >>>>> Thank you. > >>>>> > >>>>> ______________________________________________ > >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>> > >>>> > >>>> ______________________________________________ > >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide > >>>> http://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > >