HI,
In case if you wanted to take "BC" and "CB" as the same.
dat1<- read.table(text="
?? Seq,Output
?A B B C D A C,Yes
?B C A C B D A C,Yes
C D A A C D,No
",sep=",",header=TRUE,stringsAsFactors=FALSE)
lapply(str_split(str_trim(dat1$Seq),"
")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2));
x2<-sapply(strsplit(apply(x1,1,paste0,collapse=""),""),function(x)
paste(x[order(x)],collapse="")); table(x2)})
[[1]]
#x2
#AA AB AC AD BB BC BD CC CD
# 1? 4? 4? 2? 1? 4? 2? 1? 2
#[[2]]
#x2
#AA AB AC AD BB BC BD CC CD
# 1? 4? 6? 2? 1? 6? 2? 3? 3
dat1$MaxCombn<- NA
res1<-sapply(str_split(str_trim(dat1$Seq),"
")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2));
x2<-sapply(strsplit(apply(x1,1,paste0,collapse=""),""),function(x)
paste(x[order(x)],collapse="")); x3<-table(x2); x3[x3%in%
max(x3)]})
dat1$MaxCombn[dat1$Output=="Yes"]<-lapply(res1,names)
?dat1
#?????????????? Seq Output?? MaxCombn
#1??? A B B C D A C??? Yes AB, AC, BC
#2? B C A C B D A C??? Yes???? AC, BC
#3????? C D A A C D???? No???????? NA
A.K.
----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: R help <r-help at r-project.org>
Cc:
Sent: Friday, April 12, 2013 4:37 PM
Subject: Re: Search for common character strings within a column
Hi,
May be this helps:
Not sure how you wanted to select those two letters.?
dat1<- read.table(text="
?? Seq,Output
?A B B C D A C,Yes
?B C A C B D A C,Yes
C D A A C D,No
",sep=",",header=TRUE,stringsAsFactors=FALSE)
library(stringr)
lapply(str_split(str_trim(dat1$Seq),"
")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2));
apply(x1,1,paste0,collapse="")})
#[[1]]
# [1] "AB" "AB" "AC" "AD" "AA"
"AC" "BB" "BC" "BD" "BA"
"BC" "BC" "BD" "BA" "BC"
#[16] "CD" "CA" "CC" "DA" "DC"
"AC"
#[[2]]
# [1] "BC" "BA" "BC" "BB" "BD"
"BA" "BC" "CA" "CC" "CB"
"CD" "CA" "CC" "AC" "AB"
#[16] "AD" "AA" "AC" "CB" "CD"
"CA" "CC" "BD" "BA" "BC"
"DA" "DC" "AC"
res<- sapply(str_split(str_trim(dat1$Seq),"
")[dat1$Output=="Yes"],function(x) {x1<-t(combn(x,2));
x2<-table(apply(x1,1,paste0,collapse="")); x2[which.max(x2)]})
res
#BC BC
# 4? 4
?
dat1$MaxCombn<-NA
?dat1$MaxCombn[dat1$Output=="Yes"]<- names(res)
?dat1
#?????????????? Seq Output MaxCombn
#1??? A B B C D A C??? Yes?????? BC
#2? B C A C B D A C??? Yes?????? BC
#3????? C D A A C D???? No???? <NA>
A.K.
>I have a dataset (data) that consists of two columns: Seq and output.
Each entry in Seq is a combination of As,Bs,Cs and Ds and ranges from 5 ? >30
characters in length. Each sequence is associated with an output of
either yes or no such that: >
?> ? ? Seq??? ??? ? ? ? ? ?Output >(1) A B B C D A C ??? ? ? ? ?Yes
>(2) B C A C B D A C??? Yes
>(3) C D A A C D ??? ??? No
>
>etc, etc.
>
>I want to find which 2 letter (A B, A C, A D, etc) strings are
most associated with each output. Essentially I want to find which 2
letter combinations >occur most frequently in the column Seq, when the
output is Yes. I?m new to R and can?t figure out a solution to this
problem. >
>Any help greatly appreciated!
>
>Cheers,
>
>AB