Hi, It is not a just 79 triplets. As I said, there are 79 codes. I am making triplets out of that 79 codes and matching the triplets in the list. Please find the dput of the data below.> dput(head(newd,10))structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85, 87, 122, 134", "2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107", "79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96", "19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))>I am trying to count the frequency of the triplets in the above data using the below code. # split column into a list myList <- strsplit(newd$hi, split=",") # get all pairwise combinations myCombos <- t(combn(unique(unlist(myList)), 3)) # count the instances where the pair is present myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { sum(sapply(myList, function(j) { sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) #final matrix final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts) I hope I made my point clear. Please let me know if I miss anything. Regards, Sri On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:> You said you had 79 triplets and 8000 records. > > When I compared 100 triplets to 10000 records it took 86 seconds. > > So obviously there is something you're not telling us about the format > of your data. > > If you use dput() to provide actual examples, you will get better > results than if we on Rhelp have to guess. Because we tend to guess in > ways that make the most sense after extensive R experience, and that's > probably not what you have. > > Sarah > > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at gmail.com> wrote: > > Hi, > > > > Thanks for the solution. But I am afraid that after running this code > still > > it takes more time. It has been an hour and still it is executing. I > > understand the delay because each triplet has to compare almost 9000 > > elements. > > > > Regards, > > Sri > > > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at gmail.com> > > wrote: > >> > >> Hi, > >> > >> It's really a good idea to use dput() or some other reproducible way > >> to provide data. I had to guess as to what your data looked like. > >> > >> It appears that order doesn't matter? > >> > >> Given than, here's one approach: > >> > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, > 34L, > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", > >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) > >> > >> dat <- list( > >> c(77,65,34,23,55), > >> c(65,23,77,65,55,34), > >> c(77,34,65), > >> c(55,78,56), > >> c(98,23,77,65,34)) > >> > >> > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat, > >> function(j)all(combs[i,] %in% j)))) > >> > >> On a dataset of comparable time to yours, it takes me under a minute > and a > >> half. > >> > >> > combs <- combs[rep(1:nrow(combs), length=100), ] > >> > dat <- dat[rep(1:length(dat), length=10000)] > >> > > >> > dim(combs) > >> [1] 100 3 > >> > length(dat) > >> [1] 10000 > >> > > >> > system.time(test <- sapply(seq_len(nrow(combs)), > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j))))) > >> user system elapsed > >> 86.380 0.006 86.391 > >> > >> > >> > >> > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish at gmail.com> > wrote: > >> > Hi, > >> > > >> > Apologizes for the less information. > >> > > >> > Basically, myCombos is a matrix with 3 variables which is a triplet > that > >> > is > >> > a combination of 79 codes. There are around 3lakh combination as such > >> > and > >> > it looks like below. > >> > > >> > V1 V2 V3 > >> > 65 23 77 > >> > 77 34 65 > >> > 55 34 23 > >> > 23 77 34 > >> > 34 65 55 > >> > > >> > Each triplet will compare in a list (mylist) having 8177 elements > which > >> > will looks like below. > >> > > >> > 77,65,34,23,55 > >> > 65,23,77,65,55,34 > >> > 77,34,65 > >> > 55,78,56 > >> > 98,23,77,65,34 > >> > > >> > Now I want to count the no of occurrence of the triplet in the above > >> > list. > >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my output > >> > looks > >> > like below > >> > > >> > V1 V2 V3 Freq > >> > 65 23 77 3 > >> > 77 34 65 4 > >> > 55 34 23 2 > >> > > >> > I hope, I made it clear this time. > >> > > >> > > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter <bgunter.4567 at gmail.com> > >> > wrote: > >> > > >> >> Not entirely sure I understand, but match() is already vectorized, so > >> >> you > >> >> should be able to lose the supply(). This would speed things up a > lot. > >> >> Please re-read ?match *carefully* . > >> >> > >> >> Bert > >> >> > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivibish at gmail.com> wrote: > >> >> > >> >> Hi, > >> >> > >> >> I created list of 3 combination numbers (mycombos, around 3 lakh > >> >> combinations) and counting the occurrence of those combination in > >> >> another > >> >> list. This comparision list (mylist) is having around 8000 records.I > am > >> >> using the following code. > >> >> > >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { > >> >> sum(sapply(myList, function(j) { > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) > >> >> > >> >> The above code takes very long time to execute and is there any other > >> >> effecting method which will reduce the time. > >> >> -- > >> >> > >> >> Regards, > >> >> Srivathsan.K > >> >> > > > > > > > > >-- Regards, Srivathsan.K Phone : 9600165206 [[alternative HTML version deleted]]
Correction to my code. I created a "doc" variable because I was
thinking of
doing something faster, but I never did the change. grep needed to work on
the original source "dat" to be used for counting.
Fixed:
combs = structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, 34L,
34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names =
c(NA, -5L))
dat = list(
c(77,65,34,23,55, 65,23,77, 44),
c(65,23,77,65,55,34, 77, 34,65, 10),
c(77,34,65),
c(55,78,56),
c(98,23,77,65,34, 65, 23, 77, 34))
words = unlist(apply(combs, 1 , function(d) paste(as.character(d),
collapse=" ")))
dat = lapply(dat, function(d) paste( as.character(d), collapse= " "))
#doc = paste(dat, collapse = " ## ") # just some arbitrary separator
character that isn't in your words
counts = sapply(words, function(w) length(grep(w, dat)))
names(counts) = words
counts
cbind(combs, data.frame(N = counts))
On Wed, Jul 27, 2016 at 11:27 AM, sri vathsan <srivibish at gmail.com>
wrote:
> Hi,
>
> It is not a just 79 triplets. As I said, there are 79 codes. I am making
> triplets out of that 79 codes and matching the triplets in the list.
>
> Please find the dput of the data below.
>
> > dput(head(newd,10))
> structure(list(uniq_id = c("1", "2", "3",
"4", "5", "6", "7",
> "8", "9", "10"), hi = c("11, 22, 84,
85, 108, 111", "18, 84, 85,
> 87, 122, 134",
> "2, 18, 22", "18, 108, 122, 134, 176", "19,
85, 87, 100, 107",
> "79, 85, 111", "11, 88, 108", "19, 88,
96", "19, 85, 96",
> "19, 100, 103")), .Names = c("uniq_id",
"hi"), row.names = c(NA,
> -10L), class = c("tbl_df", "tbl",
"data.frame"))
> >
>
> I am trying to count the frequency of the triplets in the above data using
> the below code.
>
> # split column into a list
> myList <- strsplit(newd$hi, split=",")
> # get all pairwise combinations
> myCombos <- t(combn(unique(unlist(myList)), 3))
> # count the instances where the pair is present
> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) {
> sum(sapply(myList, function(j) {
> sum(!is.na(match(c(myCombos[i,]), j)))})==3)})
> #final matrix
> final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts)
>
> I hope I made my point clear. Please let me know if I miss anything.
>
> Regards,
> Sri
>
>
>
>
> On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at
gmail.com>
> wrote:
>
> > You said you had 79 triplets and 8000 records.
> >
> > When I compared 100 triplets to 10000 records it took 86 seconds.
> >
> > So obviously there is something you're not telling us about the
format
> > of your data.
> >
> > If you use dput() to provide actual examples, you will get better
> > results than if we on Rhelp have to guess. Because we tend to guess in
> > ways that make the most sense after extensive R experience, and
that's
> > probably not what you have.
> >
> > Sarah
> >
> > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at
gmail.com>
> wrote:
> > > Hi,
> > >
> > > Thanks for the solution. But I am afraid that after running this
code
> > still
> > > it takes more time. It has been an hour and still it is
executing. I
> > > understand the delay because each triplet has to compare almost
9000
> > > elements.
> > >
> > > Regards,
> > > Sri
> > >
> > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at
gmail.com>
> > > wrote:
> > >>
> > >> Hi,
> > >>
> > >> It's really a good idea to use dput() or some other
reproducible way
> > >> to provide data. I had to guess as to what your data looked
like.
> > >>
> > >> It appears that order doesn't matter?
> > >>
> > >> Given than, here's one approach:
> > >>
> > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L),
V2 = c(23L,
> > 34L,
> > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names =
c("V1",
> > >> "V2", "V3"), class =
"data.frame", row.names = c(NA, -5L))
> > >>
> > >> dat <- list(
> > >> c(77,65,34,23,55),
> > >> c(65,23,77,65,55,34),
> > >> c(77,34,65),
> > >> c(55,78,56),
> > >> c(98,23,77,65,34))
> > >>
> > >>
> > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat,
> > >> function(j)all(combs[i,] %in% j))))
> > >>
> > >> On a dataset of comparable time to yours, it takes me under a
minute
> > and a
> > >> half.
> > >>
> > >> > combs <- combs[rep(1:nrow(combs), length=100), ]
> > >> > dat <- dat[rep(1:length(dat), length=10000)]
> > >> >
> > >> > dim(combs)
> > >> [1] 100 3
> > >> > length(dat)
> > >> [1] 10000
> > >> >
> > >> > system.time(test <- sapply(seq_len(nrow(combs)),
> > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in%
j)))))
> > >> user system elapsed
> > >> 86.380 0.006 86.391
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish
at gmail.com>
> > wrote:
> > >> > Hi,
> > >> >
> > >> > Apologizes for the less information.
> > >> >
> > >> > Basically, myCombos is a matrix with 3 variables which
is a triplet
> > that
> > >> > is
> > >> > a combination of 79 codes. There are around 3lakh
combination as
> such
> > >> > and
> > >> > it looks like below.
> > >> >
> > >> > V1 V2 V3
> > >> > 65 23 77
> > >> > 77 34 65
> > >> > 55 34 23
> > >> > 23 77 34
> > >> > 34 65 55
> > >> >
> > >> > Each triplet will compare in a list (mylist) having 8177
elements
> > which
> > >> > will looks like below.
> > >> >
> > >> > 77,65,34,23,55
> > >> > 65,23,77,65,55,34
> > >> > 77,34,65
> > >> > 55,78,56
> > >> > 98,23,77,65,34
> > >> >
> > >> > Now I want to count the no of occurrence of the triplet
in the above
> > >> > list.
> > >> > I.e., the triplet 65 23 77 is seen 3 times in the list.
So my output
> > >> > looks
> > >> > like below
> > >> >
> > >> > V1 V2 V3 Freq
> > >> > 65 23 77 3
> > >> > 77 34 65 4
> > >> > 55 34 23 2
> > >> >
> > >> > I hope, I made it clear this time.
> > >> >
> > >> >
> > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter <
> bgunter.4567 at gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Not entirely sure I understand, but match() is
already vectorized,
> so
> > >> >> you
> > >> >> should be able to lose the supply(). This would
speed things up a
> > lot.
> > >> >> Please re-read ?match *carefully* .
> > >> >>
> > >> >> Bert
> > >> >>
> > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan"
<srivibish at gmail.com>
> wrote:
> > >> >>
> > >> >> Hi,
> > >> >>
> > >> >> I created list of 3 combination numbers (mycombos,
around 3 lakh
> > >> >> combinations) and counting the occurrence of those
combination in
> > >> >> another
> > >> >> list. This comparision list (mylist) is having
around 8000
> records.I
> > am
> > >> >> using the following code.
> > >> >>
> > >> >> myCounts <- sapply(1:nrow(myCombos),
FUN=function(i) {
> > >> >> sum(sapply(myList, function(j) {
> > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)})
> > >> >>
> > >> >> The above code takes very long time to execute and
is there any
> other
> > >> >> effecting method which will reduce the time.
> > >> >> --
> > >> >>
> > >> >> Regards,
> > >> >> Srivathsan.K
> > >> >>
> > >
> > >
> > >
> > >
> >
>
>
>
> --
>
> Regards,
> Srivathsan.K
> Phone : 9600165206
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
Hi, Thanks for the response. Unfortunately this did not solve my problem and may be the way I represented my data would be the problem. I am not sure that I can give a link for the data which will give a clear representation. If that is not a proper way, I have to follow my original method. Regards, Sri On Thu, Jul 28, 2016 at 12:56 AM, jeremiah rounds <roundsjeremiah at gmail.com> wrote:> Correction to my code. I created a "doc" variable because I was thinking > of doing something faster, but I never did the change. grep needed to work > on the original source "dat" to be used for counting. > > Fixed: > > combs = structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, 34L, > 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", > "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) > > dat = list( > c(77,65,34,23,55, 65,23,77, 44), > c(65,23,77,65,55,34, 77, 34,65, 10), > c(77,34,65), > c(55,78,56), > c(98,23,77,65,34, 65, 23, 77, 34)) > > > words = unlist(apply(combs, 1 , function(d) paste(as.character(d), > collapse=" "))) > dat = lapply(dat, function(d) paste( as.character(d), collapse= " ")) > #doc = paste(dat, collapse = " ## ") # just some arbitrary separator > character that isn't in your words > counts = sapply(words, function(w) length(grep(w, dat))) > names(counts) = words > counts > cbind(combs, data.frame(N = counts)) > > > On Wed, Jul 27, 2016 at 11:27 AM, sri vathsan <srivibish at gmail.com> wrote: > >> Hi, >> >> It is not a just 79 triplets. As I said, there are 79 codes. I am making >> triplets out of that 79 codes and matching the triplets in the list. >> >> Please find the dput of the data below. >> >> > dput(head(newd,10)) >> structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7", >> "8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85, >> 87, 122, 134", >> "2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107", >> "79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96", >> "19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA, >> -10L), class = c("tbl_df", "tbl", "data.frame")) >> > >> >> I am trying to count the frequency of the triplets in the above data using >> the below code. >> >> # split column into a list >> myList <- strsplit(newd$hi, split=",") >> # get all pairwise combinations >> myCombos <- t(combn(unique(unlist(myList)), 3)) >> # count the instances where the pair is present >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { >> sum(sapply(myList, function(j) { >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) >> #final matrix >> final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts) >> >> I hope I made my point clear. Please let me know if I miss anything. >> >> Regards, >> Sri >> >> >> >> >> On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at gmail.com> >> wrote: >> >> > You said you had 79 triplets and 8000 records. >> > >> > When I compared 100 triplets to 10000 records it took 86 seconds. >> > >> > So obviously there is something you're not telling us about the format >> > of your data. >> > >> > If you use dput() to provide actual examples, you will get better >> > results than if we on Rhelp have to guess. Because we tend to guess in >> > ways that make the most sense after extensive R experience, and that's >> > probably not what you have. >> > >> > Sarah >> > >> > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at gmail.com> >> wrote: >> > > Hi, >> > > >> > > Thanks for the solution. But I am afraid that after running this code >> > still >> > > it takes more time. It has been an hour and still it is executing. I >> > > understand the delay because each triplet has to compare almost 9000 >> > > elements. >> > > >> > > Regards, >> > > Sri >> > > >> > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at gmail.com >> > >> > > wrote: >> > >> >> > >> Hi, >> > >> >> > >> It's really a good idea to use dput() or some other reproducible way >> > >> to provide data. I had to guess as to what your data looked like. >> > >> >> > >> It appears that order doesn't matter? >> > >> >> > >> Given than, here's one approach: >> > >> >> > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, >> > 34L, >> > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", >> > >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) >> > >> >> > >> dat <- list( >> > >> c(77,65,34,23,55), >> > >> c(65,23,77,65,55,34), >> > >> c(77,34,65), >> > >> c(55,78,56), >> > >> c(98,23,77,65,34)) >> > >> >> > >> >> > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat, >> > >> function(j)all(combs[i,] %in% j)))) >> > >> >> > >> On a dataset of comparable time to yours, it takes me under a minute >> > and a >> > >> half. >> > >> >> > >> > combs <- combs[rep(1:nrow(combs), length=100), ] >> > >> > dat <- dat[rep(1:length(dat), length=10000)] >> > >> > >> > >> > dim(combs) >> > >> [1] 100 3 >> > >> > length(dat) >> > >> [1] 10000 >> > >> > >> > >> > system.time(test <- sapply(seq_len(nrow(combs)), >> > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j))))) >> > >> user system elapsed >> > >> 86.380 0.006 86.391 >> > >> >> > >> >> > >> >> > >> >> > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish at gmail.com> >> > wrote: >> > >> > Hi, >> > >> > >> > >> > Apologizes for the less information. >> > >> > >> > >> > Basically, myCombos is a matrix with 3 variables which is a triplet >> > that >> > >> > is >> > >> > a combination of 79 codes. There are around 3lakh combination as >> such >> > >> > and >> > >> > it looks like below. >> > >> > >> > >> > V1 V2 V3 >> > >> > 65 23 77 >> > >> > 77 34 65 >> > >> > 55 34 23 >> > >> > 23 77 34 >> > >> > 34 65 55 >> > >> > >> > >> > Each triplet will compare in a list (mylist) having 8177 elements >> > which >> > >> > will looks like below. >> > >> > >> > >> > 77,65,34,23,55 >> > >> > 65,23,77,65,55,34 >> > >> > 77,34,65 >> > >> > 55,78,56 >> > >> > 98,23,77,65,34 >> > >> > >> > >> > Now I want to count the no of occurrence of the triplet in the >> above >> > >> > list. >> > >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my >> output >> > >> > looks >> > >> > like below >> > >> > >> > >> > V1 V2 V3 Freq >> > >> > 65 23 77 3 >> > >> > 77 34 65 4 >> > >> > 55 34 23 2 >> > >> > >> > >> > I hope, I made it clear this time. >> > >> > >> > >> > >> > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter < >> bgunter.4567 at gmail.com> >> > >> > wrote: >> > >> > >> > >> >> Not entirely sure I understand, but match() is already >> vectorized, so >> > >> >> you >> > >> >> should be able to lose the supply(). This would speed things up a >> > lot. >> > >> >> Please re-read ?match *carefully* . >> > >> >> >> > >> >> Bert >> > >> >> >> > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivibish at gmail.com> >> wrote: >> > >> >> >> > >> >> Hi, >> > >> >> >> > >> >> I created list of 3 combination numbers (mycombos, around 3 lakh >> > >> >> combinations) and counting the occurrence of those combination in >> > >> >> another >> > >> >> list. This comparision list (mylist) is having around 8000 >> records.I >> > am >> > >> >> using the following code. >> > >> >> >> > >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { >> > >> >> sum(sapply(myList, function(j) { >> > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) >> > >> >> >> > >> >> The above code takes very long time to execute and is there any >> other >> > >> >> effecting method which will reduce the time. >> > >> >> -- >> > >> >> >> > >> >> Regards, >> > >> >> Srivathsan.K >> > >> >> >> > > >> > > >> > > >> > > >> > >> >> >> >> -- >> >> Regards, >> Srivathsan.K >> Phone : 9600165206 >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >-- Regards, Srivathsan.K Phone : 9600165206 [[alternative HTML version deleted]]