Hi, It is not a just 79 triplets. As I said, there are 79 codes. I am making triplets out of that 79 codes and matching the triplets in the list. Please find the dput of the data below.> dput(head(newd,10))structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85, 87, 122, 134", "2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107", "79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96", "19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))>I am trying to count the frequency of the triplets in the above data using the below code. # split column into a list myList <- strsplit(newd$hi, split=",") # get all pairwise combinations myCombos <- t(combn(unique(unlist(myList)), 3)) # count the instances where the pair is present myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { sum(sapply(myList, function(j) { sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) #final matrix final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts) I hope I made my point clear. Please let me know if I miss anything. Regards, Sri On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at gmail.com> wrote:> You said you had 79 triplets and 8000 records. > > When I compared 100 triplets to 10000 records it took 86 seconds. > > So obviously there is something you're not telling us about the format > of your data. > > If you use dput() to provide actual examples, you will get better > results than if we on Rhelp have to guess. Because we tend to guess in > ways that make the most sense after extensive R experience, and that's > probably not what you have. > > Sarah > > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at gmail.com> wrote: > > Hi, > > > > Thanks for the solution. But I am afraid that after running this code > still > > it takes more time. It has been an hour and still it is executing. I > > understand the delay because each triplet has to compare almost 9000 > > elements. > > > > Regards, > > Sri > > > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at gmail.com> > > wrote: > >> > >> Hi, > >> > >> It's really a good idea to use dput() or some other reproducible way > >> to provide data. I had to guess as to what your data looked like. > >> > >> It appears that order doesn't matter? > >> > >> Given than, here's one approach: > >> > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, > 34L, > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", > >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) > >> > >> dat <- list( > >> c(77,65,34,23,55), > >> c(65,23,77,65,55,34), > >> c(77,34,65), > >> c(55,78,56), > >> c(98,23,77,65,34)) > >> > >> > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat, > >> function(j)all(combs[i,] %in% j)))) > >> > >> On a dataset of comparable time to yours, it takes me under a minute > and a > >> half. > >> > >> > combs <- combs[rep(1:nrow(combs), length=100), ] > >> > dat <- dat[rep(1:length(dat), length=10000)] > >> > > >> > dim(combs) > >> [1] 100 3 > >> > length(dat) > >> [1] 10000 > >> > > >> > system.time(test <- sapply(seq_len(nrow(combs)), > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j))))) > >> user system elapsed > >> 86.380 0.006 86.391 > >> > >> > >> > >> > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish at gmail.com> > wrote: > >> > Hi, > >> > > >> > Apologizes for the less information. > >> > > >> > Basically, myCombos is a matrix with 3 variables which is a triplet > that > >> > is > >> > a combination of 79 codes. There are around 3lakh combination as such > >> > and > >> > it looks like below. > >> > > >> > V1 V2 V3 > >> > 65 23 77 > >> > 77 34 65 > >> > 55 34 23 > >> > 23 77 34 > >> > 34 65 55 > >> > > >> > Each triplet will compare in a list (mylist) having 8177 elements > which > >> > will looks like below. > >> > > >> > 77,65,34,23,55 > >> > 65,23,77,65,55,34 > >> > 77,34,65 > >> > 55,78,56 > >> > 98,23,77,65,34 > >> > > >> > Now I want to count the no of occurrence of the triplet in the above > >> > list. > >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my output > >> > looks > >> > like below > >> > > >> > V1 V2 V3 Freq > >> > 65 23 77 3 > >> > 77 34 65 4 > >> > 55 34 23 2 > >> > > >> > I hope, I made it clear this time. > >> > > >> > > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter <bgunter.4567 at gmail.com> > >> > wrote: > >> > > >> >> Not entirely sure I understand, but match() is already vectorized, so > >> >> you > >> >> should be able to lose the supply(). This would speed things up a > lot. > >> >> Please re-read ?match *carefully* . > >> >> > >> >> Bert > >> >> > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivibish at gmail.com> wrote: > >> >> > >> >> Hi, > >> >> > >> >> I created list of 3 combination numbers (mycombos, around 3 lakh > >> >> combinations) and counting the occurrence of those combination in > >> >> another > >> >> list. This comparision list (mylist) is having around 8000 records.I > am > >> >> using the following code. > >> >> > >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { > >> >> sum(sapply(myList, function(j) { > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) > >> >> > >> >> The above code takes very long time to execute and is there any other > >> >> effecting method which will reduce the time. > >> >> -- > >> >> > >> >> Regards, > >> >> Srivathsan.K > >> >> > > > > > > > > >-- Regards, Srivathsan.K Phone : 9600165206 [[alternative HTML version deleted]]
Correction to my code. I created a "doc" variable because I was thinking of doing something faster, but I never did the change. grep needed to work on the original source "dat" to be used for counting. Fixed: combs = structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, 34L, 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) dat = list( c(77,65,34,23,55, 65,23,77, 44), c(65,23,77,65,55,34, 77, 34,65, 10), c(77,34,65), c(55,78,56), c(98,23,77,65,34, 65, 23, 77, 34)) words = unlist(apply(combs, 1 , function(d) paste(as.character(d), collapse=" "))) dat = lapply(dat, function(d) paste( as.character(d), collapse= " ")) #doc = paste(dat, collapse = " ## ") # just some arbitrary separator character that isn't in your words counts = sapply(words, function(w) length(grep(w, dat))) names(counts) = words counts cbind(combs, data.frame(N = counts)) On Wed, Jul 27, 2016 at 11:27 AM, sri vathsan <srivibish at gmail.com> wrote:> Hi, > > It is not a just 79 triplets. As I said, there are 79 codes. I am making > triplets out of that 79 codes and matching the triplets in the list. > > Please find the dput of the data below. > > > dput(head(newd,10)) > structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7", > "8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85, > 87, 122, 134", > "2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107", > "79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96", > "19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA, > -10L), class = c("tbl_df", "tbl", "data.frame")) > > > > I am trying to count the frequency of the triplets in the above data using > the below code. > > # split column into a list > myList <- strsplit(newd$hi, split=",") > # get all pairwise combinations > myCombos <- t(combn(unique(unlist(myList)), 3)) > # count the instances where the pair is present > myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { > sum(sapply(myList, function(j) { > sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) > #final matrix > final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts) > > I hope I made my point clear. Please let me know if I miss anything. > > Regards, > Sri > > > > > On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at gmail.com> > wrote: > > > You said you had 79 triplets and 8000 records. > > > > When I compared 100 triplets to 10000 records it took 86 seconds. > > > > So obviously there is something you're not telling us about the format > > of your data. > > > > If you use dput() to provide actual examples, you will get better > > results than if we on Rhelp have to guess. Because we tend to guess in > > ways that make the most sense after extensive R experience, and that's > > probably not what you have. > > > > Sarah > > > > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at gmail.com> > wrote: > > > Hi, > > > > > > Thanks for the solution. But I am afraid that after running this code > > still > > > it takes more time. It has been an hour and still it is executing. I > > > understand the delay because each triplet has to compare almost 9000 > > > elements. > > > > > > Regards, > > > Sri > > > > > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at gmail.com> > > > wrote: > > >> > > >> Hi, > > >> > > >> It's really a good idea to use dput() or some other reproducible way > > >> to provide data. I had to guess as to what your data looked like. > > >> > > >> It appears that order doesn't matter? > > >> > > >> Given than, here's one approach: > > >> > > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, > > 34L, > > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", > > >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) > > >> > > >> dat <- list( > > >> c(77,65,34,23,55), > > >> c(65,23,77,65,55,34), > > >> c(77,34,65), > > >> c(55,78,56), > > >> c(98,23,77,65,34)) > > >> > > >> > > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat, > > >> function(j)all(combs[i,] %in% j)))) > > >> > > >> On a dataset of comparable time to yours, it takes me under a minute > > and a > > >> half. > > >> > > >> > combs <- combs[rep(1:nrow(combs), length=100), ] > > >> > dat <- dat[rep(1:length(dat), length=10000)] > > >> > > > >> > dim(combs) > > >> [1] 100 3 > > >> > length(dat) > > >> [1] 10000 > > >> > > > >> > system.time(test <- sapply(seq_len(nrow(combs)), > > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j))))) > > >> user system elapsed > > >> 86.380 0.006 86.391 > > >> > > >> > > >> > > >> > > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish at gmail.com> > > wrote: > > >> > Hi, > > >> > > > >> > Apologizes for the less information. > > >> > > > >> > Basically, myCombos is a matrix with 3 variables which is a triplet > > that > > >> > is > > >> > a combination of 79 codes. There are around 3lakh combination as > such > > >> > and > > >> > it looks like below. > > >> > > > >> > V1 V2 V3 > > >> > 65 23 77 > > >> > 77 34 65 > > >> > 55 34 23 > > >> > 23 77 34 > > >> > 34 65 55 > > >> > > > >> > Each triplet will compare in a list (mylist) having 8177 elements > > which > > >> > will looks like below. > > >> > > > >> > 77,65,34,23,55 > > >> > 65,23,77,65,55,34 > > >> > 77,34,65 > > >> > 55,78,56 > > >> > 98,23,77,65,34 > > >> > > > >> > Now I want to count the no of occurrence of the triplet in the above > > >> > list. > > >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my output > > >> > looks > > >> > like below > > >> > > > >> > V1 V2 V3 Freq > > >> > 65 23 77 3 > > >> > 77 34 65 4 > > >> > 55 34 23 2 > > >> > > > >> > I hope, I made it clear this time. > > >> > > > >> > > > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter < > bgunter.4567 at gmail.com> > > >> > wrote: > > >> > > > >> >> Not entirely sure I understand, but match() is already vectorized, > so > > >> >> you > > >> >> should be able to lose the supply(). This would speed things up a > > lot. > > >> >> Please re-read ?match *carefully* . > > >> >> > > >> >> Bert > > >> >> > > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivibish at gmail.com> > wrote: > > >> >> > > >> >> Hi, > > >> >> > > >> >> I created list of 3 combination numbers (mycombos, around 3 lakh > > >> >> combinations) and counting the occurrence of those combination in > > >> >> another > > >> >> list. This comparision list (mylist) is having around 8000 > records.I > > am > > >> >> using the following code. > > >> >> > > >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { > > >> >> sum(sapply(myList, function(j) { > > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) > > >> >> > > >> >> The above code takes very long time to execute and is there any > other > > >> >> effecting method which will reduce the time. > > >> >> -- > > >> >> > > >> >> Regards, > > >> >> Srivathsan.K > > >> >> > > > > > > > > > > > > > > > > > > -- > > Regards, > Srivathsan.K > Phone : 9600165206 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi, Thanks for the response. Unfortunately this did not solve my problem and may be the way I represented my data would be the problem. I am not sure that I can give a link for the data which will give a clear representation. If that is not a proper way, I have to follow my original method. Regards, Sri On Thu, Jul 28, 2016 at 12:56 AM, jeremiah rounds <roundsjeremiah at gmail.com> wrote:> Correction to my code. I created a "doc" variable because I was thinking > of doing something faster, but I never did the change. grep needed to work > on the original source "dat" to be used for counting. > > Fixed: > > combs = structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, 34L, > 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", > "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) > > dat = list( > c(77,65,34,23,55, 65,23,77, 44), > c(65,23,77,65,55,34, 77, 34,65, 10), > c(77,34,65), > c(55,78,56), > c(98,23,77,65,34, 65, 23, 77, 34)) > > > words = unlist(apply(combs, 1 , function(d) paste(as.character(d), > collapse=" "))) > dat = lapply(dat, function(d) paste( as.character(d), collapse= " ")) > #doc = paste(dat, collapse = " ## ") # just some arbitrary separator > character that isn't in your words > counts = sapply(words, function(w) length(grep(w, dat))) > names(counts) = words > counts > cbind(combs, data.frame(N = counts)) > > > On Wed, Jul 27, 2016 at 11:27 AM, sri vathsan <srivibish at gmail.com> wrote: > >> Hi, >> >> It is not a just 79 triplets. As I said, there are 79 codes. I am making >> triplets out of that 79 codes and matching the triplets in the list. >> >> Please find the dput of the data below. >> >> > dput(head(newd,10)) >> structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7", >> "8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85, >> 87, 122, 134", >> "2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107", >> "79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96", >> "19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA, >> -10L), class = c("tbl_df", "tbl", "data.frame")) >> > >> >> I am trying to count the frequency of the triplets in the above data using >> the below code. >> >> # split column into a list >> myList <- strsplit(newd$hi, split=",") >> # get all pairwise combinations >> myCombos <- t(combn(unique(unlist(myList)), 3)) >> # count the instances where the pair is present >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { >> sum(sapply(myList, function(j) { >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) >> #final matrix >> final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts) >> >> I hope I made my point clear. Please let me know if I miss anything. >> >> Regards, >> Sri >> >> >> >> >> On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at gmail.com> >> wrote: >> >> > You said you had 79 triplets and 8000 records. >> > >> > When I compared 100 triplets to 10000 records it took 86 seconds. >> > >> > So obviously there is something you're not telling us about the format >> > of your data. >> > >> > If you use dput() to provide actual examples, you will get better >> > results than if we on Rhelp have to guess. Because we tend to guess in >> > ways that make the most sense after extensive R experience, and that's >> > probably not what you have. >> > >> > Sarah >> > >> > On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at gmail.com> >> wrote: >> > > Hi, >> > > >> > > Thanks for the solution. But I am afraid that after running this code >> > still >> > > it takes more time. It has been an hour and still it is executing. I >> > > understand the delay because each triplet has to compare almost 9000 >> > > elements. >> > > >> > > Regards, >> > > Sri >> > > >> > > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at gmail.com >> > >> > > wrote: >> > >> >> > >> Hi, >> > >> >> > >> It's really a good idea to use dput() or some other reproducible way >> > >> to provide data. I had to guess as to what your data looked like. >> > >> >> > >> It appears that order doesn't matter? >> > >> >> > >> Given than, here's one approach: >> > >> >> > >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L, >> > 34L, >> > >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1", >> > >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L)) >> > >> >> > >> dat <- list( >> > >> c(77,65,34,23,55), >> > >> c(65,23,77,65,55,34), >> > >> c(77,34,65), >> > >> c(55,78,56), >> > >> c(98,23,77,65,34)) >> > >> >> > >> >> > >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat, >> > >> function(j)all(combs[i,] %in% j)))) >> > >> >> > >> On a dataset of comparable time to yours, it takes me under a minute >> > and a >> > >> half. >> > >> >> > >> > combs <- combs[rep(1:nrow(combs), length=100), ] >> > >> > dat <- dat[rep(1:length(dat), length=10000)] >> > >> > >> > >> > dim(combs) >> > >> [1] 100 3 >> > >> > length(dat) >> > >> [1] 10000 >> > >> > >> > >> > system.time(test <- sapply(seq_len(nrow(combs)), >> > >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j))))) >> > >> user system elapsed >> > >> 86.380 0.006 86.391 >> > >> >> > >> >> > >> >> > >> >> > >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish at gmail.com> >> > wrote: >> > >> > Hi, >> > >> > >> > >> > Apologizes for the less information. >> > >> > >> > >> > Basically, myCombos is a matrix with 3 variables which is a triplet >> > that >> > >> > is >> > >> > a combination of 79 codes. There are around 3lakh combination as >> such >> > >> > and >> > >> > it looks like below. >> > >> > >> > >> > V1 V2 V3 >> > >> > 65 23 77 >> > >> > 77 34 65 >> > >> > 55 34 23 >> > >> > 23 77 34 >> > >> > 34 65 55 >> > >> > >> > >> > Each triplet will compare in a list (mylist) having 8177 elements >> > which >> > >> > will looks like below. >> > >> > >> > >> > 77,65,34,23,55 >> > >> > 65,23,77,65,55,34 >> > >> > 77,34,65 >> > >> > 55,78,56 >> > >> > 98,23,77,65,34 >> > >> > >> > >> > Now I want to count the no of occurrence of the triplet in the >> above >> > >> > list. >> > >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my >> output >> > >> > looks >> > >> > like below >> > >> > >> > >> > V1 V2 V3 Freq >> > >> > 65 23 77 3 >> > >> > 77 34 65 4 >> > >> > 55 34 23 2 >> > >> > >> > >> > I hope, I made it clear this time. >> > >> > >> > >> > >> > >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter < >> bgunter.4567 at gmail.com> >> > >> > wrote: >> > >> > >> > >> >> Not entirely sure I understand, but match() is already >> vectorized, so >> > >> >> you >> > >> >> should be able to lose the supply(). This would speed things up a >> > lot. >> > >> >> Please re-read ?match *carefully* . >> > >> >> >> > >> >> Bert >> > >> >> >> > >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivibish at gmail.com> >> wrote: >> > >> >> >> > >> >> Hi, >> > >> >> >> > >> >> I created list of 3 combination numbers (mycombos, around 3 lakh >> > >> >> combinations) and counting the occurrence of those combination in >> > >> >> another >> > >> >> list. This comparision list (mylist) is having around 8000 >> records.I >> > am >> > >> >> using the following code. >> > >> >> >> > >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) { >> > >> >> sum(sapply(myList, function(j) { >> > >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)}) >> > >> >> >> > >> >> The above code takes very long time to execute and is there any >> other >> > >> >> effecting method which will reduce the time. >> > >> >> -- >> > >> >> >> > >> >> Regards, >> > >> >> Srivathsan.K >> > >> >> >> > > >> > > >> > > >> > > >> > >> >> >> >> -- >> >> Regards, >> Srivathsan.K >> Phone : 9600165206 >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > >-- Regards, Srivathsan.K Phone : 9600165206 [[alternative HTML version deleted]]