Hi all, Thanks for the responses. Herve's example is a good small size example of what I wanted.> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2) > someCoolFunc(-2, y)[1] 3 9> someCoolFunc(15, y)[1] 4 5 8 The requirement is that I want someCoolFunc() to run in O(number of matches) time, instead of O(size of y). This is because y is big. And I don't know all the queries I want to do up-front. And the results of some queries might change the queries I want to do in the future. @David: I hope the above description is more clear. @Enrico, Herve: I want both the functionality provided by one function. - On repeated calls, fmatch() does give O(1) performance, but it does not give all matches. - findMatches() gives all matches, but I need to know the entire vector x beforehand. I don't have that luxury. I do have something that works now, using split and fmatch (package fastmatch). So just posting that in case anyone in the future has the same problem.> y.unique <- unique(y) > > # create a map from the unique elements of y to the locations of all occurrences of the element > y.map <- split(1:length(y), match(y, y.unique)) > > # write a wrapper function that does a look-up on the unique list. and then returns all matches using the map. > someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }On Tue, 7 Apr 2015 at 13:21 Herv? Pag?s <hpages at fredhutch.org> wrote:> > Hi Keshav, > > findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think > does what you want: > > library(IRanges) > y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) > x <- c(unique(y), 999L) > hits <- findMatches(x, y) > > Then: > > > hits > Hits object with 9 hits and 0 metadata columns: > queryHits subjectHits > <integer> <integer> > [1] 1 1 > [2] 2 2 > [3] 3 3 > [4] 3 9 > [5] 4 4 > [6] 4 5 > [7] 4 8 > [8] 5 6 > [9] 6 7 > ------- > queryLength: 7 > subjectLength: 9 > > The Hits object can be turned into a list with: > > > as.list(hits) > [[1]] > [1] 1 > > [[2]] > [1] 2 > > [[3]] > [1] 3 9 > > [[4]] > [1] 4 5 8 > > [[5]] > [1] 6 > > [[6]] > [1] 7 > > [[7]] > integer(0) > > H. > > > sessionInfo() > R version 3.2.0 beta (2015-04-05 r68151) > Platform: x86_64-unknown-linux-gnu (64-bit) > Running under: Ubuntu 14.04.2 LTS > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats4 stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 > > loaded via a namespace (and not attached): > [1] tools_3.2.0 > > On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: > > Hi, > > > > I know that one can find all occurrences of x in a vector v by doing > >> which(x == v). > > > > However, if I need to do this again and again, where v is remaining the > > same, then this is quite inefficient. In my particular case, I need to do > > this millions of times, and length(v) = 100 million. > > > > Does anyone have suggestion on how to go about it? > > I know of a package called fmatch that does the above for the match > > function. But they don't handle multiple matches. > > > > Thanks > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319
You might find the data.table package helpful. It uses an index sorted with a
radix sort and minimizes moving the data around in memory.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania <kshav.91 at gmail.com>
wrote:>Hi all,
>
>Thanks for the responses.
>Herve's example is a good small size example of what I wanted.
>
>> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
>> someCoolFunc(-2, y)
>[1] 3 9
>> someCoolFunc(15, y)
>[1] 4 5 8
>
>The requirement is that I want someCoolFunc() to run in O(number of
>matches) time, instead of O(size of y).
>This is because y is big. And I don't know all the queries I want to
>do up-front. And the results of some queries might change the queries
>I want to do in the future.
>
>@David: I hope the above description is more clear.
>@Enrico, Herve: I want both the functionality provided by one function.
>- On repeated calls, fmatch() does give O(1) performance, but it does
>not give all matches.
>- findMatches() gives all matches, but I need to know the entire
>vector x beforehand. I don't have that luxury.
>
>
>I do have something that works now, using split and fmatch (package
>fastmatch). So just posting that in case anyone in the future has the
>same problem.
>> y.unique <- unique(y)
>>
>> # create a map from the unique elements of y to the locations of all
>occurrences of the element
>> y.map <- split(1:length(y), match(y, y.unique))
>>
>> # write a wrapper function that does a look-up on the unique list.
>and then returns all matches using the map.
>> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }
>
>
>
>On Tue, 7 Apr 2015 at 13:21 Herv? Pag?s <hpages at fredhutch.org>
wrote:
>>
>> Hi Keshav,
>>
>> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I
>think
>> does what you want:
>>
>> library(IRanges)
>> y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
>> x <- c(unique(y), 999L)
>> hits <- findMatches(x, y)
>>
>> Then:
>>
>> > hits
>> Hits object with 9 hits and 0 metadata columns:
>> queryHits subjectHits
>> <integer> <integer>
>> [1] 1 1
>> [2] 2 2
>> [3] 3 3
>> [4] 3 9
>> [5] 4 4
>> [6] 4 5
>> [7] 4 8
>> [8] 5 6
>> [9] 6 7
>> -------
>> queryLength: 7
>> subjectLength: 9
>>
>> The Hits object can be turned into a list with:
>>
>> > as.list(hits)
>> [[1]]
>> [1] 1
>>
>> [[2]]
>> [1] 2
>>
>> [[3]]
>> [1] 3 9
>>
>> [[4]]
>> [1] 4 5 8
>>
>> [[5]]
>> [1] 6
>>
>> [[6]]
>> [1] 7
>>
>> [[7]]
>> integer(0)
>>
>> H.
>>
>> > sessionInfo()
>> R version 3.2.0 beta (2015-04-05 r68151)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.2 LTS
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel stats4 stats graphics grDevices utils
>datasets
>> [8] methods base
>>
>> other attached packages:
>> [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11
>>
>> loaded via a namespace (and not attached):
>> [1] tools_3.2.0
>>
>> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
>> > Hi,
>> >
>> > I know that one can find all occurrences of x in a vector v by
>doing
>> >> which(x == v).
>> >
>> > However, if I need to do this again and again, where v is
remaining
>the
>> > same, then this is quite inefficient. In my particular case, I
need
>to do
>> > this millions of times, and length(v) = 100 million.
>> >
>> > Does anyone have suggestion on how to go about it?
>> > I know of a package called fmatch that does the above for the
match
>> > function. But they don't handle multiple matches.
>> >
>> > Thanks
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>> --
>> Herv? Pag?s
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fredhutch.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
Hi Jeff, Indeed the data.table package does provide a much cleaner way to achieve the same functionality, and a lot of other functionality as bonus. Thanks for letting me know about it. On Tue, 7 Apr 2015 at 15:41 Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> You might find the data.table package helpful. It uses an index sorted > with a radix sort and minimizes moving the data around in memory. > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > Sent from my phone. Please excuse my brevity. > > On April 7, 2015 1:50:39 PM PDT, Keshav Dhandhania <kshav.91 at gmail.com> > wrote: > >Hi all, > > > >Thanks for the responses. > >Herve's example is a good small size example of what I wanted. > > > >> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2) > >> someCoolFunc(-2, y) > >[1] 3 9 > >> someCoolFunc(15, y) > >[1] 4 5 8 > > > >The requirement is that I want someCoolFunc() to run in O(number of > >matches) time, instead of O(size of y). > >This is because y is big. And I don't know all the queries I want to > >do up-front. And the results of some queries might change the queries > >I want to do in the future. > > > >@David: I hope the above description is more clear. > >@Enrico, Herve: I want both the functionality provided by one function. > >- On repeated calls, fmatch() does give O(1) performance, but it does > >not give all matches. > >- findMatches() gives all matches, but I need to know the entire > >vector x beforehand. I don't have that luxury. > > > > > >I do have something that works now, using split and fmatch (package > >fastmatch). So just posting that in case anyone in the future has the > >same problem. > >> y.unique <- unique(y) > >> > >> # create a map from the unique elements of y to the locations of all > >occurrences of the element > >> y.map <- split(1:length(y), match(y, y.unique)) > >> > >> # write a wrapper function that does a look-up on the unique list. > >and then returns all matches using the map. > >> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] } > > > > > > > >On Tue, 7 Apr 2015 at 13:21 Herv? Pag?s <hpages at fredhutch.org> wrote: > >> > >> Hi Keshav, > >> > >> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I > >think > >> does what you want: > >> > >> library(IRanges) > >> y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L) > >> x <- c(unique(y), 999L) > >> hits <- findMatches(x, y) > >> > >> Then: > >> > >> > hits > >> Hits object with 9 hits and 0 metadata columns: > >> queryHits subjectHits > >> <integer> <integer> > >> [1] 1 1 > >> [2] 2 2 > >> [3] 3 3 > >> [4] 3 9 > >> [5] 4 4 > >> [6] 4 5 > >> [7] 4 8 > >> [8] 5 6 > >> [9] 6 7 > >> ------- > >> queryLength: 7 > >> subjectLength: 9 > >> > >> The Hits object can be turned into a list with: > >> > >> > as.list(hits) > >> [[1]] > >> [1] 1 > >> > >> [[2]] > >> [1] 2 > >> > >> [[3]] > >> [1] 3 9 > >> > >> [[4]] > >> [1] 4 5 8 > >> > >> [[5]] > >> [1] 6 > >> > >> [[6]] > >> [1] 7 > >> > >> [[7]] > >> integer(0) > >> > >> H. > >> > >> > sessionInfo() > >> R version 3.2.0 beta (2015-04-05 r68151) > >> Platform: x86_64-unknown-linux-gnu (64-bit) > >> Running under: Ubuntu 14.04.2 LTS > >> > >> locale: > >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >> [9] LC_ADDRESS=C LC_TELEPHONE=C > >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >> > >> attached base packages: > >> [1] parallel stats4 stats graphics grDevices utils > >datasets > >> [8] methods base > >> > >> other attached packages: > >> [1] IRanges_2.1.43 S4Vectors_0.5.22 BiocGenerics_0.13.11 > >> > >> loaded via a namespace (and not attached): > >> [1] tools_3.2.0 > >> > >> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote: > >> > Hi, > >> > > >> > I know that one can find all occurrences of x in a vector v by > >doing > >> >> which(x == v). > >> > > >> > However, if I need to do this again and again, where v is remaining > >the > >> > same, then this is quite inefficient. In my particular case, I need > >to do > >> > this millions of times, and length(v) = 100 million. > >> > > >> > Does anyone have suggestion on how to go about it? > >> > I know of a package called fmatch that does the above for the match > >> > function. But they don't handle multiple matches. > >> > > >> > Thanks > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > ______________________________________________ > >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > > >> > >> -- > >> Herv? Pag?s > >> > >> Program in Computational Biology > >> Division of Public Health Sciences > >> Fred Hutchinson Cancer Research Center > >> 1100 Fairview Ave. N, M1-B514 > >> P.O. Box 19024 > >> Seattle, WA 98109-1024 > >> > >> E-mail: hpages at fredhutch.org > >> Phone: (206) 667-5791 > >> Fax: (206) 667-1319 > > > >______________________________________________ > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]