thr3ads.net - R help - [R] Fast multiple match function [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Keshav Dhandhania

2015-Apr-06 20:56 UTC

[R] Fast multiple match function

Hi,

I know that one can find all occurrences of x in a vector v by
doing> which(x == v).
However, if I need to do this again and again, where v is remaining the
same, then this is quite inefficient. In my particular case, I need to do
this millions of times, and length(v) = 100 million.

Does anyone have suggestion on how to go about it?
I know of a package called fmatch that does the above for the match
function. But they don't handle multiple matches.

Thanks

	[[alternative HTML version deleted]]

David Winsemius

2015-Apr-06 22:30 UTC

head link

[R] Fast multiple match function

On Apr 6, 2015, at 1:56 PM, Keshav Dhandhania wrote:
> Hi,
> 
> I know that one can find all occurrences of x in a vector v by doing
>> which(x == v).
> 
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
> 
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
> 
You should explain why you need to do it millions of times and you should pose a
small sample problem that presents the level of complexity needed in a minimal
size.
> Thanks
> 
> 	[[alternative HTML version deleted]]
And you should read the Posting Guide where it is strongly advised that you not
post in HTML format. I have used gmail and I do know that it is fairly easy to
post in plain text.

-- 
David Winsemius
Alameda, CA, USA

William Dunlap

2015-Apr-06 22:49 UTC

head link

[R] Fast multiple match function

split() might help, but you should give a more complete
explanation of your problem.

Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Apr 6, 2015 at 1:56 PM, Keshav Dhandhania <kshav.91 at gmail.com>
wrote:
> Hi,
>
> I know that one can find all occurrences of x in a vector v by doing
> > which(x == v).
>
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
>
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
>
> Thanks
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Enrico Schumann

2015-Apr-07 19:31 UTC

head link

[R] Fast multiple match function

On Mon, 06 Apr 2015, Keshav Dhandhania <kshav.91 at gmail.com> writes:
> Hi,
>
> I know that one can find all occurrences of x in a vector v by doing
>> which(x == v).
>
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
>
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
>
Perhaps 'match(x, v)' is what you want? In which 'x' may be a
vector of
length > 1.

In any case, have you actually tried package 'fastmatch'? The function
'fmatch', which that package provides, is very fast for repeated
lookups in a table 'v'.


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

Hervé Pagès

2015-Apr-07 20:21 UTC

head link

[R] Fast multiple match function

Hi Keshav,

findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
does what you want:

   library(IRanges)
   y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
   x <- c(unique(y), 999L)
   hits <- findMatches(x, y)

Then:

   > hits
   Hits object with 9 hits and 0 metadata columns:
         queryHits subjectHits
         <integer>   <integer>
     [1]         1           1
     [2]         2           2
     [3]         3           3
     [4]         3           9
     [5]         4           4
     [6]         4           5
     [7]         4           8
     [8]         5           6
     [9]         6           7
     -------
     queryLength: 7
     subjectLength: 9

The Hits object can be turned into a list with:

   > as.list(hits)
   [[1]]
   [1] 1

   [[2]]
   [1] 2

   [[3]]
   [1] 3 9

   [[4]]
   [1] 4 5 8

   [[5]]
   [1] 6

   [[6]]
   [1] 7

   [[7]]
   integer(0)

H.

 > sessionInfo()
R version 3.2.0 beta (2015-04-05 r68151)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] IRanges_2.1.43       S4Vectors_0.5.22     BiocGenerics_0.13.11

loaded via a namespace (and not attached):
[1] tools_3.2.0

On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:> Hi,
>
> I know that one can find all occurrences of x in a vector v by doing
>> which(x == v).
>
> However, if I need to do this again and again, where v is remaining the
> same, then this is quite inefficient. In my particular case, I need to do
> this millions of times, and length(v) = 100 million.
>
> Does anyone have suggestion on how to go about it?
> I know of a package called fmatch that does the above for the match
> function. But they don't handle multiple matches.
>
> Thanks
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
-- 
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Keshav Dhandhania

2015-Apr-07 20:50 UTC

head link

[R] Fast multiple match function

Hi all,

Thanks for the responses.
Herve's example is a good small size example of what I wanted.
> y <- c(16, -3, -2, 15, 15, 0, 8, 15, -2)
> someCoolFunc(-2, y)
[1] 3 9> someCoolFunc(15, y)[1] 4 5 8

The requirement is that I want someCoolFunc() to run in O(number of
matches) time, instead of O(size of y).
This is because y is big. And I don't know all the queries I want to
do up-front. And the results of some queries might change the queries
I want to do in the future.

@David: I hope the above description is more clear.
@Enrico, Herve: I want both the functionality provided by one function.
- On repeated calls, fmatch() does give O(1) performance, but it does
not give all matches.
- findMatches() gives all matches, but I need to know the entire
vector x beforehand. I don't have that luxury.


I do have something that works now, using split and fmatch (package
fastmatch). So just posting that in case anyone in the future has the
same problem.> y.unique <- unique(y)
>
> # create a map from the unique elements of y to the locations of all
occurrences of the element
> y.map <- split(1:length(y), match(y, y.unique))
>
> # write a wrapper function that does a look-up on the unique list. and then
returns all matches using the map.
> someCoolFunc <- function(x) { y.map[[ fmatch(x, y.unique) ]] }


On Tue, 7 Apr 2015 at 13:21 Herv? Pag?s <hpages at fredhutch.org>
wrote:>
> Hi Keshav,
>
> findMatches() in the S4Vectors/IRanges packages (Bioconductor) I think
> does what you want:
>
>    library(IRanges)
>    y <- c(16L, -3L, -2L, 15L, 15L, 0L, 8L, 15L, -2L)
>    x <- c(unique(y), 999L)
>    hits <- findMatches(x, y)
>
> Then:
>
>    > hits
>    Hits object with 9 hits and 0 metadata columns:
>          queryHits subjectHits
>          <integer>   <integer>
>      [1]         1           1
>      [2]         2           2
>      [3]         3           3
>      [4]         3           9
>      [5]         4           4
>      [6]         4           5
>      [7]         4           8
>      [8]         5           6
>      [9]         6           7
>      -------
>      queryLength: 7
>      subjectLength: 9
>
> The Hits object can be turned into a list with:
>
>    > as.list(hits)
>    [[1]]
>    [1] 1
>
>    [[2]]
>    [1] 2
>
>    [[3]]
>    [1] 3 9
>
>    [[4]]
>    [1] 4 5 8
>
>    [[5]]
>    [1] 6
>
>    [[6]]
>    [1] 7
>
>    [[7]]
>    integer(0)
>
> H.
>
>  > sessionInfo()
> R version 3.2.0 beta (2015-04-05 r68151)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> Running under: Ubuntu 14.04.2 LTS
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats4    stats     graphics  grDevices utils     datasets
> [8] methods   base
>
> other attached packages:
> [1] IRanges_2.1.43       S4Vectors_0.5.22     BiocGenerics_0.13.11
>
> loaded via a namespace (and not attached):
> [1] tools_3.2.0
>
> On 04/06/2015 01:56 PM, Keshav Dhandhania wrote:
> > Hi,
> >
> > I know that one can find all occurrences of x in a vector v by doing
> >> which(x == v).
> >
> > However, if I need to do this again and again, where v is remaining
the
> > same, then this is quite inefficient. In my particular case, I need to
do
> > this millions of times, and length(v) = 100 million.
> >
> > Does anyone have suggestion on how to go about it?
> > I know of a package called fmatch that does the above for the match
> > function. But they don't handle multiple matches.
> >
> > Thanks
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> --
> Herv? Pag?s
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319

R help - Apr 2015 - Fast multiple match function

[R] Fast multiple match function

[R] Fast multiple match function

[R] Fast multiple match function

[R] Fast multiple match function

[R] Fast multiple match function

[R] Fast multiple match function