thr3ads.net - R help - [R] Faster text search in document database than with grep? [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Witold E Wolski

2015-Aug-03 09:25 UTC

[R] Faster text search in document database than with grep?

I have a database of text documents (letter sequences). Several thousands
of documents with approx. 1000-2000 letters each.

I need to find exact matches of short 3-15 letters sequences in those
documents.

Without any regexp patterns the search of one 3-15 letter "words"
takes in
the order of 1s.

So for a database with several thousand documents it's an the order of
hours.
The naive approach would be to use mcmapply, but than on a standard
hardware I am still in the same order and since R is an interactive
programming environment this isn't a solution I would go for.

But aren't there faster algorithmic solutions? Can anyone point me please
to an implementation  available in R.

Thank you
Witold




-- 
Witold Eryk Wolski

	[[alternative HTML version deleted]]

Duncan Murdoch

2015-Aug-03 13:13 UTC

head link

[R] Faster text search in document database than with grep?

On 03/08/2015 5:25 AM, Witold E Wolski wrote:> I have a database of text documents (letter sequences). Several thousands
> of documents with approx. 1000-2000 letters each.
> 
> I need to find exact matches of short 3-15 letters sequences in those
> documents.
> 
> Without any regexp patterns the search of one 3-15 letter "words"
takes in
> the order of 1s.
> 
> So for a database with several thousand documents it's an the order of
> hours.
> The naive approach would be to use mcmapply, but than on a standard
> hardware I am still in the same order and since R is an interactive
> programming environment this isn't a solution I would go for.
> 
> But aren't there faster algorithmic solutions? Can anyone point me
please
> to an implementation  available in R.
You haven't shown us what you did, but it sounds far slower than I'd
expect.  I just used the code below to set up a database of 10000
documents of 2000 letters each, and searching those documents for
"abc"
takes about 70 milliseconds:

database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE),
collapse=""))

grep("abc", database, fixed=TRUE)

Duncan Murdoch

Witold E Wolski

2015-Aug-03 13:45 UTC

head link

[R] Faster text search in document database than with grep?

Dear Duncan,

This is a model of the data I work with.

database <- replicate(50000, paste(sample(letters,rexp(1,1/500), rep=TRUE),
                                   collapse=""))

words <- replicate(10000,paste(sample(letters,rexp(1,1/70), rep=TRUE),
                                       collapse=""))

NumberOfWords <- 10
system.time(lapply(words[1: NumberOfWords], grep, database))
   user  system elapsed
  5.002   0.003   5.005

 The model reproduces the running times I have to cope with.

To use grep in this context is rather naive and I am wondering if there are
better solutions availabe in R.



On 3 August 2015 at 15:13, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 03/08/2015 5:25 AM, Witold E Wolski wrote:
> > I have a database of text documents (letter sequences). Several
thousands
> > of documents with approx. 1000-2000 letters each.
> >
> > I need to find exact matches of short 3-15 letters sequences in those
> > documents.
> >
> > Without any regexp patterns the search of one 3-15 letter
"words" takes
> in
> > the order of 1s.
> >
> > So for a database with several thousand documents it's an the
order of
> > hours.
> > The naive approach would be to use mcmapply, but than on a standard
> > hardware I am still in the same order and since R is an interactive
> > programming environment this isn't a solution I would go for.
> >
> > But aren't there faster algorithmic solutions? Can anyone point me
please
> > to an implementation  available in R.
>
> You haven't shown us what you did, but it sounds far slower than
I'd
> expect.  I just used the code below to set up a database of 10000
> documents of 2000 letters each, and searching those documents for
"abc"
> takes about 70 milliseconds:
>
> database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE),
> collapse=""))
>
> grep("abc", database, fixed=TRUE)
>
> Duncan Murdoch
>


-- 
Witold Eryk Wolski

	[[alternative HTML version deleted]]

R help - Aug 2015 - Faster text search in document database than with grep?

[R] Faster text search in document database than with grep?

[R] Faster text search in document database than with grep?

[R] Faster text search in document database than with grep?