Witold E Wolski
2015-Aug-03 09:25 UTC
[R] Faster text search in document database than with grep?
I have a database of text documents (letter sequences). Several thousands of documents with approx. 1000-2000 letters each. I need to find exact matches of short 3-15 letters sequences in those documents. Without any regexp patterns the search of one 3-15 letter "words" takes in the order of 1s. So for a database with several thousand documents it's an the order of hours. The naive approach would be to use mcmapply, but than on a standard hardware I am still in the same order and since R is an interactive programming environment this isn't a solution I would go for. But aren't there faster algorithmic solutions? Can anyone point me please to an implementation available in R. Thank you Witold -- Witold Eryk Wolski [[alternative HTML version deleted]]
Duncan Murdoch
2015-Aug-03 13:13 UTC
[R] Faster text search in document database than with grep?
On 03/08/2015 5:25 AM, Witold E Wolski wrote:> I have a database of text documents (letter sequences). Several thousands > of documents with approx. 1000-2000 letters each. > > I need to find exact matches of short 3-15 letters sequences in those > documents. > > Without any regexp patterns the search of one 3-15 letter "words" takes in > the order of 1s. > > So for a database with several thousand documents it's an the order of > hours. > The naive approach would be to use mcmapply, but than on a standard > hardware I am still in the same order and since R is an interactive > programming environment this isn't a solution I would go for. > > But aren't there faster algorithmic solutions? Can anyone point me please > to an implementation available in R.You haven't shown us what you did, but it sounds far slower than I'd expect. I just used the code below to set up a database of 10000 documents of 2000 letters each, and searching those documents for "abc" takes about 70 milliseconds: database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE), collapse="")) grep("abc", database, fixed=TRUE) Duncan Murdoch
Witold E Wolski
2015-Aug-03 13:45 UTC
[R] Faster text search in document database than with grep?
Dear Duncan,
This is a model of the data I work with.
database <- replicate(50000, paste(sample(letters,rexp(1,1/500), rep=TRUE),
collapse=""))
words <- replicate(10000,paste(sample(letters,rexp(1,1/70), rep=TRUE),
collapse=""))
NumberOfWords <- 10
system.time(lapply(words[1: NumberOfWords], grep, database))
user system elapsed
5.002 0.003 5.005
The model reproduces the running times I have to cope with.
To use grep in this context is rather naive and I am wondering if there are
better solutions availabe in R.
On 3 August 2015 at 15:13, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote:
> On 03/08/2015 5:25 AM, Witold E Wolski wrote:
> > I have a database of text documents (letter sequences). Several
thousands
> > of documents with approx. 1000-2000 letters each.
> >
> > I need to find exact matches of short 3-15 letters sequences in those
> > documents.
> >
> > Without any regexp patterns the search of one 3-15 letter
"words" takes
> in
> > the order of 1s.
> >
> > So for a database with several thousand documents it's an the
order of
> > hours.
> > The naive approach would be to use mcmapply, but than on a standard
> > hardware I am still in the same order and since R is an interactive
> > programming environment this isn't a solution I would go for.
> >
> > But aren't there faster algorithmic solutions? Can anyone point me
please
> > to an implementation available in R.
>
> You haven't shown us what you did, but it sounds far slower than
I'd
> expect. I just used the code below to set up a database of 10000
> documents of 2000 letters each, and searching those documents for
"abc"
> takes about 70 milliseconds:
>
> database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE),
> collapse=""))
>
> grep("abc", database, fixed=TRUE)
>
> Duncan Murdoch
>
--
Witold Eryk Wolski
[[alternative HTML version deleted]]