Witold E Wolski
2015-Aug-03 09:25 UTC
[R] Faster text search in document database than with grep?
I have a database of text documents (letter sequences). Several thousands of documents with approx. 1000-2000 letters each. I need to find exact matches of short 3-15 letters sequences in those documents. Without any regexp patterns the search of one 3-15 letter "words" takes in the order of 1s. So for a database with several thousand documents it's an the order of hours. The naive approach would be to use mcmapply, but than on a standard hardware I am still in the same order and since R is an interactive programming environment this isn't a solution I would go for. But aren't there faster algorithmic solutions? Can anyone point me please to an implementation available in R. Thank you Witold -- Witold Eryk Wolski [[alternative HTML version deleted]]
Duncan Murdoch
2015-Aug-03 13:13 UTC
[R] Faster text search in document database than with grep?
On 03/08/2015 5:25 AM, Witold E Wolski wrote:> I have a database of text documents (letter sequences). Several thousands > of documents with approx. 1000-2000 letters each. > > I need to find exact matches of short 3-15 letters sequences in those > documents. > > Without any regexp patterns the search of one 3-15 letter "words" takes in > the order of 1s. > > So for a database with several thousand documents it's an the order of > hours. > The naive approach would be to use mcmapply, but than on a standard > hardware I am still in the same order and since R is an interactive > programming environment this isn't a solution I would go for. > > But aren't there faster algorithmic solutions? Can anyone point me please > to an implementation available in R.You haven't shown us what you did, but it sounds far slower than I'd expect. I just used the code below to set up a database of 10000 documents of 2000 letters each, and searching those documents for "abc" takes about 70 milliseconds: database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE), collapse="")) grep("abc", database, fixed=TRUE) Duncan Murdoch
Witold E Wolski
2015-Aug-03 13:45 UTC
[R] Faster text search in document database than with grep?
Dear Duncan, This is a model of the data I work with. database <- replicate(50000, paste(sample(letters,rexp(1,1/500), rep=TRUE), collapse="")) words <- replicate(10000,paste(sample(letters,rexp(1,1/70), rep=TRUE), collapse="")) NumberOfWords <- 10 system.time(lapply(words[1: NumberOfWords], grep, database)) user system elapsed 5.002 0.003 5.005 The model reproduces the running times I have to cope with. To use grep in this context is rather naive and I am wondering if there are better solutions availabe in R. On 3 August 2015 at 15:13, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> On 03/08/2015 5:25 AM, Witold E Wolski wrote: > > I have a database of text documents (letter sequences). Several thousands > > of documents with approx. 1000-2000 letters each. > > > > I need to find exact matches of short 3-15 letters sequences in those > > documents. > > > > Without any regexp patterns the search of one 3-15 letter "words" takes > in > > the order of 1s. > > > > So for a database with several thousand documents it's an the order of > > hours. > > The naive approach would be to use mcmapply, but than on a standard > > hardware I am still in the same order and since R is an interactive > > programming environment this isn't a solution I would go for. > > > > But aren't there faster algorithmic solutions? Can anyone point me please > > to an implementation available in R. > > You haven't shown us what you did, but it sounds far slower than I'd > expect. I just used the code below to set up a database of 10000 > documents of 2000 letters each, and searching those documents for "abc" > takes about 70 milliseconds: > > database <- replicate(10000, paste(sample(letters, 2000, rep=TRUE), > collapse="")) > > grep("abc", database, fixed=TRUE) > > Duncan Murdoch >-- Witold Eryk Wolski [[alternative HTML version deleted]]