thr3ads.net - R help - [R] Bug in agrep computing edit distance? [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Dickison, Daniel

2010-Nov-16 23:47 UTC

[R] Bug in agrep computing edit distance?

The documentation for agrep says it uses the Levenshtein edit distance,
but it seems to get this wrong in certain cases when there is a
combination of deletions and substitutions.  For example:
> agrep("abcd", "abcxyz", max.distance=1)[1] 1


That should've been a no-match.  The edit distance between those strings
is 3 (1 substitution, 2 deletions), but agrep matches with max.distance >1.

I didn't find anything in the bug database, so I was wondering if somehow
I'm misinterpreting how agrep works.  If not, should I file this in
Bugzilla?
> sessionInfo()R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.12.0




Daniel  Dickison
Research Programmer
ddickison at carnegielearning.com
Toll Free: (888) 851-7094 x103
FAX: (412) 690-2444

Revolutionary Math Curricula. Revolutionary Results.

Carnegie Learning, Inc. | 437 Grant St. 20th Floor | Pittsburgh, PA 15219
www.carnegielearning.com

Ben Bolker

2010-Nov-17 15:01 UTC

head link

[R] Bug in agrep computing edit distance?

Dickison, Daniel <ddickison <at> carnegielearning.com> writes:
> 
> The documentation for agrep says it uses the Levenshtein edit distance,
> but it seems to get this wrong in certain cases when there is a
> combination of deletions and substitutions.  For example:
> 
> > agrep("abcd", "abcxyz", max.distance=1)
> [1] 1
> 
> That should've been a no-match.  The edit distance between those
strings
> is 3 (1 substitution, 2 deletions), but agrep matches with max.distance
>> 1.
> 
> I didn't find anything in the bug database, so I was wondering if
somehow
> I'm misinterpreting how agrep works.  If not, should I file this in
> Bugzilla?
> 
  Could you re-post this on r-devel?  It definitely sounds like
this is worth following up.  Based on a little bit of playing around,
it's quite clear that I don't understand what's going on.  The
examples
show things like

agrep("lasy","lazy",max=list(sub=0))

 which makes sense, but 

agrep("lasy","lazybc",max=1)
agrep("lasy","lazybc",max=0.001)
agrep("lasy","layt",max=list(all=1))

and

agrep("x",c("x","xy","xyz","xyza"),max=list(insertions=2))
agrep("x",c("x","xy","xyz","xyza"),max=list(deletions=2))
agrep("x",c("x","xy","xyz","xyza"),max=list(all=2))

  all give "1 2 3 4" ??

  this makes it clear that I really don't understand what's going on
based on the documentation.  I tried to trace into the C code
(which calls functions from the TRE regexp library) but that didn't
help much ...

Maybe Matching Threads

Search for more reasonably related threads

R help - Nov 2010 - Bug in agrep computing edit distance?

[R] Bug in agrep computing edit distance?

[R] Bug in agrep computing edit distance?

Maybe Matching Threads