thr3ads.net - Xapian discuss - [Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Frank Chang

2012-Sep-06 17:21 UTC

[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm

Good afternoon, I have been running some timing tests with 
retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer, len2,
(const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein
distance cutoff.
    Our results indicate that after 50,000 string pair comparisons that
edit_distance is equal in speed to a conventional Levenshtein distance dynamic
programming implementation Big O(N ^^ 2)  shown below.
    I would expect the running time of the Xapian edit distance implementation
based on Hal Berghel and David Roach's Big O(linear running time) algorithm
would be substantially faster. Could someone please explain how I can make this
speeed improvement happen if is is possible? Thank you.
 
 double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int
Size_){
  int i,j;
 int Len1,Len2;
 int *Distances;
 int Cost,Distance;
 int maxLen;
 // The Levenshtein distance is calculated by creating a 2D array that looks
  //   like the following (say Str1=GUMBO, Str2=GAMBOL)
 // Step 1 + 2:
 //     G U M B O
 //   0 1 2 3 4 5
 // G 1 0 1 2 3 4
 // A 2 1 1 2 3 4
 // M 3 2 2 1 2 3
 // B 4 3 3 2 1 2
 // O 5 4 4 3 2 1
 // L 6 5 5 4 3 2 <- Step 3
 if (memcmp(Str1_,Str2_,Size_)==0)
  return 100.0;
 for (Len1=Size_-1;Len1>0 && Str1_[Len1]==' ';Len1--)
  ;
 for (Len2=Size_-1;Len2>0 && Str2_[Len2]==' ';Len2--)
  ;
 Len1++;
 Len2++;
 if (MIN(Len1,Len2)<=1)
  return 0.0;
 Distances=new int[(Len1+1)*(Len2+1)]; //Original
 maxLen = MAX(Len1, Len2);
   
 // Step 1: Populate 0th row and column of array:
 for (i=0;i<=Len1;i++)
  Distances[i]=i;
 for (i=0;i<=Len2;i++)
  Distances[i*(Len1+1)]=i;
 // Step 2: Populate inner elements:
 for (i=1;i<=Len1;i++) { 
  for (j=1;j<=Len2;j++) {
   Cost=(Str1_[i-1]==Str2_[j-1] ? 0 : 1);
   Distances[j*(Len1+1)+i]=MIN(Distances[(j-1)*(Len1+1)+i]+1,
    MIN(Distances[j*(Len1+1)+i-1]+1,Distances[(j-1)*(Len1+1)+i-1]+Cost));
  }
 }
 // Step 3: Distance is the bottom right element:
 Distance=Distances[(Len2+1)*(Len1+1)-1];
 delete[] Distances;
  // Step 3: Number of errors is bottom right-most element, we'll
 //   convert it into number of correct and make a percentage out of it:
  return (100.0 * (double)(maxLen - Distance) / (double)maxLen);
}

Olly Betts

2012-Oct-09 01:17 UTC

head link

[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm

On Thu, Sep 06, 2012 at 01:21:51PM -0400, Frank Chang
wrote:>    Good afternoon, I have been running some timing tests with 
> retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer,
len2,
> (const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein
distance cutoff.
[...]>  double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int
Size_){
What types are MaryBuffer and MaryCamsterBuffer?

edit_distance_unsigned() expects pointers to arrays of unsigned int
(holding Unicode code points) but I notice your edit distance function
takes char *, so I'm wondering if you're simply casting char* to
unsigned int* - that's not going to work as you intend, and would likely
result in computing edit distance for two much longer strings of junk
data, which would take longer than computing the edit distance of the
intended input.
> Our results indicate that after 50,000 string pair comparisons that
> edit_distance is equal in speed to a conventional Levenshtein distance
> dynamic programming implementation Big O(N ^^ 2)  shown below.
> I would expect the running time of the Xapian edit distance
> implementation based on Hal Berghel and David Roach's Big O(linear
> running time) algorithm would be substantially faster. Could someone
> please explain how I can make this speeed improvement happen if is is
> possible? Thank you.
If it isn't the issue above, could you post the whole test harness
you're using?  It's hard to reproduce if we have to recreate that code.

Cheers,
    Olly

Xapian discuss - Sep 2012 - editdistance.cc average case and worst case running time compared to dynamic programming algorithm

[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm

[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm