Frank Chang
2012-Sep-06 17:21 UTC
[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm
Good afternoon, I have been running some timing tests with retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer, len2, (const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein distance cutoff. Our results indicate that after 50,000 string pair comparisons that edit_distance is equal in speed to a conventional Levenshtein distance dynamic programming implementation Big O(N ^^ 2) shown below. I would expect the running time of the Xapian edit distance implementation based on Hal Berghel and David Roach's Big O(linear running time) algorithm would be substantially faster. Could someone please explain how I can make this speeed improvement happen if is is possible? Thank you. double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int Size_){ int i,j; int Len1,Len2; int *Distances; int Cost,Distance; int maxLen; // The Levenshtein distance is calculated by creating a 2D array that looks // like the following (say Str1=GUMBO, Str2=GAMBOL) // Step 1 + 2: // G U M B O // 0 1 2 3 4 5 // G 1 0 1 2 3 4 // A 2 1 1 2 3 4 // M 3 2 2 1 2 3 // B 4 3 3 2 1 2 // O 5 4 4 3 2 1 // L 6 5 5 4 3 2 <- Step 3 if (memcmp(Str1_,Str2_,Size_)==0) return 100.0; for (Len1=Size_-1;Len1>0 && Str1_[Len1]==' ';Len1--) ; for (Len2=Size_-1;Len2>0 && Str2_[Len2]==' ';Len2--) ; Len1++; Len2++; if (MIN(Len1,Len2)<=1) return 0.0; Distances=new int[(Len1+1)*(Len2+1)]; //Original maxLen = MAX(Len1, Len2); // Step 1: Populate 0th row and column of array: for (i=0;i<=Len1;i++) Distances[i]=i; for (i=0;i<=Len2;i++) Distances[i*(Len1+1)]=i; // Step 2: Populate inner elements: for (i=1;i<=Len1;i++) { for (j=1;j<=Len2;j++) { Cost=(Str1_[i-1]==Str2_[j-1] ? 0 : 1); Distances[j*(Len1+1)+i]=MIN(Distances[(j-1)*(Len1+1)+i]+1, MIN(Distances[j*(Len1+1)+i-1]+1,Distances[(j-1)*(Len1+1)+i-1]+Cost)); } } // Step 3: Distance is the bottom right element: Distance=Distances[(Len2+1)*(Len1+1)-1]; delete[] Distances; // Step 3: Number of errors is bottom right-most element, we'll // convert it into number of correct and make a percentage out of it: return (100.0 * (double)(maxLen - Distance) / (double)maxLen); }
Olly Betts
2012-Oct-09 01:17 UTC
[Xapian-discuss] editdistance.cc average case and worst case running time compared to dynamic programming algorithm
On Thu, Sep 06, 2012 at 01:21:51PM -0400, Frank Chang wrote:> Good afternoon, I have been running some timing tests with > retval = edit_distance_unsigned((const unsigned int *)MaryCamsterBuffer, len2, > (const unsigned int *)MaryBuffer, len1, 2); where k =2 is the Levenshtein distance cutoff.[...]> double cFuzzyCompareLevenshtein::CalcCompare(char* Str1_, char* Str2_, int Size_){What types are MaryBuffer and MaryCamsterBuffer? edit_distance_unsigned() expects pointers to arrays of unsigned int (holding Unicode code points) but I notice your edit distance function takes char *, so I'm wondering if you're simply casting char* to unsigned int* - that's not going to work as you intend, and would likely result in computing edit distance for two much longer strings of junk data, which would take longer than computing the edit distance of the intended input.> Our results indicate that after 50,000 string pair comparisons that > edit_distance is equal in speed to a conventional Levenshtein distance > dynamic programming implementation Big O(N ^^ 2) shown below. > I would expect the running time of the Xapian edit distance > implementation based on Hal Berghel and David Roach's Big O(linear > running time) algorithm would be substantially faster. Could someone > please explain how I can make this speeed improvement happen if is is > possible? Thank you.If it isn't the issue above, could you post the whole test harness you're using? It's hard to reproduce if we have to recreate that code. Cheers, Olly