thr3ads.net - R help - [R] AGREP [Feb 2004]

If this information is useful, please help other people find it:
Share via:

Marcos Sanches

2004-Feb-11 17:53 UTC

[R] AGREP

Hi all, I have two questions

1 - I have the version 1.4.1 of R, and it doesn't have the 'agrep'
function in the base library. Is there a way to make this funcion
avaliable in R 1.4.1? I mean, how to 'copy' it from R 1.8.1 and
'paste'
it in R 1.4.1?

2 - The AGREP function doesn't give me the Levenshtein distance (edit
distance). Is there a function in R that does it? Is there a way to use
AGREP to acomplish this task? I've written such a function, but it is so
slow (has so many loops) that it is beeing useless. 

TIA

Marcos Sanches

2004-Feb-12 00:44 UTC

head link

[R] AGREP

Hi listers

If you don't know what is the Edit Distance beetwen two strings, I will
try to explain, in fact it is very simple to understund but not to
calculate througth a program. It is simplilly the minimum number of
operations you must perform to transform string A on string B, by
operations I mean delete letters, insert letters or substitute letter.

If you need to do few operations, it means string A is almost the same
as string B. Otherwise they are more differente as the number of
operations increase.

If you have a idea of how to make a function to calculate this distance,
it would be very important for me. 

Thanks very much,

Marcos

Gabor Grothendieck

2004-Feb-12 06:26 UTC

head link

[R] AGREP

1. The agrep function in 1.8.1 is not written entirely in R so
you would have to move the C code over too.

If you have the agrep command at the operating system level
(for windows you can get find it by searching for agrep.exe in
google) you could try something like this:

   readLines(pipe("agrep -1 pattern myfile"))

where you have written out your lines from R to myfile.  If
pipe was not available that far back you could use the system
command and redirect the output to a file and read it into R.

2. I am not sure if this is practical for you but you could try
running agrep, agrep -1, agrep -2, etc. and then
assign each line the number at which it first appears.

---
Date:   Wed, 11 Feb 2004 14:53:33 -0300 
From:   Marcos Sanches <marcos.sanches at ipsos-opinion.com.br>
[ Add to Address Book | Block Address | Report as Spam ] 
To:   R Help <r-help at stat.math.ethz.ch> 
Subject:   [R] AGREP 

 

     Hi all, I have two questions

1 - I have the version 1.4.1 of R, and it doesn't have the 'agrep'
function in the base library. Is there a way to make this funcion
avaliable in R 1.4.1? I mean, how to 'copy' it from R 1.8.1 and
'paste'
it in R 1.4.1?

2 - The AGREP function doesn't give me the Levenshtein distance (edit
distance). Is there a function in R that does it? Is there a way to use
AGREP to acomplish this task? I've written such a function, but it is so
slow (has so many loops) that it is beeing useless. 

TIA

Duncan Murdoch

2004-Feb-12 11:45 UTC

head link

[R] AGREP

On Wed, 11 Feb 2004 14:53:33 -0300, you wrote:
>
>	Hi all, I have two questions
>
>1 - I have the version 1.4.1 of R, and it doesn't have the
'agrep'
>function in the base library. Is there a way to make this funcion
>avaliable in R 1.4.1? I mean, how to 'copy' it from R 1.8.1 and
'paste'
>it in R 1.4.1?
The easiest way is to install 1.8.1.  There have been a *lot* of
improvements in R since 1.4.1
>2 - The AGREP function doesn't give me the Levenshtein distance (edit
>distance). Is there a function in R that does it? Is there a way to use
>AGREP to acomplish this task? I've written such a function, but it is so
>slow (has so many loops) that it is beeing useless. 
I don't think there's a function, but I imagine this should be
possible.  You'd need to look at src/main/apse.c to find which
function returns this information, and then write code something like
do_agrep (in character.c) to call that function instead of calling
apse_match.

Duncan Murdoch

Rodrigo Abt

2004-Feb-12 14:01 UTC

head link

[R] AGREP

"Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br>
writes:
>Hi listers
>
>If you don't know what is the Edit Distance beetwen two strings, I will
>try to explain, in fact it is very simple to understund but not to
>calculate througth a program. It is simplilly the minimum number of
>operations you must perform to transform string A on string B, by
>operations I mean delete letters, insert letters or substitute letter.
>
>If you need to do few operations, it means string A is almost the same
>as string B. Otherwise they are more differente as the number of
>operations increase.
>
>If you have a idea of how to make a function to calculate this distance,
>it would be very important for me.
>
>Thanks very much,
>
>Marcos
I guess you're looking for Levenshtein distance, so try this:

levenshtein<-function(s1,s2) {
	# Make sure args are strings
	a<-as.character(s1);an=nchar(s1)+1
	b<-as.character(s2);bn=nchar(s2)+1

	# Initialize matrix for calculations
	m<-matrix(nrow=an,ncol=bn)

	# If one arg is an empty string, returns the length of the other
	if (nchar(a)==0)
		return(nchar(b))
	if (nchar(b)==0)
		return(nchar(a))

	# Matrix initialization
	for (i in 1:an) {
		for (j in 1:bn) {
			m[i,j]<-0
			m[1,j]<-j
		}
		m[i,1]<-i
	}

	# Cost calculation
	for (i in 2:an) {
		for (j in 2:bn) {
			if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
				cost<-0
			else
				cost<-1
		m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
		}
	}
	# Returns the distance
	m[an,bn]-1
}

Examples:
> levenshtein("Great","Grreat")		<-- One addition
[1] 1> levenshtein("mahrcoz","Marcos")         <-- One
substitution,one deletionand one substitution
[1] 3

Note that this function IS case sensitive. If you want to apply this on
vectors of strings you'll have to write the
corresponding wrapper function.

Hope that helps,

Rodrigo Abt B,
Analyst,
Dept. Economic Studies,
SII, Chile.

Gabor Grothendieck

2004-Feb-12 15:06 UTC

head link

[R] AGREP

One could shorten it slightly with these minor improvements.  Unfortunately, the
key performance problem, the double loop
at the end which implements the dynamic programming calculation, 
is still there.

levenshtein<-function(s1,s2) {
     # Make sure args are strings
     a <- as.character(s1); an <- nchar(s1)+1
     b <- as.character(s2); bn <- nchar(s2)+1

     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0) return(nchar(b))
     if (nchar(b)==0) return(nchar(a))

     # Initialize matrix for calculations
     m <- matrix(0, nrow=an, ncol=bn)
     m[1,] <- 1:bn
     m[,1] <- 1:an 

     # Cost calculation - line beginning (substr... is 0-1 cost f'n
     for (i in 2:an) 
          for (j in 2:bn) 
		  m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
		       (substr(a,i-1,i-1)!=substr(b,j-1,j-1)) ) 

     # Returns the distance
     m[an,bn]-1
}

---
Date:   Thu, 12 Feb 2004 11:01:19 -0300 
From:   Rodrigo Abt <rodrigo.abt at sii.cl>
To:   'Lista de Correo de R' <r-help at stat.math.ethz.ch> 
Subject:   Re: [R] AGREP 

 
"Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br>
writes:
>Hi listers
>
>If you don't know what is the Edit Distance beetwen two strings, I will
>try to explain, in fact it is very simple to understund but not to
>calculate througth a program. It is simplilly the minimum number of
>operations you must perform to transform string A on string B, by
>operations I mean delete letters, insert letters or substitute letter.
>
>If you need to do few operations, it means string A is almost the same
>as string B. Otherwise they are more differente as the number of
>operations increase.
>
>If you have a idea of how to make a function to calculate this distance,
>it would be very important for me.
>
>Thanks very much,
>
>Marcos
I guess you're looking for Levenshtein distance, so try this:

levenshtein<-function(s1,s2) {
     # Make sure args are strings
     a<-as.character(s1);an=nchar(s1)+1
     b<-as.character(s2);bn=nchar(s2)+1

     # Initialize matrix for calculations
     m<-matrix(nrow=an,ncol=bn)

     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0)
          return(nchar(b))
     if (nchar(b)==0)
          return(nchar(a))

     # Matrix initialization
     for (i in 1:an) {
          for (j in 1:bn) {
               m[i,j]<-0
               m[1,j]<-j
          }
          m[i,1]<-i
     }

     # Cost calculation
     for (i in 2:an) {
          for (j in 2:bn) {
               if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
                    cost<-0
               else
                    cost<-1
          m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
          }
     }
     # Returns the distance
     m[an,bn]-1
}

Examples:
> levenshtein("Great","Grreat")          <-- One
addition
[1] 1> levenshtein("mahrcoz","Marcos") <-- One
substitution,one deletionand one substitution
[1] 3

Note that this function IS case sensitive. If you want to apply this on
vectors of strings you'll have to write the
corresponding wrapper function.

Hope that helps,

Rodrigo Abt B,
Analyst,
Dept. Economic Studies,
SII, Chile.

Michael Friendly

2004-Feb-13 23:18 UTC

head link

[R] AGREP

The problem of calculating levenshtein distances between strings reminds 
me of what I faced
years ago when writing an APL system to control interactive memory 
experiments, where subjects
typed words they could remember from a given list. 

To handle typos, I used the generalized outer product operator, null dot 
equals to create
a binary matrix comparing an input string with each correct word,

input o.= word

giving a 0/1 array whose rows reprepresented the letters in the input 
and whose columns
the letters in the output, from which varous measures of string 
similarity could be calculated
using reduction operators, or even a correlation, based on the positions 
of the 1s.
Thinking in arrays always helps avoid explicit loops.

-- 
Michael Friendly     Email: friendly at yorku.ca 
Professor, Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA

Reasonably Related Threads

Search for more apparently analagous threads

R help - Feb 2004 - AGREP

[R] AGREP

[R] AGREP

[R] AGREP

[R] AGREP

[R] AGREP

[R] AGREP

[R] AGREP

Reasonably Related Threads