Hi all, all.equal is generally very useful when you want to find the differences between two objects. It breaks down however, when you have two long strings to compare:> all.equal(a, b)[1] "1 string mismatch" Does any one know of any good text diffing tools implemented in R? Thanks, Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
There is the stringMatch function in the MiscPsycho package.> stringMatch('Hadley', 'Hadley Wickham', normalize = 'no')[1] 8> stringMatch('Hadley', 'Hadley Wickham', normalize = 'yes')[1] 0.4285714 It uses Levenshtein distance to tell you how much they differ by, either normalized or not. So, the above two tell you the first string differs from the second string by 8 insertions/deletions/substitutions. The second number normalizes the comparison such that 1 denotes perfect agreement and 2 denotes imperfect agreement. Examples of an exact match are below.> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'yes')[1] 1> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'n')[1] 0 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Hadley Wickham Sent: Tuesday, August 24, 2010 10:17 AM To: R-help Subject: [R] Comparing/diffing strings Hi all, all.equal is generally very useful when you want to find the differences between two objects. It breaks down however, when you have two long strings to compare:> all.equal(a, b)[1] "1 string mismatch" Does any one know of any good text diffing tools implemented in R? Thanks, Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On 24-Aug-10 14:16:55, Hadley Wickham wrote:> Hi all, > all.equal is generally very useful when you want to find the > differences between two objects. It breaks down however, > when you have two long strings to compare: > >> all.equal(a, b) > [1] "1 string mismatch" > > Does any one know of any good text diffing tools implemented in R? > > Thanks, > HadleyHi Hadley, I suppose it depends on what you want to find out: all.equal(strsplit("abcdefg",split=""),strsplit("aBcDEfg",split="")) # [1] "Component 1: 3 string mismatches" will tell you how many mismatches there are. But, if you want to find out *what* they are, and/or where, then you would have to do something like X <- "abcdefg" ; Y <- "aBcDEfg" X0 <- unlist(strsplit(X,split="")) ## Nasty but necessary! Y0 <- unlist(strsplit(Y,split="")) ## ... ix <- which(X0 != Y0) cbind(ix,X0[ix],Y0[ix]) # ix # [1,] "2" "b" "B" # [2,] "4" "d" "D" # [3,] "5" "e" "E" Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 24-Aug-10 Time: 15:38:22 ------------------------------ XFMail ------------------------------