I have 2 large data files that I need to compare and find the differences between data file x and data file y in order to correct data entry error. Theoretically both data files should be identical. I am trying to figure out a way to do this in R. Any help would be great!
Here is some ways: all.equal(readLines(file1), readLines(file2)) You could try compare md5sum of the files: library(tools) identical(md5sum(file1), md5sum(file2)) On Tue, Oct 19, 2010 at 8:23 PM, Nicole Brandt <nicolegr@buffalo.edu> wrote:> I have 2 large data files that I need to compare and find the differences > between data file x and data file y in order to correct data entry error. > Theoretically both data files should be identical. I am trying to figure out > a way to do this in R. Any help would be great! > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
----------------------------------------> From: nicolegr at buffalo.edu > Date: Tue, 19 Oct 2010 18:23:27 -0400 > To: r-help at r-project.org > Subject: [R] comparing two data files > > I have 2 large data files that I need to compare and find the differences between data file x and data file y in order to correct data entry error. Theoretically both data files should be identical. I am trying to figure ou[[elided Hotmail spam]]I'm not sure why you want to use R for this, there may be very good reasons, but generally I use text processing utilities like "diff" ( see linux or cygwin docs) along with grep,sed, awk, and maybe perl. Generally these are not sophisticated with numbers and just process strings so if your validation and correction relies on R features it may be worthwhile. If you are really just looking for diffs in strings, these others could be a good alternative and possibly worth the learning curve for you if you largest motivation for doing this in R is to learn more R. I guess the next question is, "what do you want to do if they are not equal?"