andrewH
2013-Dec-09 21:14 UTC
[R] How can I find nonstandard or control characters in a large file?
I have a humongous csv file containing census data, far too big to read into RAM. I have been trying to extract individual columns from this file using the colbycol package. This works for certain subsets of the columns, but not for others. I have not yet been able to precisely identify the problem columns, as there are 731 columns and running colbycol on the file on my old slow machine takes about 6 hours. However, my suspicion is that there are some funky characters, either control characters or characters with some non-standard encoding, somewhere in this 14 gig file. Moreover, I am concerned that these characters may cause me trouble down the road even if I use a different approach to getting columns out of the file. Is there an r utility will search through my file without trying to read it all into memory at one time and find non-standard characters or misplaced (non-end-of-line) control characters? Or some R code to the same end? Even if the real problem ultimately proves top be different, it would be helpful to eliminate this possibility. And this is also something I would routinely run on files from external sources if I had it. I am working in a windows XP environment, in case that makes a difference. Any help anyone could offer would be greatly appreciated. Sincerely, andrewH -- View this message in context: http://r.789695.n4.nabble.com/How-can-I-find-nonstandard-or-control-characters-in-a-large-file-tp4681896.html Sent from the R help mailing list archive at Nabble.com.
Enrico Schumann
2013-Dec-10 07:11 UTC
[R] How can I find nonstandard or control characters in a large file?
On Mon, 09 Dec 2013, andrewH <ahoerner at rprogress.org> writes:> I have a humongous csv file containing census data, far too big to read into > RAM. I have been trying to extract individual columns from this file using > the colbycol package. This works for certain subsets of the columns, but not > for others. I have not yet been able to precisely identify the problem > columns, as there are 731 columns and running colbycol on the file on my old > slow machine takes about 6 hours. > > However, my suspicion is that there are some funky characters, either > control characters or characters with some non-standard encoding, somewhere > in this 14 gig file. Moreover, I am concerned that these characters may > cause me trouble down the road even if I use a different approach to getting > columns out of the file. > > Is there an r utility will search through my file without trying to read it > all into memory at one time and find non-standard characters or misplaced > (non-end-of-line) control characters? Or some R code to the same end? Even > if the real problem ultimately proves top be different, it would be helpful > to eliminate this possibility. And this is also something I would routinely > run on files from external sources if I had it. > > I am working in a windows XP environment, in case that makes a difference. > > Any help anyone could offer would be greatly appreciated. > > Sincerely, andrewHYou could process your file in chunks: f <- file("myfile.csv", open = "r") lines <- readLines(f, n = 10000) ## do something with lines lines <- readLines(f, n = 10000) ## do something with lines ## .... To find 'non-standard characters' you will need to define what 'non-standard characters' are. But perhaps ?tools:::showNonASCII, which uses ?iconv, can help you. (Please note the warnings and caveats on the functions' help pages.) -- Enrico Schumann Lucerne, Switzerland http://enricoschumann.net
Earl F Glynn
2013-Dec-10 15:27 UTC
[R] How can I find nonstandard or control characters in a large file?
andrewH wrote:> However, my suspicion is that there are some funky characters, either > control characters or characters with some non-standard encoding, somewhere > in this 14 gig file. Moreover, I am concerned that these characters may > cause me trouble down the road even if I use a different approach to getting > columns out of the file.This is not an R solution, but here's a Windows utility I wrote to produce a table of frequency counts for all hex characters x00 to xFF in a file. http://www.efg2.com/Lab/OtherProjects/CharCount.ZIP Normally, you'll want to scrutinize anything below x20 or above x7F, since ASCII printable characters are in the range x20 to x7E. You can see how many tab (x09) characters are in the file, and whether the line endings are from Linux (x0A) or Windows (paired x0A and x0D). The ZIP includes Delphi source code, but provides a Windows executable. I made a change several months ago to allow drag-and-drop, so you can just drop the file on the application to have the characters counted. Just run the EXE after unzipping. No installation is needed. Once you find problems characters in the file, you can read the file as character data and use sub/gsub or other tools to remove or alter problem characters. efg Earl F Glynn UMKC School of Medicine Center for Health Insights