thr3ads.net - R help - [R] How can I find nonstandard or control characters in a large file? [Dec 2013]

If this information is useful, please help other people find it:
Share via:

andrewH

2013-Dec-09 21:14 UTC

[R] How can I find nonstandard or control characters in a large file?

I have a humongous csv file containing census data, far too big to read into
RAM. I have been trying to extract individual columns from this file using
the colbycol package. This works for certain subsets of the columns, but not
for others. I have not yet been able to precisely identify the problem
columns, as there are 731 columns and running colbycol on the file on my old
slow machine takes about 6 hours. 

However, my suspicion is that there are some funky characters, either
control characters or characters with some non-standard encoding, somewhere
in this 14 gig file. Moreover, I am concerned that these characters may
cause me trouble down the road even if I use a different approach to getting
columns out of the file.

Is there an r utility will search through my file without trying to read it
all into memory at one time and find non-standard characters or misplaced
(non-end-of-line) control characters? Or some R code to the same end?  Even
if the real problem ultimately proves top be different, it would be helpful
to eliminate this possibility. And this is also something I would routinely
run on files from external sources if I had it. 

 I am working in a windows XP environment, in case that makes a difference.

Any help anyone could offer would be greatly appreciated.

Sincerely, andrewH



--
View this message in context:
http://r.789695.n4.nabble.com/How-can-I-find-nonstandard-or-control-characters-in-a-large-file-tp4681896.html
Sent from the R help mailing list archive at Nabble.com.

Enrico Schumann

2013-Dec-10 07:11 UTC

head link

[R] How can I find nonstandard or control characters in a large file?

On Mon, 09 Dec 2013, andrewH <ahoerner at rprogress.org> writes:
> I have a humongous csv file containing census data, far too big to read
into
> RAM. I have been trying to extract individual columns from this file using
> the colbycol package. This works for certain subsets of the columns, but
not
> for others. I have not yet been able to precisely identify the problem
> columns, as there are 731 columns and running colbycol on the file on my
old
> slow machine takes about 6 hours. 
>
> However, my suspicion is that there are some funky characters, either
> control characters or characters with some non-standard encoding, somewhere
> in this 14 gig file. Moreover, I am concerned that these characters may
> cause me trouble down the road even if I use a different approach to
getting
> columns out of the file.
>
> Is there an r utility will search through my file without trying to read it
> all into memory at one time and find non-standard characters or misplaced
> (non-end-of-line) control characters? Or some R code to the same end?  Even
> if the real problem ultimately proves top be different, it would be helpful
> to eliminate this possibility. And this is also something I would routinely
> run on files from external sources if I had it. 
>
>  I am working in a windows XP environment, in case that makes a difference.
>
> Any help anyone could offer would be greatly appreciated.
>
> Sincerely, andrewH
You could process your file in chunks:

  f <- file("myfile.csv", open = "r")
  lines <- readLines(f, n = 10000)
  ## do something with lines
  lines <- readLines(f, n = 10000)
  ## do something with lines
  ## ....

To find 'non-standard characters' you will need to define what
'non-standard characters' are.  But perhaps ?tools:::showNonASCII, which
uses ?iconv, can help you.  (Please note the warnings and caveats on the
functions' help pages.)


-- 
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net

Earl F Glynn

2013-Dec-10 15:27 UTC

head link

[R] How can I find nonstandard or control characters in a large file?

andrewH wrote:
> However, my suspicion is that there are some funky characters, either
> control characters or characters with some non-standard encoding, somewhere
> in this 14 gig file. Moreover, I am concerned that these characters may
> cause me trouble down the road even if I use a different approach to
getting
> columns out of the file.
This is not an R solution, but here's a Windows utility I wrote to 
produce a table of frequency counts for all hex characters x00 to xFF in 
a file.

http://www.efg2.com/Lab/OtherProjects/CharCount.ZIP

Normally, you'll want to scrutinize anything below x20 or above x7F, 
since ASCII printable characters are in the range x20 to x7E. You can 
see how many tab (x09) characters are in the file, and whether the line 
endings are from Linux (x0A) or Windows (paired x0A and x0D).

The ZIP includes Delphi source code, but provides a Windows executable. 
  I made a change several months ago to allow drag-and-drop, so you can 
just drop the file on the application to have the characters counted. 
Just run the EXE after unzipping.  No installation is needed.

Once you find problems characters in the file, you can read the file as 
character data and use sub/gsub or other tools to remove or alter 
problem characters.

efg
Earl F Glynn
UMKC School of Medicine
Center for Health Insights

Reasonably Related Threads

Search for more maybe matching threads

R help - Dec 2013 - How can I find nonstandard or control characters in a large file?

[R] How can I find nonstandard or control characters in a large file?

[R] How can I find nonstandard or control characters in a large file?

[R] How can I find nonstandard or control characters in a large file?

Reasonably Related Threads