thr3ads.net - R help - [R] How to pre-process fwf or csv files to remove unexpected characters in R? [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Lucas Ferreira Mation

2016-Nov-06 13:36 UTC

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

I have some large .txt files about ~100GB containing a dataset in fixed
width file. This contains some errors:
- character characters in column that are supposed to be numeric,
- invalid characters
- rows with too many characters, possibly due to invalid characters or some
missing end of line character (so two rows in the original data become one
row in the .txt file).

The errors are not very frequent, but stop me from importing with readr
::read_fwf()


Is there some package, or workflow, in R to pre-process the files,
separating the valid from the not-valid rows into different files? This can
be done by ETL point-click tools, such as Pentaho PDI. Is there some
equivalent code in R to do this?

I googled it and could not find a solution. I also asked this in
StackOverflow and got no answer (here
<http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
).

regards
Lucas Mation
IPEA - Brasil

	[[alternative HTML version deleted]]

Jeff Newmiller

2016-Nov-06 16:12 UTC

head link

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

?readLines ... given the large size of file you may need to process chunks by
specifying a file connection rather than a character string file name and using
the "n" argument.

?grepl

?Extract

?tools::showNonASCII

There are many ways for data to be corrupted... in particular when invalid
characters appear the possibilities explode, so more specifics are needed if
this is not enough. Of course, reading the Posting Guide, posting with plain
text to avoid HTML corruption, and giving reproducible examples will improve the
quality of responses to those questions.

-- 
Sent from my phone. Please excuse my brevity.

On November 6, 2016 5:36:46 AM PST, Lucas Ferreira Mation <lucasmation at
gmail.com> wrote:>I have some large .txt files about ~100GB containing a dataset in fixed
>width file. This contains some errors:
>- character characters in column that are supposed to be numeric,
>- invalid characters
>- rows with too many characters, possibly due to invalid characters or
>some
>missing end of line character (so two rows in the original data become
>one
>row in the .txt file).
>
>The errors are not very frequent, but stop me from importing with readr
>::read_fwf()
>
>
>Is there some package, or workflow, in R to pre-process the files,
>separating the valid from the not-valid rows into different files? This
>can
>be done by ETL point-click tools, such as Pentaho PDI. Is there some
>equivalent code in R to do this?
>
>I googled it and could not find a solution. I also asked this in
>StackOverflow and got no answer (here
><http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
>).
>
>regards
>Lucas Mation
>IPEA - Brasil
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

David Winsemius

2016-Nov-06 16:16 UTC

head link

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at
gmail.com> wrote:
> 
> I have some large .txt files about ~100GB containing a dataset in fixed
> width file. This contains some errors:
> - character characters in column that are supposed to be numeric,
> - invalid characters
> - rows with too many characters, possibly due to invalid characters or some
> missing end of line character (so two rows in the original data become one
> row in the .txt file).
> 
> The errors are not very frequent, but stop me from importing with readr
> ::read_fwf()
> 
> 
> Is there some package, or workflow, in R to pre-process the files,
> separating the valid from the not-valid rows into different files? This can
> be done by ETL point-click tools, such as Pentaho PDI. Is there some
> equivalent code in R to do this?
> 
> I googled it and could not find a solution. I also asked this in
> StackOverflow and got no answer (here
>
<http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
> ).
Had I seen it there I would have voted to close (and just did) that SO question
as too broad, although it is too vague because of lack of definition of
"corrupted characters", and furthermore basically a request for a
package recommendation (which is also off-topic on SO).

For the csv part on a smaller file task (which you didn't repeat here) I
would have pointed you to this answer:

http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665

For the fwf part (in a file that fits into RAM), I would have suggested wrapping
table(nchar( . )) around readLines(file=filename). And then drilling down with
which( nchar( . ) == <chosen_line_length> ) .

I believe searching Rhelp will bring up examples of how to handle file input in
chunks which should allow you to cobble together a strategy if you insist on
using R ... the wrong tool. If you need to narrow your Rhelp archive search I
suggest using the name "Jim Holtman" or "William Dunlap", or
"Gabor Grothendieck" since they frequently have the most elegant
strategies in my opinion.

Here's search strategy implemented via MarkMail:
http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines

But for files of the size you contemplate I would suggest using databases, awk
or other editing software that is designed for streaming processing from disk. R
is not so designed.

-- 
David.

> 
> regards
> Lucas Mation
> IPEA - Brasil
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Jim Lemon

2016-Nov-06 21:56 UTC

head link

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

Hi Lucas,
This is a rough outline of something I programmed years ago for data
cleaning (that was programmed in C). The basic idea is to read the
file line by line and check for a problem (in the initial application
this was a discrepancy between two lines that were supposed to be
identical). Here, if the line is the wrong length (optional) or
contains an unwanted character (this can be specified either as the
set of acceptable or unacceptable characters), the line is displayed
in an editor in which the user can manually fix it. The old file is
written line by line to a new file which replaces the old one. For
files in which bad lines are uncommon, this worked very well, as the
user only had to deal with errors. It is also only useful for files
containing only printable characters in the lines. Note that this is
only a sketch and I have not tested it.

cleanFile(filename,llength=NA,goodchars="[:print:]",badchars=NA) {
 infile<-file(filename,open="r")
 if(class(infile)=="connection") {
  done<-FALSE
 
outfile<-file(paste("cF",filename,sep=""),"w")
  while(!done) {
   nextline<-readlines(infile,1)
   if(nchar(nextline) != llength && !is.na(llength))
nextline<-edit(nextline)
   if(!grepl(goodchars,nextline)) nextline<-edit(nextline)
   if(grep((badchars,nextline && !is.na(badchars))
nextline<-edit(nextline)
   writeLines(nextline,outfile)
   done<-nchar(nextline)<2
  }
  close(infile)
  close(outfile)
  file.remove(infile)
  file.rename(outfile,infile)
 } else {
  cat("Cannot open",file,"\n")
 }
}

Jim


On Mon, Nov 7, 2016 at 12:36 AM, Lucas Ferreira Mation
<lucasmation at gmail.com> wrote:> I have some large .txt files about ~100GB containing a dataset in fixed
> width file. This contains some errors:
> - character characters in column that are supposed to be numeric,
> - invalid characters
> - rows with too many characters, possibly due to invalid characters or some
> missing end of line character (so two rows in the original data become one
> row in the .txt file).
>
> The errors are not very frequent, but stop me from importing with readr
> ::read_fwf()
>
>
> Is there some package, or workflow, in R to pre-process the files,
> separating the valid from the not-valid rows into different files? This can
> be done by ETL point-click tools, such as Pentaho PDI. Is there some
> equivalent code in R to do this?
>
> I googled it and could not find a solution. I also asked this in
> StackOverflow and got no answer (here
>
<http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
> ).
>
> regards
> Lucas Mation
> IPEA - Brasil
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Lucas Ferreira Mation

2016-Nov-06 22:07 UTC

head link

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

Thank you Bert, Jeff and  David for great answers.
Let me provide more context to clarify the question:

- I am running this on a large server (512GB), so the data still fits
into memory (and I also know how to process in chunks if necessary)
- I agree that DBMS and other software would me better suited, but our
team (part of gvt-body) is mostly comprised of statisticians and
economist who know R but not much of other languages or have admin
permissions to set up DBMS. It support is available but much more
scarce. So a pure R solution may be useful, and I want to understand
how much worst or more impractical it is. Performance is a secondary
concern.
- I haven't provided more concrete problems because I did not even
know how to find the lines that contain errors. With the suggestions
given, I think I will be able to do that


The mockup of the ideal function I would dream of having is something like:

clean_my_fwf <- function(infile, outfile, vector_of_error_row_numbers,
col_positions, col_types){
  - import_data from infile
  - in each row, check if the content of each caracter postion
correponds to the expected content of that variable given
col_positions and col_types. This must be true for all columns
  - write all lines that pass this test to outfile
  - return(vector_of_error_row_numbers) : containing the row number
(of the original dataset) of all rows that fail the test

}

where col_positions and col_types follow the syntax of readr::read_fwf
could be parallelized and C++ based






2016-11-06 14:16 GMT-02:00 David Winsemius <dwinsemius at
comcast.net>:>
>> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at
gmail.com> wrote:
>>
>> I have some large .txt files about ~100GB containing a dataset in fixed
>> width file. This contains some errors:
>> - character characters in column that are supposed to be numeric,
>> - invalid characters
>> - rows with too many characters, possibly due to invalid characters or
some
>> missing end of line character (so two rows in the original data become
one
>> row in the .txt file).
>>
>> The errors are not very frequent, but stop me from importing with readr
>> ::read_fwf()
>>
>>
>> Is there some package, or workflow, in R to pre-process the files,
>> separating the valid from the not-valid rows into different files? This
can
>> be done by ETL point-click tools, such as Pentaho PDI. Is there some
>> equivalent code in R to do this?
>>
>> I googled it and could not find a solution. I also asked this in
>> StackOverflow and got no answer (here
>>
<http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
>> ).
>
> Had I seen it there I would have voted to close (and just did) that SO
question as too broad, although it is too vague because of lack of definition of
"corrupted characters", and furthermore basically a request for a
package recommendation (which is also off-topic on SO).
>
> For the csv part on a smaller file task (which you didn't repeat here)
I would have pointed you to this answer:
>
>
http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665
>
> For the fwf part (in a file that fits into RAM), I would have suggested
wrapping table(nchar( . )) around readLines(file=filename). And then drilling
down with which( nchar( . ) == <chosen_line_length> ) .
>
> I believe searching Rhelp will bring up examples of how to handle file
input in chunks which should allow you to cobble together a strategy if you
insist on using R ... the wrong tool. If you need to narrow your Rhelp archive
search I suggest using the name "Jim Holtman" or "William
Dunlap", or "Gabor Grothendieck" since they frequently have the
most elegant strategies in my opinion.
>
> Here's search strategy implemented via MarkMail:
>
http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines
>
> But for files of the size you contemplate I would suggest using databases,
awk or other editing software that is designed for streaming processing from
disk. R is not so designed.
>
> --
> David.
>
>
>>
>> regards
>> Lucas Mation
>> IPEA - Brasil
>>
>>       [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>

R help - Nov 2016 - How to pre-process fwf or csv files to remove unexpected characters in R?

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

[R] How to pre-process fwf or csv files to remove unexpected characters in R?

[R] How to pre-process fwf or csv files to remove unexpected characters in R?