Lucas Ferreira Mation
2016-Nov-06 13:36 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
I have some large .txt files about ~100GB containing a dataset in fixed width file. This contains some errors: - character characters in column that are supposed to be numeric, - invalid characters - rows with too many characters, possibly due to invalid characters or some missing end of line character (so two rows in the original data become one row in the .txt file). The errors are not very frequent, but stop me from importing with readr ::read_fwf() Is there some package, or workflow, in R to pre-process the files, separating the valid from the not-valid rows into different files? This can be done by ETL point-click tools, such as Pentaho PDI. Is there some equivalent code in R to do this? I googled it and could not find a solution. I also asked this in StackOverflow and got no answer (here <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> ). regards Lucas Mation IPEA - Brasil [[alternative HTML version deleted]]
Jeff Newmiller
2016-Nov-06 16:12 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
?readLines ... given the large size of file you may need to process chunks by specifying a file connection rather than a character string file name and using the "n" argument. ?grepl ?Extract ?tools::showNonASCII There are many ways for data to be corrupted... in particular when invalid characters appear the possibilities explode, so more specifics are needed if this is not enough. Of course, reading the Posting Guide, posting with plain text to avoid HTML corruption, and giving reproducible examples will improve the quality of responses to those questions. -- Sent from my phone. Please excuse my brevity. On November 6, 2016 5:36:46 AM PST, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:>I have some large .txt files about ~100GB containing a dataset in fixed >width file. This contains some errors: >- character characters in column that are supposed to be numeric, >- invalid characters >- rows with too many characters, possibly due to invalid characters or >some >missing end of line character (so two rows in the original data become >one >row in the .txt file). > >The errors are not very frequent, but stop me from importing with readr >::read_fwf() > > >Is there some package, or workflow, in R to pre-process the files, >separating the valid from the not-valid rows into different files? This >can >be done by ETL point-click tools, such as Pentaho PDI. Is there some >equivalent code in R to do this? > >I googled it and could not find a solution. I also asked this in >StackOverflow and got no answer (here ><http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> >). > >regards >Lucas Mation >IPEA - Brasil > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
David Winsemius
2016-Nov-06 16:16 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at gmail.com> wrote: > > I have some large .txt files about ~100GB containing a dataset in fixed > width file. This contains some errors: > - character characters in column that are supposed to be numeric, > - invalid characters > - rows with too many characters, possibly due to invalid characters or some > missing end of line character (so two rows in the original data become one > row in the .txt file). > > The errors are not very frequent, but stop me from importing with readr > ::read_fwf() > > > Is there some package, or workflow, in R to pre-process the files, > separating the valid from the not-valid rows into different files? This can > be done by ETL point-click tools, such as Pentaho PDI. Is there some > equivalent code in R to do this? > > I googled it and could not find a solution. I also asked this in > StackOverflow and got no answer (here > <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> > ).Had I seen it there I would have voted to close (and just did) that SO question as too broad, although it is too vague because of lack of definition of "corrupted characters", and furthermore basically a request for a package recommendation (which is also off-topic on SO). For the csv part on a smaller file task (which you didn't repeat here) I would have pointed you to this answer: http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665 For the fwf part (in a file that fits into RAM), I would have suggested wrapping table(nchar( . )) around readLines(file=filename). And then drilling down with which( nchar( . ) == <chosen_line_length> ) . I believe searching Rhelp will bring up examples of how to handle file input in chunks which should allow you to cobble together a strategy if you insist on using R ... the wrong tool. If you need to narrow your Rhelp archive search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor Grothendieck" since they frequently have the most elegant strategies in my opinion. Here's search strategy implemented via MarkMail: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines But for files of the size you contemplate I would suggest using databases, awk or other editing software that is designed for streaming processing from disk. R is not so designed. -- David.> > regards > Lucas Mation > IPEA - Brasil > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Jim Lemon
2016-Nov-06 21:56 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
Hi Lucas,
This is a rough outline of something I programmed years ago for data
cleaning (that was programmed in C). The basic idea is to read the
file line by line and check for a problem (in the initial application
this was a discrepancy between two lines that were supposed to be
identical). Here, if the line is the wrong length (optional) or
contains an unwanted character (this can be specified either as the
set of acceptable or unacceptable characters), the line is displayed
in an editor in which the user can manually fix it. The old file is
written line by line to a new file which replaces the old one. For
files in which bad lines are uncommon, this worked very well, as the
user only had to deal with errors. It is also only useful for files
containing only printable characters in the lines. Note that this is
only a sketch and I have not tested it.
cleanFile(filename,llength=NA,goodchars="[:print:]",badchars=NA) {
infile<-file(filename,open="r")
if(class(infile)=="connection") {
done<-FALSE
outfile<-file(paste("cF",filename,sep=""),"w")
while(!done) {
nextline<-readlines(infile,1)
if(nchar(nextline) != llength && !is.na(llength))
nextline<-edit(nextline)
if(!grepl(goodchars,nextline)) nextline<-edit(nextline)
if(grep((badchars,nextline && !is.na(badchars))
nextline<-edit(nextline)
writeLines(nextline,outfile)
done<-nchar(nextline)<2
}
close(infile)
close(outfile)
file.remove(infile)
file.rename(outfile,infile)
} else {
cat("Cannot open",file,"\n")
}
}
Jim
On Mon, Nov 7, 2016 at 12:36 AM, Lucas Ferreira Mation
<lucasmation at gmail.com> wrote:> I have some large .txt files about ~100GB containing a dataset in fixed
> width file. This contains some errors:
> - character characters in column that are supposed to be numeric,
> - invalid characters
> - rows with too many characters, possibly due to invalid characters or some
> missing end of line character (so two rows in the original data become one
> row in the .txt file).
>
> The errors are not very frequent, but stop me from importing with readr
> ::read_fwf()
>
>
> Is there some package, or workflow, in R to pre-process the files,
> separating the valid from the not-valid rows into different files? This can
> be done by ETL point-click tools, such as Pentaho PDI. Is there some
> equivalent code in R to do this?
>
> I googled it and could not find a solution. I also asked this in
> StackOverflow and got no answer (here
>
<http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
> ).
>
> regards
> Lucas Mation
> IPEA - Brasil
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Lucas Ferreira Mation
2016-Nov-06 22:07 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
Thank you Bert, Jeff and David for great answers.
Let me provide more context to clarify the question:
- I am running this on a large server (512GB), so the data still fits
into memory (and I also know how to process in chunks if necessary)
- I agree that DBMS and other software would me better suited, but our
team (part of gvt-body) is mostly comprised of statisticians and
economist who know R but not much of other languages or have admin
permissions to set up DBMS. It support is available but much more
scarce. So a pure R solution may be useful, and I want to understand
how much worst or more impractical it is. Performance is a secondary
concern.
- I haven't provided more concrete problems because I did not even
know how to find the lines that contain errors. With the suggestions
given, I think I will be able to do that
The mockup of the ideal function I would dream of having is something like:
clean_my_fwf <- function(infile, outfile, vector_of_error_row_numbers,
col_positions, col_types){
- import_data from infile
- in each row, check if the content of each caracter postion
correponds to the expected content of that variable given
col_positions and col_types. This must be true for all columns
- write all lines that pass this test to outfile
- return(vector_of_error_row_numbers) : containing the row number
(of the original dataset) of all rows that fail the test
}
where col_positions and col_types follow the syntax of readr::read_fwf
could be parallelized and C++ based
2016-11-06 14:16 GMT-02:00 David Winsemius <dwinsemius at
comcast.net>:>
>> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at
gmail.com> wrote:
>>
>> I have some large .txt files about ~100GB containing a dataset in fixed
>> width file. This contains some errors:
>> - character characters in column that are supposed to be numeric,
>> - invalid characters
>> - rows with too many characters, possibly due to invalid characters or
some
>> missing end of line character (so two rows in the original data become
one
>> row in the .txt file).
>>
>> The errors are not very frequent, but stop me from importing with readr
>> ::read_fwf()
>>
>>
>> Is there some package, or workflow, in R to pre-process the files,
>> separating the valid from the not-valid rows into different files? This
can
>> be done by ETL point-click tools, such as Pentaho PDI. Is there some
>> equivalent code in R to do this?
>>
>> I googled it and could not find a solution. I also asked this in
>> StackOverflow and got no answer (here
>>
<http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r>
>> ).
>
> Had I seen it there I would have voted to close (and just did) that SO
question as too broad, although it is too vague because of lack of definition of
"corrupted characters", and furthermore basically a request for a
package recommendation (which is also off-topic on SO).
>
> For the csv part on a smaller file task (which you didn't repeat here)
I would have pointed you to this answer:
>
>
http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665
>
> For the fwf part (in a file that fits into RAM), I would have suggested
wrapping table(nchar( . )) around readLines(file=filename). And then drilling
down with which( nchar( . ) == <chosen_line_length> ) .
>
> I believe searching Rhelp will bring up examples of how to handle file
input in chunks which should allow you to cobble together a strategy if you
insist on using R ... the wrong tool. If you need to narrow your Rhelp archive
search I suggest using the name "Jim Holtman" or "William
Dunlap", or "Gabor Grothendieck" since they frequently have the
most elegant strategies in my opinion.
>
> Here's search strategy implemented via MarkMail:
>
http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines
>
> But for files of the size you contemplate I would suggest using databases,
awk or other editing software that is designed for streaming processing from
disk. R is not so designed.
>
> --
> David.
>
>
>>
>> regards
>> Lucas Mation
>> IPEA - Brasil
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>