Lucas Ferreira Mation
2016-Nov-06 13:36 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
I have some large .txt files about ~100GB containing a dataset in fixed width file. This contains some errors: - character characters in column that are supposed to be numeric, - invalid characters - rows with too many characters, possibly due to invalid characters or some missing end of line character (so two rows in the original data become one row in the .txt file). The errors are not very frequent, but stop me from importing with readr ::read_fwf() Is there some package, or workflow, in R to pre-process the files, separating the valid from the not-valid rows into different files? This can be done by ETL point-click tools, such as Pentaho PDI. Is there some equivalent code in R to do this? I googled it and could not find a solution. I also asked this in StackOverflow and got no answer (here <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> ). regards Lucas Mation IPEA - Brasil [[alternative HTML version deleted]]
Jeff Newmiller
2016-Nov-06 16:12 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
?readLines ... given the large size of file you may need to process chunks by specifying a file connection rather than a character string file name and using the "n" argument. ?grepl ?Extract ?tools::showNonASCII There are many ways for data to be corrupted... in particular when invalid characters appear the possibilities explode, so more specifics are needed if this is not enough. Of course, reading the Posting Guide, posting with plain text to avoid HTML corruption, and giving reproducible examples will improve the quality of responses to those questions. -- Sent from my phone. Please excuse my brevity. On November 6, 2016 5:36:46 AM PST, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:>I have some large .txt files about ~100GB containing a dataset in fixed >width file. This contains some errors: >- character characters in column that are supposed to be numeric, >- invalid characters >- rows with too many characters, possibly due to invalid characters or >some >missing end of line character (so two rows in the original data become >one >row in the .txt file). > >The errors are not very frequent, but stop me from importing with readr >::read_fwf() > > >Is there some package, or workflow, in R to pre-process the files, >separating the valid from the not-valid rows into different files? This >can >be done by ETL point-click tools, such as Pentaho PDI. Is there some >equivalent code in R to do this? > >I googled it and could not find a solution. I also asked this in >StackOverflow and got no answer (here ><http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> >). > >regards >Lucas Mation >IPEA - Brasil > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
David Winsemius
2016-Nov-06 16:16 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at gmail.com> wrote: > > I have some large .txt files about ~100GB containing a dataset in fixed > width file. This contains some errors: > - character characters in column that are supposed to be numeric, > - invalid characters > - rows with too many characters, possibly due to invalid characters or some > missing end of line character (so two rows in the original data become one > row in the .txt file). > > The errors are not very frequent, but stop me from importing with readr > ::read_fwf() > > > Is there some package, or workflow, in R to pre-process the files, > separating the valid from the not-valid rows into different files? This can > be done by ETL point-click tools, such as Pentaho PDI. Is there some > equivalent code in R to do this? > > I googled it and could not find a solution. I also asked this in > StackOverflow and got no answer (here > <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> > ).Had I seen it there I would have voted to close (and just did) that SO question as too broad, although it is too vague because of lack of definition of "corrupted characters", and furthermore basically a request for a package recommendation (which is also off-topic on SO). For the csv part on a smaller file task (which you didn't repeat here) I would have pointed you to this answer: http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665 For the fwf part (in a file that fits into RAM), I would have suggested wrapping table(nchar( . )) around readLines(file=filename). And then drilling down with which( nchar( . ) == <chosen_line_length> ) . I believe searching Rhelp will bring up examples of how to handle file input in chunks which should allow you to cobble together a strategy if you insist on using R ... the wrong tool. If you need to narrow your Rhelp archive search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor Grothendieck" since they frequently have the most elegant strategies in my opinion. Here's search strategy implemented via MarkMail: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines But for files of the size you contemplate I would suggest using databases, awk or other editing software that is designed for streaming processing from disk. R is not so designed. -- David.> > regards > Lucas Mation > IPEA - Brasil > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Jim Lemon
2016-Nov-06 21:56 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
Hi Lucas, This is a rough outline of something I programmed years ago for data cleaning (that was programmed in C). The basic idea is to read the file line by line and check for a problem (in the initial application this was a discrepancy between two lines that were supposed to be identical). Here, if the line is the wrong length (optional) or contains an unwanted character (this can be specified either as the set of acceptable or unacceptable characters), the line is displayed in an editor in which the user can manually fix it. The old file is written line by line to a new file which replaces the old one. For files in which bad lines are uncommon, this worked very well, as the user only had to deal with errors. It is also only useful for files containing only printable characters in the lines. Note that this is only a sketch and I have not tested it. cleanFile(filename,llength=NA,goodchars="[:print:]",badchars=NA) { infile<-file(filename,open="r") if(class(infile)=="connection") { done<-FALSE outfile<-file(paste("cF",filename,sep=""),"w") while(!done) { nextline<-readlines(infile,1) if(nchar(nextline) != llength && !is.na(llength)) nextline<-edit(nextline) if(!grepl(goodchars,nextline)) nextline<-edit(nextline) if(grep((badchars,nextline && !is.na(badchars)) nextline<-edit(nextline) writeLines(nextline,outfile) done<-nchar(nextline)<2 } close(infile) close(outfile) file.remove(infile) file.rename(outfile,infile) } else { cat("Cannot open",file,"\n") } } Jim On Mon, Nov 7, 2016 at 12:36 AM, Lucas Ferreira Mation <lucasmation at gmail.com> wrote:> I have some large .txt files about ~100GB containing a dataset in fixed > width file. This contains some errors: > - character characters in column that are supposed to be numeric, > - invalid characters > - rows with too many characters, possibly due to invalid characters or some > missing end of line character (so two rows in the original data become one > row in the .txt file). > > The errors are not very frequent, but stop me from importing with readr > ::read_fwf() > > > Is there some package, or workflow, in R to pre-process the files, > separating the valid from the not-valid rows into different files? This can > be done by ETL point-click tools, such as Pentaho PDI. Is there some > equivalent code in R to do this? > > I googled it and could not find a solution. I also asked this in > StackOverflow and got no answer (here > <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> > ). > > regards > Lucas Mation > IPEA - Brasil > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Lucas Ferreira Mation
2016-Nov-06 22:07 UTC
[R] How to pre-process fwf or csv files to remove unexpected characters in R?
Thank you Bert, Jeff and David for great answers. Let me provide more context to clarify the question: - I am running this on a large server (512GB), so the data still fits into memory (and I also know how to process in chunks if necessary) - I agree that DBMS and other software would me better suited, but our team (part of gvt-body) is mostly comprised of statisticians and economist who know R but not much of other languages or have admin permissions to set up DBMS. It support is available but much more scarce. So a pure R solution may be useful, and I want to understand how much worst or more impractical it is. Performance is a secondary concern. - I haven't provided more concrete problems because I did not even know how to find the lines that contain errors. With the suggestions given, I think I will be able to do that The mockup of the ideal function I would dream of having is something like: clean_my_fwf <- function(infile, outfile, vector_of_error_row_numbers, col_positions, col_types){ - import_data from infile - in each row, check if the content of each caracter postion correponds to the expected content of that variable given col_positions and col_types. This must be true for all columns - write all lines that pass this test to outfile - return(vector_of_error_row_numbers) : containing the row number (of the original dataset) of all rows that fail the test } where col_positions and col_types follow the syntax of readr::read_fwf could be parallelized and C++ based 2016-11-06 14:16 GMT-02:00 David Winsemius <dwinsemius at comcast.net>:> >> On Nov 6, 2016, at 5:36 AM, Lucas Ferreira Mation <lucasmation at gmail.com> wrote: >> >> I have some large .txt files about ~100GB containing a dataset in fixed >> width file. This contains some errors: >> - character characters in column that are supposed to be numeric, >> - invalid characters >> - rows with too many characters, possibly due to invalid characters or some >> missing end of line character (so two rows in the original data become one >> row in the .txt file). >> >> The errors are not very frequent, but stop me from importing with readr >> ::read_fwf() >> >> >> Is there some package, or workflow, in R to pre-process the files, >> separating the valid from the not-valid rows into different files? This can >> be done by ETL point-click tools, such as Pentaho PDI. Is there some >> equivalent code in R to do this? >> >> I googled it and could not find a solution. I also asked this in >> StackOverflow and got no answer (here >> <http://stackoverflow.com/questions/39414886/fix-errors-in-csv-and-fwf-files-corrupted-characters-when-importing-to-r> >> ). > > Had I seen it there I would have voted to close (and just did) that SO question as too broad, although it is too vague because of lack of definition of "corrupted characters", and furthermore basically a request for a package recommendation (which is also off-topic on SO). > > For the csv part on a smaller file task (which you didn't repeat here) I would have pointed you to this answer: > > http://stackoverflow.com/questions/19082490/how-can-i-use-r-to-find-malformed-rows-and-fields-in-a-file-too-big-to-read-into/19083665#19083665 > > For the fwf part (in a file that fits into RAM), I would have suggested wrapping table(nchar( . )) around readLines(file=filename). And then drilling down with which( nchar( . ) == <chosen_line_length> ) . > > I believe searching Rhelp will bring up examples of how to handle file input in chunks which should allow you to cobble together a strategy if you insist on using R ... the wrong tool. If you need to narrow your Rhelp archive search I suggest using the name "Jim Holtman" or "William Dunlap", or "Gabor Grothendieck" since they frequently have the most elegant strategies in my opinion. > > Here's search strategy implemented via MarkMail: > http://markmail.org/search/?q=list%3Aorg.r-project.r-help+file+chunks+readlines > > But for files of the size you contemplate I would suggest using databases, awk or other editing software that is designed for streaming processing from disk. R is not so designed. > > -- > David. > > >> >> regards >> Lucas Mation >> IPEA - Brasil >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA >