Sebastien Bihorel
2019-Nov-14 15:57 UTC
[R] Can file size affect how na.strings operates in a read.table call?
The data file is a csv file. Some text variables contain spaces. "Check for extraneous spaces" Are there specific locations that would be more critical than others? ________________________________ From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us> Sent: Thursday, November 14, 2019 10:52 To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>; Sebastien Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org <r-help at r-project.org> Subject: Re: [R] Can file size affect how na.strings operates in a read.table call? Check for extraneous spaces. You may need more variations of the na.strings. On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help <r-help at r-project.org> wrote:>Hi, > >I have this generic function to read ASCII data files. It is >essentially a wrapper around the read.table function. My function is >used in a large variety of situations and has no a priori knowledge >about the data file it is asked to read. Nothing is known about file >size, variable types, variable names, or data table dimensions. > >One argument of my function is na.strings which is passed down to >read.table. > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by >~ 160 columns) using na.strings = c('-99', '.') with the intention of >interpreting '.' and '-99' >strings as the internal missing data NA. Dots were converted to NA >appropriately. However, not all -99 values in the data were interpreted >as NA. In some variables, -99 were converted to NA, while in others -99 >was read as a number. More surprisingly, when the data file was cut in >smaller chunks (ie, by dropping either rows or columns) saved in >multiple files, the function calls applied on the new data files >resulted in the correct conversion of the -99 values into NAs. > >In all cases, the data frames produced by read.table contained the >expected number of records. > >While, on face value, it appears that file size affects how the >na.strings argument operates, I wondering if there is something else at >play here. > >Unfortunately, I cannot share the data file for confidentiality reason >but was wondering if you could suggest some checks I could perform to >get to the bottom on this issue. > >Thank you in advance for your help and sorry for the lack of >reproducible example. > > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity. [[alternative HTML version deleted]]
Jeff Newmiller
2019-Nov-14 16:35 UTC
[R] Can file size affect how na.strings operates in a read.table call?
Consider the following sample: ##### s <- "A,B,C 0,0,0 1,-99,-99 2,-99 ,-99 3, -99, -99 " dta_notok <- read.csv( text = s , header=TRUE , na.strings = c( "-99", "" ) ) dta_ok <- read.csv( text = s , header=TRUE , na.strings = c( "-99", " -99" , "-99 ", "" ) ) library(data.table) fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) ) fdta_ok <- as.data.frame( fdt_ok ) ##### Leading and trailing spaces cause problems. The data.table::fread function has a strip.white argument that defaults to TRUE, but the resulting object is a data.table which has different semantics than a data.frame. On Thu, 14 Nov 2019, Sebastien Bihorel wrote:> The data file is a csv file. Some text variables contain spaces. > > "Check for extraneous spaces" > Are there specific locations that would be more critical than others? > > > ____________________________________________________________________________ > From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us> > Sent: Thursday, November 14, 2019 10:52 > To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>; Sebastien > Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org > <r-help at r-project.org> > Subject: Re: [R] Can file size affect how na.strings operates in a > read.table call? ? > Check for extraneous spaces. You may need more variations of the na.strings. > > On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help > <r-help at r-project.org> wrote: > >Hi, > > > >I have this generic function to read ASCII data files. It is > >essentially a wrapper around the read.table function. My function is > >used in a large variety of situations and has no a priori knowledge > >about the data file it is asked to read. Nothing is known about file > >size, variable types, variable names, or data table dimensions. > > > >One argument of my function is na.strings which is passed down to > >read.table. > > > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by > >~ 160 columns) using na.strings = c('-99', '.') with the intention of > >interpreting '.' and '-99' > >strings as the internal missing data NA. Dots were converted to NA > >appropriately. However, not all -99 values in the data were interpreted > >as NA. In some variables, -99 were converted to NA, while in others -99 > >was read as a number. More surprisingly, when the data file was cut in > >smaller chunks (ie, by dropping either rows or columns) saved in > >multiple files, the function calls applied on the new data files > >resulted in the correct conversion of the -99 values into NAs. > > > >In all cases, the data frames produced by read.table contained the > >expected number of records. > > > >While, on face value, it appears that file size affects how the > >na.strings argument operates, I wondering if there is something else at > >play here. > > > >Unfortunately, I cannot share the data file for confidentiality reason > >but was wondering if you could suggest some checks I could perform to > >get to the bottom on this issue. > > > >Thank you in advance for your help and sorry for the lack of > >reproducible example. > > > > > >______________________________________________ > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > > -- > Sent from my phone. Please excuse my brevity. > >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k ---------------------------------------------------------------------------
William Dunlap
2019-Nov-14 16:51 UTC
[R] Can file size affect how na.strings operates in a read.table call?
read.table (and friends) also have the strip.white argument:> s <- "A,B,C\n0,0,0\n1,-99,-99\n2,-99 ,-99\n3, -99, -99\n" > read.csv(text=s, header=TRUE, na.strings="-99", strip.white=TRUE)A B C 1 0 0 0 2 1 NA NA 3 2 NA NA 4 3 NA NA> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=FALSE)A B C 1 0 0 0 2 1 NA NA 3 2 -99 NA 4 3 -99 -99 Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Nov 14, 2019 at 8:35 AM Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> Consider the following sample: > > ##### > s <- "A,B,C > 0,0,0 > 1,-99,-99 > 2,-99 ,-99 > 3, -99, -99 > " > > dta_notok <- read.csv( text = s > , header=TRUE > , na.strings = c( "-99", "" ) > ) > > dta_ok <- read.csv( text = s > , header=TRUE > , na.strings = c( "-99", " -99" > , "-99 ", "" > ) > ) > > library(data.table) > > fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) ) > fdta_ok <- as.data.frame( fdt_ok ) > ##### > > Leading and trailing spaces cause problems. The data.table::fread function > has a strip.white argument that defaults to TRUE, but the resulting object > is a data.table which has different semantics than a data.frame. > > On Thu, 14 Nov 2019, Sebastien Bihorel wrote: > > > The data file is a csv file. Some text variables contain spaces. > > > > "Check for extraneous spaces" > > Are there specific locations that would be more critical than others? > > > > > > > ____________________________________________________________________________ > > From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us> > > Sent: Thursday, November 14, 2019 10:52 > > To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>; Sebastien > > Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org > > <r-help at r-project.org> > > Subject: Re: [R] Can file size affect how na.strings operates in a > > read.table call? > > Check for extraneous spaces. You may need more variations of the > na.strings. > > > > On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help > > <r-help at r-project.org> wrote: > > >Hi, > > > > > >I have this generic function to read ASCII data files. It is > > >essentially a wrapper around the read.table function. My function is > > >used in a large variety of situations and has no a priori knowledge > > >about the data file it is asked to read. Nothing is known about file > > >size, variable types, variable names, or data table dimensions. > > > > > >One argument of my function is na.strings which is passed down to > > >read.table. > > > > > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by > > >~ 160 columns) using na.strings = c('-99', '.') with the intention of > > >interpreting '.' and '-99' > > >strings as the internal missing data NA. Dots were converted to NA > > >appropriately. However, not all -99 values in the data were interpreted > > >as NA. In some variables, -99 were converted to NA, while in others -99 > > >was read as a number. More surprisingly, when the data file was cut in > > >smaller chunks (ie, by dropping either rows or columns) saved in > > >multiple files, the function calls applied on the new data files > > >resulted in the correct conversion of the -99 values into NAs. > > > > > >In all cases, the data frames produced by read.table contained the > > >expected number of records. > > > > > >While, on face value, it appears that file size affects how the > > >na.strings argument operates, I wondering if there is something else at > > >play here. > > > > > >Unfortunately, I cannot share the data file for confidentiality reason > > >but was wondering if you could suggest some checks I could perform to > > >get to the bottom on this issue. > > > > > >Thank you in advance for your help and sorry for the lack of > > >reproducible example. > > > > > > > > >______________________________________________ > > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >https://stat.ethz.ch/mailman/listinfo/r-help > > >PLEASE do read the posting guide > > >http://www.R-project.org/posting-guide.html > > >and provide commented, minimal, self-contained, reproducible code. > > > > -- > > Sent from my phone. Please excuse my brevity. > > > > > > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]