Sebastien Bihorel
2019-Nov-14 15:57 UTC
[R] Can file size affect how na.strings operates in a read.table call?
The data file is a csv file. Some text variables contain spaces. "Check for extraneous spaces" Are there specific locations that would be more critical than others? ________________________________ From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us> Sent: Thursday, November 14, 2019 10:52 To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>; Sebastien Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org <r-help at r-project.org> Subject: Re: [R] Can file size affect how na.strings operates in a read.table call? Check for extraneous spaces. You may need more variations of the na.strings. On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help <r-help at r-project.org> wrote:>Hi, > >I have this generic function to read ASCII data files. It is >essentially a wrapper around the read.table function. My function is >used in a large variety of situations and has no a priori knowledge >about the data file it is asked to read. Nothing is known about file >size, variable types, variable names, or data table dimensions. > >One argument of my function is na.strings which is passed down to >read.table. > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by >~ 160 columns) using na.strings = c('-99', '.') with the intention of >interpreting '.' and '-99' >strings as the internal missing data NA. Dots were converted to NA >appropriately. However, not all -99 values in the data were interpreted >as NA. In some variables, -99 were converted to NA, while in others -99 >was read as a number. More surprisingly, when the data file was cut in >smaller chunks (ie, by dropping either rows or columns) saved in >multiple files, the function calls applied on the new data files >resulted in the correct conversion of the -99 values into NAs. > >In all cases, the data frames produced by read.table contained the >expected number of records. > >While, on face value, it appears that file size affects how the >na.strings argument operates, I wondering if there is something else at >play here. > >Unfortunately, I cannot share the data file for confidentiality reason >but was wondering if you could suggest some checks I could perform to >get to the bottom on this issue. > >Thank you in advance for your help and sorry for the lack of >reproducible example. > > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity. [[alternative HTML version deleted]]
Jeff Newmiller
2019-Nov-14 16:35 UTC
[R] Can file size affect how na.strings operates in a read.table call?
Consider the following sample:
#####
s <- "A,B,C
0,0,0
1,-99,-99
2,-99 ,-99
3, -99, -99
"
dta_notok <- read.csv( text = s
, header=TRUE
, na.strings = c( "-99", "" )
)
dta_ok <- read.csv( text = s
, header=TRUE
, na.strings = c( "-99", " -99"
, "-99 ", ""
)
)
library(data.table)
fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) )
fdta_ok <- as.data.frame( fdt_ok )
#####
Leading and trailing spaces cause problems. The data.table::fread function
has a strip.white argument that defaults to TRUE, but the resulting object
is a data.table which has different semantics than a data.frame.
On Thu, 14 Nov 2019, Sebastien Bihorel wrote:
> The data file is a csv file. Some text variables contain spaces.
>
> "Check for extraneous spaces"
> Are there specific locations that would be more critical than others?
>
>
>
____________________________________________________________________________
> From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us>
> Sent: Thursday, November 14, 2019 10:52
> To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>;
Sebastien
> Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org
> <r-help at r-project.org>
> Subject: Re: [R] Can file size affect how na.strings operates in a
> read.table call? ?
> Check for extraneous spaces. You may need more variations of the
na.strings.
>
> On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help
> <r-help at r-project.org> wrote:
> >Hi,
> >
> >I have this generic function to read ASCII data files. It is
> >essentially a wrapper around the read.table function. My function is
> >used in a large variety of situations and has no a priori knowledge
> >about the data file it is asked to read. Nothing is known about file
> >size, variable types, variable names, or data table dimensions.
> >
> >One argument of my function is na.strings which is passed down to
> >read.table.
> >
> >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by
> >~ 160 columns) using na.strings = c('-99', '.') with
the intention of
> >interpreting '.' and '-99'
> >strings as the internal missing data NA. Dots were converted to NA
> >appropriately. However, not all -99 values in the data were interpreted
> >as NA. In some variables, -99 were converted to NA, while in others -99
> >was read as a number. More surprisingly, when the data file was cut in
> >smaller chunks (ie, by dropping either rows or columns) saved in
> >multiple files, the function calls applied on the new data files
> >resulted in the correct conversion of the -99 values into NAs.
> >
> >In all cases, the data frames produced by read.table contained the
> >expected number of records.
> >
> >While, on face value, it appears that file size affects how the
> >na.strings argument operates, I wondering if there is something else at
> >play here.
> >
> >Unfortunately, I cannot share the data file for confidentiality reason
> >but was wondering if you could suggest some checks I could perform to
> >get to the bottom on this issue.
> >
> >Thank you in advance for your help and sorry for the lack of
> >reproducible example.
> >
> >
> >______________________________________________
> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.
>
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
William Dunlap
2019-Nov-14 16:51 UTC
[R] Can file size affect how na.strings operates in a read.table call?
read.table (and friends) also have the strip.white argument:> s <- "A,B,C\n0,0,0\n1,-99,-99\n2,-99 ,-99\n3, -99, -99\n" > read.csv(text=s, header=TRUE, na.strings="-99", strip.white=TRUE)A B C 1 0 0 0 2 1 NA NA 3 2 NA NA 4 3 NA NA> read.csv(text=s, header=TRUE, na.strings="-99", strip.white=FALSE)A B C 1 0 0 0 2 1 NA NA 3 2 -99 NA 4 3 -99 -99 Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Nov 14, 2019 at 8:35 AM Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> Consider the following sample: > > ##### > s <- "A,B,C > 0,0,0 > 1,-99,-99 > 2,-99 ,-99 > 3, -99, -99 > " > > dta_notok <- read.csv( text = s > , header=TRUE > , na.strings = c( "-99", "" ) > ) > > dta_ok <- read.csv( text = s > , header=TRUE > , na.strings = c( "-99", " -99" > , "-99 ", "" > ) > ) > > library(data.table) > > fdt_ok <- fread( text = s, na.strings=c( "-99", "" ) ) > fdta_ok <- as.data.frame( fdt_ok ) > ##### > > Leading and trailing spaces cause problems. The data.table::fread function > has a strip.white argument that defaults to TRUE, but the resulting object > is a data.table which has different semantics than a data.frame. > > On Thu, 14 Nov 2019, Sebastien Bihorel wrote: > > > The data file is a csv file. Some text variables contain spaces. > > > > "Check for extraneous spaces" > > Are there specific locations that would be more critical than others? > > > > > > > ____________________________________________________________________________ > > From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us> > > Sent: Thursday, November 14, 2019 10:52 > > To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>; Sebastien > > Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org > > <r-help at r-project.org> > > Subject: Re: [R] Can file size affect how na.strings operates in a > > read.table call? > > Check for extraneous spaces. You may need more variations of the > na.strings. > > > > On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help > > <r-help at r-project.org> wrote: > > >Hi, > > > > > >I have this generic function to read ASCII data files. It is > > >essentially a wrapper around the read.table function. My function is > > >used in a large variety of situations and has no a priori knowledge > > >about the data file it is asked to read. Nothing is known about file > > >size, variable types, variable names, or data table dimensions. > > > > > >One argument of my function is na.strings which is passed down to > > >read.table. > > > > > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by > > >~ 160 columns) using na.strings = c('-99', '.') with the intention of > > >interpreting '.' and '-99' > > >strings as the internal missing data NA. Dots were converted to NA > > >appropriately. However, not all -99 values in the data were interpreted > > >as NA. In some variables, -99 were converted to NA, while in others -99 > > >was read as a number. More surprisingly, when the data file was cut in > > >smaller chunks (ie, by dropping either rows or columns) saved in > > >multiple files, the function calls applied on the new data files > > >resulted in the correct conversion of the -99 values into NAs. > > > > > >In all cases, the data frames produced by read.table contained the > > >expected number of records. > > > > > >While, on face value, it appears that file size affects how the > > >na.strings argument operates, I wondering if there is something else at > > >play here. > > > > > >Unfortunately, I cannot share the data file for confidentiality reason > > >but was wondering if you could suggest some checks I could perform to > > >get to the bottom on this issue. > > > > > >Thank you in advance for your help and sorry for the lack of > > >reproducible example. > > > > > > > > >______________________________________________ > > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >https://stat.ethz.ch/mailman/listinfo/r-help > > >PLEASE do read the posting guide > > >http://www.R-project.org/posting-guide.html > > >and provide commented, minimal, self-contained, reproducible code. > > > > -- > > Sent from my phone. Please excuse my brevity. > > > > > > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]