Sebastien Bihorel
2019-Nov-14 15:40 UTC
[R] Can file size affect how na.strings operates in a read.table call?
Hi, I have this generic function to read ASCII data files. It is essentially a wrapper around the read.table function. My function is used in a large variety of situations and has no a priori knowledge about the data file it is asked to read. Nothing is known about file size, variable types, variable names, or data table dimensions. One argument of my function is na.strings which is passed down to read.table. Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by ~ 160 columns) using na.strings = c('-99', '.') with the intention of interpreting '.' and '-99' strings as the internal missing data NA. Dots were converted to NA appropriately. However, not all -99 values in the data were interpreted as NA. In some variables, -99 were converted to NA, while in others -99 was read as a number. More surprisingly, when the data file was cut in smaller chunks (ie, by dropping either rows or columns) saved in multiple files, the function calls applied on the new data files resulted in the correct conversion of the -99 values into NAs. In all cases, the data frames produced by read.table contained the expected number of records. While, on face value, it appears that file size affects how the na.strings argument operates, I wondering if there is something else at play here. Unfortunately, I cannot share the data file for confidentiality reason but was wondering if you could suggest some checks I could perform to get to the bottom on this issue. Thank you in advance for your help and sorry for the lack of reproducible example.
Jeff Newmiller
2019-Nov-14 15:52 UTC
[R] Can file size affect how na.strings operates in a read.table call?
Check for extraneous spaces. You may need more variations of the na.strings. On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help <r-help at r-project.org> wrote:>Hi, > >I have this generic function to read ASCII data files. It is >essentially a wrapper around the read.table function. My function is >used in a large variety of situations and has no a priori knowledge >about the data file it is asked to read. Nothing is known about file >size, variable types, variable names, or data table dimensions. > >One argument of my function is na.strings which is passed down to >read.table. > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by >~ 160 columns) using na.strings = c('-99', '.') with the intention of >interpreting '.' and '-99' >strings as the internal missing data NA. Dots were converted to NA >appropriately. However, not all -99 values in the data were interpreted >as NA. In some variables, -99 were converted to NA, while in others -99 >was read as a number. More surprisingly, when the data file was cut in >smaller chunks (ie, by dropping either rows or columns) saved in >multiple files, the function calls applied on the new data files >resulted in the correct conversion of the -99 values into NAs. > >In all cases, the data frames produced by read.table contained the >expected number of records. > >While, on face value, it appears that file size affects how the >na.strings argument operates, I wondering if there is something else at >play here. > >Unfortunately, I cannot share the data file for confidentiality reason >but was wondering if you could suggest some checks I could perform to >get to the bottom on this issue. > >Thank you in advance for your help and sorry for the lack of >reproducible example. > > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
Sebastien Bihorel
2019-Nov-14 15:57 UTC
[R] Can file size affect how na.strings operates in a read.table call?
The data file is a csv file. Some text variables contain spaces. "Check for extraneous spaces" Are there specific locations that would be more critical than others? ________________________________ From: Jeff Newmiller <jdnewmil at dcn.davis.ca.us> Sent: Thursday, November 14, 2019 10:52 To: Sebastien Bihorel <Sebastien.Bihorel at cognigencorp.com>; Sebastien Bihorel via R-help <r-help at r-project.org>; r-help at r-project.org <r-help at r-project.org> Subject: Re: [R] Can file size affect how na.strings operates in a read.table call? Check for extraneous spaces. You may need more variations of the na.strings. On November 14, 2019 7:40:42 AM PST, Sebastien Bihorel via R-help <r-help at r-project.org> wrote:>Hi, > >I have this generic function to read ASCII data files. It is >essentially a wrapper around the read.table function. My function is >used in a large variety of situations and has no a priori knowledge >about the data file it is asked to read. Nothing is known about file >size, variable types, variable names, or data table dimensions. > >One argument of my function is na.strings which is passed down to >read.table. > >Recently, a user tried to read a data file of ~ 80 Mo (~ 93000 rows by >~ 160 columns) using na.strings = c('-99', '.') with the intention of >interpreting '.' and '-99' >strings as the internal missing data NA. Dots were converted to NA >appropriately. However, not all -99 values in the data were interpreted >as NA. In some variables, -99 were converted to NA, while in others -99 >was read as a number. More surprisingly, when the data file was cut in >smaller chunks (ie, by dropping either rows or columns) saved in >multiple files, the function calls applied on the new data files >resulted in the correct conversion of the -99 values into NAs. > >In all cases, the data frames produced by read.table contained the >expected number of records. > >While, on face value, it appears that file size affects how the >na.strings argument operates, I wondering if there is something else at >play here. > >Unfortunately, I cannot share the data file for confidentiality reason >but was wondering if you could suggest some checks I could perform to >get to the bottom on this issue. > >Thank you in advance for your help and sorry for the lack of >reproducible example. > > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity. [[alternative HTML version deleted]]