thr3ads.net - R devel - [Rd] read.table() with quoted integers [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Milan Bouchet-Valat

2013-Sep-30 12:33 UTC

[Rd] read.table() with quoted integers

Hi!


It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
quoted integers as an acceptable value for columns for which
colClasses="integer". But when colClasses is omitted, these columns
are
read as integer anyway.

For example, let's consider a file named file.dat, containing:
"1"
"2"
> read.table("file.dat", colClasses="integer")Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : 
  scan() expected 'an integer' and got '"1"'

But:> str(read.table("file.dat"))'data.frame':	2 obs. of  1 variable:
 $ V1: int  1 2

The latter result is indeed documented in ?read.table:
     Unless ?colClasses? is specified, all columns are read as
     character columns and then converted using ?type.convert? to
     logical, integer, numeric, complex or (depending on ?as.is?)
     factor as appropriate.  Quotes are (by default) interpreted in all
     fields, so a column of values like ?"42"? will result in an
     integer column.


Should the former behavior be considered a bug?

This creates problems when combined with read.table.ffdf from package
ff, since this function tries to guess the column classes by reading the
first rows of the file, and then passes colClasses to read.table to read
the remaining rows by chunks. A column of quoted integers is correctly
detected as integer in the first read, but read.table() fails in
subsequent reads.


Regards

Joshua Ulrich

2013-Sep-30 13:38 UTC

head link

[Rd] read.table() with quoted integers

On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at
club.fr> wrote:> Hi!
>
>
> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
> quoted integers as an acceptable value for columns for which
> colClasses="integer". But when colClasses is omitted, these
columns are
> read as integer anyway.
>
> For example, let's consider a file named file.dat, containing:
> "1"
> "2"
>
>> read.table("file.dat", colClasses="integer")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:
>   scan() expected 'an integer' and got '"1"'
>
> But:
>> str(read.table("file.dat"))
> 'data.frame':   2 obs. of  1 variable:
>  $ V1: int  1 2
>
> The latter result is indeed documented in ?read.table:
>      Unless ?colClasses? is specified, all columns are read as
>      character columns and then converted using ?type.convert? to
>      logical, integer, numeric, complex or (depending on ?as.is?)
>      factor as appropriate.  Quotes are (by default) interpreted in all
>      fields, so a column of values like ?"42"? will result in an
>      integer column.
>
>
> Should the former behavior be considered a bug?
>No. If you tell read.table the column is integer and it's actually
character on disk, it should be an error.
> This creates problems when combined with read.table.ffdf from package
> ff, since this function tries to guess the column classes by reading the
> first rows of the file, and then passes colClasses to read.table to read
> the remaining rows by chunks. A column of quoted integers is correctly
> detected as integer in the first read, but read.table() fails in
> subsequent reads.
>This sounds like a issue with read.table.ffdf.  The column of quoted
integers is *incorrectly* detected as integer because they're actually
character on disk.  read.table.ffdf should rely on how the data are
actually stored on disk (via as.is=TRUE), not how read.table might
convert them once they're read into R.
>
> Regards
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
--
Joshua Ulrich  |  about.me/joshuaulrich
FOSS Trading  |  www.fosstrading.com

Henrik Bengtsson

2013-Sep-30 15:50 UTC

head link

[Rd] read.table() with quoted integers

On Mon, Sep 30, 2013 at 5:33 AM, Milan Bouchet-Valat <nalimilan at
club.fr> wrote:> Hi!
>
>
> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
> quoted integers as an acceptable value for columns for which
> colClasses="integer". But when colClasses is omitted, these
columns are
> read as integer anyway.
>
> For example, let's consider a file named file.dat, containing:
> "1"
> "2"
>
>> read.table("file.dat", colClasses="integer")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:
>   scan() expected 'an integer' and got '"1"'
>
> But:
>> str(read.table("file.dat"))
> 'data.frame':   2 obs. of  1 variable:
>  $ V1: int  1 2
>
> The latter result is indeed documented in ?read.table:
>      Unless ?colClasses? is specified, all columns are read as
>      character columns and then converted using ?type.convert? to
>      logical, integer, numeric, complex or (depending on ?as.is?)
>      factor as appropriate.  Quotes are (by default) interpreted in all
>      fields, so a column of values like ?"42"? will result in an
>      integer column.
>
>
> Should the former behavior be considered a bug?
>
> This creates problems when combined with read.table.ffdf from package
> ff, since this function tries to guess the column classes by reading the
> first rows of the file, and then passes colClasses to read.table to read
> the remaining rows by chunks. A column of quoted integers is correctly
> detected as integer in the first read, but read.table() fails in
> subsequent reads.
The readDataFrame() of the R.filesets package provides argument
'trimQuotes' for this exact reason, i.e. for the purpose of trimming
quotes of columns for which 'colClasses' specifies a numeric type
before passing on to read.table().  Feel free to borrow from its
source code for a patch to ff:read.table.ffdf().  The workaround is in
readDataFrame() for TabularTextFile
[https://r-forge.r-project.org/scm/viewvc.php/pkg/R.filesets/R/TabularTextFile.R?view=markup&root=r-dots];
look for the part that starts with:

  # SPECIAL CASE/WORKAROUND: read.table()/scan() will give an error
  # if a numeric value is quoted and 'colClasses' specifies it as
  # a numeric value.  In order to read such values, we need to remove
  # the quotes first. /HB 2011-07-13

/Henrik
(author of R.filesets)
>
>
> Regards
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Jens Oehlschlägel

2013-Oct-03 14:44 UTC

head link

[Rd] read.table() with quoted integers

I agree that quoted integer columns are not the most efficient way of 
delivering csv-files. However, the sad reality is that one receives such 
formats and still needs to read the data. Therefore it is not helpful to 
state that one should 'consider "character" to be the correct
colClass
in case an integer is surrounded by quotes'.

The philosophy of read.table.ffdf is delegating the actual csv-parsing 
to a parse engine 'similarly' parametrized like 'read.table'. It
is not
'bad coding practice' - but a conscious design decision - to assume that
the parse engine behaves consistently, which read.table does not yet: it 
automatically recognizes a quoted integer column as 'integer', but when 
asked to explicitly interpret the column as 'integer' it does refuse to 
do so. So there is nothing wrong with read.table.ffdf (but something can 
be improved about read.table). It is *not* the 'best solution [...] to 
rewrite read.table.ffdf()' given that it nicely imports such data, see 
4+1 ways to do so below.

Jens Oehlschl?gel


# --- first create a csv file for demonstration 
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)

# --- how to do it with read.table.ffdf 
---------------------------------------

# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
	read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, 
FUN="fixedengine")
df

# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10, 
colClasses=c("integer", NA))
df

# 3 do your own type conversion using transFUN
#  after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
	d[[2]] <- as.integer(gsub('"', '', d[[2]]))
	d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, 
colClasses=c("integer", "character"),
FUN="read.csv", transFUN=custominterp)
df

# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
#  such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA, 
nrows=0, skip=0, fileEncoding="", comment.char = ""){
	l <- scan(file, what="character", nlines=nrows+header, skip=skip, 
fileEncoding=fileEncoding, comment.char = comment.char)
	s <- do.call("rbind", strsplit(l, ","))
	if (header){
		d <- data.frame(as.integer(s[-1,1]),
as.integer(gsub('"','',s[-1,2])))
		names(d) <- s[1,]
	}else{
		d <- data.frame(as.integer(s[,1]),
as.integer(gsub('"','',s[,2])))
	}
	if (!missing(col.names))
		names(d) <- col.names
	d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, 
FUN="customengine")
df

# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
#  parse engine for read.table.ffdf because read.table.ffdf expects
#  the parse engine to read successive chunks from a file connection
#  while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
#  for each chunk (O(N^2) costs)

ashmoran

2014-Apr-26 16:01 UTC

head link

[Rd] read.table() with quoted integers

Milan Bouchet-Valat wrote> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider
> quoted integers as an acceptable value for columns for which
> colClasses="integer". But when colClasses is omitted, these
columns are
> read as integer anyway.
> 
> For example, let's consider a file named file.dat, containing:
> "1"
> "2"
> 
>> read.table("file.dat", colClasses="integer")
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : 
>   scan() expected 'an integer' and got '"1"'
Hi 

I just ran into a variation of this. I'm teaching myself agent based
modelling from a book that uses NetLogo as the implementation language[1].
NetLogo has a feature called BehaviourSpaces that runs models over a varying
range of parameter values and make arbitrary observations at each time step,
which it then outputs to a CSV. One of the exercises involves plotting some
graphs of a model run, but the output needs some processing before it can be
graphed. Rather than hack away at the data by hand each time I run it, I
decided to find a stats package to help, and I chose R. I'm a complete
beginner to R, and I've been using the R in Action early access PDF as a
guide[2]. I'm using R 3.1.0 GUI 1.64 Mavericks build (6734).

The NetLogo CSV writer quotes all values, and mixes integers and floats. So
a column of data might contain say (with the quotes actually in the file)
"0", "1.25", "1", "2",
"3.175". I tried importing the data like this:

    profit <- read.csv("BusinessInvestor1 Profit-table.csv",
sep=",",
header=TRUE, skip=6)

But then some of the data is read in as factors:

    str(profit)
    'data.frame':	1560 obs. of  9 variables:
     $ X.run.number.                                                 : int 
8 6 2 7 5 1 3 4 6 8 ...
     $ restrict.sensing.radius                                       :
Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ...
     $ risk.multiplier                                               : int 
1 1 1 1 1 1 1 1 1 1 ...
     $ sensing.radius                                                : int 
1 1 1 1 1 1 1 1 1 1 ...
     $ profit.multiplier                                             : num 
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
     $ X.step.                                                       : int 
0 0 0 0 0 0 0 0 1 1 ...
     $ mean..wealth..of.turtles                                      :
Factor w/ 1501 levels "0","100038.136",..: 1 1 1 1 1 1 1 1
623 550 ...
     $ mean..profit..of.patches.with..any..turtles.here.             :
Factor w/ 1547 levels "2503.675","2582.275",..: 1 8 7 6 5 4
3 10 278 230 ...
     $ mean..failure.probability..of.patches.with..any..turtles.here.:
Factor w/ 1558 levels "0.026069451281579437",..: 1504 1528 1508 1518
1516
1514 1512 1536 1321 1471 ...

(For reasons I don't understand, the profit.multiplier parameter ? which
runs "0.5", "0.6", ?, "1" ? is imported as a
numeric, whereas the
observation values get turned into factors.)

I read about colClasses but this trips over the "quoted integers aren't
integers" bug:

    profit <- read.csv("BusinessInvestor1 Profit-table.csv",
sep=",",
header=TRUE, skip=6, colClasses=c("integer", "logical",
"numeric",
"numeric", "numeric", "integer",
"numeric", "numeric", "numeric"))
    Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,  : 
      scan() expected 'an integer', got '"8"'

I created a little script to import the data, and use some casts to clean it
up. At first I thought it was working, until I realised this line (to
process CSV data in the range 0...1):

    profit$mean_failure_probability_of_inhabited <-
as.numeric(profit$mean_failure_probability_of_inhabited)

Was producing crazy values:

    str(profit)
    'data.frame':   1560 obs. of  9 variables:
    ...
     $ mean_failure_probability_of_inhabited: num  1504 1528 1508 1518 1516
...

Eventually I figured out to do this (although I haven't yet figured out
why):

    profit$mean_failure_probability_of_inhabited <-
as.numeric(as.character(profit$mean_failure_probability_of_inhabited))

Anyway, for a beginner coming to R, this is all REALLY confusing, and it's
taken me several hours to get my head round it. Although after reading about
it a bit I can see the implementation issues causing this behaviour, as a
noob it just feels like "R can't import CSV data". The most
baffling thing
is how telling R what format the data in each column is in actually
*reduces* its ability to read the file! (For a while I thought it was
complaining because "8" is an integer, not a real, but now I see
it's
because it's seeing it as a string.) My understanding of the CSV was the
same as Peter Meilstrup describes it later in the thread ? that quotes in a
CSV are to allow the delimiter character in a value, and don't imply
anything about the type of the data (because CSVs are untyped).

Googling the scan() error led to this mailing list thread so I thought I'd
describe my experience. If there's a more intuitive way for read.csv /
read.table to work it might save beginners like me a lot of head-scratching!

Best regards
Ash

[1] http://www.amazon.com/dp/0691136742/
[2] http://www.manning.com/kabacoff2/



--
View this message in context:
http://r.789695.n4.nabble.com/read-table-with-quoted-integers-tp4677249p4689530.html
Sent from the R devel mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more possibly parallel threads

R devel - Sep 2013 - read.table() with quoted integers

[Rd] read.table() with quoted integers

[Rd] read.table() with quoted integers

[Rd] read.table() with quoted integers

[Rd] read.table() with quoted integers

[Rd] read.table() with quoted integers

Maybe Matching Threads