Hi! It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider quoted integers as an acceptable value for columns for which colClasses="integer". But when colClasses is omitted, these columns are read as integer anyway. For example, let's consider a file named file.dat, containing: "1" "2"> read.table("file.dat", colClasses="integer")Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'an integer' and got '"1"' But:> str(read.table("file.dat"))'data.frame': 2 obs. of 1 variable: $ V1: int 1 2 The latter result is indeed documented in ?read.table: Unless ?colClasses? is specified, all columns are read as character columns and then converted using ?type.convert? to logical, integer, numeric, complex or (depending on ?as.is?) factor as appropriate. Quotes are (by default) interpreted in all fields, so a column of values like ?"42"? will result in an integer column. Should the former behavior be considered a bug? This creates problems when combined with read.table.ffdf from package ff, since this function tries to guess the column classes by reading the first rows of the file, and then passes colClasses to read.table to read the remaining rows by chunks. A column of quoted integers is correctly detected as integer in the first read, but read.table() fails in subsequent reads. Regards
On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:> Hi! > > > It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider > quoted integers as an acceptable value for columns for which > colClasses="integer". But when colClasses is omitted, these columns are > read as integer anyway. > > For example, let's consider a file named file.dat, containing: > "1" > "2" > >> read.table("file.dat", colClasses="integer") > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > scan() expected 'an integer' and got '"1"' > > But: >> str(read.table("file.dat")) > 'data.frame': 2 obs. of 1 variable: > $ V1: int 1 2 > > The latter result is indeed documented in ?read.table: > Unless ?colClasses? is specified, all columns are read as > character columns and then converted using ?type.convert? to > logical, integer, numeric, complex or (depending on ?as.is?) > factor as appropriate. Quotes are (by default) interpreted in all > fields, so a column of values like ?"42"? will result in an > integer column. > > > Should the former behavior be considered a bug? >No. If you tell read.table the column is integer and it's actually character on disk, it should be an error.> This creates problems when combined with read.table.ffdf from package > ff, since this function tries to guess the column classes by reading the > first rows of the file, and then passes colClasses to read.table to read > the remaining rows by chunks. A column of quoted integers is correctly > detected as integer in the first read, but read.table() fails in > subsequent reads. >This sounds like a issue with read.table.ffdf. The column of quoted integers is *incorrectly* detected as integer because they're actually character on disk. read.table.ffdf should rely on how the data are actually stored on disk (via as.is=TRUE), not how read.table might convert them once they're read into R.> > Regards > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Joshua Ulrich | about.me/joshuaulrich FOSS Trading | www.fosstrading.com
On Mon, Sep 30, 2013 at 5:33 AM, Milan Bouchet-Valat <nalimilan at club.fr> wrote:> Hi! > > > It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider > quoted integers as an acceptable value for columns for which > colClasses="integer". But when colClasses is omitted, these columns are > read as integer anyway. > > For example, let's consider a file named file.dat, containing: > "1" > "2" > >> read.table("file.dat", colClasses="integer") > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > scan() expected 'an integer' and got '"1"' > > But: >> str(read.table("file.dat")) > 'data.frame': 2 obs. of 1 variable: > $ V1: int 1 2 > > The latter result is indeed documented in ?read.table: > Unless ?colClasses? is specified, all columns are read as > character columns and then converted using ?type.convert? to > logical, integer, numeric, complex or (depending on ?as.is?) > factor as appropriate. Quotes are (by default) interpreted in all > fields, so a column of values like ?"42"? will result in an > integer column. > > > Should the former behavior be considered a bug? > > This creates problems when combined with read.table.ffdf from package > ff, since this function tries to guess the column classes by reading the > first rows of the file, and then passes colClasses to read.table to read > the remaining rows by chunks. A column of quoted integers is correctly > detected as integer in the first read, but read.table() fails in > subsequent reads.The readDataFrame() of the R.filesets package provides argument 'trimQuotes' for this exact reason, i.e. for the purpose of trimming quotes of columns for which 'colClasses' specifies a numeric type before passing on to read.table(). Feel free to borrow from its source code for a patch to ff:read.table.ffdf(). The workaround is in readDataFrame() for TabularTextFile [https://r-forge.r-project.org/scm/viewvc.php/pkg/R.filesets/R/TabularTextFile.R?view=markup&root=r-dots]; look for the part that starts with: # SPECIAL CASE/WORKAROUND: read.table()/scan() will give an error # if a numeric value is quoted and 'colClasses' specifies it as # a numeric value. In order to read such values, we need to remove # the quotes first. /HB 2011-07-13 /Henrik (author of R.filesets)> > > Regards > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
I agree that quoted integer columns are not the most efficient way of delivering csv-files. However, the sad reality is that one receives such formats and still needs to read the data. Therefore it is not helpful to state that one should 'consider "character" to be the correct colClass in case an integer is surrounded by quotes'. The philosophy of read.table.ffdf is delegating the actual csv-parsing to a parse engine 'similarly' parametrized like 'read.table'. It is not 'bad coding practice' - but a conscious design decision - to assume that the parse engine behaves consistently, which read.table does not yet: it automatically recognizes a quoted integer column as 'integer', but when asked to explicitly interpret the column as 'integer' it does refuse to do so. So there is nothing wrong with read.table.ffdf (but something can be improved about read.table). It is *not* the 'best solution [...] to rewrite read.table.ffdf()' given that it nicely imports such data, see 4+1 ways to do so below. Jens Oehlschl?gel # --- first create a csv file for demonstration ------------------------------- require(ff) file <- "test.csv" path <- "c:/tmp" n <- 1e2 d <- data.frame(x=1:n, y=shQuote(1:n)) write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE) # --- how to do it with read.table.ffdf --------------------------------------- # 1 let the parse engine ignore colClasses and hope for the best fixedengine <- function(file, ..., colClasses=NA){ read.csv(file, ...) } df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, FUN="fixedengine") df # 2 Suspend colClasses(=NA) for the quoted integer column only df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10, colClasses=c("integer", NA)) df # 3 do your own type conversion using transFUN # after reading the problematic column as character # Being able to inject regexps is quite powerful isn't it? # Or error handlinig in case of varying column format! custominterp <- function(d){ d[[2]] <- as.integer(gsub('"', '', d[[2]])) d } df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp) df # 4 do your own line parsing and type conversion # Here you can even handle non-standard formats # such as varying number of columns customengine <- function(file, header=TRUE, col.names, colClasses=NA, nrows=0, skip=0, fileEncoding="", comment.char = ""){ l <- scan(file, what="character", nlines=nrows+header, skip=skip, fileEncoding=fileEncoding, comment.char = comment.char) s <- do.call("rbind", strsplit(l, ",")) if (header){ d <- data.frame(as.integer(s[-1,1]), as.integer(gsub('"','',s[-1,2]))) names(d) <- s[1,] }else{ d <- data.frame(as.integer(s[,1]), as.integer(gsub('"','',s[,2]))) } if (!missing(col.names)) names(d) <- col.names d } df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, FUN="customengine") df # 5 use a parsing engine that can apply colClasses to quoted integers # Unfortunately Henry Bengtson's readDataFrame does not work as a # parse engine for read.table.ffdf because read.table.ffdf expects # the parse engine to read successive chunks from a file connection # while readDataFrame only accepts a filename as input file spec. # Yes it has 'skip', but using that would reread the file from scratch # for each chunk (O(N^2) costs)
Milan Bouchet-Valat wrote> It seems that read.table() in R 3.0.1 (Linux 64-bit) does not consider > quoted integers as an acceptable value for columns for which > colClasses="integer". But when colClasses is omitted, these columns are > read as integer anyway. > > For example, let's consider a file named file.dat, containing: > "1" > "2" > >> read.table("file.dat", colClasses="integer") > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : > scan() expected 'an integer' and got '"1"'Hi I just ran into a variation of this. I'm teaching myself agent based modelling from a book that uses NetLogo as the implementation language[1]. NetLogo has a feature called BehaviourSpaces that runs models over a varying range of parameter values and make arbitrary observations at each time step, which it then outputs to a CSV. One of the exercises involves plotting some graphs of a model run, but the output needs some processing before it can be graphed. Rather than hack away at the data by hand each time I run it, I decided to find a stats package to help, and I chose R. I'm a complete beginner to R, and I've been using the R in Action early access PDF as a guide[2]. I'm using R 3.1.0 GUI 1.64 Mavericks build (6734). The NetLogo CSV writer quotes all values, and mixes integers and floats. So a column of data might contain say (with the quotes actually in the file) "0", "1.25", "1", "2", "3.175". I tried importing the data like this: profit <- read.csv("BusinessInvestor1 Profit-table.csv", sep=",", header=TRUE, skip=6) But then some of the data is read in as factors: str(profit) 'data.frame': 1560 obs. of 9 variables: $ X.run.number. : int 8 6 2 7 5 1 3 4 6 8 ... $ restrict.sensing.radius : Factor w/ 1 level "false": 1 1 1 1 1 1 1 1 1 1 ... $ risk.multiplier : int 1 1 1 1 1 1 1 1 1 1 ... $ sensing.radius : int 1 1 1 1 1 1 1 1 1 1 ... $ profit.multiplier : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ... $ X.step. : int 0 0 0 0 0 0 0 0 1 1 ... $ mean..wealth..of.turtles : Factor w/ 1501 levels "0","100038.136",..: 1 1 1 1 1 1 1 1 623 550 ... $ mean..profit..of.patches.with..any..turtles.here. : Factor w/ 1547 levels "2503.675","2582.275",..: 1 8 7 6 5 4 3 10 278 230 ... $ mean..failure.probability..of.patches.with..any..turtles.here.: Factor w/ 1558 levels "0.026069451281579437",..: 1504 1528 1508 1518 1516 1514 1512 1536 1321 1471 ... (For reasons I don't understand, the profit.multiplier parameter ? which runs "0.5", "0.6", ?, "1" ? is imported as a numeric, whereas the observation values get turned into factors.) I read about colClasses but this trips over the "quoted integers aren't integers" bug: profit <- read.csv("BusinessInvestor1 Profit-table.csv", sep=",", header=TRUE, skip=6, colClasses=c("integer", "logical", "numeric", "numeric", "numeric", "integer", "numeric", "numeric", "numeric")) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'an integer', got '"8"' I created a little script to import the data, and use some casts to clean it up. At first I thought it was working, until I realised this line (to process CSV data in the range 0...1): profit$mean_failure_probability_of_inhabited <- as.numeric(profit$mean_failure_probability_of_inhabited) Was producing crazy values: str(profit) 'data.frame': 1560 obs. of 9 variables: ... $ mean_failure_probability_of_inhabited: num 1504 1528 1508 1518 1516 ... Eventually I figured out to do this (although I haven't yet figured out why): profit$mean_failure_probability_of_inhabited <- as.numeric(as.character(profit$mean_failure_probability_of_inhabited)) Anyway, for a beginner coming to R, this is all REALLY confusing, and it's taken me several hours to get my head round it. Although after reading about it a bit I can see the implementation issues causing this behaviour, as a noob it just feels like "R can't import CSV data". The most baffling thing is how telling R what format the data in each column is in actually *reduces* its ability to read the file! (For a while I thought it was complaining because "8" is an integer, not a real, but now I see it's because it's seeing it as a string.) My understanding of the CSV was the same as Peter Meilstrup describes it later in the thread ? that quotes in a CSV are to allow the delimiter character in a value, and don't imply anything about the type of the data (because CSVs are untyped). Googling the scan() error led to this mailing list thread so I thought I'd describe my experience. If there's a more intuitive way for read.csv / read.table to work it might save beginners like me a lot of head-scratching! Best regards Ash [1] http://www.amazon.com/dp/0691136742/ [2] http://www.manning.com/kabacoff2/ -- View this message in context: http://r.789695.n4.nabble.com/read-table-with-quoted-integers-tp4677249p4689530.html Sent from the R devel mailing list archive at Nabble.com.