Hi,
I understand work is being done to improve read.table(), especially by
Prof. Brian D. Ripley. I offer below a version that I wrote, in the hope some
aspects of it may prove useful or at least inspire discussion.
Be aware that my version differs in a couple fundamental ways that reflect
my aversion to dataframes and factors. So it returns a list of vectors which
are all character, numeric, integer, or logical. Also, I have no row.names
column, and so no issues about what to put in that column's header.
A big advantage of my function is that the user can specify which columns
are wanted, and can force columns to be character, integer, or logical.
Unwanted trailing columns are not even read [they are "flushed" by
scan()],
improving speed and memory usage. Note, though, that in S+ one can also
eliminate intermediate columns in scan() with a "what" entry of NULL,
and my
function was written to take advantage of that, but it doesn't work in R.
The first thing it does is some optional external file manipulation
(gunzip'ing and removing quotes with "sed") that may only work on
Unix. Then
it parses a single header line (using scan() on a textConnection), and compares
the columns available with the columns wanted. Everything is scanned as
character, then columns are tested with "type.convert" and converted
as
appropriate. Enjoy!
-- David Brahm (a215020 at agate.fmr.com)
############################# Begin code #####################################
new.read.table <-
# Reads a tab-delimited data file into a list. Somewhat like:
# as.list(read.table(file, header=T, sep="\t", as.is=T,
row.names=NULL,
# na.strings=na.strings))
function(file, want=items, skip=0, skip2=0, sep="\t",
strip.white=F, rm.quotes=F, integers=NULL, logicals=NULL,
characters=c("cusip","Cusip","cusp","symb","ticker","sector"),
na.strings=c("","-","na","NA","NC","ND","NaN","#N/A","#N/A
N Ap",
"#N/A N.A.","@NA","NULL")) {
rm.files <- NULL
on.exit(if (length(rm.files)) unlink(rm.files))
## Gunzip, remove quotes, then make connection object:
if (is.character(file) && !file.exists(file)) {
if (!file.exists(file %&% ".gz")) stop("No file: "
%&% file)
rm.files <- c(rm.files, newfile<-tempfile())
system("gunzip -c " %&% file %&% ".gz > "
%&% newfile)
file <- newfile
}
if (rm.quotes) { # Won't work with
pipe
rm.files <- c(rm.files, newfile<-tempfile())
system("tr -d \\\" < " %&% file %&% " >
" %&% newfile)
file <- newfile
}
if (is.character(file)) {file <- file(file,"r");
on.exit(close(file))}
## Get "items" from header row:
tmp <- readLines(file, skip+1)[skip+1]
tc <- textConnection(tmp)
items <- scan(tc, "", sep=sep, strip.white=strip.white, quiet=T)
items <- gsub("_",".",items)
# Convert "_" to "."
close(tc)
## Build a "what" list:
wanted <- items %in% want
last <- max(which(wanted)) # Flush the
rest
what <- structure(rep(list(""), last), names=items[1:last])
# what[!wanted] <- list(NULL) # Don't read unwanted columns (doesn't
work in R)
## Scan the data connection:
obj <- scan(file, what, sep=sep, flush=T, skip=skip2,
strip.white=strip.white, quiet=T)
if (!all(wanted)) obj <- obj[which(wanted)]
## Convert strings to numeric or logical:
for (i in g.except(names(obj), characters)) {
z <- .Internal(type.convert(obj[[i]], na.strings, T, "."))
if (is.numeric(z)) obj[[i]] <- if (i %in% integers) as.integer(z) else z
}
for (i in logicals) obj[[i]] <- as.logical(obj[[i]]) #
"F","T", or 0,1
obj
}
# Supplementary (and very useful) functions:
"%&%" <- function(a, b) paste(a, b, sep="")
g.except <- function(a, b) unique(a[!match(a, b, 0)])
############################## End code ######################################
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._