andrewH
2013-Nov-24 03:38 UTC
[R] filehash error in the colbycol method for as.data.frame from a large object
Dear Folks-- I have a 14 gig .csv file with 731 columns. I have read it into a colbycol object (which took overnight ? about 16 hours) using the code below, which produced no warnings or error messages. The object, CPS62_12, is 49 gig. After the reading, summary() produced the output below and colnames() successfully returned the names of all 731 columns.> CPS62_12 <- cbc.read.table(+ "C:\\R_PROJ\\INEQ_TRENDS\\TESTS\\monofile_ALLVARS\\cps_00078.csv", + header = T, sep = "," )> summary(CPS62_12)Object of class colbycol with 8093281 rows and 731 columns. Data for the object is stored at C:\DOCUME~1\ADMINI~2\LOCALS~1\Temp\RtmpMj3LRP\dir1a6d82a1df37.> nrow(CPS62_12)[1] 8093281> ncol(CPS62_12)[1] 731> colnames(CPS62_12)[1] "RECTYPE" "YEAR" "SERIAL" "MISH" <etc.> I then ran as.data.frame() (code below) and got the following error and warning message:> income_HH_CPS <-as.data.frame(CPS62_12,+ c(YEAR, STATEFIP, RECTYPE, SERIAL, HWTSUPP, HHINCOME, NUMPREC)) Error in readSingleKey(con, map, key) : unable to obtain value for key 'RECTYPE' In addition: Warning message: In readKeyMap(filecon) : NAs introduced by coercion I tried the command on a number of column name combinations and subsequently always got the error message without the warning. The error is always on RECTYPE, which is the name of the first column in the csv file. I am as yet not able to reproduce this error on a smaller object. I copied the first 10 lines of my file into an object by using a connection with readLines. I evaluated the object in the console, and passed the result into notepad, and saved it. Then I manually sliced off all but the first 15 variables. The resulting file sailed through the code above and produced a data frame faultlessly. This undercut my leading theory, which was that the slash double-quotes (/") that bracketed the column names were causing the problem. I tried running cbc.get.col on the second variable in the file, YEAR. These two commands: yearCPS <-cbc.get.col(CPS62_12, YEAR) yearCPS <-cbc.get.col(CPS62_12, 2) both resulted in the following error message: Error in readSingleKey(con, map, key) : unable to obtain value for key 'YEAR' Note that numerical indexing still returned an error on the variable name, YEAR. I got the same result for several other variables, returning their own names I tracked the error message back to the following function in the filehash package: readSingleKey <- function(con, map, key) { start <- map[[key]] if(is.null(start)) stop(gettextf("unable to obtain value for key '%s'", key)) seek(con, start, rw = "read") unserialize(con) } Now I am at a loss. I see that the element ?key? of the list ?map? has the value NULL, that any call to as.data.frame uses RECTYPE as the key, and that any call to cbc.get.col() uses the passed variable name as a key, even those that only pass a number. But I don?t know much of anything about file hashing, and I have run out of ideas. Can anyone tell me what I am doing wrong, or whether there is a particular problem with my file that is likely to be causing this problem, or what my next diagnostic step should be? Please be aware that I can only do things I can run on 3 gig of ram. I am running R under RStudio 0.97.551, on a Widows XP machine with Service Pack 3. Sincerely, andrewH -- View this message in context: http://r.789695.n4.nabble.com/filehash-error-in-the-colbycol-method-for-as-data-frame-from-a-large-object-tp4681052.html Sent from the R help mailing list archive at Nabble.com.