David Hall (coding)
2008-Mar-15 00:25 UTC
[R] Appending new values to an existing factor vector
Hello, I've recently come across a situation where I'm trying to read in [genotype data] files that have around 80,000,000 lines, 4 fields, with a high proportion of repeated strings, here's a sample: rsXXXXXXX SAMPLE0001 CG 0.05302 rsXXXXXX SAMPLE0001 CC 0.06817 rsXXXXXXXX SAMPLE0001 CC 0.01369 rsXXXXXXY SAMPLE0001 GG 0.01816 rsXXXXXXZ SAMPLE0001 GG 0.006711 rsXXXXXXX SAMPLE0002 GG 0.05813 [For the purpose of the work I'm doing at the moment, I don't care about the last column] What's the best way to read in these data? My understanding of what happens when I do read.table on such a file is that it reads the file into a matrix (or perhaps a list) of character strings, then carries out the character conversions [i.e. as.factor(data[[i]])]. infile.df <- read.table(gzfile("large_file.txt.gz"), nrows = 82000000) Doing this all in one go results in R complaining about not having enough memory to store a data structure of that size [I'm running on Linux, with 1.5GB memory + 2GB swap], so I need to do it piecewise, but I suspect the memory issues will still be present if I do that. What I'd like is a way to read in, say, a million lines at a time, do the factor conversion, then append to my existing data frame, which has columns of factors. However, something I came across while participating in the ICFP 2007 (http://www.icfpcontest.org/) using R was the strange behaviour when adding new/unknown values to a factor vector:> (a <- factor(c("I","C","I","C","F","I")))[1] I C I C F I Levels: C F I> append(a,"P")[1] "3" "1" "3" "1" "2" "3" "P" What would be nice is for unknown levels to be added and encoded as a new value, without having to refactor the whole list, as follows:> factor(append(as.character(a),"P"))[1] I C I C F I P Levels: C F I P Is there a better way to do this that means I don't need to do the character conversion process? The need to do this character conversion seems to removes one of the useful features of a factored vector in that it substantially reduces space requirements. Thanks for your help, David Hall