Yong Wang
2007-May-21 20:28 UTC
[R] A "subscript out of bonds" and "write.table" problem on manipulating a large size dataset
Dear all: Described below is a large data set problem (data size > 2G after unzipping, table delimited). I know R is not the appropriate tool for such task, anyway I did it on a server and get some straightforward problems. 1. The first is count.fields can count all the rows, however, when I tried to remove rows beyond 3/5 of the data,R says subscripts out of bounds, is there any option constraining the maximal size for R to read in? 2. I rewrote the original data due to careless coding and find the rewrote table delimited file does not match the original file. I experimented the code on a small dataset as attached at the end, no problem at all for such small dataset. I appreciate any tips and suggestions on how to remove the unwanted rows in such a large dataset. finally, thanks for all answering the tab delimited problem I rised yesterday. ### code as following ### data.mm <- read.table(file,header=T,sep="\t",fill=T); #read in the large file cf <- count.fields(file,sep="\t"); #count fields n <- 23; #the CORRECT fields for each row i.e., the number of variable name del <- which(cf!=n); # try to remove any row which has number of fields not euqal to 23 del <- del-1; # count cf contains the fields of header, -1 give the row I want to remove data.mm <- data.mm[-del,]; # try to remove the rows with fields number unequal to 23 ### PROBLEM: R says "subscripts out of bonds" write.table(data.mm,file="mm_0206.txt", eol="\n",sep="\t", quote=F,row.names=F); # since data.mm <- data.mm[-del,] aborted, write the original data as mm_0206.txt ### PROBLEM:then following code should have the same output table(cf); # maximal fields number is 23 table( count.fields("mm_0206.txt",sep="\t")); # maximal fields number larger than 23 and other things also unequle # for example, original data has x rows with 10 fields, the wrote # data has y row with 10 fields. # if the original file is not correctly rewrote, probably an equal length # file will also not be wrote properly wrote, suppose data.mm <- data.mm[-del,]; # get executed successfully. #### experimental data set as following ### V1 V2 V3 v4 v5 v6 v7 v8 v9 11 1 desc A 1 34 1-Sep-00 1 first mid last 12 2 desc B 6 56 2-Sep-00 1 First last 13 3 desc A 7 32 3-Sep-00 1 last 14 4 desc 4-Sep-00 0 first mid last 15 5 desc A 2 . 5-Sep-00 1 first mid last 16 6 desc B 9 3 6-Sep-00 0 last 17 7 A 6 65 7-Sep-00 first last 18 8 desc B 2 . 8-Sep-00 0 last 19 9 desc A 8 56 9-Sep-00 1 first last 20 10 desc B 5 89 10-Sep-00 0 first last