Hi, I have a dataset with 2194651x135, in which all the numbers are 0,1,2, and is bar-delimited. I used the following approach which can handle 100,000 lines: t<-scan('fv', sep='|', nlines=100000) t1<-matrix(t, nrow=135, ncol=100000) t2<-t(t1) t3<-as.data.frame(t2) I changed my plan into using stratified sampling with replacement (col 2 is my class variable: 1 or 2). The class distr is like: awk -F\| '{print $2}' fv | sort | uniq -c 2162792 1 31859 2 Is it possible to use R to read the whole dataset and do the stratified sampling? Is it really dependent on my memory size? Mem: 3111736k total, 1023040k used, 2088696k free, 150160k buffers Swap: 4008208k total, 19040k used, 3989168k free, 668892k cached Thanks, weiwei -- Weiwei Shi, Ph.D "Did you always know?" "No, I did not. But I believed..." ---Matrix III
On Mon, 18 Jul 2005, Weiwei Shi wrote:> Hi, > I have a dataset with 2194651x135, in which all the numbers are 0,1,2, > and is bar-delimited. > > I used the following approach which can handle 100,000 lines: > t<-scan('fv', sep='|', nlines=100000) > t1<-matrix(t, nrow=135, ncol=100000) > t2<-t(t1) > t3<-as.data.frame(t2) > > I changed my plan into using stratified sampling with replacement (col > 2 is my class variable: 1 or 2). The class distr is like: > awk -F\| '{print $2}' fv | sort | uniq -c > 2162792 1 > 31859 2 > > Is it possible to use R to read the whole dataset and do the > stratified sampling? Is it really dependent on my memory size?You may well not be able to read the whole data set into memory at once: it would take a bit more than 2Gb memory even to store it. You can use readLines to read it in chunks of, say, 10000 lines. To do stratified sampling I would suggest bernoulli sampling of slightly more than you want. Eg if you want 10000 from class 1, keeping each elements with probability 10500/2162792 will get you Poisson(10500) elements, which will be more than 10000 elements with better than 99.999% probability. You can then choose 10000 at random from these. I can't think of an approach that it is guaranteed to work in one pass over the data, but 99.999% is pretty close. -thomas