Hi All, I'd like to ask for a few clarifications. I am doing some calculations over some biggish datasets. One has ~ 23000 rows, and 6 columns, the other has ~620000 rows and 6 columns. I am using these datasets to perform a simulation of of haplotype coalescence over a pedigree (the datestes themselves are pedigree information). I created a new dataset (same number of rows as the pedigree dataset, 2 colums) and I use a looping functions to assign haplotypes according to a standrd biological reprodictive process (i.e. meiosis, sexual reproduction). My code is someting like: off = function(sire, dam){ # simulation of reproduction, two inds sch.toll = round(runif(1, min = 1, max = 2)) dch.toll = round(runif(1, min = 1, max = 2)) s.gam = sire[,sch.toll] d.gam = dam[,dch.toll] offspring = cbind(s.gam,d.gam) # offspring } for (i in 1:dim(new)[1]){ if(ped[i,3] != 0 & ped[i,5] != 0){ zz = off(as.matrix(t(new[ped[i,3],])),as.matrix(t(new[ped[i,5],]))) new[i,1] = zz[1] new[i,2] = zz[2] } } I am also attribution a generation index to each row with a trivial calulation: for(i in atres){ genz[i] = (genz[ped[i,3]] + genz[ped[i,5]])/2 + 1 #print(genz[i]) } My question then. On the 23000 rows dataset the calculations take about 5 minutes. On the 620000 rows one I kill the process after ~24 hours, and the the job is not finished. Why such immense discrepancy in execution times (the code is the same, the datasets are stored in two separate .RData files)? Any light would be appreciated. Federico PS I am running R 2.1.0 on Debian Sarge, on a Dual 3 GHz Xeon machine with 2 gig RAM. The R process uses 99% of the CPU, but hardly any RAM for what I gather from top. -- Federico C. F. Calboli Department of Epidemiology and Public Health Imperial College, St Mary's Campus Norfolk Place, London W2 1PG Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 f.calboli [.a.t] imperial.ac.uk f.calboli [.a.t] gmail.com
My suggestion is that you try to vectorize the computation as much as you can.>From what you've shown, `new' and `ped' need to have the same number ofrows, right? Your `off' function seems to be randomly choosing between columns 1 and 2 from its two input matrices (one row each?). You may want to do the sampling all at once instead of looping over the rows. E.g.,> (m <- matrix(1:10, ncol=2))[,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10> (colSample <- sample(1:2, nrow(m), replace=TRUE))[1] 1 1 2 1 1> (x <- m[cbind(1:nrow(m), colSample)])[1] 1 2 8 4 5 So you might want to do something like (obviously untested): todo <- ped[,3] * ped[,5] != 0 ## indicator of which rows to work on n.todo <- sum(todo) ## how many are there? sire <- new[ped[todo, 3], ] dam <- new[ped[todo, 5], ] s.gam <- sire[1:nrow(sire), sample(1:2, nrow(sire), replace=TRUE)] d.gam <- dam[1:nrow(dam), sample(1:2, nrow(dam), replace=TRUE)] new[todo, 1:2] <- cbind(s.gam, d.gam) Andy> From: Federico Calboli > > Hi All, > > I'd like to ask for a few clarifications. I am doing some calculations > over some biggish datasets. One has ~ 23000 rows, and 6 columns, the > other has ~620000 rows and 6 columns. > > I am using these datasets to perform a simulation of of haplotype > coalescence over a pedigree (the datestes themselves are pedigree > information). I created a new dataset (same number of rows as the > pedigree dataset, 2 colums) and I use a looping functions to assign > haplotypes according to a standrd biological reprodictive > process (i.e. > meiosis, sexual reproduction). > > My code is someting like: > > off = function(sire, dam){ # simulation of reproduction, two inds > sch.toll = round(runif(1, min = 1, max = 2)) > dch.toll = round(runif(1, min = 1, max = 2)) > s.gam = sire[,sch.toll] > d.gam = dam[,dch.toll] > offspring = cbind(s.gam,d.gam) > # offspring > } > > for (i in 1:dim(new)[1]){ > if(ped[i,3] != 0 & ped[i,5] != 0){ > zz = off(as.matrix(t(new[ped[i,3],])),as.matrix(t(new[ped[i,5],]))) > new[i,1] = zz[1] > new[i,2] = zz[2] > } > } > > I am also attribution a generation index to each row with a trivial > calulation: > > for(i in atres){ > genz[i] = (genz[ped[i,3]] + genz[ped[i,5]])/2 + 1 > #print(genz[i]) > } > > My question then. On the 23000 rows dataset the calculations > take about > 5 minutes. On the 620000 rows one I kill the process after ~24 hours, > and the the job is not finished. Why such immense discrepancy in > execution times (the code is the same, the datasets are stored in two > separate .RData files)? > > Any light would be appreciated. > > Federico > > PS I am running R 2.1.0 on Debian Sarge, on a Dual 3 GHz Xeon machine > with 2 gig RAM. The R process uses 99% of the CPU, but hardly any RAM > for what I gather from top. > > > > -- > Federico C. F. Calboli > Department of Epidemiology and Public Health > Imperial College, St Mary's Campus > Norfolk Place, London W2 1PG > > Tel +44 (0)20 7594 1602 Fax (+44) 020 7594 3193 > > f.calboli [.a.t] imperial.ac.uk > f.calboli [.a.t] gmail.com > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > >
In my absentmindedness I'd forgotten to CC this to the list... and BTW, using gc() in the loop increases the runtime...>> My suggestion is that you try to vectorize the computation as much >> as you >> can. >> >> From what you've shown, `new' and `ped' need to have the same >> number of >> rows, right? >> >> Your `off' function seems to be randomly choosing between columns >> 1 and 2 >> from its two input matrices (one row each?). You may want to do the >> sampling all at once instead of looping over the rows. E.g., >> >> >> >>> (m <- matrix(1:10, ncol=2)) >>> >>> >> [,1] [,2] >> [1,] 1 6 >> [2,] 2 7 >> [3,] 3 8 >> [4,] 4 9 >> [5,] 5 10 >> >> >>> (colSample <- sample(1:2, nrow(m), replace=TRUE)) >>> >>> >> [1] 1 1 2 1 1 >> >> >>> (x <- m[cbind(1:nrow(m), colSample)]) >>> >>> >> [1] 1 2 8 4 5 >> >> So you might want to do something like (obviously untested): >> >> todo <- ped[,3] * ped[,5] != 0 ## indicator of which rows to work on >> n.todo <- sum(todo) ## how many are there? >> sire <- new[ped[todo, 3], ] >> dam <- new[ped[todo, 5], ] >> s.gam <- sire[1:nrow(sire), sample(1:2, nrow(sire), replace=TRUE)] >> d.gam <- dam[1:nrow(dam), sample(1:2, nrow(dam), replace=TRUE)] >> new[todo, 1:2] <- cbind(s.gam, d.gam) >> >> > > Improving the efficiency of the code is abviously a plus, but the > real thing I am mesmerised by is the sheer increase in runtime... > how come not a linear increase with dataset size? > > Cheers, > > Federico > > -- > Federico C. F. Calboli > Department of Epidemiology and Public Health > Imperial College, St. Mary's Campus > Norfolk Place, London W2 1PG > > Tel +44 (0)20 75941602 Fax +44 (0)20 75943193 > > f.calboli [.a.t] imperial.ac.uk > f.calboli [.a.t] gmail.com > >