Ellis, Alicia M
2016-Aug-08 18:49 UTC
[R] Help with big data and parallel computing: 500, 000 x 4 linear models
I have a large dataset with ~500,000 columns and 1264 rows. Each column represents the percent methylation at a given location in the genome. I need to run 500,000 linear models for each of 4 predictors of interest in the form of: Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9 ...and save only the pvalue for the predictor The original methylation data file had methylation sites as row labels and the individuals as columns so I read the data in chunks and transposed it so I now have 5 csv files (chunks) with columns representing methylation sites and rows as individuals. I was able to get results for all of the regressions by running each chunk of methylation data separately on our supercomputer using the code below. However, I'm going to have to do this again for another project and I would really like to accomplish two things to make the whole process more computationally efficient: 1) Work with data.tables instead of data.frames (reading and manipulating will be much easier and faster) 2) Do the work in parallel using say 12 cores at once and having the program divide the work up on the cores rather than me having to split the data and run 5 separate jobs on the supercomputer. I have some basic knowledge of the data.table package but I wasn't able to modify the foreach code below to get it to work and the code using data.frames didn't seem to be using all 12 cores that I created in the cluster. Can anyone suggest some modifications to the foreach code below that will allow me to do this in parallel with datatables and not have to do it in chunks? ############# Set up cluster clus = makeCluster(12, type = "SOCK") registerDoSNOW(clus) getDoParWorkers() getDoParName() ################### Following code needs to be modified to run the full dataset (batch1-batch5) in parallel ### Currently I read in the following chunks, and run each predictor separately for each chunk of data ############### Methylation data in batches batch1=read.csv("/home/alicia.m.ellis/batch1.csv") ###### #Each batch has about 100,000 columns and 1264 rows; want to alter this to: ## batch1=fread(file= ) batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv") batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv") batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv") batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv") predictors ## this is a data.frame with 4 columns and 1264 rows covariates ## this is a data.frame with 9 columns and 1264 rows fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1), function(x) summary(lm(batch1[, x] ~ predictors[,1] + covariates[,1]+ covariates[,2]+ covariates[,3]+ covariates[,4]+ covariates[,5]+ covariates[,6]+ covariates[,7]+ covariates[,8]+ covariates[,9] ) )$coefficients[2,4] ) ) ] ###################################### This is what I was trying but wasn't having much luck #### I'm having trouble getting the data merged as a single data.frame and the code below doesn't seem to be dividing the work among the 12 cores in the cluster all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1), combine='rbind', .inorder=TRUE) %dopar% { model = lm(meth[, i] ~ predictors[,j] + covariates[,1]+ covariates[,2]+ covariates[,3]+ covariates[,4]+ covariates[,5]+ covariates[,6]+ covariates[,7]+ covariates[,8]+ covariates[,9]) summary(model)$coefficients[2,4] } Alicia Ellis, Ph.D Biostatistician Pathology & Laboratory Medicine Colchester Research Facility 360 South Park Drive, Room 209C Colchester, VT 05446 802-656-9840 [[alternative HTML version deleted]]
Aaron Mackey
2016-Aug-08 21:33 UTC
[R] Help with big data and parallel computing: 500, 000 x 4 linear models
Don't run 500K separate models. Use the limma package to fit one model that can learn the variance parameters jointly. Run it on your laptop. And don't use %methylation as your Y variable, use logit(percent), i.e. the Beta value. -Aaron On Mon, Aug 8, 2016 at 2:49 PM, Ellis, Alicia M <Alicia.Ellis at med.uvm.edu> wrote:> I have a large dataset with ~500,000 columns and 1264 rows. Each column > represents the percent methylation at a given location in the genome. I > need to run 500,000 linear models for each of 4 predictors of interest in > the form of: > Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9 > ...and save only the pvalue for the predictor > > The original methylation data file had methylation sites as row labels and > the individuals as columns so I read the data in chunks and transposed it > so I now have 5 csv files (chunks) with columns representing methylation > sites and rows as individuals. > > I was able to get results for all of the regressions by running each chunk > of methylation data separately on our supercomputer using the code below. > However, I'm going to have to do this again for another project and I would > really like to accomplish two things to make the whole process more > computationally efficient: > > > 1) Work with data.tables instead of data.frames (reading and > manipulating will be much easier and faster) > > 2) Do the work in parallel using say 12 cores at once and having the > program divide the work up on the cores rather than me having to split the > data and run 5 separate jobs on the supercomputer. > > I have some basic knowledge of the data.table package but I wasn't able to > modify the foreach code below to get it to work and the code using > data.frames didn't seem to be using all 12 cores that I created in the > cluster. > > Can anyone suggest some modifications to the foreach code below that will > allow me to do this in parallel with datatables and not have to do it in > chunks? > > > ############# Set up cluster > clus = makeCluster(12, type = "SOCK") > registerDoSNOW(clus) > getDoParWorkers() > getDoParName() > > > ################### Following code needs to be modified to run the full > dataset (batch1-batch5) in parallel > ### Currently I read in the following chunks, and run each predictor > separately for each chunk of data > > ############### Methylation data in batches > batch1=read.csv("/home/alicia.m.ellis/batch1.csv") ###### #Each batch > has about 100,000 columns and 1264 rows; want to alter this to: > ## batch1=fread(file= ) > batch2=read.csv(file="/home/alicia.m.ellis/batch2.csv") > batch3=read.csv(file="/home/alicia.m.ellis/batch3.csv") > batch4=read.csv(file="/home/alicia.m.ellis/batch4.csv") > batch5=read.csv(file="/home/alicia.m.ellis/batch5.csv") > > predictors ## this is a data.frame with 4 columns and 1264 rows > > covariates ## this is a data.frame with 9 columns and 1264 rows > > fits <- as.data.table(batch1)[, list(MyFits = lapply(1:ncol(batch1), > function(x) summary(lm(batch1[, x] ~ predictors[,1] + > > covariates[,1]+ > > covariates[,2]+ > > covariates[,3]+ > > covariates[,4]+ > > covariates[,5]+ > > covariates[,6]+ > > covariates[,7]+ > > covariates[,8]+ > > covariates[,9] > ) > )$coefficients[2,4] > ) > ) > ] > > > ###################################### This is what I was trying but > wasn't having much luck > #### I'm having trouble getting the data merged as a single data.frame and > the code below doesn't seem to be dividing the work among the 12 cores in > the cluster > > all. fits = foreach (j=1:ncol(predictors), i=1:ncol(meth1), > combine='rbind', .inorder=TRUE) %dopar% { > > model = lm(meth[, i] ~ predictors[,j] + > covariates[,1]+ > covariates[,2]+ > covariates[,3]+ > covariates[,4]+ > covariates[,5]+ > covariates[,6]+ > covariates[,7]+ > covariates[,8]+ > covariates[,9]) > summary(model)$coefficients[2,4] > } > > > Alicia Ellis, Ph.D > Biostatistician > Pathology & Laboratory Medicine > Colchester Research Facility > 360 South Park Drive, Room 209C > Colchester, VT 05446 > 802-656-9840 > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Charles C. Berry
2016-Aug-09 02:03 UTC
[R] Help with big data and parallel computing: 500, 000 x 4 linear models
On Mon, 8 Aug 2016, Ellis, Alicia M wrote:> I have a large dataset with ~500,000 columns and 1264 rows. Each column > represents the percent methylation at a given location in the genome. > I need to run 500,000 linear models for each of 4 predictors of interest > in the form of:> Methylation.stie1 ~ predictor1 + covariate1+ covariate2 + ... covariate9 > ...and save only the pvalue for the predictor > > The original methylation data file had methylation sites as row labels > and the individuals as columns so I read the data in chunks and > transposed it so I now have 5 csv files (chunks) with columns > representing methylation sites and rows as individuals. > > I was able to get results for all of the regressions by running each > chunk of methylation data separately on our supercomputer using the code > below.This sounds like a problem for my old laptop, not a supercomputer. You might want to review the algebra and geometry of least squares. In particular, covariate1 ... covariate9 are the same 1264 x 9 matrix for every problem IIUC. So, you can compute the QR decomposition for that matrix (and the unit vector `intercept') *once* and use it in all the problems. Using that decomposition, find the residuals for the regressands and for `predictor1' (etc) regressors. The rest is simple least squares. You compute the correlation coefficient of the residuals of a regressand and those of a regressor, for each combination. Make a table of critical values for the p-value(s) you require - remember to get the degrees of freedom right (i.e. account for the covariates). These correlations of residuals are the partial correlations given the covariates, and a test on one of them is algebraically equal to the test on regression coefficient for corresponding regressand and regressor in a modelthat also includes those 9 covariates. See: ?qr ?lm.fit HTH, Chuck