thr3ads.net - R help - [R] making code (loop) more efficient [Dec 2020]

If this information is useful, please help other people find it:
Share via:

Ana Marija

2020-Dec-16 00:11 UTC

[R] making code (loop) more efficient

Hello,

I made a terribly inefficient code which runs forever but it does run.

library(dplyr)
library(splitstackshape)

datalist = list()
files <- list.files("/WEIGHTS1/Retina", pattern=".RDat",
ignore.case=T)

for(i in files)
{
a<-get(load(i))
names <- rownames(a)
data <- as.data.frame(cbind(names,a))
rownames(data) <- NULL
dd=na.omit(concat.split.multiple(data = data, split.cols = c("names"),
seps = ":"))
dd=select(dd,names_1,blup,names_3,names_4)
colnames(dd)=c("rsid","weight","ref_allele","eff_allele")
dd$WGT<-i
datalist[[i]] <- dd # add it to your list
}

big_data = do.call(rbind, datalist)

There is 17345 RDat files this loop has to go through. And each file
has approximately 10,000 lines. All RDat files can be downloaded from
here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115828 and
they are compressed in this file: GSE115828_retina_TWAS_wgts.tar.gz .
And subset of 3 of those .RDat files is here:
https://github.com/montenegrina/sample

For one of those files, say i="retina.ENSG00000135776.wgt.RDat"
dd looks like this:
> head(dd)          rsid        weight ref_allele eff_allele
1:  rs72763981  9.376766e-09          C          G
2: rs144383755 -2.093346e-09          A          G
3:   rs1925717  1.511376e-08          T          C
4:  rs61827307 -1.625302e-08          C          A
5:  rs61827308 -1.625302e-08          G          C
6: rs199623136 -9.128354e-10         GC          G
                           WGT
1: retina.ENSG00000135776.wgt.RDat
2: retina.ENSG00000135776.wgt.RDat
3: retina.ENSG00000135776.wgt.RDat
4: retina.ENSG00000135776.wgt.RDat
5: retina.ENSG00000135776.wgt.RDat
6: retina.ENSG00000135776.wgt.RDat

so on attempt to parallelize this I did this:

library(parallel)
library(data.table)
library(foreach)
library(doSNOW)

n <-  parallel::detectCores()
cl <- parallel::makeCluster(n, type = "SOCK")
doSNOW::registerDoSNOW(cl)
files <- list.files("/WEIGHTS1/Retina", pattern=".RDat",
ignore.case=T)

lst_out <- foreach::foreach(i = seq_along(files),
                  .packages = c("data.table") ) %dopar% {

   a <- get(load(files[i]))
   names <- rownames(a)
   data <- data.table(names, a["blup"])
   nm1 <- c("rsid", "ref_allele", "eff_allele")
   data[,  (nm1) := tstrsplit(names, ":")[-2]]
   return(data[, .(rsid, weight = blup, ref_allele, eff_allele)][,
               WGT := files[i]][])
 }
parallel::stopCluster(cl)

big_data <- rbindlist(lst_out)

I am getting this Error:

Error in { : task 7 failed - "object 'blup' not
found"> parallel::stopCluster(cl)
Can you please advise,
Ana

Jim Lemon

2020-Dec-16 01:34 UTC

head link

[R] making code (loop) more efficient

Hi Ana,
My guess is that in your second code fragment you are assigning the
rownames of "a" and the _values_ contained in a$blup to the data.table
"data". As I don't have much experience with data tables I may be
wrong, but I suspect that the column name "blup" may not be visible or
even present in "data". I don't see it in "dd" above
this code
fragment.

Jim

On Wed, Dec 16, 2020 at 11:12 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:>
> Hello,
>
> I made a terribly inefficient code which runs forever but it does run.
>
> library(dplyr)
> library(splitstackshape)
>
> datalist = list()
> files <- list.files("/WEIGHTS1/Retina",
pattern=".RDat", ignore.case=T)
>
> for(i in files)
> {
> a<-get(load(i))
> names <- rownames(a)
> data <- as.data.frame(cbind(names,a))
> rownames(data) <- NULL
> dd=na.omit(concat.split.multiple(data = data, split.cols =
c("names"),
> seps = ":"))
> dd=select(dd,names_1,blup,names_3,names_4)
>
colnames(dd)=c("rsid","weight","ref_allele","eff_allele")
> dd$WGT<-i
> datalist[[i]] <- dd # add it to your list
> }
>
> big_data = do.call(rbind, datalist)
>
> There is 17345 RDat files this loop has to go through. And each file
> has approximately 10,000 lines. All RDat files can be downloaded from
> here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115828 and
> they are compressed in this file: GSE115828_retina_TWAS_wgts.tar.gz .
> And subset of 3 of those .RDat files is here:
> https://github.com/montenegrina/sample
>
> For one of those files, say i="retina.ENSG00000135776.wgt.RDat"
> dd looks like this:
>
> > head(dd)
>           rsid        weight ref_allele eff_allele
> 1:  rs72763981  9.376766e-09          C          G
> 2: rs144383755 -2.093346e-09          A          G
> 3:   rs1925717  1.511376e-08          T          C
> 4:  rs61827307 -1.625302e-08          C          A
> 5:  rs61827308 -1.625302e-08          G          C
> 6: rs199623136 -9.128354e-10         GC          G
>                            WGT
> 1: retina.ENSG00000135776.wgt.RDat
> 2: retina.ENSG00000135776.wgt.RDat
> 3: retina.ENSG00000135776.wgt.RDat
> 4: retina.ENSG00000135776.wgt.RDat
> 5: retina.ENSG00000135776.wgt.RDat
> 6: retina.ENSG00000135776.wgt.RDat
>
> so on attempt to parallelize this I did this:
>
> library(parallel)
> library(data.table)
> library(foreach)
> library(doSNOW)
>
> n <-  parallel::detectCores()
> cl <- parallel::makeCluster(n, type = "SOCK")
> doSNOW::registerDoSNOW(cl)
> files <- list.files("/WEIGHTS1/Retina",
pattern=".RDat", ignore.case=T)
>
> lst_out <- foreach::foreach(i = seq_along(files),
>                   .packages = c("data.table") ) %dopar% {
>
>    a <- get(load(files[i]))
>    names <- rownames(a)
>    data <- data.table(names, a["blup"])
>    nm1 <- c("rsid", "ref_allele",
"eff_allele")
>    data[,  (nm1) := tstrsplit(names, ":")[-2]]
>    return(data[, .(rsid, weight = blup, ref_allele, eff_allele)][,
>                WGT := files[i]][])
>  }
> parallel::stopCluster(cl)
>
> big_data <- rbindlist(lst_out)
>
> I am getting this Error:
>
> Error in { : task 7 failed - "object 'blup' not found"
> > parallel::stopCluster(cl)
>
> Can you please advise,
> Ana
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Dec 2020 - making code (loop) more efficient

[R] making code (loop) more efficient

[R] making code (loop) more efficient