Hi Ana,
I would look at "data" in your second example and see if it contains a
column named "blup" or just the values that were extracted from
a$blup. Also, I assume that weight=blup looks for an object named
"blup", which may not be there.
Jim
On Wed, Dec 16, 2020 at 1:20 PM Ana Marija <sokovic.anamarija at
gmail.com> wrote:>
> Hi Jim,
>
> Maybe my post is confusing.
>
> so "dd" came from my slow code and I don't use it again in
parallelized code.
>
> So for example for one of my files:
>
> if
> i="retina.ENSG00000120647.wgt.RDat"
> > a <- get(load(i))
> > head(a)
> top1 blup lasso enet
> rs4980905:184404:C:A 0.07692622 -1.881795e-04 0 0
> rs7978751:187541:G:C 0.62411425 9.934994e-04 0 0
> rs2368831:188285:C:T 0.69529158 1.211028e-03 0 0
> ...
>
> Slow code was posted just to show what was running very slow and it
> was running. I really need help fixing parallelized version.
>
> On Tue, Dec 15, 2020 at 7:35 PM Jim Lemon <drjimlemon at gmail.com>
wrote:
> >
> > Hi Ana,
> > My guess is that in your second code fragment you are assigning the
> > rownames of "a" and the _values_ contained in a$blup to the
data.table
> > "data". As I don't have much experience with data tables
I may be
> > wrong, but I suspect that the column name "blup" may not be
visible or
> > even present in "data". I don't see it in "dd"
above this code
> > fragment.
> >
> > Jim
> >
> > On Wed, Dec 16, 2020 at 11:12 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I made a terribly inefficient code which runs forever but it does
run.
> > >
> > > library(dplyr)
> > > library(splitstackshape)
> > >
> > > datalist = list()
> > > files <- list.files("/WEIGHTS1/Retina",
pattern=".RDat", ignore.case=T)
> > >
> > > for(i in files)
> > > {
> > > a<-get(load(i))
> > > names <- rownames(a)
> > > data <- as.data.frame(cbind(names,a))
> > > rownames(data) <- NULL
> > > dd=na.omit(concat.split.multiple(data = data, split.cols =
c("names"),
> > > seps = ":"))
> > > dd=select(dd,names_1,blup,names_3,names_4)
> > >
colnames(dd)=c("rsid","weight","ref_allele","eff_allele")
> > > dd$WGT<-i
> > > datalist[[i]] <- dd # add it to your list
> > > }
> > >
> > > big_data = do.call(rbind, datalist)
> > >
> > > There is 17345 RDat files this loop has to go through. And each
file
> > > has approximately 10,000 lines. All RDat files can be downloaded
from
> > > here:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115828 and
> > > they are compressed in this file:
GSE115828_retina_TWAS_wgts.tar.gz .
> > > And subset of 3 of those .RDat files is here:
> > > https://github.com/montenegrina/sample
> > >
> > > For one of those files, say
i="retina.ENSG00000135776.wgt.RDat"
> > > dd looks like this:
> > >
> > > > head(dd)
> > > rsid weight ref_allele eff_allele
> > > 1: rs72763981 9.376766e-09 C G
> > > 2: rs144383755 -2.093346e-09 A G
> > > 3: rs1925717 1.511376e-08 T C
> > > 4: rs61827307 -1.625302e-08 C A
> > > 5: rs61827308 -1.625302e-08 G C
> > > 6: rs199623136 -9.128354e-10 GC G
> > > WGT
> > > 1: retina.ENSG00000135776.wgt.RDat
> > > 2: retina.ENSG00000135776.wgt.RDat
> > > 3: retina.ENSG00000135776.wgt.RDat
> > > 4: retina.ENSG00000135776.wgt.RDat
> > > 5: retina.ENSG00000135776.wgt.RDat
> > > 6: retina.ENSG00000135776.wgt.RDat
> > >
> > > so on attempt to parallelize this I did this:
> > >
> > > library(parallel)
> > > library(data.table)
> > > library(foreach)
> > > library(doSNOW)
> > >
> > > n <- parallel::detectCores()
> > > cl <- parallel::makeCluster(n, type = "SOCK")
> > > doSNOW::registerDoSNOW(cl)
> > > files <- list.files("/WEIGHTS1/Retina",
pattern=".RDat", ignore.case=T)
> > >
> > > lst_out <- foreach::foreach(i = seq_along(files),
> > > .packages = c("data.table") ) %dopar%
{
> > >
> > > a <- get(load(files[i]))
> > > names <- rownames(a)
> > > data <- data.table(names, a["blup"])
> > > nm1 <- c("rsid", "ref_allele",
"eff_allele")
> > > data[, (nm1) := tstrsplit(names, ":")[-2]]
> > > return(data[, .(rsid, weight = blup, ref_allele,
eff_allele)][,
> > > WGT := files[i]][])
> > > }
> > > parallel::stopCluster(cl)
> > >
> > > big_data <- rbindlist(lst_out)
> > >
> > > I am getting this Error:
> > >
> > > Error in { : task 7 failed - "object 'blup' not
found"
> > > > parallel::stopCluster(cl)
> > >
> > > Can you please advise,
> > > Ana
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.