I'm pretty new to R, and I've been given a script by a user who wants
some help with it. I know enough about the way R works to know that
this is a very inefficient way to do what the user wants (the
LSB_JOBINDEX stuff is added by me so that this can work on many
hundreds of input data files as LSF jobs - it's the nested loops I'm
really interested in):
gtpfile<-paste("genotype_chunk.", sep="",
Sys.getenv("LSB_JOBINDEX"))
snps<-read.table(gtpfile,header=TRUE)
exp<-read.table("data.TXT",header=TRUE)
# the above two files have columns for individuals and snps (or genes)
for rows
# so the next two lines simply transpose these data matrices
tsnps<-t(snps)
texp<-t(exp)
sink(paste("output.", sep="",
Sys.getenv("LSB_JOBINDEX")))
#loops below are hardwired for 5 gene-expression levels (some genes
have two
#probes, and those are treated as separate genes for now) and 100 SNPs.
for (iexp in 1:5){
for (isnp in 1:100){
genotype<-factor(tsnps[,isnp])
# make sure there is more than one genotypic class before doing
ANOVA
if (length(unique(genotype))>1) {
expression<-texp[,iexp]
stuff<-anova(lm(expression ~ genotype))
qq=c(iexp,isnp)
print (qq)
print(stuff)
# plot(factor(tsnps[,isnp]),texp[,iexp], xlab="Genotype",
ylab="Expression
level")
}
}
}
sink()
Eventually, on the full data set, the size of the two data sets being
related to each other will be much larger - several hundred thousand
rows in snps, and several hundred rows in exp.
Clearly, the genotype<-factor line is being calculated repeatedly, and
in the wrong place; it should be done en masse up front once the data
has been read, and I'm sure both loops could actually be expressed some
other way.
It seems to me that I ought to be able to calculate all the factors in
a statement before the iexp loop, presumably by using apply() or
aggregate(), but apply() seems to coerce the result so the results
aren't factors any more, and I'm clearly missing something with regard
to the way aggregate() works; I really don't understand what I'd need
to set the 'by' parameter to.
I apologise if this is a hopelessly naive question, but I've got both
the "Introduction to R" and Peter's introductory statistics with R
book
here, and I'm having some difficulty finding the information I'm after.
"Introduction to R" doesn't mention the word "aggregate"
once... :-)
Thanks in advance,
Tim
--
Dr Tim Cutts
Informatics Systems Group, Wellcome Trust Sanger Institute
GPG: 1024D/E3134233 FE3D 6C73 BBD6 726A A3F5 860B 3CDD 3F56 E313 4233