Hi all, I have SNP data set: the first column is the ID and the the subsequent pair of columns are the alleles for each SNP1, SNP2 and So on. Each SNP has two columns. Based on the alleles I want make phenotype if the alleles are 1 1 then genotype is 0 2 2 then genotype is 1 and if it is 1 2 or 2 1 then genotyep is 3 This is a sample data set but the actual has 13,000 SNP(26,000columns) Geno data AB95 1 1 2 2 2 2 2 2 1 1 AB82 2 2 2 2 2 2 2 2 2 2 AB95 2 1 2 2 2 2 2 2 1 1 AB59 1 1 2 2 1 2 1 2 1 2 AB32 2 1 2 2 2 2 2 2 1 2 AB46 2 1 2 2 1 2 1 1 2 2 AB61 1 1 2 2 1 2 1 2 1 2 AB32 2 2 1 2 2 2 2 2 1 2 AB35 2 2 1 2 2 2 2 2 2 2 AB43 2 2 1 2 2 2 2 2 2 2 Desired output AB95 0 1 1 1 0 AB82 1 1 1 1 1 AB95 3 1 1 1 0 AB59 0 1 3 3 3 AB32 3 1 1 1 3 AB46 3 1 3 0 1 AB61 0 1 3 3 3 AB32 1 3 1 1 3 AB35 1 3 1 1 1 AB43 1 3 1 1 1 I would appreciate if you help me out here. Thank you in advance [[alternative HTML version deleted]]
Hi Val, There are probably more elegant ways to do it, but the following is fairly transparent: # input data arranged as an array: indat<-cbind(c(1,2,2,1),c(1,2,1,1),c(2,2,2,2),c(2,2,2,2),c(2,2,2,1),c(2,2,2,2),c(2,2,2,1),c(2,2,2,2),c(1,2,1,1),c(1,2,1,2)) indat outdat<-array(dim=c(dim(indat)[1],dim(indat)[2]/2)) # output data has same number of rows and half as many columns for (i in 1:dim(outdat)[2]){ outdat[,i]<-apply(indat[,(i-1)*2+1:2],F=sum,M=1) # each column of output = sum(two columns of input) } outdat[outdat==2]<-0 # allele pairs that sum to 2 are genotype 0 outdat[outdat==4]<-1 # allele pairs that sum to 4 are genotype 1 # allele pairs that sum to 3 are genotype 3, so no need to change anything with them outdat # faster but a little more difficult to see what is going on: outdat<-indat %*% array(c(rep(c(rep(1,2),rep(0,dim(indat)[2])),dim(indat)[2]/2),1,1),dim=c(dim(indat)[2],dim(indat)[2]/2)) outdat[outdat==2]<-0 outdat[outdat==4]<-1 outdat -Dan On Thu, Feb 11, 2016 at 2:52 PM, Val <valkremk at gmail.com> wrote:> Hi all, > > I have SNP data set: the first column is the ID and the the > subsequent pair of columns are the alleles for each > SNP1, SNP2 and So on. Each SNP has two columns. Based on the alleles > I want make phenotype > > if the alleles are 1 1 then genotype is 0 > 2 2 then genotype is 1 > and if it is 1 2 or 2 1 then genotyep is 3 > > This is a sample data set but the actual has 13,000 SNP(26,000columns) > > > Geno data > AB95 1 1 2 2 2 2 2 2 1 1 > AB82 2 2 2 2 2 2 2 2 2 2 > AB95 2 1 2 2 2 2 2 2 1 1 > AB59 1 1 2 2 1 2 1 2 1 2 > AB32 2 1 2 2 2 2 2 2 1 2 > AB46 2 1 2 2 1 2 1 1 2 2 > AB61 1 1 2 2 1 2 1 2 1 2 > AB32 2 2 1 2 2 2 2 2 1 2 > AB35 2 2 1 2 2 2 2 2 2 2 > AB43 2 2 1 2 2 2 2 2 2 2 > > Desired output > AB95 0 1 1 1 0 > AB82 1 1 1 1 1 > AB95 3 1 1 1 0 > AB59 0 1 3 3 3 > AB32 3 1 1 1 3 > AB46 3 1 3 0 1 > AB61 0 1 3 3 3 > AB32 1 3 1 1 3 > AB35 1 3 1 1 1 > AB43 1 3 1 1 1 > > I would appreciate if you help me out here. > Thank you in advance > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Dan Dalthorp, PhD USGS Forest and Rangeland Ecosystem Science Center Forest Sciences Lab, Rm 189 3200 SW Jefferson Way Corvallis, OR 97331 ph: 541-750-0953 ddalthorp at usgs.gov [[alternative HTML version deleted]]
Thank you very much Dan! I want go with the second one, because the data very huge (>25,000 columns) and > 3,000 row. The data is loaded as "testdat" Can you help me to fit in the following code please, # faster but a little more difficult to see what is going on: outdat<-indat %*% array(c(rep(c(rep(1,2),rep(0,dim(indat)[2])),dim(indat)[2]/2),1,1),dim=c(dim(indat)[2],dim(indat)[2]/2)) outdat[outdat==2]<-0 outdat[outdat==4]<-1 outdat Thank you! On Thu, Feb 11, 2016 at 5:58 PM, Dalthorp, Daniel <ddalthorp at usgs.gov> wrote:> Hi Val, > There are probably more elegant ways to do it, but the following is fairly > transparent: > > # input data arranged as an array: > > indat<-cbind(c(1,2,2,1),c(1,2,1,1),c(2,2,2,2),c(2,2,2,2),c(2,2,2,1),c(2,2,2,2),c(2,2,2,1),c(2,2,2,2),c(1,2,1,1),c(1,2,1,2)) > indat > > outdat<-array(dim=c(dim(indat)[1],dim(indat)[2]/2)) # output data has same > number of rows and half as many columns > for (i in 1:dim(outdat)[2]){ > outdat[,i]<-apply(indat[,(i-1)*2+1:2],F=sum,M=1) # each column of output > = sum(two columns of input) > } > outdat[outdat==2]<-0 # allele pairs that sum to 2 are genotype 0 > outdat[outdat==4]<-1 # allele pairs that sum to 4 are genotype 1 > # allele pairs that sum to 3 are genotype 3, so no need to change anything > with them > outdat > > # faster but a little more difficult to see what is going on: > outdat<-indat %*% > array(c(rep(c(rep(1,2),rep(0,dim(indat)[2])),dim(indat)[2]/2),1,1),dim=c(dim(indat)[2],dim(indat)[2]/2)) > outdat[outdat==2]<-0 > outdat[outdat==4]<-1 > outdat > > -Dan > > > > On Thu, Feb 11, 2016 at 2:52 PM, Val <valkremk at gmail.com> wrote: > >> Hi all, >> >> I have SNP data set: the first column is the ID and the the >> subsequent pair of columns are the alleles for each >> SNP1, SNP2 and So on. Each SNP has two columns. Based on the alleles >> I want make phenotype >> >> if the alleles are 1 1 then genotype is 0 >> 2 2 then genotype is 1 >> and if it is 1 2 or 2 1 then genotyep is 3 >> >> This is a sample data set but the actual has 13,000 SNP(26,000columns) >> >> >> Geno data >> AB95 1 1 2 2 2 2 2 2 1 1 >> AB82 2 2 2 2 2 2 2 2 2 2 >> AB95 2 1 2 2 2 2 2 2 1 1 >> AB59 1 1 2 2 1 2 1 2 1 2 >> AB32 2 1 2 2 2 2 2 2 1 2 >> AB46 2 1 2 2 1 2 1 1 2 2 >> AB61 1 1 2 2 1 2 1 2 1 2 >> AB32 2 2 1 2 2 2 2 2 1 2 >> AB35 2 2 1 2 2 2 2 2 2 2 >> AB43 2 2 1 2 2 2 2 2 2 2 >> >> Desired output >> AB95 0 1 1 1 0 >> AB82 1 1 1 1 1 >> AB95 3 1 1 1 0 >> AB59 0 1 3 3 3 >> AB32 3 1 1 1 3 >> AB46 3 1 3 0 1 >> AB61 0 1 3 3 3 >> AB32 1 3 1 1 3 >> AB35 1 3 1 1 1 >> AB43 1 3 1 1 1 >> >> I would appreciate if you help me out here. >> Thank you in advance >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Dan Dalthorp, PhD > USGS Forest and Rangeland Ecosystem Science Center > Forest Sciences Lab, Rm 189 > 3200 SW Jefferson Way > Corvallis, OR 97331 > ph: 541-750-0953 > ddalthorp at usgs.gov > >[[alternative HTML version deleted]]