Josh B
2009-Nov-09 21:09 UTC
[R] Using something like the "by" command, but on rows instead of columns
Hello R Forum users, I was hoping someone could help me with the following problem. Consider the following "toy" dataset: Accession SNP_CRY2 SNP_FLC Phenotype 1 NA A 0.783143079 2 BQ A 0.881714811 3 BQ A 0.886619488 4 AQ B 0.416893034 5 AQ B 0.621392903 6 AS B 0.031719125 7 AS NA 0.652375037 "Accession" = individual plants, arbitrarily identified by unique numbers "SNP_" = individual genes. "SNP_CRY2" = the CRY2 gene. The plants either have the BQ, AQ, or AS genotype at the CRY2 gene. "NA" = missing data. "SNP_FLC" = the FLC gene. The plants either have the A or B genotype at the FLC gene. "NA" = missing data. "Phenotype" = a continuous variable of interest. I have a much larger number of columns corresponding to genes (i.e., more columns with the "SNP_" prefix) in my real dataset. For each gene in turn (i.e., each "SNP_" column), I would like to find the phenotypic variance for all of the plants with non-missing data. Note that the plants with missing genotype data ("NA") differ for each gene (each "SNP_" column). Would one of you be able to offer some specific code that could do this operation? Please rest assured that I am not a student trying to elicit help with a homework assignment. I am a post-doc with limited R skills, working with a large genetic dataset. Thanks very much in advance to a wonderful online community. Sincerely, Josh [[alternative HTML version deleted]]
David Freedman
2009-Nov-09 21:52 UTC
[R] Using something like the "by" command, but on rows instead of columns
Some variation of the following might be want you want: df=data.frame(sex=sample(1:2,100,replace=T),snp.1=rnorm(100),snp.15=runif(100)) df$snp.1[df$snp.1>1.0]<-NA; #put some missing values into the data x=grep('^snp',names(df)); x #which columns that begin with 'snp' apply(df[,x],2,summary) #or apply(df[,x],2,FUN=function(x)mean(x,na=T)) hth, david Josh B-3 wrote:> > Hello R Forum users, > > I was hoping someone could help me with the following problem. Consider > the following "toy" dataset: > > Accession SNP_CRY2 SNP_FLC Phenotype > 1 NA A 0.783143079 > 2 BQ A 0.881714811 > 3 BQ A 0.886619488 > 4 AQ B 0.416893034 > 5 AQ B 0.621392903 > 6 AS B 0.031719125 > 7 AS NA 0.652375037 > > "Accession" = individual plants, arbitrarily identified by unique numbers > "SNP_" = individual genes. > "SNP_CRY2" = the CRY2 gene. The plants either have the BQ, AQ, or AS > genotype at the CRY2 gene. "NA" = missing data. > "SNP_FLC" = the FLC gene. The plants either have the A or B genotype at > the FLC gene. "NA" = missing data. > "Phenotype" = a continuous variable of interest. > > I have a much larger number of columns corresponding to genes (i.e., more > columns with the "SNP_" prefix) in my real dataset. For each gene in turn > (i.e., each "SNP_" column), I would like to find the phenotypic variance > for all of the plants with non-missing data. Note that the plants with > missing genotype data ("NA") differ for each gene (each "SNP_" column). > > Would one of you be able to offer some specific code that could do this > operation? Please rest assured that I am not a student trying to elicit > help with a homework assignment. I am a post-doc with limited R skills, > working with a large genetic dataset. > > Thanks very much in advance to a wonderful online community. > Sincerely, > Josh > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://old.nabble.com/Using-something-like-the-%22by%22-command%2C-but-on-rows-instead-of-columns-tp26273840p26274373.html Sent from the R help mailing list archive at Nabble.com.