kathleen askland
2013-Jul-02 20:46 UTC
[R] Recoding variables based on reference values in data frame
I'm new to R (previously used SAS primarily) and I have a genetics data frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3, ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of the data is shown below: SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1 CC GG CT AA ID2 CC GG CC AA ID3 CC GG nc AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG CC AG ID10 CC GG CC AA ID11 CC GG CT AA ID12 _ _ _ _ ID13 CC GG CC AA The name of the data file is Kgeno. What I would like to do is recode all of the genotype values to standard integer notation, based on their values relative to the reference rows (Maj_Allele and Min_Allele). Standard notation sums the total of minor alleles in the genotype, so values can be 0, 1 or 2. Here are the changes I want to make: 1. If the genotype= "nc" or '_" then set equal to NA. 2. If genotype value = a character string comprised of two consecutive major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0. 3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1. 4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2. I've tried the following ifelse processing but get error (Warning: Executed script did not end with R session at the top-level prompt. Top-level state will be restored) and can't seem to fix the code properly. I've counted the parentheses. Also, not sure if it would execute properly if I could fix it. # change 'nc' and '_' to NA, else leave as is: Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2]) Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2]) #convert genotype strings in the first data column to numeric values #(two major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as is (to preserve NA values). Kgeno[,2] <- ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[1,2]), sep=""), 0, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[2,2]), sep=""), 1, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character( Kgeno[2,2]), sep=""), 2, Kgeno[,2]))) Finally, if above code were corrected, this would only change the first column of data, but I would like to change all 3000+ columns in the same way. I would greatly appreciate some suggestions on how to proceed. Thank you, Kathleen --- Kathleen Askland, MD Assistant Professor Department of Psychiatry & Human Behavior The Warren Alpert School of Medicine Brown University/Butler Hospital [[alternative HTML version deleted]]
Rui Barradas
2013-Jul-02 21:15 UTC
[R] Recoding variables based on reference values in data frame
Hello, I'm not sure I understood, but try the following. Kgeno <- read.table(text = " SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1 CC GG CT AA ID2 CC GG CC AA ID3 CC GG nc AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG CC AG ID10 CC GG CC AA ID11 CC GG CT AA ID12 _ _ _ _ ID13 CC GG CC AA ", header = TRUE, stringsAsFactors = FALSE) dat fun <- function(x){ x[x %in% c("nc", "_")] <- NA MM <- paste0(x[1], x[1]) # Major Major Mm <- paste0(x[1], x[2]) # Major minor mm <- paste0(x[2], x[2]) # minor minor x[x == MM] <- 0 x[x == Mm] <- 1 x[x == mm] <- 2 x } Kgeno[, -1] <- sapply(Kgeno[, -1], fun) Kgeno Also, the best way to post data is by using ?dput. dput(head(Kgeno[, 1:5], 30)) # post the output of this Hope this helps, Rui Barradas Em 02-07-2013 21:46, kathleen askland escreveu:> I'm new to R (previously used SAS primarily) and I have a genetics data > frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3, > ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of > the data is shown below: > SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G C A Min_Allele T A T G ID1 > CC GG CT AA ID2 CC GG CC AA ID3 CC GG > nc > AA ID4 _ _ _ _ ID5 CC GG CC AA ID6 CC GG CC > AA ID7 CC GG CT AA ID8 _ _ _ _ ID9 CT GG > CC AG ID10 CC GG CC AA ID11 CC GG CT AA > ID12 _ _ _ _ ID13 CC GG CC AA > The name of the data file is Kgeno. > What I would like to do is recode all of the genotype values to standard > integer notation, based on their values relative to the reference rows > (Maj_Allele and Min_Allele). Standard notation sums the total of minor > alleles in the genotype, so values can be 0, 1 or 2. > > Here are the changes I want to make: > 1. If the genotype= "nc" or '_" then set equal to NA. > 2. If genotype value = a character string comprised of two consecutive > major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0. > 3. If genotype value= c(Maj_Allele, Min_Allele) then set equal to 1. > 4. If genotype value = c(Min_Allele, Min_Allele) then set equal to 2. > > I've tried the following ifelse processing but get error (Warning: Executed > script did not end with R session at the top-level prompt. Top-level state > will be restored) and can't seem to fix the code properly. I've counted the > parentheses. Also, not sure if it would execute properly if I could fix it. > > # change 'nc' and '_' to NA, else leave as is: > Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2]) > Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2]) > > #convert genotype strings in the first data column to numeric values #(two > major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as > is (to preserve NA values). > > Kgeno[,2] <- > > ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( > Kgeno[1,2]), sep=""), 0, > > ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( > Kgeno[2,2]), sep=""), 1, > > ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character( > Kgeno[2,2]), sep=""), 2, > Kgeno[,2]))) > > > Finally, if above code were corrected, this would only change the first > column of data, but I would like to change all 3000+ columns in the same > way. > > I would greatly appreciate some suggestions on how to proceed. > > Thank you, > > Kathleen > > --- > Kathleen Askland, MD > Assistant Professor > Department of Psychiatry & Human Behavior > The Warren Alpert School of Medicine > Brown University/Butler Hospital > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Hi, May be this helps: Kgeno<- read.table(text=" SNP_ID SNP1 SNP2 SNP3 SNP4 Maj_Allele C G? C? A Min_Allele T A T G? ID1 CC??? GG??? CT??? AA ID2 CC??? GG??? CC AA ID3 CC??? GG nc? AA ID4? _? _? _? _ ID5 CC??? GG??? CC??? AA ID6 CC??? GG??? CC? AA ID7 CC??? GG??? CT??? AA ID8 _ _ _ _? ID9 CT??? GG? CC AG ID10 CC??? GG??? CC??? AA ID11 CC??? GG??? CT??? AA ID12 _ _ _ _? ID13 CC??? GG??? CC??? AA ",sep="",header=TRUE,stringsAsFactors=FALSE) library(stringr) library(car) fun1<- function(x){ ?MajMin<- paste0(x[1],x[2]) ?MajMaj<-str_dup(x[1],2) ?MinMin<-str_dup(x[2],2) ?recode(x,"'nc'=NA;'_'=NA;MajMaj=0;MajMin=1;MinMin=2")} sapply(Kgeno[,-1],fun1) #or ?mat1<-sapply(Kgeno[1:2,-1],function(x) {c(str_dup(x,2),paste(x,collapse=""))})[c(1,3,2),] sapply(seq_len(ncol(Kgeno[,-1])),function(i) {x<-Kgeno[-c(1:2),-1][,i];as.numeric(factor(x,levels=mat1[,i]))-1}) #Speed comparison KgenoNew<- rbind(Kgeno[c(1:2),-1],sapply(Kgeno[-c(1:2),-1],rep,1e4)) ?system.time(res1<- sapply(KgenoNew,fun1)) #?? user? system elapsed ?#0.672?? 0.000?? 0.674 system.time({ mat1<-sapply(Kgeno[1:2,-1],function(x) {c(str_dup(x,2),paste(x,collapse=""))})[c(1,3,2),] res2<- sapply(seq_len(ncol(KgenoNew)),function(i){ x<- KgenoNew[-c(1:2),][,i];as.numeric(factor(x,levels=mat1[,i]))-1}) }) #user? system elapsed #? 0.212?? 0.000?? 0.214 res1New<- res1[-c(1:2),] res1New1<- as.numeric(res1New) ?dim(res1New1)<- dim(res1New) identical(res1New1,res2) #[1] TRUE A.K. ----- Original Message ----- From: kathleen askland <k.askland at gmail.com> To: r-help at r-project.org Cc: Sent: Tuesday, July 2, 2013 4:46 PM Subject: [R] Recoding variables based on reference values in data frame I'm new to R (previously used SAS primarily) and I have a genetics data frame consisting of genotypes for each of 300+ subjects (ID1, ID2, ID3, ...) at 3000+ genetic locations (SNP1, SNP2, SNP3...). A small subset of the data is shown below: ? SNP_ID SNP1 SNP2 SNP3 SNP4? Maj_Allele C G? C? A? Min_Allele T A T G? ID1 CC? ? GG? ? CT? ? AA? ? ? ID2 CC? ? GG? ? CC AA? ? ? ID3 CC? ? GG nc AA? ? ? ID4 _ _ _ _? ID5 CC? ? GG? ? CC? ? AA? ? ? ID6 CC? ? GG? ? CC ? ? AA? ? ? ID7 CC? ? GG? ? CT? ? AA? ? ? ID8 _ _ _ _? ID9 CT? ? GG CC AG? ? ? ID10 CC? ? GG? ? CC? ? AA? ? ? ID11 CC? ? GG? ? CT? ? AA ? ? ? ID12 _ _ _ _? ID13 CC? ? GG? ? CC? ? AA The name of the data file is Kgeno. What I would like to do is recode all of the genotype values to standard integer notation, based on their values relative to the reference rows (Maj_Allele and Min_Allele). Standard notation sums the total of minor alleles in the genotype, so values can be 0, 1 or 2. Here are the changes I want to make: 1. If the genotype= "nc" or '_" then set equal to NA. 2. If genotype value = a character string comprised of two consecutive major allele values -- c(Maj_Allele, Maj_Allele) -- then set equal to 0. 3. If genotype? value= c(Maj_Allele, Min_Allele) then set equal to 1. 4. If genotype? value = c(Min_Allele, Min_Allele) then set equal to 2. I've tried the following ifelse processing but get error (Warning: Executed script did not end with R session at the top-level prompt.? Top-level state will be restored) and can't seem to fix the code properly. I've counted the parentheses. Also, not sure if it would execute properly if I could fix it. # change 'nc' and '_' to NA, else leave as is: Kgeno[,2] <- ifelse(Kgeno[,2] == "nc", "NA", Kgeno[,2]) Kgeno[,2] <- ifelse(Kgeno[,2] == "_", "NA", Kgeno[,2]) #convert genotype strings in the first data column to numeric values #(two major alleles=0, 1 minor and 1 major=1, 2 minor alleles=2), else #leave as is (to preserve NA values). Kgeno[,2] <- ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[1,2]), sep=""), 0, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[1,2]), as.character( Kgeno[2,2]), sep=""), 1, ifelse(Kgeno[,2] == noquote(paste(as.character(Kgeno[2,2]), as.character( Kgeno[2,2]), sep=""), 2, ? ? ? ? ? ? Kgeno[,2]))) Finally, if above code were corrected, this would only change the first column of data, but I would like to change all 3000+ columns in the same way. I would greatly appreciate some suggestions on how to proceed. Thank you, Kathleen --- Kathleen Askland, MD Assistant Professor Department of Psychiatry & Human Behavior The Warren Alpert School of Medicine Brown University/Butler Hospital ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.