Josh B
2009-Jan-19 04:06 UTC
[R] Deleting columns where the frequency of values are too disparate
Hello R-help community, I have another question about filtering datasets. Please consider the following "toy" data matrix example, called "x" for simplicity. There are 20 different individuals ("ID"), with information about the alleles (A,T, G, C) at six different loci ("Locus1" - "Locus6") for each of these 20 individuals. At any single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the individuals have either one allele (from the set of A,T,C,G) or one other allele (from the set of A,T,C, G). For example, at Locus1 individuals have have either the A or T allele only; at Locus2 the individuals can have either C or G only; at Locus3 the individuals can have either T or G only. IDLocus1Locus2Locus3Locus4Locus5Locus6 1AGTAAC 2AGGACC 3ACGGCC 4ACGGCC 5AGGGAC 6TGGGCC 7TCGGCC 8TCGGAC 9TGGGCC 10TCGGCC 11AGGGAC 12ACGGCC 13AGGGCC 14AGGGAC 15ACGGCC 16TCGGCC 17TGGGAC 18TGGGCC 19TGGGCC 20TCGGAC I want to delete any columns from the dataset where the rarer of the two alleles has a frequency of ten percent or less. In other words, I would like to delete Locus3, Locus4, and Locus6 in this data matrix, because the frequency of the rare allele is not greater than ten percent (and conversely, the frequency of the common allele is not less than ninety percent). Please note that the frequency of the rare allele in Locus6 is equal to zero (conversely, the frequency of the common allele is equal to one hundred percent). Would one of you know of simple way to write this sort of code? (In my real dataset, there are 1096 loci, so this cannot be done easily "by eye.") Thanks again in advance for any suggestions! Josh B. [[alternative HTML version deleted]]
Richard.Cotton at hsl.gov.uk
2009-Jan-19 12:13 UTC
[R] Deleting columns where the frequency of values are too disparate
> Please consider the following "toy" data matrix example, called "x" > for simplicity. There are 20 different individuals ("ID"), with > information about the alleles (A,T, G, C) at six different loci > ("Locus1" - "Locus6") for each of these 20 individuals. At any > single locus (e.g., "Locus1" or "Locus2", ... or "Locus6"), the > individuals have either one allele (from the set of A,T,C,G) or one > other allele (from the set of A,T,C, G). For example, at Locus1 > individuals have have either the A or T allele only; at Locus2 the > individuals can have either C or G only; at Locus3 the individuals > can have either T or G only. > > IDLocus1Locus2Locus3Locus4Locus5Locus6 > 1AGTAAC > 2AGGACC > 3ACGGCC > 4ACGGCC > 5AGGGAC > 6TGGGCC > 7TCGGCC > 8TCGGAC > 9TGGGCC > 10TCGGCC > 11AGGGAC > 12ACGGCC > 13AGGGCC > 14AGGGAC > 15ACGGCC > 16TCGGCC > 17TGGGAC > 18TGGGCC > 19TGGGCC > 20TCGGAC > > I want to delete any columns from the dataset where the rarer of the > two alleles has a frequency of ten percent or less. In other words, > I would like to delete Locus3, Locus4, and Locus6 in this data > matrix, because the frequency of the rare allele is not greater than > ten percent (and conversely, the frequency of the common allele is > not less than ninety percent). Please note that the frequency of the > rare allele in Locus6 is equal to zero (conversely, the frequency of > the common allele is equal to one hundred percent). > > Would one of you know of simple way to write this sort of code? (In > my real dataset, there are 1096 loci, so this cannot be done easily "byeye.") Most of the problem is just organising the data into a sensible form. # read in data data <- readLines(tc <- textConnection("1AGTAAC 2AGGACC 3ACGGCC 4ACGGCC 5AGGGAC 6TGGGCC 7TCGGCC 8TCGGAC 9TGGGCC 10TCGGCC 11AGGGAC 12ACGGCC 13AGGGCC 14AGGGAC 15ACGGCC 16TCGGCC 17TGGGAC 18TGGGCC 19TGGGCC 20TCGGAC")); close(tc) # retrieve the useful bit loci <- sub("[[:digit:]]{1,2}", "", data) # you may also want this ID <- grep("[[:digit:]]{1,2}", data) # find out how many of each base occurs at each locus freqs <- list() n <- length(ID) for(i in 1:6) { assign(paste("locus", i, sep=""), factor(substring(loci,i,i), levels=c("A","C","G","T"))) freqs[[i]] <- summary(get(paste("locus", i, sep=""))) } freqs # remove loci with 90% or more cases of same base loci.to.remove <- sapply(freqs, function(x) any(x>0.9*n)) Regards, Richie. Mathematical Sciences Unit HSL ------------------------------------------------------------------------ ATTENTION: This message contains privileged and confidential inform...{{dropped:20}}