thr3ads.net - R help - [R] Deleting columns where the frequency of values are too disparate [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Josh B

2009-Jan-19 04:06 UTC

[R] Deleting columns where the frequency of values are too disparate

Hello R-help community,

I have another question about filtering datasets.

Please consider the following "toy" data matrix example, called
"x" for simplicity. There are 20 different individuals 
("ID"), with information about the alleles (A,T, G, C) at six
different loci ("Locus1" -  "Locus6") for each of these 20
individuals. At any single locus (e.g., "Locus1" or
"Locus2", ... or "Locus6"), the individuals have either one
allele (from the set of A,T,C,G) or one other allele (from the set of A,T,C, G).
For example, at Locus1 individuals have have either the A or T allele only; at
Locus2 the individuals can have either C or G only; at Locus3 the individuals
can have either T or G only.

IDLocus1Locus2Locus3Locus4Locus5Locus6
1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC

I want to delete any columns from the dataset where the rarer of the two alleles
has a frequency of ten percent or less. In other words, I would like to delete
Locus3, Locus4, and Locus6 in this data matrix, because the frequency of the
rare allele is not greater than ten percent (and conversely, the frequency of
the common allele is not less than ninety percent). Please note that the
frequency of the rare allele in Locus6 is equal to zero (conversely, the
frequency of the common allele is equal to one hundred percent).

Would one of you know of simple way to write this sort of code? (In my real
dataset, there are 1096 loci, so this cannot be done easily "by eye.")

Thanks again in advance for any suggestions!
Josh B.


      
	[[alternative HTML version deleted]]

Richard.Cotton at hsl.gov.uk

2009-Jan-19 12:13 UTC

head link

[R] Deleting columns where the frequency of values are too disparate

> Please consider the following "toy" data matrix example, called
"x"
> for simplicity. There are 20 different individuals  ("ID"), with 
> information about the alleles (A,T, G, C) at six different loci 
> ("Locus1" -  "Locus6") for each of these 20
individuals. At any
> single locus (e.g., "Locus1" or "Locus2", ... or
"Locus6"), the
> individuals have either one allele (from the set of A,T,C,G) or one 
> other allele (from the set of A,T,C, G). For example, at Locus1 
> individuals have have either the A or T allele only; at Locus2 the 
> individuals can have either C or G only; at Locus3 the individuals 
> can have either T or G only.
> 
> IDLocus1Locus2Locus3Locus4Locus5Locus6
> 1AGTAAC
> 2AGGACC
> 3ACGGCC
> 4ACGGCC
> 5AGGGAC
> 6TGGGCC
> 7TCGGCC
> 8TCGGAC
> 9TGGGCC
> 10TCGGCC
> 11AGGGAC
> 12ACGGCC
> 13AGGGCC
> 14AGGGAC
> 15ACGGCC
> 16TCGGCC
> 17TGGGAC
> 18TGGGCC
> 19TGGGCC
> 20TCGGAC
> 
> I want to delete any columns from the dataset where the rarer of the
> two alleles has a frequency of ten percent or less. In other words, 
> I would like to delete Locus3, Locus4, and Locus6 in this data 
> matrix, because the frequency of the rare allele is not greater than
> ten percent (and conversely, the frequency of the common allele is 
> not less than ninety percent). Please note that the frequency of the
> rare allele in Locus6 is equal to zero (conversely, the frequency of
> the common allele is equal to one hundred percent).
> 
> Would one of you know of simple way to write this sort of code? (In 
> my real dataset, there are 1096 loci, so this cannot be done easily
"byeye.")

Most of the problem is just organising the data into a sensible form.

# read in data
data <- readLines(tc <- textConnection("1AGTAAC
2AGGACC
3ACGGCC
4ACGGCC
5AGGGAC
6TGGGCC
7TCGGCC
8TCGGAC
9TGGGCC
10TCGGCC
11AGGGAC
12ACGGCC
13AGGGCC
14AGGGAC
15ACGGCC
16TCGGCC
17TGGGAC
18TGGGCC
19TGGGCC
20TCGGAC")); close(tc)

# retrieve the useful bit
loci <- sub("[[:digit:]]{1,2}", "", data)

# you may also want this
ID <- grep("[[:digit:]]{1,2}", data)

# find out how many of each base occurs at each locus
freqs <- list()
n <- length(ID)
for(i in 1:6)
{
   assign(paste("locus", i, sep=""),
factor(substring(loci,i,i),
levels=c("A","C","G","T")))
   freqs[[i]] <- summary(get(paste("locus", i, sep=""))) 
}
freqs

# remove loci with 90% or more cases of same base
loci.to.remove <- sapply(freqs, function(x) any(x>0.9*n))

Regards,
Richie.

Mathematical Sciences Unit
HSL


------------------------------------------------------------------------
ATTENTION:

This message contains privileged and confidential inform...{{dropped:20}}

R help - Jan 2009 - Deleting columns where the frequency of values are too disparate

[R] Deleting columns where the frequency of values are too disparate

[R] Deleting columns where the frequency of values are too disparate

Maybe Matching Threads