thr3ads.net - R help - [R] Calculating symbol (letter) frequencies [Jan 2005]

If this information is useful, please help other people find it:
Share via:

Wollenberg, Kurt R

2005-Jan-03 17:01 UTC

[R] Calculating symbol (letter) frequencies

Hello:

I am attempting to use R to analyze amino acid frequencies in aligned
protein sequences and need some help. So far, I have imported my sequence
alignment into a data frame (lets call it "alignment") with each site
in one
column, so that I have a data frame consisting of columns of letters (the 21
amino acid symbols plus "-") with row names being the corresponding
protein
names. >summary(alignment) gives me the counts of symbols in each column.
Now I would like to convert these counts into frequencies and perform
calculations on these frequencies. Do I need to place the individual
elements in summary(alignment) in a separate data frame to perform
calculations on them? What I've been thinking of is doing is creating a data
frame of symbol frequencies with each column corresponding to a column in
the sequence alignment. If it makes sense to do this how do I extract these
data into a data frame so that I can perform further analyses on these
frequencies? I've tried >DF1 <- data.frame(alignment, row.names=AA),
where
AA is a character vector of amino acid symbols plus "-", but the error
message tells me that the "row names supplied are of the wrong
length". As
not all of the symbols are present in each column of "alignment" this
makes
some sense to me, as each summary(alignment[[i]]) varies in length. Also, I
would need to match up the individual symbol entries in each
summary(alignment[[i]]) with the corresponding row in the new data frame
(which I believe can be efficiently done with indexing, but I can't put my
finger on an appropriate example of how to do this). I have looked at the
package Biostrings on the Bioconductor site but it doesn't appear to work
with amino acid sequence alignments. So my questions to the R-help community
are: Can I do various statistical calculations on and using
summary(alignment[[i]]) or do I need a separate data frame? If I should be
using a separate data frame for symbol frequencies how do I extract these
from the data? Should I try to extract this from summary or is there a more
efficient way to calculate symbol frequencies?

Thanks,
Kurt Wollenberg, PhD
Tufts Center for Vision Research 
New England Medical Center
750 Washington St, Box 450 
Boston, MA, USA
kwollenberg at tufts-nemc.org 
617-636-8945 (Fax)
617-636-9028 (Lab)

The most exciting phrase to hear in science, the one that heralds new
discoveries, is not "Eureka!" (I found it!) but  "That's
funny ..."
--Isaac Asimov


********************** 
Confidentiality Notice\ **********************\      The inf...{{dropped}}

Berton Gunter

2005-Jan-03 18:14 UTC

head link

[R] Calculating symbol (letter) frequencies

My best advice:

Yours is a complicated question, but it is probably the wrong question. I
suspect that you are trying to reinvent wheels: I believe there has been a
lot of work in the statistical community on (OK, maybe DNA not amino acid)
sequence alignment. Undoubtedly much more published in the
biological/bioinformatics literature that I'm unaware of. I think you should
collaborate with a statistician at your institution to help you access that
literature and its methods, which are almost certainly implementable and
probably implemented within R. Perhaps even on BioConductor (despite your
lack of success there thus far).

My far poorer advice:

See ?lapply and relatives, as well as ?table and the links therein. Also
?factor may be of interest.

lapply(alignment,function(x)table(x)/length(x)) 

will return a list of length the number of sites (columns), each component
of which is a frequency table of the letters at the respective site. You can
then further process these results as you like. Is this the sort of thing
that you want?

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Wollenberg, Kurt R
> Sent: Monday, January 03, 2005 9:02 AM
> To: 'r-help at stat.math.ethz.ch'
> Subject: [R] Calculating symbol (letter) frequencies
> 
> Hello:
> 
> I am attempting to use R to analyze amino acid frequencies in aligned
> protein sequences and need some help. So far, I have imported 
> my sequence
> alignment into a data frame (lets call it "alignment") with 
> each site in one
> column, so that I have a data frame consisting of columns of 
> letters (the 21
> amino acid symbols plus "-") with row names being the 
> corresponding protein
> names. >summary(alignment) gives me the counts of symbols in 
> each column.
> Now I would like to convert these counts into frequencies and perform
> calculations on these frequencies. Do I need to place the individual
> elements in summary(alignment) in a separate data frame to perform
> calculations on them? What I've been thinking of is doing is 
> creating a data
> frame of symbol frequencies with each column corresponding to 
> a column in
> the sequence alignment. If it makes sense to do this how do I 
> extract these
> data into a data frame so that I can perform further analyses on these
> frequencies? I've tried >DF1 <- data.frame(alignment, 
> row.names=AA), where
> AA is a character vector of amino acid symbols plus "-", but the
error
> message tells me that the "row names supplied are of the 
> wrong length". As
> not all of the symbols are present in each column of 
> "alignment" this makes
> some sense to me, as each summary(alignment[[i]]) varies in 
> length. Also, I
> would need to match up the individual symbol entries in each
> summary(alignment[[i]]) with the corresponding row in the new 
> data frame
> (which I believe can be efficiently done with indexing, but I 
> can't put my
> finger on an appropriate example of how to do this). I have 
> looked at the
> package Biostrings on the Bioconductor site but it doesn't 
> appear to work
> with amino acid sequence alignments. So my questions to the 
> R-help community
> are: Can I do various statistical calculations on and using
> summary(alignment[[i]]) or do I need a separate data frame? 
> If I should be
> using a separate data frame for symbol frequencies how do I 
> extract these
> from the data? Should I try to extract this from summary or 
> is there a more
> efficient way to calculate symbol frequencies?
> 
> Thanks,
> Kurt Wollenberg, PhD
> Tufts Center for Vision Research 
> New England Medical Center
> 750 Washington St, Box 450 
> Boston, MA, USA
> kwollenberg at tufts-nemc.org 
> 617-636-8945 (Fax)
> 617-636-9028 (Lab)
> 
> The most exciting phrase to hear in science, the one that heralds new
> discoveries, is not "Eureka!" (I found it!) but  "That's
funny ..."
> --Isaac Asimov
> 
> 
> ********************** 
> Confidentiality Notice\ **********************\      The 
> inf...{{dropped}}
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jan 2005 - Calculating symbol (letter) frequencies

[R] Calculating symbol (letter) frequencies

[R] Calculating symbol (letter) frequencies

Apparently Analagous Threads