thr3ads.net - R help - [R] analyze amino acid sequence (composition)of proteins [Jun 2006]

If this information is useful, please help other people find it:
Share via:

ucecgxu at ucl.ac.uk

2006-Jun-18 10:04 UTC

[R] analyze amino acid sequence (composition)of proteins

Dear R-helpers:

thank your for your attention.

i am a newer to R and i am doing some protein category classification based on
the amino acid sequence.while i have some questions urgently.

1. any packages for analysis amino acid sequence

2. given two sequences "AAA" and "BBB",how can i combine
them into "AAABBB"

3. based on "AAABBB",how can i get some statistics of this string such
as how
many letters,how many "A"s in the string.


Thank you very much and i am looking forward to hearing from you soon

have a nice day!

Best Regards

Marshall

stubben

2006-Jun-19 16:48 UTC

head link

[R] analyze amino acid sequence (composition)of proteins

Have you checked the package seqinR at CRAN?  It may help.

Also, if you split a string into single characters using strsplit, then 
use table to count characters.


seq<-"ATGAAC"


table(strsplit(seq, ""))

A C G T
3 1 1 1




 > 3. based on "AAABBB",how can i get some statistics of this
string
such as how
 > many letters,how many "A"s in the string.
-- 

-----------------
Chris Stubben

Los Alamos National Lab
BioScience Division
MS M888
Los Alamos, NM 87545

Jean lobry

2006-Jun-19 18:53 UTC

head link

[R] analyze amino acid sequence (composition)of proteins

>i am a newer to R and i am doing some protein category classification based
on
>the amino acid sequence.while i have some questions urgently.
I'm not sure to understand exactly what you are trying to do, but if
this is a kind of clustering from the amino-acid composition of
proteins beware that the genomic G+C content is a major confounding
factor.
>1. any packages for analysis amino acid sequence
There are some utilities in the seqinR package to retrieve protein
sequences from various protein databases such as SwissProt. The
list of available databases (including nucleic databases) is given by:

   library(seqinr)
   choosebank()
  [1] "genbank"      "embl"         "emblwgs"     
"swissprot"    "ensembl"
  [6] "ensemblnew"   "emglib"       "nrsub"       
"nbrf"         "hobacnucl"
[11] "hobacprot"    "hovernucl"    "hoverprot"   
"hogennucl"    "hogenprot"
[16] "hoverclnu"    "hoverclpr"    "homolensprot"
"homolensnucl" "HAMAPnucl"
[21] "HAMAPprot"    "hoppsigen"    "nurebnucl"   
"nurebprot"    "taxobacgen"
[26] "greview"      "hogendnucl"   "hogendprot"  
"refseq"

More info with the infobank argument set to TRUE:

   choosebank(infobank = TRUE)[1:5,]
        bank status 
info
1   genbank     on             GenBank Rel. 154 (15 June 2006) Last 
Updated: Jun 18, 2006
2      embl     on                                   EMBL Library 
Release 86 (March 2006)
3   emblwgs     on           EMBL Whole Genome Shotgun sequences 
Release 86  (March 2006)
4 swissprot     on UniProt Rel. 7 (SWISS-PROT 49 + TrEMBL 32): Last 
Updated: Jun  6, 2006
5   ensembl     on 
Ensembl databases rel 24

There are also some tools to compute protein indices such as the
Kyte and Doolittle hydrophaty index, or the PI (Isoelectric point
of a protein). See the examples in this document:
http://pbil.univ-lyon1.fr/software/SeqinR/seqinr_1_0-4.pdf

Most protein indices are linear forms on amino acid
relative frequencies whose coefficients are available in the
AAindex (http://www.genome.jp/aaindex/).
>2. given two sequences "AAA" and "BBB",how can i combine
them into "AAABBB"
paste() is your friend:

   paste("AAA", "BBB", sep = "")
[1] "AAABBB"

Or define your own infix binary operator for convenience:

   "%+%" <- function(...) paste(..., sep = "")
   "AAA" %+% "BBB" %+% "CCC"
[1] "AAABBBCCC"

>3. based on "AAABBB",how can i get some statistics of this string
such as how
>many letters,how many "A"s in the string.
  table(s2c("AAABBB"))

A B
3 3

With a real protein sequence:

  prot <- read.fasta(File = system.file("sequences/seqAA.fasta", 
package = "seqinr"), seqtype = "AA")
  table(prot)
prot
  *  A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y
  1  8  6  6 18  6  8  1  9 14 29  5  7 10  9 13 16  7  6  3  1

If you want the three letter code instead:

   table(prot) -> tmp
   names(tmp) <- aaa(names(tmp))
   tmp
Stp Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser 
Thr Val Trp Tyr
   1   8   6   6  18   6   8   1   9  14  29   5   7  10   9  13  16 
7   6   3   1

Try also:

AAstat(prot[[1]])

Hope this helps,

-- 
Jean R. Lobry            (lobry at biomserv.univ-lyon1.fr)
Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I,
43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE
allo  : +33 472 43 12 87     fax    : +33 472 43 13 88
http://pbil.univ-lyon1.fr/members/lobry/

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Jun 2006 - analyze amino acid sequence (composition)of proteins

[R] analyze amino acid sequence (composition)of proteins

[R] analyze amino acid sequence (composition)of proteins

[R] analyze amino acid sequence (composition)of proteins

Apparently Analagous Threads