ucecgxu at ucl.ac.uk
2006-Jun-18 10:04 UTC
[R] analyze amino acid sequence (composition)of proteins
Dear R-helpers: thank your for your attention. i am a newer to R and i am doing some protein category classification based on the amino acid sequence.while i have some questions urgently. 1. any packages for analysis amino acid sequence 2. given two sequences "AAA" and "BBB",how can i combine them into "AAABBB" 3. based on "AAABBB",how can i get some statistics of this string such as how many letters,how many "A"s in the string. Thank you very much and i am looking forward to hearing from you soon have a nice day! Best Regards Marshall
Have you checked the package seqinR at CRAN? It may help. Also, if you split a string into single characters using strsplit, then use table to count characters. seq<-"ATGAAC" table(strsplit(seq, "")) A C G T 3 1 1 1 > 3. based on "AAABBB",how can i get some statistics of this string such as how > many letters,how many "A"s in the string. -- ----------------- Chris Stubben Los Alamos National Lab BioScience Division MS M888 Los Alamos, NM 87545
>i am a newer to R and i am doing some protein category classification based on >the amino acid sequence.while i have some questions urgently.I'm not sure to understand exactly what you are trying to do, but if this is a kind of clustering from the amino-acid composition of proteins beware that the genomic G+C content is a major confounding factor.>1. any packages for analysis amino acid sequenceThere are some utilities in the seqinR package to retrieve protein sequences from various protein databases such as SwissProt. The list of available databases (including nucleic databases) is given by: library(seqinr) choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" [6] "ensemblnew" "emglib" "nrsub" "nbrf" "hobacnucl" [11] "hobacprot" "hovernucl" "hoverprot" "hogennucl" "hogenprot" [16] "hoverclnu" "hoverclpr" "homolensprot" "homolensnucl" "HAMAPnucl" [21] "HAMAPprot" "hoppsigen" "nurebnucl" "nurebprot" "taxobacgen" [26] "greview" "hogendnucl" "hogendprot" "refseq" More info with the infobank argument set to TRUE: choosebank(infobank = TRUE)[1:5,] bank status info 1 genbank on GenBank Rel. 154 (15 June 2006) Last Updated: Jun 18, 2006 2 embl on EMBL Library Release 86 (March 2006) 3 emblwgs on EMBL Whole Genome Shotgun sequences Release 86 (March 2006) 4 swissprot on UniProt Rel. 7 (SWISS-PROT 49 + TrEMBL 32): Last Updated: Jun 6, 2006 5 ensembl on Ensembl databases rel 24 There are also some tools to compute protein indices such as the Kyte and Doolittle hydrophaty index, or the PI (Isoelectric point of a protein). See the examples in this document: http://pbil.univ-lyon1.fr/software/SeqinR/seqinr_1_0-4.pdf Most protein indices are linear forms on amino acid relative frequencies whose coefficients are available in the AAindex (http://www.genome.jp/aaindex/).>2. given two sequences "AAA" and "BBB",how can i combine them into "AAABBB"paste() is your friend: paste("AAA", "BBB", sep = "") [1] "AAABBB" Or define your own infix binary operator for convenience: "%+%" <- function(...) paste(..., sep = "") "AAA" %+% "BBB" %+% "CCC" [1] "AAABBBCCC">3. based on "AAABBB",how can i get some statistics of this string such as how >many letters,how many "A"s in the string.table(s2c("AAABBB")) A B 3 3 With a real protein sequence: prot <- read.fasta(File = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA") table(prot) prot * A C D E F G H I K L M N P Q R S T V W Y 1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1 If you want the three letter code instead: table(prot) -> tmp names(tmp) <- aaa(names(tmp)) tmp Stp Ala Cys Asp Glu Phe Gly His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr 1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1 Try also: AAstat(prot[[1]]) Hope this helps, -- Jean R. Lobry (lobry at biomserv.univ-lyon1.fr) Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - LYON I, 43 Bd 11/11/1918, F-69622 VILLEURBANNE CEDEX, FRANCE allo : +33 472 43 12 87 fax : +33 472 43 13 88 http://pbil.univ-lyon1.fr/members/lobry/