On Mon, Sep 21, 2009 at 8:07 AM, Anne-Marie Ternes
<amternes@gmail.com>wrote:
> Dear mailing list,
>
> I'm stuck with a tricky problem here - at least it seems tricky to me,
> being not really talented in pattern matching and regex matters.
>
> I'm analysing amino acid mutations by position and type of mutation.
> E.g. (fictitious example) in position 92, I can find L92V, L92MV,
> L92I... L is in this example the wild-type amino-acid, and everything
> behind the position number is a mutation (single amino acid or
> mixture). I'm only interested in the mutation information, so:
>
> Say I've got this vector:
> bla -> c("V", "MV", "I", "IL",
"PT", "M", "E", "OM")
>
> I'd like to count only those elements that are "truly unique"
> mutations, i.e.count "V", "MV" as 1, "I",
"IL" as 1, "PT" as 1, "M" as
> 1, "E" as 1, not count "OM".
>
> I could do it iteratively:
> Element 1: V. Keep.
> Element 2: MV. Match Keep vs New -> 1. I got already a V, so don't
count.
> Element 3: I. Match Keep vs New -> 0. I is new, keep. Keep = V,I
> Element 4: IL. Match Keep vs New -> 1. I got already an I, so don't
count.
> Element 5: PT. Match Keep vs New -> 0. PT is new, keep. Keep = V,I,PT
> Element 6: M: Match Keep vs New -> 0. M is new, keep. Keep = V,I,PT,M
> Element 7: E. Match Keep vs New -> 0. E is new, keep. Keep = V,I,PT,M,E
> Element 8: OM. Match Keep vs New -> 1. I got already M, so don't
count.
>
> Keep vector= (V,I,PT,M,E), count =5
>
> OK. There must be a more elegant way to do this! Something with
> vector-wise pattern matching or so?... By the way, I dont care e.g.
> which of "V" or "MV" is counted, what is important is
that they are
> only counted as 1.
>
> Thanks for your help!
>
> Anne-Marie
>
>
I'm on my first cup of caffeinated beverage today so I don't know how
helpful I will be-- but I'll give it a shot. I would approach this problem
by:
1. Creating a function that uses grep to search the vector of acids for the
components that match a certain letter or combination. This function would
return 1 if any matches are found and 0 if no matches were found. The test
for any matching mutations would be done by the appropriately-named any()
function.
2. Use an apply function to execute the matching function for each
possibility I want to search for.
Here's an example for your case:
# Your data
acids <- c("V", "MV", "I", "IL",
"PT", "M", "E", "OM")
# The letters you are interested in
to.count <- c('V','I','PT','M')
counts <- sapply( to.count, function( to.match ){
did.match <- grep( to.match, acids )
if( any( did.match ) ){
return(1)
}else{
return(0)
}
})
# The result
counts
V I PT M
1 1 1 1
If TRUE/FALSE answers would suffice, you could shorten the above code a
little by just returning the value of any():
counts <- sapply( to.count, function( to.match ){
did.match <- grep( to.match, acids )
return( any( did.match ) )
})
counts
V I PT M
TRUE TRUE TRUE TRUE
Actually, you could use as.integer() to achieve the same thing and get 1s
and 0s (sorry, I ramble a lot in the early morning.)
counts <- sapply( to.count, function( to.match ){
did.match <- grep( to.match, acids )
return( as.integer( any( did.match ) ) )
})
Here's a function that packs the above code up nicely:
countMutations <- function( acids, to.count ){
count <- sapply( to.count, function(to.match){
did.match <- grep(to.match, acids)
return(as.integer(any(did.match)))
})
return(count)
}
There is one problem with the above method that I can think of- if 'M'
and
'OM' were to be missing in your data it would still be matched due to
the
presence of 'MV':
# Data set without M and OM
acids <- c("V", "MV", "I", "IL",
"PT", "E" )
countMutations( acids, to.count )
# Doh! M still counted...
V I PT M
1 1 1 1
The remedy to this is to add a little regex pixie dust to the to.match
vector. For this conflict between 'MV' and 'M', we could impose
the
following rules-- we only want M to match if it is by it's self or preceded
by an O and we only want V to match by it's self or preceded by M. We
indicate this by changing the contents of to.match to include some regular
expression voodoo:
# New to.count vector- contains voodoo
to.count <- c( '[M]?[V]$', 'I', 'PT',
'[O]?[M]$')
countMutations( acids, to.count )
# Looks funky, but you could fix that by slapping it with names()
[M]?[V]$ I PT [O]?[M]$
1 1 1 0
Basically, what happened with the regexes:
[M]?[V]$
The [] indicate groups of possible matching characters-- in this case each
group only contains one character. The [M]? means that there may possibly be
a M at the start of the sequence, the [V]$ means that the sequence is
terminated by a V. The [O]?[M]$ expression works exactly the same way.
If you had multiple variants for V, such as:
'V'
'MV'
'PV'
You can add more characters into the first set of brackets: [MP]?[V]$ will
match anything possibly preceded by 'M' or 'P' and terminated by
'V'.
If you need something more advanced, I would suggest investing some time in
studying regular expressions-- they incredibly powerful yet powerfully
cryptic.
Well, coffee breaks' over- hope this helps!
-Charlie
[[alternative HTML version deleted]]