thr3ads.net - R help - [R] Pattern Matching within Vector? [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Anne-Marie Ternes

2009-Sep-21 15:07 UTC

[R] Pattern Matching within Vector?

Dear mailing list,

I'm stuck with a tricky problem here - at least it seems tricky to me,
being not really talented in pattern matching and regex matters.

I'm analysing amino acid mutations by position and type of mutation.
E.g. (fictitious example) in position 92, I can find L92V, L92MV,
L92I... L is in this example the wild-type amino-acid, and everything
behind the position number is a mutation (single amino acid or
mixture). I'm only interested in the mutation information, so:

Say I've got this vector:
bla -> c("V", "MV", "I", "IL",
"PT", "M", "E", "OM")

I'd like to count only those elements that are "truly unique"
mutations, i.e.count "V", "MV" as 1, "I",
"IL" as 1, "PT" as 1, "M" as
1, "E" as 1, not count "OM".

I could do it iteratively:
Element 1: V. Keep.
Element 2: MV. Match Keep vs New -> 1. I got already a V, so don't count.
Element 3: I. Match Keep vs New -> 0. I is new, keep. Keep = V,I
Element 4: IL. Match Keep vs New -> 1. I got already an I, so don't
count.
Element 5: PT. Match Keep vs New -> 0. PT is new, keep. Keep = V,I,PT
Element 6: M: Match Keep vs New -> 0. M is new, keep. Keep = V,I,PT,M
Element 7: E. Match Keep vs New -> 0. E is new, keep. Keep = V,I,PT,M,E
Element 8: OM. Match Keep vs New -> 1. I got already M, so don't count.

Keep vector= (V,I,PT,M,E), count =5

OK. There must be a more elegant way to do this! Something with
vector-wise pattern matching or so?... By the way, I dont care e.g.
which of "V" or "MV" is counted, what is important is that
they are
only counted as 1.

Thanks for your help!

Anne-Marie

Charlie Sharpsteen

2009-Sep-21 16:03 UTC

head link

[R] Pattern Matching within Vector?

On Mon, Sep 21, 2009 at 8:07 AM, Anne-Marie Ternes
<amternes@gmail.com>wrote:
> Dear mailing list,
>
> I'm stuck with a tricky problem here - at least it seems tricky to me,
> being not really talented in pattern matching and regex matters.
>
> I'm analysing amino acid mutations by position and type of mutation.
> E.g. (fictitious example) in position 92, I can find L92V, L92MV,
> L92I... L is in this example the wild-type amino-acid, and everything
> behind the position number is a mutation (single amino acid or
> mixture). I'm only interested in the mutation information, so:
>
> Say I've got this vector:
> bla -> c("V", "MV", "I", "IL",
"PT", "M", "E", "OM")
>
> I'd like to count only those elements that are "truly unique"
> mutations, i.e.count "V", "MV" as 1, "I",
"IL" as 1, "PT" as 1, "M" as
> 1, "E" as 1, not count "OM".
>
> I could do it iteratively:
> Element 1: V. Keep.
> Element 2: MV. Match Keep vs New -> 1. I got already a V, so don't
count.
> Element 3: I. Match Keep vs New -> 0. I is new, keep. Keep = V,I
> Element 4: IL. Match Keep vs New -> 1. I got already an I, so don't
count.
> Element 5: PT. Match Keep vs New -> 0. PT is new, keep. Keep = V,I,PT
> Element 6: M: Match Keep vs New -> 0. M is new, keep. Keep = V,I,PT,M
> Element 7: E. Match Keep vs New -> 0. E is new, keep. Keep = V,I,PT,M,E
> Element 8: OM. Match Keep vs New -> 1. I got already M, so don't
count.
>
> Keep vector= (V,I,PT,M,E), count =5
>
> OK. There must be a more elegant way to do this! Something with
> vector-wise pattern matching or so?... By the way, I dont care e.g.
> which of "V" or "MV" is counted, what is important is
that they are
> only counted as 1.
>
> Thanks for your help!
>
> Anne-Marie
>
>I'm on my first cup of caffeinated beverage today so I don't know how
helpful I will be-- but I'll give it a shot. I would approach this problem
by:

1. Creating a function that uses grep to search the vector of acids for the
components that match a certain letter or combination. This function would
return 1 if any matches are found and 0 if no matches were found. The test
for any matching mutations would be done by the appropriately-named any()
function.

2. Use an apply function to execute the matching function for each
possibility I want to search for.

Here's an example for your case:

# Your data
acids <- c("V", "MV", "I", "IL",
"PT", "M", "E", "OM")

# The letters you are interested in
to.count <- c('V','I','PT','M')

counts <- sapply( to.count, function( to.match ){

  did.match <- grep( to.match, acids )

  if( any( did.match ) ){
    return(1)
  }else{
    return(0)
  }

})

# The result
counts
 V  I PT  M
 1  1  1  1

If TRUE/FALSE answers would suffice, you could shorten the above code a
little by just returning the value of any():

counts <- sapply( to.count, function( to.match ){

  did.match <- grep( to.match, acids )

 return( any( did.match ) )

})

counts
   V     I    PT     M
 TRUE  TRUE TRUE  TRUE

Actually, you could use as.integer() to achieve the same thing and get 1s
and 0s (sorry, I ramble a lot in the early morning.)

counts <- sapply( to.count, function( to.match ){

  did.match <- grep( to.match, acids )

  return( as.integer( any( did.match ) ) )

})

Here's a function that packs the above code up nicely:

countMutations <- function( acids, to.count ){

  count <- sapply( to.count, function(to.match){

    did.match <- grep(to.match, acids)

    return(as.integer(any(did.match)))

  })

  return(count)

}

There is one problem with the above method that I can think of- if 'M'
and
'OM' were to be missing in your data it would still be matched due to
the
presence of 'MV':

# Data set without M and OM
acids <- c("V", "MV", "I", "IL",
"PT", "E" )

countMutations( acids, to.count )

# Doh! M still counted...
V  I PT  M
 1  1  1  1

The remedy to this is to add a little regex pixie dust to the to.match
vector. For this conflict between 'MV' and 'M', we could impose
the
following rules-- we only want M to match if it is by it's self or preceded
by an O and we only want V to match by it's self or preceded by M. We
indicate this by changing the contents of to.match to include some regular
expression voodoo:

# New to.count vector- contains voodoo
to.count <- c( '[M]?[V]$', 'I', 'PT',
'[O]?[M]$')

countMutations( acids, to.count )

# Looks funky, but you could fix that by slapping it with names()
[M]?[V]$        I       PT [O]?[M]$
       1           1        1        0

Basically, what happened with the regexes:

[M]?[V]$

The [] indicate groups of possible matching characters-- in this case each
group only contains one character. The [M]? means that there may possibly be
a M at the start of the sequence, the [V]$ means that the sequence is
terminated by a V. The [O]?[M]$ expression works exactly the same way.

If you had multiple variants for V, such as:

'V'
'MV'
'PV'

You can add more characters into the first set of brackets: [MP]?[V]$ will
match anything possibly preceded by 'M' or 'P' and terminated by
'V'.

If you need something more advanced, I would suggest investing some time in
studying regular expressions-- they incredibly powerful yet powerfully
cryptic.

Well, coffee breaks' over- hope this helps!

-Charlie

	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more reasonably related threads

R help - Sep 2009 - Pattern Matching within Vector?

[R] Pattern Matching within Vector?

[R] Pattern Matching within Vector?

Possibly Parallel Threads