thr3ads.net - R help - [R] row by row similarity [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Grant Gillis

2008-Apr-06 18:07 UTC

[R] row by row similarity

Hello all and thanks in advance for any advice.
I am very new to R and have searched my question but have not come up with
anything quite like what I would like to do.

My problem is:

I have a data set for individuals (rows) and values for behaviours
(columns).  I would like to know the proportion of shared behaviours for all
possible pairs of individuals.  The sum of shared behaviours divided by the
total.  There are zeros in the data that I would like treated as the
behaviour does not exist.


example data format:

ind    B1  B2  B3  B4  B5  B6
w       2    1    5    3    4    4
x       1    2    3    4    5    6
y       1    3    5    2    7    6
z       2    3    2    4    2    6


Desired output:

w  x   0
w  y   0.166667
w  z   0
x   y   0.33333
x   z   0.33333
etc.


Thanks

Grant

	[[alternative HTML version deleted]]

Simon Anders

2008-Apr-06 19:27 UTC

head link

[R] row by row similarity

Hi Grant,

Grant Gillis wrote:> My problem is:
> 
> I have a data set for individuals (rows) and values for behaviours
> (columns).  I would like to know the proportion of shared behaviours for
all
> possible pairs of individuals.  The sum of shared behaviours divided by the
> total.  There are zeros in the data that I would like treated as the
> behaviour does not exist.
> 
> example data format:
> 
> ind    B1  B2  B3  B4  B5  B6
> w       2    1    5    3    4    4
> x       1    2    3    4    5    6
> y       1    3    5    2    7    6
> z       2    3    2    4    2    6
I hope I understand correctly that the numbers label different
behaviours, hence e.g. individuals 'y' and 'z' have the same
level of
behaviour, namely level '3', for the behaviour B2. You may want to look
at R's 'factor's, which allow you to give the levels descriptive
names
instead of just numbers.

Let us first make a dataframe out of your example:

t <- data.frame(
    B1 = c(2,1,1,2),
    B2 = c(1,NA,3,3),
    B3 = c(5,2,5,3),
    B4 = c(3,4,2,4),
    B5 = c(4,5,7,2),
    B6 = c(4,6,6,6) )
rownames(t) = c("w","x","y","z")
> t   B1 B2 B3 B4 B5 B6
w  2  1  5  3  4  4
x  1  2  2  4  5  6
y  1  3  5  2  7  6
z  2  3  3  4  2  6

If you now test two rows for equality, this happens element-wise:
> t["w",] == t["y",]      B1    B2   B3    B4    B5    B6
w FALSE FALSE TRUE FALSE FALSE FALSE

You can call 'sum' on this output to get the number of TRUE values.
> sum( t["w",] == t["y",] )[1] 1

As you want to do this with all pairings, we need a nested 'sapply':
> sapply( rownames(t), function(ind1)+    sapply( rownames(t), function(ind2)
+       sum( t[ind1,] == t[ind2,] ) ) )
   w x y z
w 6 0 1 1
x 0 6 2 2
y 1 2 6 2
z 1 2 2 6

This table now contains the desired information. Of course, you have to
divide by the number of behaviours, i.e. by 6, and the format is a bit
different from your suggestion, but I hope that does not matter.
> Desired output:
> 
> w  x   0
> w  y   0.166667
> w  z   0
> x   y   0.33333
> x   z   0.33333
> etc.
To deal with the missing behaviour you should better use 'NA' instead of
0. Then R may be able to help you with it, as it treats NAs, i.e. values
marked as missing, in a special way.

Assume, for example, that you compare the rows
> r1 <- c( 2, 3, NA, 1, 5 )
> r2 <- c( 1, 3, 4, NA, 4 )
Calling '==' as above on such data yields:
> r1==r2[1] FALSE  TRUE    NA    NA FALSE

As you can see, the missing behaviour is marked NA, because it is
uncomparable. To get the number of TRUE values, use
> sum( r1==r2, na.rm=TRUE )[1] 1

And to get the number of comparable observations, i.e. those without NA,
use e.g.
> length( na.omit( r1==r2 ) )[1] 3

I hope this helps you to work out your own solution. Otherwise, ask again.

Best
   Simon


+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| preferred (permanent) e-mail: sanders at fs.tum.de

Seemingly Similar Threads

Search for more maybe matching threads

R help - Apr 2008 - row by row similarity

[R] row by row similarity

[R] row by row similarity

Seemingly Similar Threads