thr3ads.net - R help - [R] Transforming relational data [Feb 2011]

If this information is useful, please help other people find it:
Share via:

mathijsdevaan

2011-Feb-14 17:22 UTC

[R] Transforming relational data

Hi,

I have a large dataset with info on individuals (B) that have been involved
in projects (A) during multiple years (C). The dataset contains three
columns: A, B, C. Example:
   
   A  B  C
1 1  a  1999
2 1  b  1999
3 1  c  1999
4 1  d  1999
5 2  c  2001
6 2  d  2001
7 3  a  2004
8 3  c  2004
9 3  d  2004

I am interested in how well all the individuals in a project know each
other. To calculate this team familiarity measure I want to sum the
familiarity between all individual pairs in a team. The familiarity between
each individual pair in a team is calculated as the summation of each pair's
prior co-appearance in a project divided by the total number of team
members. So the team familiarity in project 3 = (1/4+1/4) + (1/4+1/4+1/2) +
(1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size 4) with c and d >
1/4+1/4 and c has been in project 1 (of size 4) with 1 and d > 1/4+1/4 and c
has been in project 2 (of size 2) with d > 1/2.

I think that the best way to do it is to transform the data into an edgelist
(each pair in one row/two columns) and then creating two additional columns
for the strength of the familiarity and the year of the project in which the
pair was active. The problem is that I am stuck already in the first step.
So the question is: how do I go from the current data structure to a list of
projects and the familiarity of its team members?

Your help is very much appreciated. Thanks!
-- 
View this message in context:
http://r.789695.n4.nabble.com/Transforming-relational-data-tp3305398p3305398.html
Sent from the R help mailing list archive at Nabble.com.

Matthew Dowle

2011-Feb-15 19:15 UTC

head link

[R] Transforming relational data

Hello. One (of many) solution might be:

require(data.table)
DT = data.table(read.table(textConnection("    A  B  C
1 1  a  1999
2 1  b  1999
3 1  c  1999
4 1  d  1999
5 2  c  2001
6 2  d  2001"),head=TRUE,stringsAsFactors=FALSE))

firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
setkey(firststep,Var1,Var2)
grp3 = c("a","b","d")
firststep[J(expand.grid(grp3,grp3)),nomatch=0][,sum(v)]
# 2.5

If I guess the bigger picture correctly, this can be extended
to make a time series of prior familiarity by including
the year in the key.

If you decide to try this, please make sure to grab the latest
(recent) version of data.table from CRAN (v1.5.3). Suggest that
you run it first to confirm it does return 2.5, then break it
down and run it step by step to see how each part works. You
will need some time to read the vignettes and ?data.table
(which has recently been improved) but I hope you think it is
worth it. Support is available at maintainer("data.table").

HTH
Matthew


On Mon, 14 Feb 2011 09:22:12 -0800, mathijsdevaan wrote:> Hi,
> 
> I have a large dataset with info on individuals (B) that have been
> involved in projects (A) during multiple years (C). The dataset contains
> three columns: A, B, C. Example:
>    
>    A  B  C
> 1 1  a  1999
> 2 1  b  1999
> 3 1  c  1999
> 4 1  d  1999
> 5 2  c  2001
> 6 2  d  2001
> 7 3  a  2004
> 8 3  c  2004
> 9 3  d  2004
> 
> I am interested in how well all the individuals in a project know each
> other. To calculate this team familiarity measure I want to sum the
> familiarity between all individual pairs in a team. The familiarity
> between each individual pair in a team is calculated as the summation of
> each pair's prior co-appearance in a project divided by the total
number
> of team members. So the team familiarity in project 3 = (1/4+1/4) +
> (1/4+1/4+1/2) + (1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size
> 4) with c and d > 1/4+1/4 and c has been in project 1 (of size 4) with 1
> and d > 1/4+1/4 and c has been in project 2 (of size 2) with d > 1/2.
> 
> I think that the best way to do it is to transform the data into an
> edgelist (each pair in one row/two columns) and then creating two
> additional columns for the strength of the familiarity and the year of
> the project in which the pair was active. The problem is that I am stuck
> already in the first step. So the question is: how do I go from the
> current data structure to a list of projects and the familiarity of its
> team members?
> 
> Your help is very much appreciated. Thanks!

mathijsdevaan

2011-Feb-18 07:07 UTC

head link

[R] Transforming relational data

OK, for the last step I have tried this (among other things): 
library(data.table) 
DT = data.table(read.table(textConnection("    A  B  C 
1 1  a  1999 
2 1  b  1999 
3 1  c  1999 
4 1  d  1999 
5 2  c  2001 
6 2  d  2001
7 3  a  2004
8 3  b  2004
9 3  d  2004"),head=TRUE,stringsAsFactors=FALSE)) 

firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2] 
setkey(firststep,Var1,Var2) 
list1<-firststep[J(expand.grid(DT$B,DT$B),v=1/length(DT$B)),nomatch=0][,sum(v)]
list1
#27

What I would like to get:
list
1  0
2  0.5
3  2.5

Thanks!
-- 
View this message in context:
http://r.789695.n4.nabble.com/Re-Transforming-relational-data-tp3307449p3312140.html
Sent from the R help mailing list archive at Nabble.com.

Matthew Dowle

2011-Feb-21 20:36 UTC

head link

[R] Transforming relational data

Thanks for the attempt and required output. How about this?

firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
setkey(firststep,Var1,Var2,C)
firststep = firststep[,transform(.SD,cv=cumsum(v)),by=list(Var1,Var2)]
setkey(firststep,Var1,Var2,C)
DT[, {x=data.table(expand.grid(B,B),C[1]-1L)
      firststep[x,roll=TRUE,nomatch=0][,sum(cv)]   # prior familiarity
     },by=C]
        C  V1
[1,] 1999 0.0
[2,] 2001 0.5
[3,] 2004 2.5

I think you may have said you have large data. If so, this
method should be fast. Please let us know how you get on.

HTH
Matthew



On Thu, 17 Feb 2011 23:07:19 -0800, mathijsdevaan wrote:
> OK, for the last step I have tried this (among other things):
> library(data.table)
> DT = data.table(read.table(textConnection("    A  B  C 1 1  a  1999
> 2 1  b  1999
> 3 1  c  1999
> 4 1  d  1999
> 5 2  c  2001
> 6 2  d  2001
> 7 3  a  2004
> 8 3  b  2004
> 9 3  d  2004"),head=TRUE,stringsAsFactors=FALSE))
> 
> firststep = DT[,cbind(expand.grid(B,B),v=1/length(B)),by=C][Var1!=Var2]
> setkey(firststep,Var1,Var2)
> list1<-firststep[J(expand.grid(DT$B,DT$B),v=1/length(DT$B)),nomatch=0]
[,sum(v)]> list1
> #27
> 
> What I would like to get:
> list
> 1  0
> 2  0.5
> 3  2.5
> 
> Thanks!

Gabor Grothendieck

2011-Feb-22 16:44 UTC

head link

[R] Transforming relational data

On Mon, Feb 14, 2011 at 12:22 PM, mathijsdevaan <mathijsdevaan at
gmail.com> wrote:>
> Hi,
>
> I have a large dataset with info on individuals (B) that have been involved
> in projects (A) during multiple years (C). The dataset contains three
> columns: A, B, C. Example:
>
> ? A ?B ?C
> 1 1 ?a ?1999
> 2 1 ?b ?1999
> 3 1 ?c ?1999
> 4 1 ?d ?1999
> 5 2 ?c ?2001
> 6 2 ?d ?2001
> 7 3 ?a ?2004
> 8 3 ?c ?2004
> 9 3 ?d ?2004
>
> I am interested in how well all the individuals in a project know each
> other. To calculate this team familiarity measure I want to sum the
> familiarity between all individual pairs in a team. The familiarity between
> each individual pair in a team is calculated as the summation of each
pair's
> prior co-appearance in a project divided by the total number of team
> members. So the team familiarity in project 3 = (1/4+1/4) + (1/4+1/4+1/2) +
> (1/4+1/4+1/2) = 2,5 or a has been in project 1 (of size 4) with c and d
>
> 1/4+1/4 and c has been in project 1 (of size 4) with 1 and d > 1/4+1/4
and c
> has been in project 2 (of size 2) with d > 1/2.
>
> I think that the best way to do it is to transform the data into an
edgelist
> (each pair in one row/two columns) and then creating two additional columns
> for the strength of the familiarity and the year of the project in which
the
> pair was active. The problem is that I am stuck already in the first step.
> So the question is: how do I go from the current data structure to a list
of
> projects and the familiarity of its team members?
>
First define the data frame, DF.  Note we have used column names of
proj, pers and year.  Then append a size column producing DF2.

Using sqldf merge DF2 with itself within project giving one row per
pair in M. At the same time we calculate the reciprocal of size for
each row or 0 if the two components of the pair are the same person.

Next we merge M with itself giving pairs of pairs zeroing out rows
that should not contribute to the sum and aggregating the reciprocal
sizes by project.

The automatic class assignment heuristic does not work well in this
case so we use method = "raw" to bypass it.

DF <-
structure(list(proj = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
    pers = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 3L, 4L), .Label =
c("a",
    "b", "c", "d"), class = "factor"),
year = c(1999L, 1999L,
    1999L, 1999L, 2001L, 2001L, 2004L, 2004L, 2004L)), .Names =
c("proj",
"pers", "year"), row.names = c("1", "2",
"3", "4", "5", "6",
"7", "8", "9"), class = "data.frame")

library(sqldf)

DF2 <- transform(DF, size = ave(proj, proj, FUN = length))

M <- sqldf("select proj, year, x.pers || ' ' || y.pers pair,
 (x.pers != y.pers) / (size + 0.0) recip from DF2 x, DF2 y
 using(proj, year, size)", method = "raw")

sqldf("select x.proj,
	  sum((x.year > y.year and x.recip > 0) * y.recip) familiarity
    from M x, M y using (pair) group by x.proj", method = "raw")

The result of the last statement is:

  proj familiarity
1    1         0.0
2    2         0.5
3    3         2.5

More info on sqldf is available at http://sqldf.googlecode.com


The last statement could be replaced by these two in case you want the
intermediate PP:

PP <- sqldf("select *, (x.year > y.year and x.recip > 0) * y.recip
familiarity
from M x, M y using (pair)", method = "raw")

sqldf("select proj, sum(familiarity) as familiarity from PP group by
proj")

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

mathijsdevaan

2011-Feb-22 21:20 UTC

head link

[R] Transforming relational data

Gabor, that worked great! I was unaware of the sqldf package, but it is great
for data manipulation. Thanks!
-- 
View this message in context:
http://r.789695.n4.nabble.com/Re-Transforming-relational-data-tp3307449p3320067.html
Sent from the R help mailing list archive at Nabble.com.

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Feb 2011 - Transforming relational data

[R] Transforming relational data

[R] Transforming relational data

[R] Transforming relational data

[R] Transforming relational data

[R] Transforming relational data

[R] Transforming relational data

Apparently Analagous Threads