Dear R users I'm a novice user of R and have absolutely no prior knowledge of social network analysis, so apologies if my question is trivial. I've spent alot of time trying to solve this on my own but I really can't so hope someone here can help me out. Cheers! The dataset: I'm trying to predict the existance of links (True or False) in a test set using a training set. Both data sets are in an "edgelist" format, where User IDs represents nodes in both columns with the 1st column directing to the 2nd column (see figure 1 below). Using the AUC to evaluate the performance, I am looking for the best algorithm to predict the existance of links in the test data (50% are true and rest are false). Figure 1:> trainingVertices: 1133143 Edges: 999 Directed: TRUE Edges: [0] 105 -> 850956 [1] 105 -> 1073420 [2] 105 -> 1102667 [3] 165 -> 888346 [4] 165 -> 579649 [5] 165 -> 136665 etc.. I'm having problems obtaining the probability scores for the links / edges as most of the scores are for the nodes. An example of this is the graph.knn and page.rank module in igraph. So my questions are: 1) What do I need to do to obtain the scores for the links instead of the nodes (I presume it must be a data preparation step that I must be missing out)? 2) Which R package would be the best for running the various techniques - Jackard index, Adamic-Adar, common neightbours, PropFlow, etc 3) How to implement a supervised learning method such as random forest (I am guessing I need to obtain a feature list but again, how can I get the scores for the edges)? Hope I've explain my questions well but do let me know if more clarification is need. Thanks in advance Eu Jin [[alternative HTML version deleted]]
You could start having a look at cran packages like sna or statnet, or search cran for "network" and you nfind a lot of packages! On Wed, Dec 22, 2010 at 12:00 AM, EU JIN LOK <ejlok1 at hotmail.com> wrote:> > Dear R users > > I'm a novice user of R and have absolutely no prior knowledge of social network analysis, so apologies if my question is trivial. I've spent alot of time trying to solve this on my own but I really can't so hope someone here can help me out. Cheers! > > The dataset: > I'm trying to predict the existance of links (True or False) in a test set using a training set. Both data sets are in an "edgelist" format, where User IDs represents nodes in both columns with the 1st column directing to the 2nd column (see figure 1 below). Using the AUC to evaluate the performance, I am looking for the best algorithm to predict the existance of links in the test data (50% are true and rest are false). > > Figure 1: >> training > Vertices: 1133143 > Edges: 999 > Directed: TRUE > Edges: > > [0] ? ? ? 105 -> ?850956 > [1] ? ? ? 105 -> 1073420 > [2] ? ? ? 105 -> 1102667 > [3] ? ? ? 165 -> ?888346 > [4] ? ? ? 165 -> ?579649 > [5] ? ? ? 165 -> ?136665 > etc.. > > I'm having problems obtaining the probability scores for the links / edges as most of the scores are for the nodes. An example of this is the graph.knn and page.rank module in igraph. > > So my questions are: > 1) What do I need to do to obtain the scores for the links instead of the nodes (I presume it must be a data preparation step that I must be missing out)? > 2) Which R package would be the best for running the various techniques - Jackard index, Adamic-Adar, common neightbours, PropFlow, etc > 3) How to implement a supervised learning method such as random forest (I am guessing I need to obtain a feature list but again, how can I get the scores for the edges)? > > Hope I've explain my questions well but do let me know if more clarification is need. > > Thanks in advance > Eu Jin > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Dear Eu, On Wed, Dec 22, 2010 at 12:00 AM, EU JIN LOK <ejlok1 at hotmail.com> wrote:> > Dear R users > > I'm a novice user of R and have absolutely no prior knowledge of social network analysis, so apologies if my question is trivial. I've spent alot of time trying to solve this on my own but I really can't so hope someone here can help me out. Cheers! > > The dataset: > I'm trying to predict the existance of links (True or False) in a test set using a training set. Both data sets are in an "edgelist" format, where User IDs represents nodes in both columns with the 1st column directing to the 2nd column (see figure 1 below). Using the AUC to evaluate the performance, I am looking for the best algorithm to predict the existance of links in the test data (50% are true and rest are false). > > Figure 1: >> training > Vertices: 1133143 > Edges: 999 > Directed: TRUE > Edges: > > [0] ? ? ? 105 -> ?850956 > [1] ? ? ? 105 -> 1073420 > [2] ? ? ? 105 -> 1102667 > [3] ? ? ? 165 -> ?888346 > [4] ? ? ? 165 -> ?579649 > [5] ? ? ? 165 -> ?136665 > etc.. > > I'm having problems obtaining the probability scores for the links / edges as most of the scores are for the nodes. An example of this is the graph.knn and page.rank module in igraph. > > So my questions are: > 1) What do I need to do to obtain the scores for the links instead of the nodes (I presume it must be a data preparation step that I must be missing out)?In general, most people are interested in the nodes of the network, so most network indices are node level. If you want edge-level indices, you can create another graph from yours, by transforming the edges into vertices and vice-versa. Two vertices are connected in the new graph, if the corresponding two edges in the old graph share an incident vertex. However, I am sure that there are some vertex measures that don't make sense for edges at all, so you need to be careful with this, especially with the interpretation of the results. Another possibility is to use the few edge-level indices, e.g. edge betweenness, or just define analog edge measures for the existing vertex measures.> 2) Which R package would be the best for running the various techniques - Jackard index, Adamic-Adar, common neightbours, PropFlow, etcThe first three are implemented in igraph if I remember well.> 3) How to implement a supervised learning method such as random forest (I am guessing I need to obtain a feature list but again, how can I get the scores for the edges)?I am not an expert on this, but there are are several R packages for supervised methods, random forests as well, look around on CRAN. I hope this helps, Best, Gabor> Hope I've explain my questions well but do let me know if more clarification is need. > > Thanks in advance > Eu Jin > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Gabor Csardi <Gabor.Csardi at unil.ch>? ?? UNIL DGM