List account
2005-Jan-15 04:39 UTC
[R] Newbie question regarding graphing of Princomp object
Greetings, I am working on a stylometric analysis of some latin texts; one of the latest stylometric techniques involves using principal components analysis. Not being a statistician, I can't really fully rely on PCA as my primary tool, since I don't really understand the statistics behind the PCA technique. Nevertheless, the ability to use PCA and graph the results has been marvelously helpful as a preliminary technique to determine what kinds of stylometric variables are worth pursuing as indicators of authorship. For instance, I'm doing the following... I have a set of data for approximately 120 different latin works, about half of which are by St. Thomas Aquinas, and the other half are by various other authors in the Thomistic tradition, some known and some anonymous. My data for frequencies of prepositions looks like the following: A,AD,CIRCA,CUM,DE, .... (total of 10 variables) 1,0.00967667222531036,0.0208124884194923,0.00142671854734112,0.004863813 22957198,0.00758291643505651 ... 2,0.00874917700292081,0.0217315416668508,0.00133005165549453,0.004379007 27772451,0.00537323193714733 .... 3,0.0064258378627327,0.0280901956627422,0.00178739176045295,0.0043058230 9573329,0.00821688482105979 .... 4,0.00706850368364528,0.027446604903448,0.000821141574836712,0.004617615 47172807,0.00812783899774761 .... 5,0.010214039424891,0.015409971157808,0.000745993537614122,0.00584650749 246416,0.00475787738815518 .... 6,0.00952534711010655,0.0180981595092025,0.00125928317726832,0.005150145 30190507,0.00447206974491443 ... .... (and so on for the rest of the 120 works) The works are numbered such that works 100 and below are by St. Thomas, those from 101 to 117 are of dubious authenticity, and those from 118 to 179 are by other authors. When I perform a biplot, on the results of the princomp() function, I get a nice graph that plots the 120 works on the two principal component axes (I've figured out how to get rid of the red arrows already). Given that the data points tend to jumble together, I'd like some way to color the different categories of works in the biplot, so that data points for works 1-100 are red, those from 101-117 are blue, and those from 118 to 179 are green (for instance). I've included a sample of the output that I'm currently getting, in case it's helpful to anybody. BTW, I am running RAqua (for the Mac), version 1.8.1. Thanks in advance for any help! -Erik Norvelle erik (at) norvelle (dot) org Facultad de Filosof?a y Letras Universidad de Navarra Pamplona, Navarra, Espa?a -------------- next part -------------- A non-text attachment was scrubbed... Name: prepositions.pdf Type: application/pdf Size: 12639 bytes Desc: not available Url : https://stat.ethz.ch/pipermail/r-help/attachments/20050115/3611db92/prepositions.pdf -------------- next part --------------
Tobias Verbeke
2005-Jan-15 08:47 UTC
[R] Newbie question regarding graphing of Princomp object
On Sat, 15 Jan 2005 05:39:00 +0100 List account <lists at norvelle.org> wrote:> Greetings, > > I am working on a stylometric analysis of some latin texts; one of the > latest stylometric techniques involves using principal components > analysis. Not being a statistician, I can't really fully rely on PCA > as my primary tool, since I don't really understand the statistics > behind the PCA technique. Nevertheless, the ability to use PCA and > graph the results has been marvelously helpful as a preliminary > technique to determine what kinds of stylometric variables are worth > pursuing as indicators of authorship. > > For instance, I'm doing the following... I have a set of data for > approximately 120 different latin works, about half of which are by St. > Thomas Aquinas, and the other half are by various other authors in the > Thomistic tradition, some known and some anonymous. My data for > frequencies of prepositions looks like the following: > > A,AD,CIRCA,CUM,DE, .... (total of 10 variables) > 1,0.00967667222531036,0.0208124884194923,0.00142671854734112,0.004863813 > 22957198,0.00758291643505651 ... > 2,0.00874917700292081,0.0217315416668508,0.00133005165549453,0.004379007 > 27772451,0.00537323193714733 .... > 3,0.0064258378627327,0.0280901956627422,0.00178739176045295,0.0043058230 > 9573329,0.00821688482105979 .... > 4,0.00706850368364528,0.027446604903448,0.000821141574836712,0.004617615 > 47172807,0.00812783899774761 .... > 5,0.010214039424891,0.015409971157808,0.000745993537614122,0.00584650749 > 246416,0.00475787738815518 .... > 6,0.00952534711010655,0.0180981595092025,0.00125928317726832,0.005150145 > 30190507,0.00447206974491443 ... > .... (and so on for the rest of the 120 works) > > The works are numbered such that works 100 and below are by St. Thomas, > those from 101 to 117 are of dubious authenticity, and those from 118 > to 179 are by other authors. > > When I perform a biplot, on the results of the princomp() function, I > get a nice graph that plots the 120 works on the two principal > component axes (I've figured out how to get rid of the red arrows > already). Given that the data points tend to jumble together, I'd like > some way to color the different categories of works in the biplot, so > that data points for works 1-100 are red, those from 101-117 are blue, > and those from 118 to 179 are green (for instance).You can use the `col' argument in the biplot call. In this case, I would do something like biplot(mydata, col = c(rep("red", 100), rep("blue", 17), rep("green", 62))) For a list of built-in color names, you can type colors() at the R prompt. For more information on biplot, type ?biplot VaRiis modis bene fit. HTH, Tobias> I've included a sample of the output that I'm currently getting, in > case it's helpful to anybody. BTW, I am running RAqua (for the Mac), > version 1.8.1. > > Thanks in advance for any help! > > -Erik Norvelle > erik (at) norvelle (dot) org > Facultad de Filosof?a y Letras > Universidad de Navarra > Pamplona, Navarra, Espa?a > >
On Sat, 15 Jan 2005 15:53:18 +0100 List account <lists at norvelle.org> wrote:> Thanks, Tobias for the response. > > I tried the suggestion you gave, and apparently (at least according to > the biplot manpage, only the first two members of the col vector are > used, the first to plot the first set of values, i.e. the scores, and > the second color is used for the loadings (I think I have that right). > At any rate, if I add the clause 'col = c(rep("red", 100), rep("blue", > 17), rep("green", 62))' I just get a bunch of red points! :(You're right. I'm sorry I did not read ?biplot, but only checked it had a col argument (Semel in anno licet insanire..). Anyway, with PCA it is not a good idea to plot both variables and cases on one single plot, because the temptation is too great to interpret proximities between variables and cases. You'd better plot two different graphs, one for the cases and one for the `circle of correlations'. For plotting the cases, you could make up your own plot using something similar to this: library(MASS) # for eqscplot F1 <- yourpca$score[,1] F2 <- yourpca$score[,2] eqscplot(F1, F2, pch = 20) text(F1, F2, labels = names(F1), col = c(rep("red", 100), rep("blue", 17), rep("green", 62)), pos = 3) Tobias> Si vales, valeo... > > -Erik