Hi R-help, I have a database of 10 students who have written an overall of 78 essays. The challenge? I would like to identify who wrote the 79th essay. Has anybody used R in this context? Even if not, would you suggest me which pattern recognition technique I might possibly apply? Thanks a lot and regards, Tom --------------------------------- [[alternative HTML version deleted]]
I assume that you know the usual procedure is to 'score' each essay by a vector that gives the frequency of occurrence of commonly used (sometimes adding subject matter specific) words and phrases. This multivariate response is then fed in as a "training set" into your favorite supervised learning/classification procedure. R has many of these -- trees, logisic regression, boosting, Random Forests,svm's,LDA,SOM's (whoops -- that's an Unsupervised one), ... . Try RSiteSearch('Classification',restrict=('functions'). The devil is in the details as to what works best, I believe. With only 78 exemplars in 10 groups, unless there is a lot of separation (disparate styles that you could probably detect manually) it may be difficult. It also depends on how large each group is (balance is generally better). Cheers, Bert -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Werner Bier Sent: Sunday, June 12, 2005 12:30 PM To: r-help at stat.math.ethz.ch Subject: [R] Essay identification Hi R-help, I have a database of 10 students who have written an overall of 78 essays. The challenge? I would like to identify who wrote the 79th essay. Has anybody used R in this context? Even if not, would you suggest me which pattern recognition technique I might possibly apply? Thanks a lot and regards, Tom --------------------------------- [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
On 6/12/05, Werner Bier <aliscla at yahoo.com> wrote:> Hi R-help, > > I have a database of 10 students who have written an overall of 78 essays. > The challenge? I would like to identify who wrote the 79th essay. > > Has anybody used R in this context? > > Even if not, would you suggest me which pattern recognition technique I might possibly apply?Check out http://xxx.uni-augsburg.de/PS_cache/cond-mat/pdf/0108/0108530.pdf for a simple method.
This topic is sometimes called wordprinting or stylometry. The spring 2003 issue of Chance magazine had several articles on the topic. A colleague of mine and I have been working on a perl program (along with various graduate students) to extract many of the common statistics used in wordprinting (counts/percentages of non-contextual words, word pattern ratios, vocabulary richness). The data can then be loaded into R (or any other stats package) to be analyzed. The program is currently in a beta state (usable, but we want to possibly add more features and documentation), but I can send a copy to anyone who is interested (specify if you have perl, or need a stand alone copy (windows only)). hope this helps, Greg Snow, Ph.D. Statistical Data Center, LDS Hospital Intermountain Health Care greg.snow at ihc.com (801) 408-8111>>> Werner Bier <aliscla at yahoo.com> 06/12/05 01:29PM >>>Hi R-help, I have a database of 10 students who have written an overall of 78 essays. The challenge? I would like to identify who wrote the 79th essay. Has anybody used R in this context? Even if not, would you suggest me which pattern recognition technique I might possibly apply? Thanks a lot and regards, Tom --------------------------------- [[alternative HTML version deleted]] ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html