aeneas24 at priest.com
2010-Sep-29 01:24 UTC
[R] Transforming/appending data (words in IMDB)
Hi everyone, I am doing an analysis of reviews in IMDB and am running into trouble getting my data into the right shape for analysis. Key question: I want to know for each word in the IMDB, whether it is over- or under-represented in a particular category (Rating x Genre). I was figuring on estimating this with a g-test, fwiw. But the basic question I'm asking here is about data transformation/appending. To go from these columns: Film | Genre1 | Genre2 | Genre3 | Reviewer | Rating | Word | Word_ct to these: Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating | Expected_word_ct | G-test-score The actual amount of data is enormous (I have 10 files of ~1.5 GB each) and I suspect I'm going to have to learn how to use the bigmemory package or something like it. But for now, I'd be happy if I could figure out how to do some basic calculations on a toy-sized subset of the data. Broader goal: I am trying to do is see how words pattern in different reviews. For example, if someone uses "gross" in a describing a romance, they probably gave it a low rating. But someone else describing a horror movie with "gross" may think of it as a positive thing. Below I've put what I've tried so far, with annotations about my thinking. ------------------------------------- # Here is what a little portion of the data looks like: mini<-read.csv("miniimdb.csv", header=F) head(mini) Film Genre1 Genre2 Genre3 Reviewer Rating Word Word_ct 1 Up Animation Adventure Comedy Kelly 9 the 22 2 Up Animation Adventure Comedy Kelly 9 dog 5 3 Up Animation Adventure Comedy Kelly 9 can 10 4 Up Animation Adventure Comedy Kelly 9 wow 2 5 Fame Drama Music <NA> Jenn 8 the 15 6 Fame Drama Music <NA> Jenn 8 dance 12 # Each row indicates one word as used by one reviewer for one movie. In the first line, we see that Kelly used "the" 22 times in her review of "Up". The line gives us information about her review: she gave "Up" 9 stars (out of 10), 22/32 people found her review helpful. "Up" is listed as multiple genres, which will become more inconvenient later. It would be easier if each row of "Up" occurred as three rows--identical except for "Genre". So that's one transformation I'm not sure how to do. # I also need to sum up, say, how many times any word appears in a given genre (and later how often it appears in a given rating category so I can do the conditional probabilities and get my observed vs. expected values). Here's one way I thought about doing it: data.by.genre1<-split(mini,mini$Genre1) data.by.genre1$Animation Film Genre1 Genre2 Genre3 Reviewer Rating Word Word_ct 1 Up Animation Adventure Comedy Kelly 9 the 22 2 Up Animation Adventure Comedy Kelly 9 dog 5 3 Up Animation Adventure Comedy Kelly 9 can 10 4 Up Animation Adventure Comedy Kelly 9 wow 2 11 Up Animation Adventure Comedy Eddie 4 the 12 12 Up Animation Adventure Comedy Eddie 4 dog 4 13 Up Animation Adventure Comedy Eddie 4 can 2 14 Up Animation Adventure Comedy Eddie 4 boy 2 data.ani=data.by.genre1$Animation ani.wc<-sum(data.ani$V10) ani.wc [1] 59 miniplus=cbind(mini,ani.wc) head(miniplus) Film Genre1 Genre2 Genre3 Reviewer Rating Word Word_ct ani.wc 1 Up Animation Adventure Comedy Kelly 9 the 22 59 2 Up Animation Adventure Comedy Kelly 9 dog 5 59 3 Up Animation Adventure Comedy Kelly 9 can 10 59 4 Up Animation Adventure Comedy Kelly 9 wow 2 59 5 Fame Drama Music <NA> Jenn 8 the 15 59 6 Fame Drama Music <NA> Jenn 8 dance 12 59 # That's one approach, but it seems REALLY inefficient since I would have to have add a new column specific to each genre, whether it was appropriate for the film or not. Also, sometimes a genre is listed as the first genre, sometimes as the second or third. I actually want to report the total of all "the"'s in Animation, whether Animation is listed as the first, second, or third genre. I'm not sure how to do that. # I've also tried to use the reshape library. That gets me a little closer to the kind of data I want, but I'm not really sure how to append it smartly. library(reshape) # To use "melt", I seem to need to shave off the columns I'm not interested in (film title and reviewer). mininew<-cbind(mini[2:4],mini[6:8]) head(melt(mininew,id=c("Word", "Genre1", "Genre2", "Genre3", "Rating"), measured=c("Word_Ct"))) Word Genre1 Genre2 Genre3 Rating variable value 1 the Animation Adventure Comedy 9 Word_ct 22 2 dog Animation Adventure Comedy 9 Word_ct 5 3 can Animation Adventure Comedy 9 Word_ct 10 4 wow Animation Adventure Comedy 9 Word_ct 2 5 the Drama Music <NA> 8 Word_ct 15 6 dance Drama Music <NA> 8 Word_ct 12 # This almost does what I want it to, except that it is only looking at the first genre tag--I'd like to combine all "the" for Animation, whether "Animation" is listed as the first, second, or third genre tag. # Using cast gets me even closer. cast(mininew, Genre1 ~ Word, sum) Using Word_ct as value column. Use the value argument to cast to override this choice Genre1 bank boy can dance dog great heist shit the wow 1 Animation 0 2 12 0 9 0 0 0 34 2 2 Crime 3 0 2 0 0 0 3 0 12 0 3 Drama 0 0 0 16 0 4 0 4 25 0 # Now that I know that there are 34 "the"'s in Animation, how do I relate that back to my main data set? In order to figure out whether words are over- or under-represented, I need to be able to figure out what we would have expected at chance and then compare the observed to the expected. Again, I was planning on using a g-test for this. And I was planning on reporting it per line, with each line being a word in a particular genre in a particular rating category (1-10 stars). # Again, my ultimate goal is to report observed and expected word counts for each word in each Genre in each Rating (1-10): Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating | Expected_word_ct | G-test-score Any suggestions would be greatly appreciated. Thanks, Tyler [[alternative HTML version deleted]]