aeneas24 at priest.com
2010-Sep-29 01:24 UTC
[R] Transforming/appending data (words in IMDB)
Hi everyone,
I am doing an analysis of reviews in IMDB and am running into trouble getting my
data into the right shape for analysis.
Key question: I want to know for each word in the IMDB, whether it is over- or
under-represented in a particular category (Rating x Genre). I was figuring on
estimating this with a g-test, fwiw. But the basic question I'm asking here
is about data transformation/appending. To go from these columns:
Film | Genre1 | Genre2 | Genre3 | Reviewer | Rating | Word | Word_ct
to these:
Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating |
Expected_word_ct | G-test-score
The actual amount of data is enormous (I have 10 files of ~1.5 GB each) and I
suspect I'm going to have to learn how to use the bigmemory package or
something like it. But for now, I'd be happy if I could figure out how to do
some basic calculations on a toy-sized subset of the data.
Broader goal: I am trying to do is see how words pattern in different reviews.
For example, if someone uses "gross" in a describing a romance, they
probably gave it a low rating. But someone else describing a horror movie with
"gross" may think of it as a positive thing.
Below I've put what I've tried so far, with annotations about my
thinking.
-------------------------------------
# Here is what a little portion of the data looks like:
mini<-read.csv("miniimdb.csv", header=F)
head(mini)
Film Genre1 Genre2 Genre3 Reviewer Rating Word Word_ct
1 Up Animation Adventure Comedy Kelly 9 the 22
2 Up Animation Adventure Comedy Kelly 9 dog 5
3 Up Animation Adventure Comedy Kelly 9 can 10
4 Up Animation Adventure Comedy Kelly 9 wow 2
5 Fame Drama Music <NA> Jenn 8 the 15
6 Fame Drama Music <NA> Jenn 8 dance 12
# Each row indicates one word as used by one reviewer for one movie. In the
first line, we see that Kelly used "the" 22 times in her review of
"Up". The line gives us information about her review: she gave
"Up" 9 stars (out of 10), 22/32 people found her review helpful.
"Up" is listed as multiple genres, which will become more inconvenient
later. It would be easier if each row of "Up" occurred as three
rows--identical except for "Genre". So that's one transformation
I'm not sure how to do.
# I also need to sum up, say, how many times any word appears in a given genre
(and later how often it appears in a given rating category so I can do the
conditional probabilities and get my observed vs. expected values). Here's
one way I thought about doing it:
data.by.genre1<-split(mini,mini$Genre1)
data.by.genre1$Animation
Film Genre1 Genre2 Genre3 Reviewer Rating Word Word_ct
1 Up Animation Adventure Comedy Kelly 9 the 22
2 Up Animation Adventure Comedy Kelly 9 dog 5
3 Up Animation Adventure Comedy Kelly 9 can 10
4 Up Animation Adventure Comedy Kelly 9 wow 2
11 Up Animation Adventure Comedy Eddie 4 the 12
12 Up Animation Adventure Comedy Eddie 4 dog 4
13 Up Animation Adventure Comedy Eddie 4 can 2
14 Up Animation Adventure Comedy Eddie 4 boy 2
data.ani=data.by.genre1$Animation
ani.wc<-sum(data.ani$V10)
ani.wc
[1] 59
miniplus=cbind(mini,ani.wc)
head(miniplus)
Film Genre1 Genre2 Genre3 Reviewer Rating Word Word_ct ani.wc
1 Up Animation Adventure Comedy Kelly 9 the 22 59
2 Up Animation Adventure Comedy Kelly 9 dog 5 59
3 Up Animation Adventure Comedy Kelly 9 can 10 59
4 Up Animation Adventure Comedy Kelly 9 wow 2 59
5 Fame Drama Music <NA> Jenn 8 the 15 59
6 Fame Drama Music <NA> Jenn 8 dance 12 59
# That's one approach, but it seems REALLY inefficient since I would have to
have add a new column specific to each genre, whether it was appropriate for the
film or not. Also, sometimes a genre is listed as the first genre, sometimes as
the second or third. I actually want to report the total of all
"the"'s in Animation, whether Animation is listed as the first,
second, or third genre. I'm not sure how to do that.
# I've also tried to use the reshape library. That gets me a little closer
to the kind of data I want, but I'm not really sure how to append it
smartly.
library(reshape)
# To use "melt", I seem to need to shave off the columns I'm not
interested in (film title and reviewer).
mininew<-cbind(mini[2:4],mini[6:8])
head(melt(mininew,id=c("Word", "Genre1", "Genre2",
"Genre3", "Rating"), measured=c("Word_Ct")))
Word Genre1 Genre2 Genre3 Rating variable value
1 the Animation Adventure Comedy 9 Word_ct 22
2 dog Animation Adventure Comedy 9 Word_ct 5
3 can Animation Adventure Comedy 9 Word_ct 10
4 wow Animation Adventure Comedy 9 Word_ct 2
5 the Drama Music <NA> 8 Word_ct 15
6 dance Drama Music <NA> 8 Word_ct 12
# This almost does what I want it to, except that it is only looking at the
first genre tag--I'd like to combine all "the" for Animation,
whether "Animation" is listed as the first, second, or third genre
tag.
# Using cast gets me even closer.
cast(mininew, Genre1 ~ Word, sum)
Using Word_ct as value column. Use the value argument to cast to override this
choice
Genre1 bank boy can dance dog great heist shit the wow
1 Animation 0 2 12 0 9 0 0 0 34 2
2 Crime 3 0 2 0 0 0 3 0 12 0
3 Drama 0 0 0 16 0 4 0 4 25 0
# Now that I know that there are 34 "the"'s in Animation, how do I
relate that back to my main data set? In order to figure out whether words are
over- or under-represented, I need to be able to figure out what we would have
expected at chance and then compare the observed to the expected. Again, I was
planning on using a g-test for this. And I was planning on reporting it per
line, with each line being a word in a particular genre in a particular rating
category (1-10 stars).
# Again, my ultimate goal is to report observed and expected word counts for
each word in each Genre in each Rating (1-10):
Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating |
Expected_word_ct | G-test-score
Any suggestions would be greatly appreciated.
Thanks,
Tyler
[[alternative HTML version deleted]]
