thr3ads.net - R help - [R] Transforming/appending data (words in IMDB) [Sep 2010]

If this information is useful, please help other people find it:
Share via:
aeneas24 at priest.com
2010-Sep-29 01:24 UTC
[R] Transforming/appending data (words in IMDB)

Hi everyone,

 
I am doing an analysis of reviews in IMDB and am running into trouble getting my
data into the right shape for analysis.
 
Key question: I want to know for each word in the IMDB, whether it is over- or
under-represented in a particular category (Rating x Genre). I was figuring on
estimating this with a g-test, fwiw. But the basic question I'm asking here
is about data transformation/appending. To go from these columns:

Film    | Genre1    | Genre2 | Genre3 | Reviewer | Rating  | Word | Word_ct

to these:

Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating |
Expected_word_ct | G-test-score

The actual amount of data is enormous (I have 10 files of ~1.5 GB each) and I
suspect I'm going to have to learn how to use the bigmemory package or
something like it. But for now, I'd be happy if I could figure out how to do
some basic calculations on a toy-sized subset of the data.
 
Broader goal: I am trying to do is see how words pattern in different reviews.
For example, if someone uses "gross" in a describing a romance, they
probably gave it a low rating. But someone else describing a horror movie with
"gross" may think of it as a positive thing.

Below I've put what I've tried so far, with annotations about my
thinking.
-------------------------------------

# Here is what a little portion of the data looks like:
mini<-read.csv("miniimdb.csv", header=F)
head(mini)
  Film    Genre1    Genre2 Genre3 Reviewer Rating  Word Word_ct
1   Up Animation Adventure Comedy    Kelly      9   the    22
2   Up Animation Adventure Comedy    Kelly      9   dog     5
3   Up Animation Adventure Comedy    Kelly      9   can    10
4   Up Animation Adventure Comedy    Kelly      9   wow     2
5 Fame     Drama     Music   <NA>     Jenn      8   the    15
6 Fame     Drama     Music   <NA>     Jenn      8 dance    12

# Each row indicates one word as used by one reviewer for one movie. In the
first line, we see that Kelly used "the" 22 times in her review of
"Up". The line gives us information about her review: she gave
"Up" 9 stars (out of 10), 22/32 people found her review helpful.
"Up" is listed as multiple genres, which will become more inconvenient
later. It would be easier if each row of "Up" occurred as three
rows--identical except for "Genre". So that's one transformation
I'm not sure how to do.

# I also need to sum up, say, how many times any word appears in a given genre
(and later how often it appears in a given rating category so I can do the
conditional probabilities and get my observed vs. expected values). Here's
one way I thought about doing it:
  
data.by.genre1<-split(mini,mini$Genre1)
data.by.genre1$Animation
   Film    Genre1    Genre2 Genre3 Reviewer Rating Word Word_ct
1    Up Animation Adventure Comedy    Kelly      9  the    22
2    Up Animation Adventure Comedy    Kelly      9  dog     5
3    Up Animation Adventure Comedy    Kelly      9  can    10
4    Up Animation Adventure Comedy    Kelly      9  wow     2
11   Up Animation Adventure Comedy    Eddie      4  the    12
12   Up Animation Adventure Comedy    Eddie      4  dog     4
13   Up Animation Adventure Comedy    Eddie      4  can     2
14   Up Animation Adventure Comedy    Eddie      4  boy     2

data.ani=data.by.genre1$Animation
ani.wc<-sum(data.ani$V10)
ani.wc
[1] 59
miniplus=cbind(mini,ani.wc)
head(miniplus)
  Film    Genre1    Genre2 Genre3 Reviewer Rating  Word Word_ct ani.wc
1   Up Animation Adventure Comedy    Kelly      9   the    22     59
2   Up Animation Adventure Comedy    Kelly      9   dog     5     59
3   Up Animation Adventure Comedy    Kelly      9   can    10     59
4   Up Animation Adventure Comedy    Kelly      9   wow     2     59
5 Fame     Drama     Music   <NA>     Jenn      8   the    15     59
6 Fame     Drama     Music   <NA>     Jenn      8 dance    12     59

# That's one approach, but it seems REALLY inefficient since I would have to
have add a new column specific to each genre, whether it was appropriate for the
film or not. Also, sometimes a genre is listed as the first genre, sometimes as
the second or third. I actually want to report the total of all
"the"'s in Animation, whether Animation is listed as the first,
second, or third genre. I'm not sure how to do that.

# I've also tried to use the reshape library. That gets me a little closer
to the kind of data I want, but I'm not really sure how to append it
smartly.

library(reshape)

# To use "melt", I seem to need to shave off the columns I'm not
interested in (film title and reviewer).

mininew<-cbind(mini[2:4],mini[6:8])
head(melt(mininew,id=c("Word", "Genre1", "Genre2",
"Genre3", "Rating"), measured=c("Word_Ct")))
    Word    Genre1    Genre2   Genre3 Rating variable value
1    the Animation Adventure   Comedy      9    Word_ct 22
2    dog Animation Adventure   Comedy      9    Word_ct 5
3    can Animation Adventure   Comedy      9    Word_ct 10
4    wow Animation Adventure   Comedy      9    Word_ct 2
5    the     Drama     Music     <NA>      8    Word_ct 15
6  dance     Drama     Music     <NA>      8    Word_ct 12

# This almost does what I want it to, except that it is only looking at the
first genre tag--I'd like to combine all "the" for Animation,
whether "Animation" is listed as the first, second, or third genre
tag.

# Using cast gets me even closer.  
cast(mininew, Genre1 ~ Word, sum)
Using Word_ct as value column.  Use the value argument to cast to override this
choice
     Genre1 bank boy can dance dog great heist shit the wow
1 Animation    0   2  12     0   9     0     0    0  34   2
2     Crime    3   0   2     0   0     0     3    0  12   0
3     Drama    0   0   0    16   0     4     0    4  25   0

# Now that I know that there are 34 "the"'s in Animation, how do I
relate that back to my main data set? In order to figure out whether words are
over- or under-represented, I need to be able to figure out what we would have
expected at chance and then compare the observed to the expected. Again, I was
planning on using a g-test for this. And I was planning on reporting it per
line, with each line being a word in a particular genre in a particular rating
category (1-10 stars).

# Again, my ultimate goal is to report observed and expected word counts for
each word in each Genre in each Rating (1-10):
Word | Genre | Rating | Word_ct | Word_ct_in_genre | Word_ct_in_Rating |
Expected_word_ct | G-test-score

Any suggestions would be greatly appreciated. 

Thanks,

Tyler





	[[alternative HTML version deleted]]
Seemingly Similar Threads

Search for more possibly parallel threads
R help - Sep 2010 - Transforming/appending data (words in IMDB)

[R] Transforming/appending data (words in IMDB)

Seemingly Similar Threads

Wisdom of the Ancients