I'm an R newbie but recently discovered the ggplot2 and reshape packages which seem incredibly useful and much easier to use for a beginner. Using the data from the IMDB, I'm trying to see how the average movie rating varies by year. Here is what my data looks like:> ratings <- read.delim("groomed.list", header = TRUE, sep = "|", comment.char = "") > ratings <- subset(ratings, VoteCount > 100) > head(ratings)Title Histogram VoteCount VoteMean Year 1 !Huff (2004) (TV) 0000000016 299 8.4 2004 8 'Allo 'Allo! (1982) 0000000125 829 8.6 1982 50 .hack//SIGN (2002) 0000001113 150 7.0 2002 56 1-800-Missing (2003) 0000000103 118 5.4 2003 66 Greatest Artists (2000) (mini) 00..000016 110 7.8 2000 77 00 Scariest Movie (2004) (mini) 00..000115 256 8.6 2004 The above data is not aggregated. So after playing around with basic R functionality, I stumbled across the 'aggregate' function and was able to see the information in the manner I desired (average movie rating by year).> byYear <- aggregate(ratings$VoteMean, list(Year = ratings$Year), mean) > plot(byYear)Having just discovered gglot2, I wanted to create the same graph but augment it with a color attribute based on the total number of votes in a year. So first I tried to see if I could reproduce the above:> library(ggplot2) > qplot(Year, x, byYear)This did not work as expected because the x-axis contained labels for each and every year making it impossible to read whereas the plot created with basic R had nice x-axis labels. How do I get 'qplot' to treat the x-axis in a similar manner to 'plot'? After playing around further, I was able to get 'qplot' to work in a manner similar to 'plot' with regards to the x-axis labels by using 'melt' and 'cast'. The 'qplot' now behaves correctly:> mratings <- melt(ratings, id = c("Title", "Year"), measure = c("VoteCount", "VoteMean")) > byYear2 <- cast(mratings, Year ~ variable, mean, subset = variable == "VoteMean") > qplot(Year, VoteMean, data = byYear2)How do 'byYear' and 'byYear2' differ? I am trying to use 'typeof' but both seem to be lists. However, they are clearly different in some way because 'qplot' graphs them differently. Finally, I'd like to use a color attribute to 'qplot' to augment each point with a color based on the total number of votes for the year. Using attributes with 'qplot' seems simple, but I'm having a hard time grooming my data appropriately. I believe this requires aggregation by summing the VoteCount column. Is there a way to cast the data using different aggregation functions for various columns? In my case, I want the mean of the VoteMean column, and the sum of the VoteCount column. Then I want to produce a graph showing the average movie rating per year but with each point colored to reflect the total number of votes for that year. Any pointers? Thanks, Pete
hadley wickham
2007-Jul-12 07:35 UTC
[R] ggplot2 / reshape / Question on manipulating data
On 7/12/07, Pete Kazmier <pete-expires-20070910 at kazmier.com> wrote:> I'm an R newbie but recently discovered the ggplot2 and reshape > packages which seem incredibly useful and much easier to use for a > beginner. Using the data from the IMDB, I'm trying to see how the > average movie rating varies by year. Here is what my data looks like: > > > ratings <- read.delim("groomed.list", header = TRUE, sep = "|", comment.char = "") > > ratings <- subset(ratings, VoteCount > 100) > > head(ratings) > Title Histogram VoteCount VoteMean Year > 1 !Huff (2004) (TV) 0000000016 299 8.4 2004 > 8 'Allo 'Allo! (1982) 0000000125 829 8.6 1982 > 50 .hack//SIGN (2002) 0000001113 150 7.0 2002 > 56 1-800-Missing (2003) 0000000103 118 5.4 2003 > 66 Greatest Artists (2000) (mini) 00..000016 110 7.8 2000 > 77 00 Scariest Movie (2004) (mini) 00..000115 256 8.6 2004Have you tried using the movies dataset included in ggplot? Or is there some data that you want that is not in that dataset.> The above data is not aggregated. So after playing around with basic > R functionality, I stumbled across the 'aggregate' function and was > able to see the information in the manner I desired (average movie > rating by year). > > > byYear <- aggregate(ratings$VoteMean, list(Year = ratings$Year), mean) > > plot(byYear) > > Having just discovered gglot2, I wanted to create the same graph but > augment it with a color attribute based on the total number of votes > in a year. So first I tried to see if I could reproduce the above: > > > library(ggplot2) > > qplot(Year, x, byYear) > > This did not work as expected because the x-axis contained labels for > each and every year making it impossible to read whereas the plot > created with basic R had nice x-axis labels. How do I get 'qplot' to > treat the x-axis in a similar manner to 'plot'?The problem is probably that Year is a factor - and factors are labelled on every level (even if they overlap - which is a bug). There's no terribly easy way to fix this, but the following will work: qplot(as.numeric(as.character(Year)), x, data=byYear)> After playing around further, I was able to get 'qplot' to work in a > manner similar to 'plot' with regards to the x-axis labels by using > 'melt' and 'cast'. The 'qplot' now behaves correctly: > > > mratings <- melt(ratings, id = c("Title", "Year"), measure = c("VoteCount", "VoteMean")) > > byYear2 <- cast(mratings, Year ~ variable, mean, subset = variable == "VoteMean") > > qplot(Year, VoteMean, data = byYear2) > > How do 'byYear' and 'byYear2' differ? I am trying to use 'typeof' but > both seem to be lists. However, they are clearly different in some > way because 'qplot' graphs them differently.Try using str - it's much more helpful, and you should see the different quickly.> Finally, I'd like to use a color attribute to 'qplot' to augment each > point with a color based on the total number of votes for the year. > Using attributes with 'qplot' seems simple, but I'm having a hard time > grooming my data appropriately. I believe this requires aggregation > by summing the VoteCount column. Is there a way to cast the data > using different aggregation functions for various columns? In myNot easily, unfortunately. However, you could do: cast(mratings, Year ~ variable, c(mean, sum)), subset = variable %in% c("VoteMean", "VoteCount")) which will give you a mean and sum for both.> case, I want the mean of the VoteMean column, and the sum of the > VoteCount column. Then I want to produce a graph showing the average > movie rating per year but with each point colored to reflect the total > number of votes for that year. Any pointers?Using the built in movies data: mm <- melt(movies, id=1:2, m=c("rating", "votes")) msum <- cast(mm, year ~ variable, c(mean, sum)) qplot(year, rating_mean, data=msum, colour=votes_sum) qplot(year, rating_mean, data=msum, colour=votes_sum, geom="line") Hadley