I have a large data set of airport data and wish to analyze it by hour and day of the week. hour and day of the week are factors. I can do something such as: histogram(~(Arrival.Val) | DAY*Hour, type="count", breaks=60) which displays the data the way I want it in principle, but the plots are too small to read. I added layout=c(7,6,4) to the argument list, but then I only get the first page of plots. How do I see the other pages? And I would like to add a Poisson Distribution fit to each of these plots (see below), but am clueless as to how to go about it. I would like to fit a distribution to the count data for each combination of day and hour, and I am unable to see how to do this in a vector manner. For example, I tried density((Arrival.Val | DAY*Hour), na.rm=TRUE) which does not work. I think my question boils down to "how do you replace a whole data set by its factored subsets in all of the usual R commands? I am climbing up a steep R learning curve, and so would appreciate some help. Thanks, Jim ?
Hi, On Thu, Dec 24, 2009 at 3:24 PM, James Rome <jamesrome at gmail.com> wrote:> I think my question boils down to "how do you replace a whole data set > by its factored subsets in all of the usual R commands?I think the answer to your question is: I'm not sure that there's a way to do that in "all of the usual R commands", but you can split up your data first, then run "all of the usual R commands" on them :-) Functions you want to look at are probably: ?split ?by and ?tapply You might try to look at the plyr library first, though: http://cran.r-project.org/web/packages/plyr/index.html It provides some groovy ways to iterate/manipulate sets/subsets of data. Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
On Dec 24, 2009, at 3:24 PM, James Rome wrote:> I have a large data set of airport data and wish to analyze it by hour > and day of the week. hour and day of the week are factors. > > I can do something such as: > histogram(~() | , type="count", breaks=60) > which displays the data the way I want it in principle, but the plots > are too small to read. I added layout=c(7,6,4) to the argument list, > but > then I only get the first page of plots. How do I see the other pages?I was not aware that layout had a paging argument, but that just shows you that there are large gaps in my knowledge. if I munge one of the examples on the xyplot help page I get (ugly) multi-page output; pdf(test.pdf") xyplot(Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width | Species, data = iris, scales = "free", layout = c(2, 1, 2), auto.key = list(x = .6, y = .7, corner = c(0, 0))) dev.off() You may not be getting what you expect, but it may be that your plots are all being created, but too quickly to be seen. Try printing to a more durable "canvas".> And I would like to add a Poisson Distribution fit to each of these > plots (see below), but am clueless as to how to go about it. > > I would like to fit a distribution to the count data for each > combination of day and hour, and I am unable to see how to do this > in a > vector manner. For example, I tried > density((Arrival.Val | DAY*Hour), na.rm=TRUE) > which does not work.I should think the this would be informative: glm(Arrival.Val ~ DAY*Hour, family="poisson") Since DAY and Hour are factors you will get a large number of estimates. You can use the typical regression functions, such as predict() and summary() to get the fitted values.> > I think my question boils down to "how do you replace a whole data set > by its factored subsets in all of the usual R commands? > > I am climbing up a steep R learning curve, and so would appreciate > some > help. > > Thanks,David Winsemius, MD Heritage Laboratories West Hartford, CT
Grrrr.... Quote: " I am climbing up a steep R learning curve, and so would appreciate some help." Please help stamp out technical illiteracy: a steep learning curve is a GOOD thing. Think about what a "learning curve" represents. Time is the x-axis and "knowledge" or 'ability' is the y-axis. Your objective is to get as highon the y-axis as soon as possible. Thus, a steep curve is good, and a flat curve is bad. thank you Carl
Thanks for the help. I tried making the pdf file as suggested. Acrobat said it was damaged and could not be opened. Is this an R bug? It did make a PostScript file that I was able to distill into PDF, but it was gray scales. How do I get the color back? And yes, I did do the layout I wanted so I could see how the days compared for each hour. On 12/24/09 4:56 PM, David Winsemius wrote:> > > pdf(test.pdf") > xyplot(Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width | > Species, data = iris, scales = "free", layout = c(2, 1, 2), auto.key > list(x = .6, y = .7, corner = c(0, 0))) > dev.off() > You may not be getting what you expect, but it may be that your plots > are all being created, but too quickly to be seen. Try printing to a > more durable "canvas". > >> And I would like to add a Poisson Distribution fit to each of these >> plots (see below), but am clueless as to how to go about it. >> >> I would like to fit a distribution to the count data for each >> combination of day and hour, and I am unable to see how to do this in a >> vector manner. For example, I tried >> density((Arrival.Val | DAY*Hour), na.rm=TRUE) >> which does not work. > > I should think the this would be informative: > > glm(Arrival.Val ~ DAY*Hour, family="poisson") > > Since DAY and Hour are factors you will get a large number of > estimates. You can use the typical regression functions, such as > predict() and summary() to get the fitted values. >I tried glm: ---------> glm(Arrival.Val ~ DAY*as.factor(Hour), family="poisson")Call: glm(formula = Arrival.Val ~ DAY * as.factor(Hour), family "poisson") Coefficients: (Intercept) DAY[T.Monday] 3.15396 -0.61348 DAY[T.Saturday] DAY[T.Sunday] -0.43853 -0.93475 DAY[T.Thursday] DAY[T.Tuesday] -0.23109 -0.38137 DAY[T.Wednesday] as.factor(Hour)[T.1] -0.35715 -1.01389 as.factor(Hour)[T.2] as.factor(Hour)[T.3] -1.07451 -0.69315 as.factor(Hour)[T.4] as.factor(Hour)[T.5] -0.87384 -0.57808 as.factor(Hour)[T.6] as.factor(Hour)[T.7] -0.41122 0.26453 as.factor(Hour)[T.8] as.factor(Hour)[T.9] -0.08802 -0.01618 as.factor(Hour)[T.10] as.factor(Hour)[T.11] 0.33495 0.40389 as.factor(Hour)[T.12] as.factor(Hour)[T.13] 0.43834 0.49019 as.factor(Hour)[T.14] as.factor(Hour)[T.15] 0.56895 0.54856 as.factor(Hour)[T.16] as.factor(Hour)[T.17] 0.50895 0.49770 as.factor(Hour)[T.18] as.factor(Hour)[T.19] 0.49879 0.41296 as.factor(Hour)[T.20] as.factor(Hour)[T.21] 0.37310 0.26455 as.factor(Hour)[T.22] as.factor(Hour)[T.23] 0.14955 0.07016 DAY[T.Monday]:as.factor(Hour)[T.1] DAY[T.Saturday]:as.factor(Hour)[T.1] 1.02978 0.81973 DAY[T.Sunday]:as.factor(Hour)[T.1] DAY[T.Thursday]:as.factor(Hour)[T.1] 0.58645 0.17046 DAY[T.Tuesday]:as.factor(Hour)[T.1] DAY[T.Wednesday]:as.factor(Hour)[T.1] 0.66905 0.63300 DAY[T.Monday]:as.factor(Hour)[T.2] DAY[T.Saturday]:as.factor(Hour)[T.2] 0.61348 NA . . . . DAY[T.Tuesday]:as.factor(Hour)[T.22] DAY[T.Wednesday]:as.factor(Hour)[T.22] 0.37518 0.34362 DAY[T.Monday]:as.factor(Hour)[T.23] DAY[T.Saturday]:as.factor(Hour)[T.23] 0.52431 0.04906 DAY[T.Sunday]:as.factor(Hour)[T.23] DAY[T.Thursday]:as.factor(Hour)[T.23] 0.68802 0.39860 DAY[T.Tuesday]:as.factor(Hour)[T.23] DAY[T.Wednesday]:as.factor(Hour)[T.23] 0.43209 0.49274 Degrees of Freedom: 8124 Total (i.e. Null); 7963 Residual (18 observations deleted due to missingness) Null Deviance: 40120 Residual Deviance: 17030 AIC: 59170 ---------------- I am not sure what to make of this. So how do I get the fit plotted on top of my histograms? Is there a way to save the bin data from the histogram command?> > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT >Again Thanks for the prompt holiday response. Jim Rome