thr3ads.net - R help - [R] How to separate a data set by its factors [Dec 2009]

If this information is useful, please help other people find it:
Share via:

James Rome

2009-Dec-24 20:24 UTC

[R] How to separate a data set by its factors

I have a large data set of airport data and wish to analyze it by hour
and day of the week. hour and day of the week are factors.

I can do something such as:
histogram(~(Arrival.Val) | DAY*Hour, type="count", breaks=60)
which displays the data the way I want it in principle,  but the plots
are too small to read. I added layout=c(7,6,4) to the argument list, but
then I only get the first page of plots. How do I see the other pages?
And I would like to add a Poisson Distribution fit to each of these
plots (see below), but am clueless as to how to go about it.

I would like to fit a distribution to the count data for each
combination of day and hour, and I am unable to see how to do this in a
vector manner.  For example, I tried
density((Arrival.Val | DAY*Hour), na.rm=TRUE)
which does not work.

I think my question boils down to "how do you replace a whole data set
by its factored subsets in all of the usual R commands?

I am climbing up a steep R learning curve, and so would appreciate some
help.

Thanks,
Jim

?

Steve Lianoglou

2009-Dec-24 20:34 UTC

head link

[R] How to separate a data set by its factors

Hi,

On Thu, Dec 24, 2009 at 3:24 PM, James Rome <jamesrome at gmail.com>
wrote:
> I think my question boils down to "how do you replace a whole data set
> by its factored subsets in all of the usual R commands?
I think the answer to your question is: I'm not sure that there's a
way to do that in "all of the usual R commands", but you can split up
your data first, then run "all of the usual R commands" on them :-)

Functions you want to look at are probably:

?split
?by and ?tapply

You might try to look at the plyr library first, though:
http://cran.r-project.org/web/packages/plyr/index.html

It provides some groovy ways to iterate/manipulate sets/subsets of data.

Hope that helps,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

David Winsemius

2009-Dec-24 21:56 UTC

head link

[R] How to separate a data set by its factors

On Dec 24, 2009, at 3:24 PM, James Rome wrote:
> I have a large data set of airport data and wish to analyze it by hour
> and day of the week. hour and day of the week are factors.
>
> I can do something such as:
> histogram(~() | , type="count", breaks=60)
> which displays the data the way I want it in principle,  but the plots
> are too small to read. I added layout=c(7,6,4) to the argument list,  
> but
> then I only get the first page of plots. How do I see the other pages?
I was not aware that layout had a paging argument, but that just shows  
you that there are large gaps in my knowledge. if I munge one of the  
examples on the xyplot help page I get (ugly) multi-page output;

pdf(test.pdf")
xyplot(Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width |  
Species, data = iris, scales = "free", layout = c(2, 1, 2), auto.key =
list(x = .6, y = .7, corner = c(0, 0)))
dev.off()
You may not be getting what you expect, but it may be that your plots  
are all being created, but too quickly to be seen. Try printing to a  
more durable "canvas".
> And I would like to add a Poisson Distribution fit to each of these
> plots (see below), but am clueless as to how to go about it.
>
> I would like to fit a distribution to the count data for each
> combination of day and hour, and I am unable to see how to do this  
> in a
> vector manner.  For example, I tried
> density((Arrival.Val | DAY*Hour), na.rm=TRUE)
> which does not work.
I should think the this would be informative:

glm(Arrival.Val ~ DAY*Hour, family="poisson")

Since DAY and Hour are factors you will get a large number of  
estimates. You can use the typical regression functions, such as  
predict() and summary() to get the fitted values.
>
> I think my question boils down to "how do you replace a whole data set
> by its factored subsets in all of the usual R commands?
>
> I am climbing up a steep R learning curve, and so would appreciate  
> some
> help.
>
> Thanks,

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

Carl Witthoft

2009-Dec-25 01:27 UTC

head link

[R] How to separate a data set by its factors

Grrrr....

Quote:

" I am climbing up a steep R learning curve, and so would appreciate 
some help."


Please help stamp out technical illiteracy:  a steep learning curve is a 
GOOD thing.  Think about what a "learning curve" represents.  Time is 
the  x-axis and "knowledge" or 'ability' is the y-axis.  Your
objective
is to get as highon the y-axis as soon as possible.   Thus, a steep 
curve is good, and a flat curve is bad.


thank you

Carl

James Rome

2009-Dec-25 14:38 UTC

head link

[R] How to separate a data set by its factors

Thanks for the help.

I tried making the pdf file as suggested. Acrobat said it was damaged
and could not be opened. Is this an R bug?
It did make a PostScript file that I was able to distill into PDF, but
it was gray scales. How do I get the color back?
And yes, I did do the layout I wanted so I could see how the days
compared for each hour.

On 12/24/09 4:56 PM, David Winsemius wrote:>
>
> pdf(test.pdf")
> xyplot(Sepal.Length + Sepal.Width ~ Petal.Length + Petal.Width |
> Species, data = iris, scales = "free", layout = c(2, 1, 2),
auto.key > list(x = .6, y = .7, corner = c(0, 0)))
> dev.off()
> You may not be getting what you expect, but it may be that your plots
> are all being created, but too quickly to be seen. Try printing to a
> more durable "canvas".
>
>> And I would like to add a Poisson Distribution fit to each of these
>> plots (see below), but am clueless as to how to go about it.
>>
>> I would like to fit a distribution to the count data for each
>> combination of day and hour, and I am unable to see how to do this in a
>> vector manner.  For example, I tried
>> density((Arrival.Val | DAY*Hour), na.rm=TRUE)
>> which does not work.
>
> I should think the this would be informative:
>
> glm(Arrival.Val ~ DAY*Hour, family="poisson")
>
> Since DAY and Hour are factors you will get a large number of
> estimates. You can use the typical regression functions, such as
> predict() and summary() to get the fitted values.
>I tried glm:
---------> glm(Arrival.Val ~ DAY*as.factor(Hour), family="poisson")
Call:  glm(formula = Arrival.Val ~ DAY * as.factor(Hour), family
"poisson")

Coefficients:
                           (Intercept)                          
DAY[T.Monday] 
                               3.15396                               
-0.61348 
                       DAY[T.Saturday]                          
DAY[T.Sunday] 
                              -0.43853                               
-0.93475 
                       DAY[T.Thursday]                         
DAY[T.Tuesday] 
                              -0.23109                               
-0.38137 
                      DAY[T.Wednesday]                   
as.factor(Hour)[T.1] 
                              -0.35715                               
-1.01389 
                  as.factor(Hour)[T.2]                   
as.factor(Hour)[T.3] 
                              -1.07451                               
-0.69315 
                  as.factor(Hour)[T.4]                   
as.factor(Hour)[T.5] 
                              -0.87384                               
-0.57808 
                  as.factor(Hour)[T.6]                   
as.factor(Hour)[T.7] 
                              -0.41122                                
0.26453 
                  as.factor(Hour)[T.8]                   
as.factor(Hour)[T.9] 
                              -0.08802                               
-0.01618 
                 as.factor(Hour)[T.10]                  
as.factor(Hour)[T.11] 
                               0.33495                                
0.40389 
                 as.factor(Hour)[T.12]                  
as.factor(Hour)[T.13] 
                               0.43834                                
0.49019 
                 as.factor(Hour)[T.14]                  
as.factor(Hour)[T.15] 
                               0.56895                                
0.54856 
                 as.factor(Hour)[T.16]                  
as.factor(Hour)[T.17] 
                               0.50895                                
0.49770 
                 as.factor(Hour)[T.18]                  
as.factor(Hour)[T.19] 
                               0.49879                                
0.41296 
                 as.factor(Hour)[T.20]                  
as.factor(Hour)[T.21] 
                               0.37310                                
0.26455 
                 as.factor(Hour)[T.22]                  
as.factor(Hour)[T.23] 
                               0.14955                                
0.07016 
    DAY[T.Monday]:as.factor(Hour)[T.1]   
DAY[T.Saturday]:as.factor(Hour)[T.1] 
                               1.02978                                
0.81973 
    DAY[T.Sunday]:as.factor(Hour)[T.1]   
DAY[T.Thursday]:as.factor(Hour)[T.1] 
                               0.58645                                
0.17046 
   DAY[T.Tuesday]:as.factor(Hour)[T.1]  
DAY[T.Wednesday]:as.factor(Hour)[T.1] 
                               0.66905                                
0.63300 
    DAY[T.Monday]:as.factor(Hour)[T.2]   
DAY[T.Saturday]:as.factor(Hour)[T.2] 
                              
0.61348                                      NA 

. . . .
  DAY[T.Tuesday]:as.factor(Hour)[T.22] 
DAY[T.Wednesday]:as.factor(Hour)[T.22] 
                               0.37518                                
0.34362 
   DAY[T.Monday]:as.factor(Hour)[T.23]  
DAY[T.Saturday]:as.factor(Hour)[T.23] 
                               0.52431                                
0.04906 
   DAY[T.Sunday]:as.factor(Hour)[T.23]  
DAY[T.Thursday]:as.factor(Hour)[T.23] 
                               0.68802                                
0.39860 
  DAY[T.Tuesday]:as.factor(Hour)[T.23] 
DAY[T.Wednesday]:as.factor(Hour)[T.23] 
                               0.43209                                
0.49274 

Degrees of Freedom: 8124 Total (i.e. Null);  7963 Residual
  (18 observations deleted due to missingness)
Null Deviance:        40120
Residual Deviance: 17030     AIC: 59170
----------------
I am not sure what to make of this.
So how do I get the fit plotted on top of my histograms?

Is there a way to save the bin data from the histogram
command?>
>
> David Winsemius, MD
> Heritage Laboratories
> West Hartford, CT
>Again Thanks for the prompt holiday response.
Jim Rome

Maybe Matching Threads

Search for more seemingly similar threads

R help - Dec 2009 - How to separate a data set by its factors

[R] How to separate a data set by its factors

[R] How to separate a data set by its factors

[R] How to separate a data set by its factors

[R] How to separate a data set by its factors

[R] How to separate a data set by its factors

Maybe Matching Threads