thr3ads.net - R help - [R] Using a by() function to process several regression (lm()) functions [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Marc Los Huertos

2009-Nov-05 19:12 UTC

[R] Using a by() function to process several regression (lm()) functions

Hello,

Thank you very much for looking at this. I have a "seasonal" user for
R. I
teach my undergrads and graduates students statistics using R and often find
myself trying to solve problems to process student collected data in an
efficient way.

In this case, I have a data.frame with multiple observations. These are gas
concentrations in a chamber and are used to measure into rates, usually (not
not always), four or five observations per chamber.

Data are from several sites and multiple dates. I'd like to run a regression
to calculate the rate of emissions, for each date and site.

The regression is using concentration ~ Sample_interval (which is in
seconds).

lm(Concentration ~ Sample_interval, data=df)

The data.frame looks something like this:

Date               Time                     Site
Sample_interval                Concentration
10/03/09          5:00                     Site 1
0                                       334
10/03/09          5:00                     Site 1
566                                   355
10/03/09          5:00                     Site 1
1005                                 367
10/03/09          5:00                     Site 1
1244                                 384
10/03/09          5:00                     Site 2
0                                       434
10/03/09          5:00                     Site 2
466                                   455
10/03/09          5:00                     Site 2
1005                                 437
10/03/09          5:00                     Site 2                 1349
                                474
10/12/09          5:00                     Site 1
0                                       354
10/12/09          5:00                     Site 1
506                                   359
10/12/09          5:00                     Site 1
1065                                 377
10/12/09          5:00                     Site 1
1184                                 394
10/12/09          5:00                     Site 2
0                                       424
10/12/09          5:00                     Site 2
396                                   495
10/12/09          5:00                     Site 2                 1022
                                497
10/12/09          5:00                     Site 2
1304                                 574

I tried this:

bylist.lm <- by(df, list(df$Date, df$Site),
      function(x) lm(Concentration~Sample_interval, data=x))

Then, I get a list of stuff...which seems hard very exact much...

I tried this...

rate <- (sapply(bylist.lm, coef)[2,])

and got, the rates, but I can't figure out how to line them up with the
"by"
variables.

So... In a perfect world, I would like to get a data frame with the
intercept, slope, r.square, p.value for each date, site, so the question
involves extracting from a "by" class, and mode "list".
Granted, I have been
confused for a few years on how to think about classes and modes...so some
insight there would also be quite appreciated...


-- 
Marc Los Huertos
Assistant Professor
Science and Environmental Policy
100 Campus Drive
Seaside CA, 93955

831-582-3209

	[[alternative HTML version deleted]]

Charlie Sharpsteen

2009-Nov-06 03:13 UTC

head link

[R] Using a by() function to process several regression (lm()) functions

On Thu, Nov 5, 2009 at 11:12 AM, Marc Los Huertos <mloshuertos at
csumb.edu> wrote:> Hello,
>
> Thank you very much for looking at this. I have a "seasonal" user
for R. I
> teach my undergrads and graduates students statistics using R and often
find
> myself trying to solve problems to process student collected data in an
> efficient way.
>
> In this case, I have a data.frame with multiple observations. These are gas
> concentrations in a chamber and are used to measure into rates, usually
(not
> not always), four or five observations per chamber.
>
> Data are from several sites and multiple dates. I'd like to run a
regression
> to calculate the rate of emissions, for each date and site.
>
> The regression is using concentration ~ Sample_interval (which is in
> seconds).
>
> lm(Concentration ~ Sample_interval, data=df)
>
> The data.frame looks something like this:

  # The structure of your data got mangled in the email, so I'm
reprinting it here in a
  # more stable form incase anyone else wants to give this problem a try.
  experimentData <-  structure(list(Date = structure(c(-715808,
-715808, -715808,
-715808, -715808, -715808, -715808, -715808, -715533, -715533,
-715533, -715533, -715533, -715533, -715533, -715533), class =
"Date"),
    Time = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "5:00", class =
"factor"),
    Site = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
    1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Site 1", "Site 2"),
class = "factor"),
    Sample_interval = c(0L, 566L, 1005L, 1244L, 0L, 466L, 1005L,
    1349L, 0L, 506L, 1065L, 1184L, 0L, 396L, 1022L, 1304L),
Concentration = c(334L,
    355L, 367L, 384L, 434L, 455L, 437L, 474L, 354L, 359L, 377L,
    394L, 424L, 495L, 497L, 574L)), .Names = c("Date",
"Time",
"Site", "Sample_interval", "Concentration"),
row.names = c(NA,
-16L), class = "data.frame")

> I tried this:
>
> bylist.lm <- by(df, list(df$Date, df$Site),
> ? ? ?function(x) lm(Concentration~Sample_interval, data=x))
>
> Then, I get a list of stuff...which seems hard very exact much...
>
> I tried this...
>
> rate <- (sapply(bylist.lm, coef)[2,])
>
> and got, the rates, but I can't figure out how to line them up with the
"by"
> variables.
>
> So... In a perfect world, I would like to get a data frame with the
> intercept, slope, r.square, p.value for each date, site, so the question
> involves extracting from a "by" class, and mode "list".
Granted, I have been
> confused for a few years on how to think about classes and modes...so some
> insight there would also be quite appreciated...

Well, the by() function returns what is basically a glorified list of
objects.  In your first case, you got a bunch of "lm" linear model
objects since that is what the function you provided to by() returns.
If you want to return a table of data, you will need to provide a more
sophisticated function that fits the linear model and then extracts
the statistics of interest and returns a vector/list/data.frame
containing the results:


  modelResults <- by( experimentData, list( experimentData[[ 'Date'
]], experimentData[[ 'Site' ]] ), function( dataSlice ){

    linMod <- lm( Concentration ~ Sample_interval, data = dataSlice )

    # Slope and intercept may be recovered from the output of the
coef() function:
    intercept <- coef( linMod )[1]
    slope <- coef( linMod )[2]

    # The R-Squared value is returned by the summary() function:
    rsq <- summary( linMod )[[ 'r.squared' ]]

    # The summary function also provides statistics for the F-distribution,
    # extract them, reformat as a list, rename and feed to pf() using do.call()
    # in order to get the p-value:
    fStats <- as.list( summary( linMod )[[ 'fstatistic' ]] )
    names( fStats ) <- c( 'q', 'df1', 'df2' )
    fStats[[ 'lower.tail' ]] <- FALSE

    pVal <- do.call( pf, fStats )

    return(
      data.frame( slope, intercept, rsq, pVal )
    )

  })

So, now modelResults is a list returned from by(), each component of
which contains a row of the statistics of interest. To collapse it
into a data.frame, use rbind() inside do.call():

  modelResults <- do.call( rbind, modelResults )



The approach outlined above gets the job done-- however some
information is lost or mangled in the translation such as which Site
and Date were used to fit a given model.  A more elegant solution is
provided by the __ply() functions in Hadley Wickham's plyr package.
These do the same job as by(), but eliminate the need to post process
the data with do.call() and preserve information about the parameters
used to subset the data.  Here we use ddply() since we are inputting a
data.frame and expecting a data.frame as output:

  require( plyr )

  modelResults <- ddply( experimentData, c( 'Date', 'Site' ),
function( dataSlice ){

    linMod <- lm( Concentration ~ Sample_interval, data = dataSlice )

    intercept <- coef( linMod )[1]
    slope <- coef( linMod )[2]

    rsq <- summary( linMod )[[ 'r.squared' ]]

    fStats <- as.list( summary( linMod )[[ 'fstatistic' ]] )
    names( fStats ) <- c( 'q', 'df1', 'df2' )
    fStats[[ 'lower.tail' ]] <- FALSE

    pVal <- do.call( pf, fStats )

    return(
      data.frame( slope, intercept, rsq, pVal )
    )

  })

Hadley's ggplot2 package also provides a nice way to efficiently
subset your data and plot it along with the linear model fits:

  require( ggplot2 )

  resultPlot <- qplot( Sample_interval, Concentration, data = experimentData
) +
    # facet_grid() breaks your data into sub-groups based on the values of
    # Date and Site in experimentData.
    facet_grid( Date ~ Site ) +
    # geom_smooth can be used to fit a variety of models, in this case we want a
    # linear model instead of the default LOESS.
    geom_smooth( method = 'lm' ) +
    # theme_bw() changes the default color scheme.
    theme_bw()

  print( resultPlot )


I hope this helps!


-Charlie

Charlie Sharpsteen

2009-Nov-06 16:25 UTC

head link

[R] Using a by() function to process several regression (lm()) functions

On Thu, Nov 5, 2009 at 11:15 PM, Marc Los Huertos <mloshuertos at
csumb.edu> wrote:> Hi Charlie,
>
> Wow, I like this approach and see the problem my list of lm objects.? It
> does not work well. You have created a list of the values of interest,
which
> seems obvious in hindsight, but still extracting the values with the
> do.call(rbind()) bit is certainly outside my experience.
>
> I'll have to look at the do.call() and see if I can backward engineer
what
> it is doing...always more to do!? :-)
do.call() is an incredibly useful function-- I was able to do data
processing much more efficiently after I found it.  Basically, it
takes two arguments-- a function and a list.  The function is called
and the list is used as the arguments.  Since rbind() takes an
arbitrarily long list of objects, using do.call and rbind() or cbind()
is a quick way to collapse a list into a matrix or data.frame.  If the
function takes named arguments, such as pf(), do.call will match the
names in the list with the names of the arguments-- this is the reason
for all the monkey business I pulled by:

  1. Extracting the parameters of the F-distribution from a summary of
the linear model.
  2. Converting them from a vector to a list.
  3. Renaming them so that they matched the arguments to pf()
> Another suggestion include this to extract the p-value,
> anova(linMod)$'Pr(>F)'[1], which seems more straight forward. Do
you see any
> reason why this should be a problem?? It seems to work fine when I inserted
> it into your code.
This looks like a much more efficient method!
> However, the plyr() package seems best to solve the other problem of trying
> to extract my date and site information, which I need to run the rest of
the
> analysis (i.e. the treatment difference, which is what the point is!).? I
am
> disappointed I didn't find it after a few hours of searching, but that
is
> another issue.? Do you have any idea why the function has the syntax that
> include dots for each argument, e.g. .data and .fun. I am sure there is
some
> logic, but I didn't find a reference in the help. Perhaps, this
convention
> is not important, but it does beg the question for me...
I believe this is just the convention that Hadley decided to use in
the plyr package.  Another incredibly useful package of his that you
may want to check out for data processing is 'reshape'.  It is based
on plyr and uses some of the same conventions.  You can find good
documentation, examples and papers concerning his R packages on his
website:

  http://had.co.nz/
> Thank you very much! I appreciate the diverse set of solutions, I am sure
> I'll find use for each of them...
>
> cheers, marc
No problem, have fun busting that data apart!

-Charlie

Maybe Matching Threads

Search for more maybe matching threads

R help - Nov 2009 - Using a by() function to process several regression (lm()) functions

[R] Using a by() function to process several regression (lm()) functions

[R] Using a by() function to process several regression (lm()) functions

[R] Using a by() function to process several regression (lm()) functions

Maybe Matching Threads