Marc Los Huertos
2009-Nov-05 19:12 UTC
[R] Using a by() function to process several regression (lm()) functions
Hello, Thank you very much for looking at this. I have a "seasonal" user for R. I teach my undergrads and graduates students statistics using R and often find myself trying to solve problems to process student collected data in an efficient way. In this case, I have a data.frame with multiple observations. These are gas concentrations in a chamber and are used to measure into rates, usually (not not always), four or five observations per chamber. Data are from several sites and multiple dates. I'd like to run a regression to calculate the rate of emissions, for each date and site. The regression is using concentration ~ Sample_interval (which is in seconds). lm(Concentration ~ Sample_interval, data=df) The data.frame looks something like this: Date Time Site Sample_interval Concentration 10/03/09 5:00 Site 1 0 334 10/03/09 5:00 Site 1 566 355 10/03/09 5:00 Site 1 1005 367 10/03/09 5:00 Site 1 1244 384 10/03/09 5:00 Site 2 0 434 10/03/09 5:00 Site 2 466 455 10/03/09 5:00 Site 2 1005 437 10/03/09 5:00 Site 2 1349 474 10/12/09 5:00 Site 1 0 354 10/12/09 5:00 Site 1 506 359 10/12/09 5:00 Site 1 1065 377 10/12/09 5:00 Site 1 1184 394 10/12/09 5:00 Site 2 0 424 10/12/09 5:00 Site 2 396 495 10/12/09 5:00 Site 2 1022 497 10/12/09 5:00 Site 2 1304 574 I tried this: bylist.lm <- by(df, list(df$Date, df$Site), function(x) lm(Concentration~Sample_interval, data=x)) Then, I get a list of stuff...which seems hard very exact much... I tried this... rate <- (sapply(bylist.lm, coef)[2,]) and got, the rates, but I can't figure out how to line them up with the "by" variables. So... In a perfect world, I would like to get a data frame with the intercept, slope, r.square, p.value for each date, site, so the question involves extracting from a "by" class, and mode "list". Granted, I have been confused for a few years on how to think about classes and modes...so some insight there would also be quite appreciated... -- Marc Los Huertos Assistant Professor Science and Environmental Policy 100 Campus Drive Seaside CA, 93955 831-582-3209 [[alternative HTML version deleted]]
Charlie Sharpsteen
2009-Nov-06 03:13 UTC
[R] Using a by() function to process several regression (lm()) functions
On Thu, Nov 5, 2009 at 11:12 AM, Marc Los Huertos <mloshuertos at csumb.edu> wrote:> Hello, > > Thank you very much for looking at this. I have a "seasonal" user for R. I > teach my undergrads and graduates students statistics using R and often find > myself trying to solve problems to process student collected data in an > efficient way. > > In this case, I have a data.frame with multiple observations. These are gas > concentrations in a chamber and are used to measure into rates, usually (not > not always), four or five observations per chamber. > > Data are from several sites and multiple dates. I'd like to run a regression > to calculate the rate of emissions, for each date and site. > > The regression is using concentration ~ Sample_interval (which is in > seconds). > > lm(Concentration ~ Sample_interval, data=df) > > The data.frame looks something like this:# The structure of your data got mangled in the email, so I'm reprinting it here in a # more stable form incase anyone else wants to give this problem a try. experimentData <- structure(list(Date = structure(c(-715808, -715808, -715808, -715808, -715808, -715808, -715808, -715808, -715533, -715533, -715533, -715533, -715533, -715533, -715533, -715533), class = "Date"), Time = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5:00", class = "factor"), Site = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Site 1", "Site 2"), class = "factor"), Sample_interval = c(0L, 566L, 1005L, 1244L, 0L, 466L, 1005L, 1349L, 0L, 506L, 1065L, 1184L, 0L, 396L, 1022L, 1304L), Concentration = c(334L, 355L, 367L, 384L, 434L, 455L, 437L, 474L, 354L, 359L, 377L, 394L, 424L, 495L, 497L, 574L)), .Names = c("Date", "Time", "Site", "Sample_interval", "Concentration"), row.names = c(NA, -16L), class = "data.frame")> I tried this: > > bylist.lm <- by(df, list(df$Date, df$Site), > ? ? ?function(x) lm(Concentration~Sample_interval, data=x)) > > Then, I get a list of stuff...which seems hard very exact much... > > I tried this... > > rate <- (sapply(bylist.lm, coef)[2,]) > > and got, the rates, but I can't figure out how to line them up with the "by" > variables. > > So... In a perfect world, I would like to get a data frame with the > intercept, slope, r.square, p.value for each date, site, so the question > involves extracting from a "by" class, and mode "list". Granted, I have been > confused for a few years on how to think about classes and modes...so some > insight there would also be quite appreciated...Well, the by() function returns what is basically a glorified list of objects. In your first case, you got a bunch of "lm" linear model objects since that is what the function you provided to by() returns. If you want to return a table of data, you will need to provide a more sophisticated function that fits the linear model and then extracts the statistics of interest and returns a vector/list/data.frame containing the results: modelResults <- by( experimentData, list( experimentData[[ 'Date' ]], experimentData[[ 'Site' ]] ), function( dataSlice ){ linMod <- lm( Concentration ~ Sample_interval, data = dataSlice ) # Slope and intercept may be recovered from the output of the coef() function: intercept <- coef( linMod )[1] slope <- coef( linMod )[2] # The R-Squared value is returned by the summary() function: rsq <- summary( linMod )[[ 'r.squared' ]] # The summary function also provides statistics for the F-distribution, # extract them, reformat as a list, rename and feed to pf() using do.call() # in order to get the p-value: fStats <- as.list( summary( linMod )[[ 'fstatistic' ]] ) names( fStats ) <- c( 'q', 'df1', 'df2' ) fStats[[ 'lower.tail' ]] <- FALSE pVal <- do.call( pf, fStats ) return( data.frame( slope, intercept, rsq, pVal ) ) }) So, now modelResults is a list returned from by(), each component of which contains a row of the statistics of interest. To collapse it into a data.frame, use rbind() inside do.call(): modelResults <- do.call( rbind, modelResults ) The approach outlined above gets the job done-- however some information is lost or mangled in the translation such as which Site and Date were used to fit a given model. A more elegant solution is provided by the __ply() functions in Hadley Wickham's plyr package. These do the same job as by(), but eliminate the need to post process the data with do.call() and preserve information about the parameters used to subset the data. Here we use ddply() since we are inputting a data.frame and expecting a data.frame as output: require( plyr ) modelResults <- ddply( experimentData, c( 'Date', 'Site' ), function( dataSlice ){ linMod <- lm( Concentration ~ Sample_interval, data = dataSlice ) intercept <- coef( linMod )[1] slope <- coef( linMod )[2] rsq <- summary( linMod )[[ 'r.squared' ]] fStats <- as.list( summary( linMod )[[ 'fstatistic' ]] ) names( fStats ) <- c( 'q', 'df1', 'df2' ) fStats[[ 'lower.tail' ]] <- FALSE pVal <- do.call( pf, fStats ) return( data.frame( slope, intercept, rsq, pVal ) ) }) Hadley's ggplot2 package also provides a nice way to efficiently subset your data and plot it along with the linear model fits: require( ggplot2 ) resultPlot <- qplot( Sample_interval, Concentration, data = experimentData ) + # facet_grid() breaks your data into sub-groups based on the values of # Date and Site in experimentData. facet_grid( Date ~ Site ) + # geom_smooth can be used to fit a variety of models, in this case we want a # linear model instead of the default LOESS. geom_smooth( method = 'lm' ) + # theme_bw() changes the default color scheme. theme_bw() print( resultPlot ) I hope this helps! -Charlie
Charlie Sharpsteen
2009-Nov-06 16:25 UTC
[R] Using a by() function to process several regression (lm()) functions
On Thu, Nov 5, 2009 at 11:15 PM, Marc Los Huertos <mloshuertos at csumb.edu> wrote:> Hi Charlie, > > Wow, I like this approach and see the problem my list of lm objects.? It > does not work well. You have created a list of the values of interest, which > seems obvious in hindsight, but still extracting the values with the > do.call(rbind()) bit is certainly outside my experience. > > I'll have to look at the do.call() and see if I can backward engineer what > it is doing...always more to do!? :-)do.call() is an incredibly useful function-- I was able to do data processing much more efficiently after I found it. Basically, it takes two arguments-- a function and a list. The function is called and the list is used as the arguments. Since rbind() takes an arbitrarily long list of objects, using do.call and rbind() or cbind() is a quick way to collapse a list into a matrix or data.frame. If the function takes named arguments, such as pf(), do.call will match the names in the list with the names of the arguments-- this is the reason for all the monkey business I pulled by: 1. Extracting the parameters of the F-distribution from a summary of the linear model. 2. Converting them from a vector to a list. 3. Renaming them so that they matched the arguments to pf()> Another suggestion include this to extract the p-value, > anova(linMod)$'Pr(>F)'[1], which seems more straight forward. Do you see any > reason why this should be a problem?? It seems to work fine when I inserted > it into your code.This looks like a much more efficient method!> However, the plyr() package seems best to solve the other problem of trying > to extract my date and site information, which I need to run the rest of the > analysis (i.e. the treatment difference, which is what the point is!).? I am > disappointed I didn't find it after a few hours of searching, but that is > another issue.? Do you have any idea why the function has the syntax that > include dots for each argument, e.g. .data and .fun. I am sure there is some > logic, but I didn't find a reference in the help. Perhaps, this convention > is not important, but it does beg the question for me...I believe this is just the convention that Hadley decided to use in the plyr package. Another incredibly useful package of his that you may want to check out for data processing is 'reshape'. It is based on plyr and uses some of the same conventions. You can find good documentation, examples and papers concerning his R packages on his website: http://had.co.nz/> Thank you very much! I appreciate the diverse set of solutions, I am sure > I'll find use for each of them... > > cheers, marcNo problem, have fun busting that data apart! -Charlie