Heather Anne Wright
2012-Feb-14 16:34 UTC
[R] cumsum function to determine plankton phenology
Apologies for the empty email earlier! I have species abundance data sampled at a weekly frequency or sometimes monthly depending on the year. The goal is to identify the dates in an annual cycle in which the cumulative abundance of a species reaches some threshold. Here's an example of the data for 1 species over an annual period: "mc_pheno" is the object created from this data: zoo_date sp 1/8/1988 13.2 1/20/1988 11.6 2/3/1988 8.7 2/17/1988 10.7 3/3/1988 2.5 3/21/1988 2.2 4/12/1988 0.6 5/2/1988 0.3 5/10/1988 0.6 5/25/1988 0.8 6/8/1988 0.8 6/28/1988 3.8 7/5/1988 11.2 7/20/1988 9.3 8/3/1988 31.1 8/17/1988 50.9 9/6/1988 225 9/15/1988 172.6 10/18/1988 752.1 11/3/1988 28 11/15/1988 32.8 11/30/1988 10.6 12/14/1988 2.6 I want to obtain the start, middle and end dates of cumulative abundance (for those ecologically minded folks the reference is in a Greve, et al., 2005 paper) to determine phenology of a species or functional group. I previously applied this method in Excel and want to replicate the formula in R using cumsum. Easy right? Excel method: 1. calculate cumulative sum 2. divide the cumulative sum for each time point by the annual cumulative sum or maximum value 3. Calculate three different percentile thresholds corresponding to the start (when a population reaches 15% of cumulative annual abundance), the middle (50% of cumulative abundance) or end (85%). This is done by taking an average between the original sampling dates that most closely correspond to the threshold values. The function in excel is INDEX and in R is which.min. R method: # convert data zoo_date <- strptime(mc_pheno$zoo_date,"%m/%d/%Y") # check dates str(zoo_date) # prepare data frame data<-na.omit(data.frame(date=zoo_date,sp=mc_pheno[,2]) # start the series datayear<-data[format(data$date,"%Y")==1984,] # calculate cumulative sum datayear$cumsum<-cumsum(datayear$sp) # calculate cumsum standardised datayear$cumsum<-cumsum(datayear$sp)/max(cumsum(datayear$sp)) # take index of minimum value of cumsum idq15<-which.min(mean(.15-datayear$cumsum)) idq50<-which.min(mean(.50-datayear$cumsum)) idq85<-which.min(mean(.85-datayear$cumsum)) #take corresponding date dateq15<-datayear$date[idq15] dateq50<-datayear$date[idq50] dateq85<-datayear$date[idq85] # output results in day in year form print(format(c(dateq15,dateq50,dateq85),"%j") # results should be similar to this: [1] "2010-03-02 CET" "2010-06-29 CEST" "2010-10-27 CEST" with dates formatted as a day in year corresponding to each threshold The problem lies in the comparison between methods between Excel and R. Using the original method in R, the dates obtained are always earlier than with Excel due to the which.min calculation: idq15<-which.min(abs(.15-datayear$cumsum)) idq50<-which.min(abs(.50-datayear$cumsum)) idq85<-which.min(abs(.85-datayear$cumsum)) If I change the formula to idq15<-which.min(mean(.15-datayear$cumsum)) how do I take an average between the dates so each threshold is consistent between the results I obtain in R and excel? Adding the mean to this line instead of abs resulted in obtaining the exact same output dates. Happy to provide more code but this is long enough. Thanks for your input! ------------------------------------------- Heather A. Wright, PhD candidate Ecology and Evolution of Plankton Stazione Zoologica Anton Dohrn Villa Comunale 80121 - Napoli, Italy Lab: +39 081 583 3201 Fax: +39 081 764 1355
Is it necessary to match the Excel results? If not, you could just use approx or approxfun to do a linear interpolation of the cumulative sums at each of the three points. It would be simpler, faster, and easier to understand. Taking mean(.15-datayear$cumsum) is giving you the mean of all the differences, not just the two smallest absolute values. ---------------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77843-4352 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Heather Anne Wright Sent: Tuesday, February 14, 2012 10:35 AM To: r-help at r-project.org Subject: [R] cumsum function to determine plankton phenology Apologies for the empty email earlier! I have species abundance data sampled at a weekly frequency or sometimes monthly depending on the year. The goal is to identify the dates in an annual cycle in which the cumulative abundance of a species reaches some threshold. Here's an example of the data for 1 species over an annual period: "mc_pheno" is the object created from this data: zoo_date sp 1/8/1988 13.2 1/20/1988 11.6 2/3/1988 8.7 2/17/1988 10.7 3/3/1988 2.5 3/21/1988 2.2 4/12/1988 0.6 5/2/1988 0.3 5/10/1988 0.6 5/25/1988 0.8 6/8/1988 0.8 6/28/1988 3.8 7/5/1988 11.2 7/20/1988 9.3 8/3/1988 31.1 8/17/1988 50.9 9/6/1988 225 9/15/1988 172.6 10/18/1988 752.1 11/3/1988 28 11/15/1988 32.8 11/30/1988 10.6 12/14/1988 2.6 I want to obtain the start, middle and end dates of cumulative abundance (for those ecologically minded folks the reference is in a Greve, et al., 2005 paper) to determine phenology of a species or functional group. I previously applied this method in Excel and want to replicate the formula in R using cumsum. Easy right? Excel method: 1. calculate cumulative sum 2. divide the cumulative sum for each time point by the annual cumulative sum or maximum value 3. Calculate three different percentile thresholds corresponding to the start (when a population reaches 15% of cumulative annual abundance), the middle (50% of cumulative abundance) or end (85%). This is done by taking an average between the original sampling dates that most closely correspond to the threshold values. The function in excel is INDEX and in R is which.min. R method: # convert data zoo_date <- strptime(mc_pheno$zoo_date,"%m/%d/%Y") # check dates str(zoo_date) # prepare data frame data<-na.omit(data.frame(date=zoo_date,sp=mc_pheno[,2]) # start the series datayear<-data[format(data$date,"%Y")==1984,] # calculate cumulative sum datayear$cumsum<-cumsum(datayear$sp) # calculate cumsum standardised datayear$cumsum<-cumsum(datayear$sp)/max(cumsum(datayear$sp)) # take index of minimum value of cumsum idq15<-which.min(mean(.15-datayear$cumsum)) idq50<-which.min(mean(.50-datayear$cumsum)) idq85<-which.min(mean(.85-datayear$cumsum)) #take corresponding date dateq15<-datayear$date[idq15] dateq50<-datayear$date[idq50] dateq85<-datayear$date[idq85] # output results in day in year form print(format(c(dateq15,dateq50,dateq85),"%j") # results should be similar to this: [1] "2010-03-02 CET" "2010-06-29 CEST" "2010-10-27 CEST" with dates formatted as a day in year corresponding to each threshold The problem lies in the comparison between methods between Excel and R. Using the original method in R, the dates obtained are always earlier than with Excel due to the which.min calculation: idq15<-which.min(abs(.15-datayear$cumsum)) idq50<-which.min(abs(.50-datayear$cumsum)) idq85<-which.min(abs(.85-datayear$cumsum)) If I change the formula to idq15<-which.min(mean(.15-datayear$cumsum)) how do I take an average between the dates so each threshold is consistent between the results I obtain in R and excel? Adding the mean to this line instead of abs resulted in obtaining the exact same output dates. Happy to provide more code but this is long enough. Thanks for your input! ------------------------------------------- Heather A. Wright, PhD candidate Ecology and Evolution of Plankton Stazione Zoologica Anton Dohrn Villa Comunale 80121 - Napoli, Italy Lab: +39 081 583 3201 Fax: +39 081 764 1355 ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.