Hi, I am new to R and have not had the most exposure to statistics. I have a dataset of percentage cover (so 0-100) for certain species in 3 different shore zones (High, mid and low). The data was recorded for different protected areas as well (17 of them) and my number of obs is large (3358). I'm obviously interested in the difference in percentage cover of species between shore zones as well as between protected areas. The problem is that my data contains loads of zeros and I haven't dealt yet in statistics with how to manipulate the data so as to perform robust tests on it. I previously used Kruskal-Wallis ANOVAs to look at cover differences in shore zone but I am worried that it is inappropriate because of the large sample size that I have and because my variances are not equal. I've read a bit about using a zero-inflated negative binomial regression to fit to my data, but I'm not sure if that will work because it is for count data. I would very much appreciate it if someone could point me in the correct direction wrt a transformation that may help or an appropriate model to fit or test to use. I've searched quite a bit but I'm a out of my depth. PS sorry if I sound like a halfwit Thanks a lot Ben [[alternative HTML version deleted]]
Don't worry, there are plenty of halfwits around here. However, this is about stats theory, and not really about R, so you're better off trying CrossValidated, aka stats.stackexchange.com -pd> On 24 Jan 2015, at 14:26 , Ben Brooker <awe.ben at googlemail.com> wrote: > > Hi, > > I am new to R and have not had the most exposure to statistics. > I have a dataset of percentage cover (so 0-100) for certain species in 3 > different shore zones (High, mid and low). The data was recorded for > different protected areas as well (17 of them) and my number of obs is > large (3358). I'm obviously interested in the difference in percentage > cover of species between shore zones as well as between protected areas. > The problem is that my data contains loads of zeros and I haven't dealt yet > in statistics with how to manipulate the data so as to perform robust tests > on it. I previously used Kruskal-Wallis ANOVAs to look at cover differences > in shore zone but I am worried that it is inappropriate because of the > large sample size that I have and because my variances are not equal. > > I've read a bit about using a zero-inflated negative binomial regression to > fit to my data, but I'm not sure if that will work because it is for count > data. > > I would very much appreciate it if someone could point me in the correct > direction wrt a transformation that may help or an appropriate model to fit > or test to use. I've searched quite a bit but I'm a out of my depth. > > PS sorry if I sound like a halfwit > > Thanks a lot > > Ben > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
On Jan 24, 2015, at 8:37 AM, peter dalgaard wrote:> Don't worry, there are plenty of halfwits around here. However, this is about stats theory, and not really about R, so you're better off trying CrossValidated, aka stats.stackexchange.comThis is useful and correct advice, but if you want to see an excellent description of some of the R tools for dealing with zero-inflation, the Zeileis, Kleiber & Jackman article on the matter: http://www.jstatsoft.org/v27/i08/paper , was very helpful in advancing my understanding of the subject area.> > -pd > >> On 24 Jan 2015, at 14:26 , Ben Brooker <awe.ben at googlemail.com> wrote: >> >> Hi, >> >> I am new to R and have not had the most exposure to statistics. >> I have a dataset of percentage cover (so 0-100) for certain species in 3 >> different shore zones (High, mid and low). The data was recorded for >> different protected areas as well (17 of them) and my number of obs is >> large (3358). I'm obviously interested in the difference in percentage >> cover of species between shore zones as well as between protected areas. >> The problem is that my data contains loads of zeros and I haven't dealt yet >> in statistics with how to manipulate the data so as to perform robust tests >> on it. I previously used Kruskal-Wallis ANOVAsI wonder if the terms Kruskal-Wallis and ANOVA should be adjacent. I do not remember that variances are part of the inference with KW-tests. You might ask that in your question to the very helpful group on CrossValidated.com -- David.>> to look at cover differences >> in shore zone but I am worried that it is inappropriate because of the >> large sample size that I have and because my variances are not equal. >> >> I've read a bit about using a zero-inflated negative binomial regression to >> fit to my data, but I'm not sure if that will work because it is for count >> data. >> >> I would very much appreciate it if someone could point me in the correct >> direction wrt a transformation that may help or an appropriate model to fit >> or test to use. I've searched quite a bit but I'm a out of my depth. >> >> PS sorry if I sound like a halfwit >> >> Thanks a lot >> >> Ben >>
Ben: You have a statistical problem with a bounded response variable (0 to 100%, or 0.0 to 1.0) and thus, might make use of a logistic quantile regression model (see Bottai et al. 2010. Logistic quantile regression for bounded outcomes. Statistics in Medicine 29: 309-317). This requires a logit transformation, log ((y - ymin)/(ymax - y)) of your percent cover response variable and then estimation in linear quantile regression, rq() function in quantreg package. There are details in Bottai et al. that you will need to understand about back transforming your estimates, intepretations, etc. But it is fairly easy to use. Quantile regression by modeling the conditional cumulative distribution function readily accomodates the heterogeneous variance patterns that typically occur with bounded outcomes. Depending on the pattern and mass of zero values, you may still have lower regions of the cumulative distribution function about which you are able to make no inferential statements. Brian Brian S. Cade, PhD U. S. Geological Survey Fort Collins Science Center 2150 Centre Ave., Bldg. C Fort Collins, CO 80526-8818 email: cadeb at usgs.gov <brian_cade at usgs.gov> tel: 970 226-9326 On Sat, Jan 24, 2015 at 6:26 AM, Ben Brooker <awe.ben at googlemail.com> wrote:> Hi, > > I am new to R and have not had the most exposure to statistics. > I have a dataset of percentage cover (so 0-100) for certain species in 3 > different shore zones (High, mid and low). The data was recorded for > different protected areas as well (17 of them) and my number of obs is > large (3358). I'm obviously interested in the difference in percentage > cover of species between shore zones as well as between protected areas. > The problem is that my data contains loads of zeros and I haven't dealt yet > in statistics with how to manipulate the data so as to perform robust tests > on it. I previously used Kruskal-Wallis ANOVAs to look at cover differences > in shore zone but I am worried that it is inappropriate because of the > large sample size that I have and because my variances are not equal. > > I've read a bit about using a zero-inflated negative binomial regression to > fit to my data, but I'm not sure if that will work because it is for count > data. > > I would very much appreciate it if someone could point me in the correct > direction wrt a transformation that may help or an appropriate model to fit > or test to use. I've searched quite a bit but I'm a out of my depth. > > PS sorry if I sound like a halfwit > > Thanks a lot > > Ben > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]