thr3ads.net - R help - [R] glm and percentage data with many zero values [Jan 2005]

If this information is useful, please help other people find it:
Share via:

Christian Kamenik

2005-Jan-20 16:02 UTC

[R] glm and percentage data with many zero values

Dear all,

I am interested in correctly testing effects of continuous environmental 
variables and ordered factors on bacterial abundance. Bacterial 
abundance is derived from counts and expressed as percentage. My problem 
is that the abundance data contain many zero values:
Bacteria <- 
c(2.23,0,0.03,0.71,2.34,0,0.2,0.2,0.02,2.07,0.85,0.12,0,0.59,0.02,2.3,0.29,0.39,1.32,0.07,0.52,1.2,0,0.85,1.09,0,0.5,1.4,0.08,0.11,0.05,0.17,0.31,0,0.12,0,0.99,1.11,1.78,0,0,0,2.33,0.07,0.66,1.03,0.15,0.15,0.59,0,0.03,0.16,2.86,0.2,1.66,0.12,0.09,0.01,0,0.82,0.31,0.2,0.48,0.15)

First I tried transforming the data (e.g., logit) but because of the 
zeros I was not satisfied. Next I converted the percentages into integer 
values by round(Bacteria*10) or ceiling(Bacteria*10) and calculated a 
glm with a Poisson error structure; however, I am not very happy with 
this approach because it changes the original percentage data 
substantially (e.g., 0.03 becomes either 0 or 1). The same is true for 
converting the percentages into factors and calculating a multinomial or 
proportional-odds model (anyway, I do not know if this would be a 
meaningful approach).
I was searching the web and the best answer I could get was 
http://www.biostat.wustl.edu/archives/html/s-news/1998-12/msg00010.html 
in which several persons suggested quasi-likelihood. Would it be 
reasonable to use a glm with quasipoisson? If yes, how I can I find the 
appropriate variance function? Any other suggestions?

Many thanks in advance, Christian


===============================

Christian Kamenik
Institute of Plant Sciences
University of Bern
Altenbergrain 21
3013 Bern
Switzerland

Gregor GORJANC

2005-Jan-21 12:05 UTC

head link

[R] glm and percentage data with many zero values

A hint.

You might try with ZIP i.e. zero inflated poisson model. I did not used it, 
but I have such data to work on. So if there is anyone hwo can point how to 
do this in R - please. There is also a classs of ZINB or something like 
that for zero inflated negative binomial models.

Actually I just went on web and found a book from Simonoff "Analyzing 
Categorical Data" and there are some examples in it for ZIP et al. Look 
examples for sections 4.5 and 5.4

http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/analcatdata.s
http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/functions.s

-- 
Lep pozdrav / With regards,
     Gregor GORJANC

---------------------------------------------------------------
University of Ljubljana
Biotechnical Faculty       URI: http://www.bfro.uni-lj.si
Zootechnical Department    mail: gregor.gorjanc <at> bfro.uni-lj.si
Groblje 3                  tel: +386 (0)1 72 17 861
SI-1230 Domzale            fax: +386 (0)1 72 17 888
Slovenia

BXC (Bendix Carstensen)

2005-Jan-21 13:46 UTC

head link

[R] glm and percentage data with many zero values

The ZIP model can be fitted with Jim Lindsey's function fmr 
from his gnlm library, see:

http://popgen0146uns50.unimaas.nl/~jlindsey/rcode.html

Bendix Carstensen
----------------------
Bendix Carstensen
Senior Statistician
Steno Diabetes Center
Niels Steensens Vej 2
DK-2820 Gentofte
Denmark
tel: +45 44 43 87 38
mob: +45 30 75 87 38
fax: +45 44 43 07 06
bxc at steno.dk
www.biostat.ku.dk/~bxc
----------------------


> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Gregor GORJANC
> Sent: Friday, January 21, 2005 1:05 PM
> To: christian.kamenik at ips.unibe.ch; r-help at stat.math.ethz.ch
> Subject: RE: [R] glm and percentage data with many zero values
> 
> 
> A hint.
> 
> You might try with ZIP i.e. zero inflated poisson model. I 
> did not used it, 
> but I have such data to work on. So if there is anyone hwo 
> can point how to 
> do this in R - please. There is also a classs of ZINB or 
> something like 
> that for zero inflated negative binomial models.
> 
> Actually I just went on web and found a book from Simonoff "Analyzing 
> Categorical Data" and there are some examples in it for ZIP 
> et al. Look 
> examples for sections 4.5 and 5.4
> http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/analcatdata.s
http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/functions.s

-- 
Lep pozdrav / With regards,
     Gregor GORJANC

---------------------------------------------------------------
University of Ljubljana
Biotechnical Faculty       URI: http://www.bfro.uni-lj.si
Zootechnical Department    mail: gregor.gorjanc <at> bfro.uni-lj.si
Groblje 3                  tel: +386 (0)1 72 17 861
SI-1230 Domzale            fax: +386 (0)1 72 17 888
Slovenia

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Christian Kamenik

2005-Jan-29 12:22 UTC

head link

[R] glm and percentage data with many zero values

Dear R users,

I would like to summarize the answers I got to the following question:

I am interested in correctly testing effects of continuous  
environmental variables and ordered factors on bacterial abundance.  
Bacterial abundance is derived from counts and expressed as  percentage. 
My problem is that the abundance data contain many zero  values:
Bacteria <-  
c(2.23,0,0.03,0.71,2.34,0,0.2,0.2,0.02,2.07,0.85,0.12,0,0.59,0.02,2.3,0 
.29,0.39,1.32,0.07,0.52,1.2,0,0.85,1.09,0,0.5,1.4,0.08,0.11,0.05,0.17,0 
.31,0,0.12,0,0.99,1.11,1.78,0,0,0,2.33,0.07,0.66,1.03,0.15,0.15,0.59,0, 
0.03,0.16,2.86,0.2,1.66,0.12,0.09,0.01,0,0.82,0.31,0.2,0.48,0.15)

First I tried transforming the data (e.g., logit) but because of the  
zeros I was not satisfied. Next I converted the percentages into  
integer values by round(Bacteria*10) or ceiling(Bacteria*10) and  
calculated a glm with a Poisson error structure; however, I am not  very 
happy with this approach because it changes the original  percentage 
data substantially (e.g., 0.03 becomes either 0 or 1). The  same is true 
for converting the percentages into factors and  calculating a 
multinomial or proportional-odds model (anyway, I do not  know if this 
would be a meaningful approach).
I was searching the web and the best answer I could get was  
http://www.biostat.wustl.edu/archives/html/s-news/1998-12/ msg00010.html 
in which several persons suggested quasi-likelihood.  Would it be 
reasonable to use a glm with quasipoisson? If yes, how I  can I find the 
appropriate variance function? Any other suggestions?
> If you know the totals from which these "percentages" were
derived,
> then transform your Bacteria back to original observations and fit a  
> quasi-Poisson model with log(total) as an offset. That is:
>
> BCount <- round(tot * Bacteria)
> glm(Bcount  ~ x1+ x2 + offset(log(tot)), family=quasipoisson)
>
> cheers, jari oksanen 
> I have developed an R library for specificially dealing with positive
> continuous data with exact zeros.  For example, rainfall:  No rain
> means exactly zero is recorded, but when rain falls, a continuous
> amount is recorded (after suitable rounding).
>
> This library--available on CRAN--is called  tweedie.  The distributions
> used are Tweedie models, which belong to the EDM family and so
> can be used in generalized linear models.  The Tweedie models have
> a variance function  V(mu) = mu^p, for p not in the range (0, 1).
> For various values of p, we have:
>
>  Value of p          Distribution
> p <=0     Defined over whole real line
> p=0     Normal distribution
> 0 < p < 1     No distributions exist
> p=1     Poisson distribution (with phi=1)
> 1 < p < 2     Continuous over positive Y, with positive mass at Y=0
> p=2     Gamma distribution
> p >= 2     Continuous for positive Y
> p=3     Inverse Gaussian distribution
>
> Of particular interest are the distributions such that 1 < p < 2, 
> which can be seen as a Poisson sum of gamma random variables. They are 
> continuous for Y>0 with a positive probability that Y=0 exactly. For 
> this reason, the Tweedie densities with 1 < p < 2 have been called
the
> compound Poisson, compound gamma and the Poisson-gamma distribution.
>
> In your case, percentages with exact zeros may not exactly fall into
> this category because of the upper limit of 100%.  But provided there's
> very few values near 100%, the Tweedie models might be worth a try.
> (The data above seem to indicate few values near 100%.)
>
> Get the  tweedie  package from CRAN, or from
> http://www.sci.usq.edu.au/staff/dunn/twhtml/home.html
>
> You will also need the  statmod  package, also available on CRAN.
>
> All the best.
>
> P.
>
> -- 
> Dr Peter Dunn          (USQ CRICOS No. 00244B)
>   Web:    http://www.sci.usq.edu.au/staff/dunn
>   Email:  dunn @ usq.edu.au
> Opinions expressed are mine, not those of USQ.  Obviously...
> You might try with ZIP i.e. zero inflated poisson model. I did not 
> used it, but I have such data to work on. So if there is anyone hwo 
> can point how to do this in R - please. There is also a classs of ZINB 
> or something like that for zero inflated negative binomial models.
>
> Actually I just went on web and found a book from Simonoff "Analyzing 
> Categorical Data" and there are some examples in it for ZIP et al. 
> Look examples for sections 4.5 and 5.4
>
> http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/analcatdata.s
> http://www.stern.nyu.edu/~jsimonof/AnalCatData/Splus/functions.s
>
> -- 
> Lep pozdrav / With regards,
>     Gregor GORJANC 
>The ZIP model can be fitted with Jim Lindsey's function fmr 
>from his gnlm library, see:
>
>http://popgen0146uns50.unimaas.nl/~jlindsey/rcode.html
>
>Bendix Carstensen
>It turned out that the percentage data were calculated from 
concentrations resulting in positive continuous data with exact zeros. 
The Tweedie models did a fine job.

Many thanks, Christian Kamenik

Tony Plate

2005-Mar-08 22:18 UTC

head link

[R] glm and percentage data with many zero values

A very quick and easy thing to do with count data is to add 1 (or 0.5) to 
all your counts (I'm sure you can work backwards from abundance data to 
counts and then forward again).  This gets rid of zero problems.  In some 
cases this approximates a Bayesian approach with a low-information prior 
(though I'm not at all sure whether this is the case with a glm with 
Poisson errors).

-- Tony Plate

At Wednesday 08:02 AM 4/20/2005, Christian Kamenik
wrote:>Dear all,
>
>I am interested in correctly testing effects of continuous environmental 
>variables and ordered factors on bacterial abundance. Bacterial abundance 
>is derived from counts and expressed as percentage. My problem is that the 
>abundance data contain many zero values:
>Bacteria <- 
>c(2.23,0,0.03,0.71,2.34,0,0.2,0.2,0.02,2.07,0.85,0.12,0,0.59,0.02,2.3,0.29,0.39,1.32,0.07,0.52,1.2,0,0.85,1.09,0,0.5,1.4,0.08,0.11,0.05,0.17,0.31,0,0.12,0,0.99,1.11,1.78,0,0,0,2.33,0.07,0.66,1.03,0.15,0.15,0.59,0,0.03,0.16,2.86,0.2,1.66,0.12,0.09,0.01,0,0.82,0.31,0.2,0.48,0.15)
>
>First I tried transforming the data (e.g., logit) but because of the zeros 
>I was not satisfied. Next I converted the percentages into integer values 
>by round(Bacteria*10) or ceiling(Bacteria*10) and calculated a glm with a 
>Poisson error structure; however, I am not very happy with this approach 
>because it changes the original percentage data substantially (e.g., 0.03 
>becomes either 0 or 1). The same is true for converting the percentages 
>into factors and calculating a multinomial or proportional-odds model 
>(anyway, I do not know if this would be a meaningful approach).
>I was searching the web and the best answer I could get was 
>http://www.biostat.wustl.edu/archives/html/s-news/1998-12/msg00010.html in 
>which several persons suggested quasi-likelihood. Would it be reasonable 
>to use a glm with quasipoisson? If yes, how I can I find the appropriate 
>variance function? Any other suggestions?
>
>Many thanks in advance, Christian
>
>
>===============================>
>
>Christian Kamenik
>Institute of Plant Sciences
>University of Bern
>Altenbergrain 21
>3013 Bern
>Switzerland
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Kenneth Cabrera

2005-Mar-13 13:01 UTC

head link

[R] Big databases

Hi R users:

I am using R 2.01. over a LINUX(Scientific Linux CERN)
 platform and I got the following problem:
It takes too much time (more than 6 hours, because I
stop the process) to read a
270MB database. I am using read.table() function.

Is there any workaround to read faster a big data base?

Thank you for your help.

Kenneth

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Jan 2005 - glm and percentage data with many zero values

[R] glm and percentage data with many zero values

[R] glm and percentage data with many zero values

[R] glm and percentage data with many zero values

[R] glm and percentage data with many zero values

[R] glm and percentage data with many zero values

[R] Big databases

Seemingly Similar Threads