thr3ads.net - R help - [R] Statistical question about logistic regression simulation [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Denis Aydin

2009-Aug-26 14:17 UTC

[R] Statistical question about logistic regression simulation

Hi R help list

I'm simulating logistic regression data with a specified odds ratio 
(beta) and have a problem/unexpected behaviour that occurs.


The datasets includes a lognormal exposure and diseased and healthy 
subjects.

Here is my loop:

ors <- vector()

for(i in 1:200){

# First, I create a vector with a lognormally distributed exposure:

n <- 10000 # number of study subjects
mean <- 6
sd <- 1

expo <- rlnorm(n, mean, sd)

# Then I assign each study subject a probability of disease with a
# specified Odds ratio (or beta coefficient) according to a logistic
# model:

inter <- 0.01 # intercept
or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5)

p <- exp(inter + or * expo)/(1 + exp(inter + or * expo))

# Then I use the probability to decide who is having the disease and who 
# is not:

disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy

# Then I calculate the logistic regression and extract the odds ratio

model <- glm(disease ~ expo, family = binomial)

ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR

}


Now to my questions:

1. I was expecting the mean of the odds ratios over all simulations to 
be close to the specified one (1.5 in this case). This is not the case 
if the mean of the lognormal distribution is, say 6.
If I reduce the mean of the exposure distribution to say 3, the mean of 
the simulated ORs is very close to the specified one. So the simulation 
seems to be quite sensitive to the parameters of the exposure distribution.

2. Is it somehow possible to "stabilize" the simulation so that it is 
not that sensitive to the parameters of the lognormal exposure 
distribution? I can't make up the parameters of the exposure 
distribution, they are estimations from real data.

3. Are there general flaws or errors in my approach?


Thanks a lot for any help on this!

All the best,
Denis

-- 
Denis Aydin
Institute of Social and Preventive Medicine at Swiss Tropical Institute 
Basel
Associated Institute of the University of Basel
Steinengraben 49 ? 4051 Basel ? Switzerland
Phone: +41 (0)61 270 22 04
Fax:   +41 (0)61 270 22 25
denis.aydin at unibas.ch
www.ispm-unibasel.ch

(Ted Harding)

2009-Aug-26 15:03 UTC

head link

[R] Statistical question about logistic regression simulatio

On 26-Aug-09 14:17:40, Denis Aydin wrote:> Hi R help list
> I'm simulating logistic regression data with a specified odds ratio 
> (beta) and have a problem/unexpected behaviour that occurs.
> 
> The datasets includes a lognormal exposure and diseased and healthy 
> subjects.
> 
> Here is my loop:
> 
> ors <- vector()
> for(i in 1:200){
> 
># First, I create a vector with a lognormally distributed exposure:
> n <- 10000 # number of study subjects
> mean <- 6
> sd <- 1
> expo <- rlnorm(n, mean, sd)
> 
># Then I assign each study subject a probability of disease with a
># specified Odds ratio (or beta coefficient) according to a logistic
># model:
> inter <- 0.01 # intercept
> or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5)
> p <- exp(inter + or * expo)/(1 + exp(inter + or * expo))
> 
># Then I use the probability to decide who is having the disease and who
># is not:
> disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy
> 
># Then I calculate the logistic regression and extract the odds ratio
> model <- glm(disease ~ expo, family = binomial)
> ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR
> }
> 
> Now to my questions:
> 
> 1. I was expecting the mean of the odds ratios over all simulations to 
> be close to the specified one (1.5 in this case). This is not the case 
> if the mean of the lognormal distribution is, say 6.
> If I reduce the mean of the exposure distribution to say 3, the mean of
> the simulated ORs is very close to the specified one. So the simulation
> seems to be quite sensitive to the parameters of the exposure
> distribution.
> 
> 2. Is it somehow possible to "stabilize" the simulation so that
it is
> not that sensitive to the parameters of the lognormal exposure 
> distribution? I can't make up the parameters of the exposure 
> distribution, they are estimations from real data.
> 
> 3. Are there general flaws or errors in my approach?
> 
> 
> Thanks a lot for any help on this!
> 
> All the best,
> Denis
> 
> -- 
> Denis Aydin
You need to look at the probabilities 'p' being generated by your code.

Taking first your case "mean <- 6" (and sorting 'expo', and
using
whatever seed my system had current at the time):

  n <- 10000 # number of study subjects
  mean <- 6
  sd <- 1
  expo <- sort(rlnorm(n, mean, sd))
  p <- exp(inter + or * expo)/(1 + exp(inter + or * expo))

  p[1:20]
  #  [1] 0.9763438 0.9918962 0.9924002 0.9965314 0.9980887 0.9984698
  #  [7] 0.9993116 0.9993167 0.9994007 0.9994243 0.9996288 0.9997037
  # [13] 0.9998728 0.9998832 0.9999284 0.9999346 0.9999446 0.9999528
  # [19] 0.9999561 0.9999645

so that almost all of your 'p's are very close to 1.0, which means
that almost all or even all) of your responses will be "1". Indeed,
continuiung from the above:

  disease <- rbinom(length(p), 1, p)
  disease[1:20]
  #  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  sum(disease)
  # [1] NaN
  sum(is.nan(disease))
  # [1] 710

What has happened here is that the higher values of 'expo' are so
large (in the 1000s) that the calculation of 'p' gives NA, because
the value of exp(inter + or * expo) is +Inf, so the calculation of
'p' is in terms of (+Inf)/(+Inf), which is NA.

Now compare with what happens when "mean <- 3":

  mean <- 3
  sd <- 1
  expo <- sort(rlnorm(n, mean, sd))
  p <- exp(inter + or * expo)/(1 + exp(inter + or * expo))
  p[1:20]
  #  [1] 0.5514112 0.5543155 0.5702318 0.5830025 0.5885994 0.5889078
  #  [7] 0.5908860 0.6004657 0.6029123 0.6042805 0.6048688 0.6122290
  # [13] 0.6123407 0.6135233 0.6137499 0.6139299 0.6153900 0.6181017
  # [19] 0.6184093 0.6203757
  sum(is.na(p))
  #  [1] 0
  max(expo)
  #  [1] 728.0519

So now no NAs (max(expo), though large, is now not large enough to
make the calculation of 'p' yield NA).

These smaller probabilities are now well away from 1.0, so a good mix
of "0" and "1" responses can be expected, although a good
number of
the 'p's will still be very close to 1 or will be set equal to 1.

  disease <- rbinom(length(p), 1, p)
  disease[1:20]
  #  [1] 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1

(as expected), and

  sum(disease)
  # [1] 9740

As well as the problem with p = NA ---> disease = NaN, when you
have all the probabiltiies close to 1 and (as above) get all the
'disease' outcomes = 1, the resulting attempt to fit the glm will
yield nonsense.

In summary: do not use silly paramater values for the model you
are simulating. It will almost always not work (for reasons
illustrated above), and even if it appreas to work the result
will be highly unreliable. If in doubt, have a look at what you
are getting, along the line, as illustrated above!

The above reasons almost certainly underlie your finding that the
mean of simulated OR estimates is markedly different from the value
which you set when you run the case "mean <- 6", and the much
better finding when you run the case "mean <- 3".

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 26-Aug-09                                       Time: 16:03:49
------------------------------ XFMail ------------------------------

Ravi Varadhan

2009-Aug-26 15:07 UTC

head link

[R] Statistical question about logistic regression simulation

Your exposure variable has very large values, so all your probabilities are
1. You also get a bunch of NaN's because the `expit' (inverse logit)
function to calculate the probabilities cannot be evaluated. You need to use
values of exposure that will yield some 0's and 1's so that the binomial
model can be estimated.

Ravi.

----------------------------------------------------------------------------
-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: rvaradhan at jhmi.edu

Webpage:
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
tml

 

----------------------------------------------------------------------------
--------


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Denis Aydin
Sent: Wednesday, August 26, 2009 10:18 AM
To: r-help at r-project.org
Subject: [R] Statistical question about logistic regression simulation

Hi R help list

I'm simulating logistic regression data with a specified odds ratio 
(beta) and have a problem/unexpected behaviour that occurs.


The datasets includes a lognormal exposure and diseased and healthy 
subjects.

Here is my loop:

ors <- vector()

for(i in 1:200){

# First, I create a vector with a lognormally distributed exposure:

n <- 10000 # number of study subjects
mean <- 6
sd <- 1

expo <- rlnorm(n, mean, sd)

# Then I assign each study subject a probability of disease with a
# specified Odds ratio (or beta coefficient) according to a logistic
# model:

inter <- 0.01 # intercept
or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5)

p <- exp(inter + or * expo)/(1 + exp(inter + or * expo))

# Then I use the probability to decide who is having the disease and who 
# is not:

disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy

# Then I calculate the logistic regression and extract the odds ratio

model <- glm(disease ~ expo, family = binomial)

ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR

}


Now to my questions:

1. I was expecting the mean of the odds ratios over all simulations to 
be close to the specified one (1.5 in this case). This is not the case 
if the mean of the lognormal distribution is, say 6.
If I reduce the mean of the exposure distribution to say 3, the mean of 
the simulated ORs is very close to the specified one. So the simulation 
seems to be quite sensitive to the parameters of the exposure distribution.

2. Is it somehow possible to "stabilize" the simulation so that it is 
not that sensitive to the parameters of the lognormal exposure 
distribution? I can't make up the parameters of the exposure 
distribution, they are estimations from real data.

3. Are there general flaws or errors in my approach?


Thanks a lot for any help on this!

All the best,
Denis

-- 
Denis Aydin
Institute of Social and Preventive Medicine at Swiss Tropical Institute 
Basel
Associated Institute of the University of Basel
Steinengraben 49 - 4051 Basel - Switzerland
Phone: +41 (0)61 270 22 04
Fax:   +41 (0)61 270 22 25
denis.aydin at unibas.ch
www.ispm-unibasel.ch

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Seemingly Similar Threads

Search for more maybe matching threads

R help - Aug 2009 - Statistical question about logistic regression simulation

[R] Statistical question about logistic regression simulation

[R] Statistical question about logistic regression simulatio

[R] Statistical question about logistic regression simulation

Seemingly Similar Threads