Denis Aydin
2009-Aug-26 14:17 UTC
[R] Statistical question about logistic regression simulation
Hi R help list I'm simulating logistic regression data with a specified odds ratio (beta) and have a problem/unexpected behaviour that occurs. The datasets includes a lognormal exposure and diseased and healthy subjects. Here is my loop: ors <- vector() for(i in 1:200){ # First, I create a vector with a lognormally distributed exposure: n <- 10000 # number of study subjects mean <- 6 sd <- 1 expo <- rlnorm(n, mean, sd) # Then I assign each study subject a probability of disease with a # specified Odds ratio (or beta coefficient) according to a logistic # model: inter <- 0.01 # intercept or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5) p <- exp(inter + or * expo)/(1 + exp(inter + or * expo)) # Then I use the probability to decide who is having the disease and who # is not: disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy # Then I calculate the logistic regression and extract the odds ratio model <- glm(disease ~ expo, family = binomial) ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR } Now to my questions: 1. I was expecting the mean of the odds ratios over all simulations to be close to the specified one (1.5 in this case). This is not the case if the mean of the lognormal distribution is, say 6. If I reduce the mean of the exposure distribution to say 3, the mean of the simulated ORs is very close to the specified one. So the simulation seems to be quite sensitive to the parameters of the exposure distribution. 2. Is it somehow possible to "stabilize" the simulation so that it is not that sensitive to the parameters of the lognormal exposure distribution? I can't make up the parameters of the exposure distribution, they are estimations from real data. 3. Are there general flaws or errors in my approach? Thanks a lot for any help on this! All the best, Denis -- Denis Aydin Institute of Social and Preventive Medicine at Swiss Tropical Institute Basel Associated Institute of the University of Basel Steinengraben 49 ? 4051 Basel ? Switzerland Phone: +41 (0)61 270 22 04 Fax: +41 (0)61 270 22 25 denis.aydin at unibas.ch www.ispm-unibasel.ch
(Ted Harding)
2009-Aug-26 15:03 UTC
[R] Statistical question about logistic regression simulatio
On 26-Aug-09 14:17:40, Denis Aydin wrote:> Hi R help list > I'm simulating logistic regression data with a specified odds ratio > (beta) and have a problem/unexpected behaviour that occurs. > > The datasets includes a lognormal exposure and diseased and healthy > subjects. > > Here is my loop: > > ors <- vector() > for(i in 1:200){ > ># First, I create a vector with a lognormally distributed exposure: > n <- 10000 # number of study subjects > mean <- 6 > sd <- 1 > expo <- rlnorm(n, mean, sd) > ># Then I assign each study subject a probability of disease with a ># specified Odds ratio (or beta coefficient) according to a logistic ># model: > inter <- 0.01 # intercept > or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5) > p <- exp(inter + or * expo)/(1 + exp(inter + or * expo)) > ># Then I use the probability to decide who is having the disease and who ># is not: > disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy > ># Then I calculate the logistic regression and extract the odds ratio > model <- glm(disease ~ expo, family = binomial) > ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR > } > > Now to my questions: > > 1. I was expecting the mean of the odds ratios over all simulations to > be close to the specified one (1.5 in this case). This is not the case > if the mean of the lognormal distribution is, say 6. > If I reduce the mean of the exposure distribution to say 3, the mean of > the simulated ORs is very close to the specified one. So the simulation > seems to be quite sensitive to the parameters of the exposure > distribution. > > 2. Is it somehow possible to "stabilize" the simulation so that it is > not that sensitive to the parameters of the lognormal exposure > distribution? I can't make up the parameters of the exposure > distribution, they are estimations from real data. > > 3. Are there general flaws or errors in my approach? > > > Thanks a lot for any help on this! > > All the best, > Denis > > -- > Denis AydinYou need to look at the probabilities 'p' being generated by your code. Taking first your case "mean <- 6" (and sorting 'expo', and using whatever seed my system had current at the time): n <- 10000 # number of study subjects mean <- 6 sd <- 1 expo <- sort(rlnorm(n, mean, sd)) p <- exp(inter + or * expo)/(1 + exp(inter + or * expo)) p[1:20] # [1] 0.9763438 0.9918962 0.9924002 0.9965314 0.9980887 0.9984698 # [7] 0.9993116 0.9993167 0.9994007 0.9994243 0.9996288 0.9997037 # [13] 0.9998728 0.9998832 0.9999284 0.9999346 0.9999446 0.9999528 # [19] 0.9999561 0.9999645 so that almost all of your 'p's are very close to 1.0, which means that almost all or even all) of your responses will be "1". Indeed, continuiung from the above: disease <- rbinom(length(p), 1, p) disease[1:20] # [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 sum(disease) # [1] NaN sum(is.nan(disease)) # [1] 710 What has happened here is that the higher values of 'expo' are so large (in the 1000s) that the calculation of 'p' gives NA, because the value of exp(inter + or * expo) is +Inf, so the calculation of 'p' is in terms of (+Inf)/(+Inf), which is NA. Now compare with what happens when "mean <- 3": mean <- 3 sd <- 1 expo <- sort(rlnorm(n, mean, sd)) p <- exp(inter + or * expo)/(1 + exp(inter + or * expo)) p[1:20] # [1] 0.5514112 0.5543155 0.5702318 0.5830025 0.5885994 0.5889078 # [7] 0.5908860 0.6004657 0.6029123 0.6042805 0.6048688 0.6122290 # [13] 0.6123407 0.6135233 0.6137499 0.6139299 0.6153900 0.6181017 # [19] 0.6184093 0.6203757 sum(is.na(p)) # [1] 0 max(expo) # [1] 728.0519 So now no NAs (max(expo), though large, is now not large enough to make the calculation of 'p' yield NA). These smaller probabilities are now well away from 1.0, so a good mix of "0" and "1" responses can be expected, although a good number of the 'p's will still be very close to 1 or will be set equal to 1. disease <- rbinom(length(p), 1, p) disease[1:20] # [1] 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 (as expected), and sum(disease) # [1] 9740 As well as the problem with p = NA ---> disease = NaN, when you have all the probabiltiies close to 1 and (as above) get all the 'disease' outcomes = 1, the resulting attempt to fit the glm will yield nonsense. In summary: do not use silly paramater values for the model you are simulating. It will almost always not work (for reasons illustrated above), and even if it appreas to work the result will be highly unreliable. If in doubt, have a look at what you are getting, along the line, as illustrated above! The above reasons almost certainly underlie your finding that the mean of simulated OR estimates is markedly different from the value which you set when you run the case "mean <- 6", and the much better finding when you run the case "mean <- 3". Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 26-Aug-09 Time: 16:03:49 ------------------------------ XFMail ------------------------------
Ravi Varadhan
2009-Aug-26 15:07 UTC
[R] Statistical question about logistic regression simulation
Your exposure variable has very large values, so all your probabilities are 1. You also get a bunch of NaN's because the `expit' (inverse logit) function to calculate the probabilities cannot be evaluated. You need to use values of exposure that will yield some 0's and 1's so that the binomial model can be estimated. Ravi. ---------------------------------------------------------------------------- ------- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: rvaradhan at jhmi.edu Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h tml ---------------------------------------------------------------------------- -------- -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Denis Aydin Sent: Wednesday, August 26, 2009 10:18 AM To: r-help at r-project.org Subject: [R] Statistical question about logistic regression simulation Hi R help list I'm simulating logistic regression data with a specified odds ratio (beta) and have a problem/unexpected behaviour that occurs. The datasets includes a lognormal exposure and diseased and healthy subjects. Here is my loop: ors <- vector() for(i in 1:200){ # First, I create a vector with a lognormally distributed exposure: n <- 10000 # number of study subjects mean <- 6 sd <- 1 expo <- rlnorm(n, mean, sd) # Then I assign each study subject a probability of disease with a # specified Odds ratio (or beta coefficient) according to a logistic # model: inter <- 0.01 # intercept or <- log(1.5) # an odds ratio of 1.5 or a beta of ln(1.5) p <- exp(inter + or * expo)/(1 + exp(inter + or * expo)) # Then I use the probability to decide who is having the disease and who # is not: disease <- rbinom(length(p), 1, p) # 1 = disease, 0 = healthy # Then I calculate the logistic regression and extract the odds ratio model <- glm(disease ~ expo, family = binomial) ors[i] <- exp(summary(model)$coef[2]) # exponentiated beta = OR } Now to my questions: 1. I was expecting the mean of the odds ratios over all simulations to be close to the specified one (1.5 in this case). This is not the case if the mean of the lognormal distribution is, say 6. If I reduce the mean of the exposure distribution to say 3, the mean of the simulated ORs is very close to the specified one. So the simulation seems to be quite sensitive to the parameters of the exposure distribution. 2. Is it somehow possible to "stabilize" the simulation so that it is not that sensitive to the parameters of the lognormal exposure distribution? I can't make up the parameters of the exposure distribution, they are estimations from real data. 3. Are there general flaws or errors in my approach? Thanks a lot for any help on this! All the best, Denis -- Denis Aydin Institute of Social and Preventive Medicine at Swiss Tropical Institute Basel Associated Institute of the University of Basel Steinengraben 49 - 4051 Basel - Switzerland Phone: +41 (0)61 270 22 04 Fax: +41 (0)61 270 22 25 denis.aydin at unibas.ch www.ispm-unibasel.ch ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.