Ebert,Timothy Aaron
2022-Jun-15 18:02 UTC
[R] Model Comparision for case control studies in R
The uncorrelated nature of smoking and hypertension is a major medical breakthrough and in contrast to reports like this: https://pubmed.ncbi.nlm.nih.gov/20550499/ and the literature indicates the possibility of a relationship between age and hypertension https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4768730/. Depending on the country, there might be a relationship between smoking and age as government programs against smoking are developed. Are you looking at different models or different methods. I could have y = x + y + z as one model and y=x + z as another model. Alternatively I could be comparing ordinary least squares regression versus maximum likelihood versus Bayesian linear regression versus nonlinear regression. The former might use something like the Akaike information criterion. I am not sure the latter is useful (or possible). For example I could approximate an exponential function using a polynomial, but in this context I see no benefit in doing so even if I could compare the models. I do not quite understand why this is being done. It feels like fishing statistical methods to get the answer that I know is correct. Generally, one should understand the system well enough to select an appropriate model rather than try every possible model in the hope something fits. Of course one sometimes collects extra data in the hope that we do not miss an important feature. Then forwards/backwards/stepwise methods are used to identify the "best" model but this is looking at similar models that differ only in the list of independent variables. However the problem is solved, I would start by trying to determine if any one model was appropriate. Are the model assumptions satisfied? If the answer is no, then try another model until you find one that does satisfy the model assumptions. Alternatively, start with an understanding of the biology and use the best model. Comparing an biologically meaningless statistical model to a biologically meaningful one is an easy choice. Tim -----Original Message----- From: anteneh asmare <hanatezera at gmail.com> Sent: Wednesday, June 15, 2022 1:10 PM To: Ebert,Timothy Aaron <tebert at ufl.edu> Cc: r-help at r-project.org Subject: Re: [R] Model Comparision for case control studies in R [External Email] Dear Tim, Thanks. the first vector y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1) is the disease status y= (1=Case,0=Control). The covariate age, smoking status and hypertension are independent(uncorrelated). The logistic regression (unconditional) will used. But I need to compare other models with logistic regression instead of fitting it directly to logistic regression. There is no matching on the data to use conditional logistics regression. Best, Hana On 6/15/22, Ebert,Timothy Aaron <tebert at ufl.edu> wrote:> Disease status is missing from the sample data. > Are age, disease, smoking, and/or hypertension correlated in any way > or are they independent (correlation=0)? > Are the correlations large enough to adversely influence your model? > Tim > > -----Original Message----- > From: R-help <r-help-bounces at r-project.org> On Behalf Of anteneh > asmare > Sent: Wednesday, June 15, 2022 7:29 AM > To: r-help at r-project.org > Subject: [R] Model Comparision for case control studies in R > > [External Email] > > y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1) > age<-c(45,23,56,67,23,23,28,56,45,47,36,37,33,35,38,39,43,28,39,41) > smoking<-c(0,1,1,1,0,0,0,0,0,1,1,0,0,1,0,1,1,1,0,1) > hypertension<-c(1,1,0,1,0,1,0,1,1,0,1,1,1,1,1,1,0,0,1,0) > data<-data.frame(y,age,smoking,hypertension) > data > model<-glm(y~age+factor(smoking)+factor(hypertension), data, family = > binomial(link = "logit"),na.action = na.omit) > summary(model) > from above sample data I want to study a case-control study on male > individuals with my response variable y, disease status (1=Case, > 0=Control) with covariates age, smoking status(1=Yes, 0=No) and > hypertension, hypertensive (1=Yes, 0=No). I want to fit the model to > predict the disease status using at least two different methods. And > to make model comparisons. I think logistic regression will be the > best fit for this case control study. Do we have other options in addition to logistic regression? > My objective is to fit the model to predict the disease status using > at least two different methods. > Kind regards, > Hana > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail > man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs > Rzsn7AkP-g&m=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWww > ig4oYOB&s=ztyDthknydhlcM49F33Gz6xRl6G7U9s8aIhB1VN-EKY&e> PLEASE do read the posting guide > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or > g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA > sRzsn7AkP-g&m=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWw > wig4oYOB&s=tcsGkhvtVvoVvb1Ehah-vLRC6an40rJXQXqqfX2f0gI&e> and provide commented, minimal, self-contained, reproducible code. >
Dear Tim, Thanks a lot I am looking for different methods for each method, I want to select the best predictors and I want to report some measures of the accuracy. And I will compare the performance of the models, by plotting their ROC curves. Best, Hana On 6/15/22, Ebert,Timothy Aaron <tebert at ufl.edu> wrote:> The uncorrelated nature of smoking and hypertension is a major medical > breakthrough and in contrast to reports like this: > https://pubmed.ncbi.nlm.nih.gov/20550499/ and the literature indicates the > possibility of a relationship between age and hypertension > https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4768730/. Depending on the > country, there might be a relationship between smoking and age as government > programs against smoking are developed. > > Are you looking at different models or different methods. I could have y = x > + y + z as one model and y=x + z as another model. Alternatively I could be > comparing ordinary least squares regression versus maximum likelihood versus > Bayesian linear regression versus nonlinear regression. The former might use > something like the Akaike information criterion. I am not sure the latter is > useful (or possible). For example I could approximate an exponential > function using a polynomial, but in this context I see no benefit in doing > so even if I could compare the models. > > I do not quite understand why this is being done. It feels like fishing > statistical methods to get the answer that I know is correct. Generally, one > should understand the system well enough to select an appropriate model > rather than try every possible model in the hope something fits. Of course > one sometimes collects extra data in the hope that we do not miss an > important feature. Then forwards/backwards/stepwise methods are used to > identify the "best" model but this is looking at similar models that differ > only in the list of independent variables. > > However the problem is solved, I would start by trying to determine if any > one model was appropriate. Are the model assumptions satisfied? If the > answer is no, then try another model until you find one that does satisfy > the model assumptions. Alternatively, start with an understanding of the > biology and use the best model. Comparing an biologically meaningless > statistical model to a biologically meaningful one is an easy choice. > > Tim > > -----Original Message----- > From: anteneh asmare <hanatezera at gmail.com> > Sent: Wednesday, June 15, 2022 1:10 PM > To: Ebert,Timothy Aaron <tebert at ufl.edu> > Cc: r-help at r-project.org > Subject: Re: [R] Model Comparision for case control studies in R > > [External Email] > > Dear Tim, Thanks. the first vector > y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1) is the disease status y> (1=Case,0=Control). The covariate age, smoking status and hypertension are > independent(uncorrelated). The logistic regression (unconditional) will > used. But I need to compare other models with logistic regression instead of > fitting it directly to logistic regression. > There is no matching on the data to use conditional logistics regression. > Best, > Hana > On 6/15/22, Ebert,Timothy Aaron <tebert at ufl.edu> wrote: >> Disease status is missing from the sample data. >> Are age, disease, smoking, and/or hypertension correlated in any way >> or are they independent (correlation=0)? >> Are the correlations large enough to adversely influence your model? >> Tim >> >> -----Original Message----- >> From: R-help <r-help-bounces at r-project.org> On Behalf Of anteneh >> asmare >> Sent: Wednesday, June 15, 2022 7:29 AM >> To: r-help at r-project.org >> Subject: [R] Model Comparision for case control studies in R >> >> [External Email] >> >> y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1) >> age<-c(45,23,56,67,23,23,28,56,45,47,36,37,33,35,38,39,43,28,39,41) >> smoking<-c(0,1,1,1,0,0,0,0,0,1,1,0,0,1,0,1,1,1,0,1) >> hypertension<-c(1,1,0,1,0,1,0,1,1,0,1,1,1,1,1,1,0,0,1,0) >> data<-data.frame(y,age,smoking,hypertension) >> data >> model<-glm(y~age+factor(smoking)+factor(hypertension), data, family >> binomial(link = "logit"),na.action = na.omit) >> summary(model) >> from above sample data I want to study a case-control study on male >> individuals with my response variable y, disease status (1=Case, >> 0=Control) with covariates age, smoking status(1=Yes, 0=No) and >> hypertension, hypertensive (1=Yes, 0=No). I want to fit the model to >> predict the disease status using at least two different methods. And >> to make model comparisons. I think logistic regression will be the >> best fit for this case control study. Do we have other options in addition >> to logistic regression? >> My objective is to fit the model to predict the disease status using >> at least two different methods. >> Kind regards, >> Hana >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail >> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs >> Rzsn7AkP-g&m=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWww >> ig4oYOB&s=ztyDthknydhlcM49F33Gz6xRl6G7U9s8aIhB1VN-EKY&e>> PLEASE do read the posting guide >> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or >> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA >> sRzsn7AkP-g&m=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWw >> wig4oYOB&s=tcsGkhvtVvoVvb1Ehah-vLRC6an40rJXQXqqfX2f0gI&e>> and provide commented, minimal, self-contained, reproducible code. >> >