Hello R-people! I have a general statistical question about regressions. I just want to describe my case: I have got a dataset of around 150 observations and 1 dependent and 2 independent variables. The dependent variable is of metric nature (in my case meters in a range from around 0.5-10000 m). The first independent is also metric (in mm ranging from 50-700 mm) and it is assumed to be in a linear relation with the dependend one. So that is not a problem at all to do a typicall linear regression on that. No there is the second independent variable. This is also of metric nature and gives information on time (ranging from 1 day to 800 days) but here sometimes is this variable not exactly clear, I know for example a range (1-2 days) or less than x days etc. So my dataset could look like this: measured dependent variable in days: 1 15 7-9 <2 <9 24 4 4-7 So my question: Is there a general method to include such types of variables into a regression analysis? Secondly I assume that there is not a linear relation given, it is more of a logarithmic nature so that the influence of the time on the dependent variable decreases with increasing size. So in short my questions: * How can I use variable values like <5 or 4-5 in a regression * Is it possible to combine the linear relationship with a logarithmic one in a multiple regression *How can that be done in R, are there any special packages you'd recommend? Thank you very much best regards Johannes
You could treat the dependent variable as a nominal variable. And scale the indepent variables to have a Mean:0 and StDev:1. Stick all these in a multinomial regression package such as mlogit. Or a non -parametric method such as randomForest. -- View this message in context: http://r.789695.n4.nabble.com/special-question-on-regression-tp3673228p3673377.html Sent from the R help mailing list archive at Nabble.com.
Johannes: R is not a statistical tutorial service, although kind and able helpeRs sometimes do reply to such queries. You should try such a service, for example: http://stackoverflow.com/ FWIW, this is an example of censoring in regression. R has packages for this, but you need to learn more or get help to use them properly, as you, yourself, indicated. -- Bert On Sun, Jul 17, 2011 at 3:01 AM, Johannes Radinger <Jradinger at gmx.at> wrote:> Hello R-people! > > I have a general statistical question about regressions. I just want to > describe my case: > > I have got a dataset of around 150 observations and 1 dependent and 2 > independent variables. > The dependent variable is of metric nature (in my case meters in a range > from around 0.5-10000 m). The first independent is also metric (in mm > ranging from 50-700 mm) and it is assumed to be in a linear relation with > the dependend one. So that is not a problem at all to do a typicall linear > regression on that. > > No there is the second independent variable. This is also of metric nature > and gives information on time (ranging from 1 day to 800 days) but here > sometimes is this variable not exactly clear, I know for example a range > (1-2 days) or less than x days etc. So my dataset could look like this: > > measured dependent variable in days: > 1 > 15 > 7-9 > <2 > <9 > 24 > 4 > 4-7 > > So my question: Is there a general method to include such types of variables > into a regression analysis? > > Secondly I assume that there is not a linear relation given, it is more of a > logarithmic nature so that the influence of the time on the dependent > variable decreases with increasing size. > > So in short my questions: > * How can I use variable values like <5 or 4-5 in a regression > * Is it possible to combine the linear relationship with a logarithmic one > in a multiple regression > *How can that be done in R, are there any special packages you'd recommend? > > Thank you very much > > best regards > Johannes > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- "Men by nature long to get on to the ultimate truths, and will often be impatient with elementary studies or fight shy of them. If it were possible to reach the ultimate truths without the elementary studies usually prefixed to them, these would not be preparatory studies but superfluous diversions." -- Maimonides (1135-1204) Bert Gunter Genentech Nonclinical Biostatistics 467-7374 http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
- Please also reply to the original poster who may not be subscribed to the list. - Please cite the original question (and other relevant parts of the thread) since some readers of this list will delete messages before an answer arrives. Uwe Ligges On 17.07.2011 15:22, saskay wrote:> You could treat the dependent variable as a nominal variable. And scale the > indepent variables to have a Mean:0 and StDev:1. Stick all these in a > multinomial regression package such as mlogit. Or a non -parametric method > such as randomForest. > > -- > View this message in context: http://r.789695.n4.nabble.com/special-question-on-regression-tp3673228p3673377.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
I remember seeing an example using the EM algorithm where one of the variables was age of child and they assumed that an age like 16 months was accurate to the month, but ages like 18 months may have been off by as much as 2 months and ages like 3 years could be off by 6 months (or more), so they used the EM algorithm to estimate the actual ages (I think, but am not sure, that age was used as a predictor in the regression). I think the example may be in Little and Rubin's book on missing data, but I could not find it in a quick skim through my copy. But that is one approach. If you have an exact transformation of your interval data and you are willing to assume multivariate normality (after the transform) then you could use maximum likelihood with the optim (or other) function, just create the likelihood function to take into account the intervals. This would work with something like a log transform, but I don't know how it would work with something like a spline. Another approach would be a Bayesian regression (use BRugs or similar) where you put a prior distribution on each of the intervals, e.g. if the data you have is 2-4 then maybe use a uniform prior between 2 and 4, etc. This has a similar feel to me to the EM approach, but based on very different theory. One advantage of this is that you also have a posterior distribution on the actual value of each of your values that you only know the interval for. Again this works great with known transforms like log, but I don't know how you would do a spline, you could use a polynomial transformation to start to at least get a feel for the level and general shape of the nonlinearity. You might also try multiple imputation on the interval data, there are several packages that do the multiple imputation, but I don't know if any of them would take the intervals into account. You could possibly create your own imputations generating randomly within the intervals, then use the existing tools to help with the analysis. There are a few avenues to investigate, I think I would go with the Bayesian (just don't tell my Bayesian friends that :-), but your preferences could differ. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Johannes Radinger > Sent: Sunday, July 17, 2011 4:02 AM > To: r-help at r-project.org > Subject: [R] special question on regression > > Hello R-people! > > I have a general statistical question about regressions. I just want > to describe my case: > > I have got a dataset of around 150 observations and 1 dependent and 2 > independent variables. > The dependent variable is of metric nature (in my case meters in a > range from around 0.5-10000 m). The first independent is also metric > (in mm ranging from 50-700 mm) and it is assumed to be in a linear > relation with the dependend one. So that is not a problem at all to > do a typicall linear regression on that. > > No there is the second independent variable. This is also of metric > nature and gives information on time (ranging from 1 day to 800 days) > but here sometimes is this variable not exactly clear, I know for > example a range (1-2 days) or less than x days etc. So my dataset > could look like this: > > measured dependent variable in days: > 1 > 15 > 7-9 > <2 > <9 > 24 > 4 > 4-7 > > So my question: Is there a general method to include such types of > variables into a regression analysis? > > Secondly I assume that there is not a linear relation given, it is > more of a logarithmic nature so that the influence of the time on the > dependent variable decreases with increasing size. > > So in short my questions: > * How can I use variable values like <5 or 4-5 in a regression > * Is it possible to combine the linear relationship with a > logarithmic one in a multiple regression > *How can that be done in R, are there any special packages you'd > recommend? > > Thank you very much > > best regards > Johannes > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.