Because of regulatory requirement changes over several decades and weather conditions preventing site access the variables in my data set have different lengths. I'd like guidance on how to perform linear regressions and other models with these variables. For example, there are 2206 rows for the parameter "TDS" but only 1191 rows for the parameter "Cond." Such discrepancies are common in these data. Is there a reference I can read to learn how to analyze such data? Rich
Sounds like you are dealing with missing data problem. At default, lm or glm would only keep observations with complete records (complete case analysis). This can be problematic if you have many missing variables and missing values occur not completely at random (i.e., missing values are dependent on other (un)measured variables or missing values themselves). Imputation is a common tool for handling imcomplete data analysis. In R, you can find packages which conduct single or multiple imputations, e.g. randomForest, norm, mice, mi etc.. No easy way out with missing data problems, all imputations are based on some strong and untestable assumptions. Weidong Gu On Fri, Oct 21, 2011 at 12:13 PM, Rich Shepard <rshepard at appl-ecosys.com> wrote:> ?Because of regulatory requirement changes over several decades and weather > conditions preventing site access the variables in my data set have > different lengths. I'd like guidance on how to perform linear regressions > and other models with these variables. > > ?For example, there are 2206 rows for the parameter "TDS" but only 1191 > rows for the parameter "Cond." Such discrepancies are common in these data. > > ?Is there a reference I can read to learn how to analyze such data? > > Rich > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
I know in my experience "Cond" (conductivity??) doesn't vary much within a stream except for during high flow events, and I would imagine the same is true for TDS. If these are all low flow values, you could possibly determine a mean/median value to use for the missing data points. Obviously this is going to be different if you are sampling storm events. If you have stage data and lots of data points, you may be able to model the parameters as a function of stage. HTH Rich Shepard wrote:> > Because of regulatory requirement changes over several decades and weather > conditions preventing site access the variables in my data set have > different lengths. I'd like guidance on how to perform linear regressions > and other models with these variables. > > For example, there are 2206 rows for the parameter "TDS" but only 1191 > rows for the parameter "Cond." Such discrepancies are common in these > data. > > Is there a reference I can read to learn how to analyze such data? > > Rich > > ______________________________________________ > R-help@ mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- View this message in context: http://r.789695.n4.nabble.com/Working-With-Variables-Having-Different-Lengths-tp3926023p3926158.html Sent from the R help mailing list archive at Nabble.com.