Bekzod Akhmuratov
2024-Sep-24 05:04 UTC
[R] Help needed! Pre-processing the dataset before splitting - model building - model tuning - performance evaluation
Below is the link for a dataset on focus. I want to split the dataset into training and test set, use training set to build the model and model tune, use test set to evaluate performance. But before doing that I want to make sure that original dataset doesn't have noise, collinearity to address, no major outliers so that I have to transform the data using techniques like Box-Cox and looking at VIF to eliminate highly correlated predictors. https://www.kaggle.com/datasets/joaofilipemarques/google-advanced-data-analytics-waze-user-data When I fit the original dataset into regression model with Minitab, I get attached result for residuals. It doesn't look normal. Does it mean there is high correlation or the dataset in have nonlinear response and predictors? How should I approach this? What would be my strategy if I use in Python, Minitab, and R. Explaining it in all softwares are appraciated if possible. -------------- next part -------------- A non-text attachment was scrubbed... Name: Residual Plots for Response.png Type: image/png Size: 17679 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20240924/8046d3c5/attachment.png>
Rui Barradas
2024-Sep-25 08:00 UTC
[R] Help needed! Pre-processing the dataset before splitting - model building - model tuning - performance evaluation
?s 06:04 de 24/09/2024, Bekzod Akhmuratov escreveu:> Below is the link for a dataset on focus. I want to split the dataset into > training and test set, use training set to build the model and model tune, > use test set to evaluate performance. But before doing that I want to make > sure that original dataset doesn't have noise, collinearity to address, no > major outliers so that I have to transform the data using techniques like > Box-Cox and looking at VIF to eliminate highly correlated predictors. > > https://www.kaggle.com/datasets/joaofilipemarques/google-advanced-data-analytics-waze-user-data > > When I fit the original dataset into regression model with Minitab, I get > attached result for residuals. It doesn't look normal. Does it mean there > is high correlation or the dataset in have nonlinear response and > predictors? How should I approach this? What would be my strategy if I use > in Python, Minitab, and R. Explaining it in all softwares are appraciated > if possible. > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, R-Help is a list of questions and answers about R code, not to suggest analysis strategies. Anyhow, I suggest that you read the Python notebook at the bottom of the Kaggle page, it addresses your main question and if you have doubts translating the Python code to R code, ask us more specific questions on those doubts. Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. www.avg.com