Brian Feeny
2012-Dec-10 22:41 UTC
[R] splitting dataset based on variable and re-combining
I have a dataset and I wish to use two different models to predict. Both models are SVM. The reason for two different models is based on the sex of the observation. I wish to be able to make predictions and have the results be in the same order as my original dataset. To illustrate I will use iris: # Take Iris and create a dataframe of just two Species, setosa and versicolor, shuffle them data(iris) iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),] irisindex <- sample(1:nrow(iris), nrow(iris)) iris <- iris[irisindex,] # Make predictions on setosa using the mySetosaModel model, and on versicolor using the myVersicolorModel: predict(mySetosaModel, iris[iris$Species=="setosa",]) predict(myVersicolorModel, iris[iris$Species=="versicolor",]) The problem is this will give me a vector of just the setosa results, and then one of just the versicolor results. I wish to take the results and have them be in the same order as the original dataset. So if the original dataset had: Species setosa setosa versicolor setosa versicolor setosa I wish for my results to have: <prediction for setosa> <prediction for setosa> <prediction for versicolor> <prediction for setosa> <prediction for versicolor> <prediction for setosa> But instead, what I am ending up with is two result sets, and no way I can think of to combine them. I am sure this comes up alot where you have a factor you wish to split your models on, say sex (male vs. female), and you need to present the results back so it matches to the order of the orignal dataset. I have tried to think of ways to use an index, to try to keep things in order, but I can't figure it out. Any help is greatly appreciated. Brian
David L Carlson
2012-Dec-11 00:02 UTC
[R] splitting dataset based on variable and re-combining
Package plyr is designed for this sort of thing, but functions split() and unsplit() will work as well. This example just uses a simple lm() model:> data(iris) > iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),] > set.seed(42) > irisindex <- sample(1:nrow(iris), nrow(iris)) > iris <- iris[irisindex,] > iris$Species <- factor(iris$Species) # Eliminate empty level virginica > iris2 <- split(iris, iris$Species) # List with two data.frames > results <- lapply(iris2, function(x) lm(Sepal.Length ~ Sepal.Width ++ Petal.Length + Petal.Width, x))> fit <- lapply(results, predict) > iris3 <- lapply(names(iris2), function(x) data.frame(iris2[[x]],fitted=fit[[x]]))> iris4 <- unsplit(iris3, iris$Species) > head(iris4)Sepal.Length Sepal.Width Petal.Length Petal.Width Species fitted 92 6.1 3.0 4.6 1.4 versicolor 6.283549 93 5.8 2.6 4.0 1.2 versicolor 5.719649 29 5.2 3.4 1.4 0.2 setosa 4.961338 81 5.5 2.4 3.8 1.1 versicolor 5.528532 62 5.9 3.0 4.2 1.5 versicolor 5.852292 50 5.0 3.3 1.4 0.2 setosa 4.895855 ---------------------------------------------- David L Carlson Associate Professor of Anthropology Texas A&M University College Station, TX 77843-4352> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Brian Feeny > Sent: Monday, December 10, 2012 4:41 PM > To: r-help at r-project.org > Subject: [R] splitting dataset based on variable and re-combining > > > I have a dataset and I wish to use two different models to predict. > Both models are SVM. The reason for two different models is based > on the sex of the observation. I wish to be able to make predictions > and have the results be in the same order as my original dataset. To > illustrate I will use iris: > > # Take Iris and create a dataframe of just two Species, setosa and > versicolor, shuffle them > data(iris) > iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),] > irisindex <- sample(1:nrow(iris), nrow(iris)) > iris <- iris[irisindex,] > > # Make predictions on setosa using the mySetosaModel model, and on > versicolor using the myVersicolorModel: > > predict(mySetosaModel, iris[iris$Species=="setosa",]) > predict(myVersicolorModel, iris[iris$Species=="versicolor",]) > > The problem is this will give me a vector of just the setosa results, > and then one of just the versicolor results. > > I wish to take the results and have them be in the same order as the > original dataset. So if the original dataset had: > > > Species > setosa > setosa > versicolor > setosa > versicolor > setosa > > I wish for my results to have: > <prediction for setosa> > <prediction for setosa> > <prediction for versicolor> > <prediction for setosa> > <prediction for versicolor> > <prediction for setosa> > > But instead, what I am ending up with is two result sets, and no way I > can think of to combine them. I am sure this comes up alot where you > have a factor you wish to split your models on, say sex (male vs. > female), and you need to present the results back so it matches to the > order of the orignal dataset. > > I have tried to think of ways to use an index, to try to keep things in > order, but I can't figure it out. > > Any help is greatly appreciated. > > Brian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Thomas Stewart
2012-Dec-11 03:38 UTC
[R] splitting dataset based on variable and re-combining
Why not use an indicator variable? P1 <- ... # prediction from model 1 (Setosa) for entire dataset P2 <- ... # prediction from model 2 for entire dataset I <- Species=="setosa" # Predictions <- P1 * I + P2 * ( 1 - I ) On Monday, December 10, 2012, Brian Feeny wrote:> > I have a dataset and I wish to use two different models to predict. Both > models are SVM. The reason for two different models is based > on the sex of the observation. I wish to be able to make predictions and > have the results be in the same order as my original dataset. To > illustrate I will use iris: > > # Take Iris and create a dataframe of just two Species, setosa and > versicolor, shuffle them > data(iris) > iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),] > irisindex <- sample(1:nrow(iris), nrow(iris)) > iris <- iris[irisindex,] > > # Make predictions on setosa using the mySetosaModel model, and on > versicolor using the myVersicolorModel: > > predict(mySetosaModel, iris[iris$Species=="setosa",]) > predict(myVersicolorModel, iris[iris$Species=="versicolor",]) > > The problem is this will give me a vector of just the setosa results, and > then one of just the versicolor results. > > I wish to take the results and have them be in the same order as the > original dataset. So if the original dataset had: > > > Species > setosa > setosa > versicolor > setosa > versicolor > setosa > > I wish for my results to have: > <prediction for setosa> > <prediction for setosa> > <prediction for versicolor> > <prediction for setosa> > <prediction for versicolor> > <prediction for setosa> > > But instead, what I am ending up with is two result sets, and no way I can > think of to combine them. I am sure this comes up alot where you have a > factor you wish to split your models on, say sex (male vs. female), and you > need to present the results back so it matches to the order of the orignal > dataset. > > I have tried to think of ways to use an index, to try to keep things in > order, but I can't figure it out. > > Any help is greatly appreciated. > > Brian > > ______________________________________________ > R-help@r-project.org <javascript:;> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >[[alternative HTML version deleted]]
Brian Feeny
2012-Dec-11 03:46 UTC
[R] splitting dataset based on variable and re-combining
I will look into that, thanks. I am afraid I don't quite understand what is going on there with the multiplication, so I will need to read up. What I ended up doing was like so: For train data, its easy, as I can subset to have the model only work off the data I want: rbfSVM_setosa <- train(Sepal.Length~., data = trainset, subset = trainset$Species=="setosa", ...) rbfSVM_versicolor <- train(Sepal.Length~., data = trainset, subset = trainset$Species=="versicolor", ...) For my test data (testset), I ended up doing like so which appears to work: index_setosa<- which(testset$Species == "setosa") svmPred <- as.vector(rep(NA,nrow(testset))) svmPred[index_setosa] <- predict(rbfSVM_setosa, testset[testset$Species == "setosa",]) svmPred[is.na(svmPred)] <- predict(rbfSVM_versicolor, testset[testset$Species == "versicolor",]) The above works when there are just two classes. I am going to read on some of these other ways suggested and give them a try. Brian On Dec 10, 2012, at 10:38 PM, Thomas Stewart <tgs.public.mail@gmail.com> wrote:> Why not use an indicator variable? > > P1 <- ... # prediction from model 1 (Setosa) for entire dataset > > P2 <- ... # prediction from model 2 for entire dataset > > I <- Species=="setosa" # > > Predictions <- P1 * I + P2 * ( 1 - I ) > > On Monday, December 10, 2012, Brian Feeny wrote: > > I have a dataset and I wish to use two different models to predict. Both models are SVM. The reason for two different models is based > on the sex of the observation. I wish to be able to make predictions and have the results be in the same order as my original dataset. To > illustrate I will use iris: > > # Take Iris and create a dataframe of just two Species, setosa and versicolor, shuffle them > data(iris) > iris <- iris[(iris$Species=="setosa" | iris$Species=="versicolor"),] > irisindex <- sample(1:nrow(iris), nrow(iris)) > iris <- iris[irisindex,] > > # Make predictions on setosa using the mySetosaModel model, and on versicolor using the myVersicolorModel: > > predict(mySetosaModel, iris[iris$Species=="setosa",]) > predict(myVersicolorModel, iris[iris$Species=="versicolor",]) > > The problem is this will give me a vector of just the setosa results, and then one of just the versicolor results. > > I wish to take the results and have them be in the same order as the original dataset. So if the original dataset had: > > > Species > setosa > setosa > versicolor > setosa > versicolor > setosa > > I wish for my results to have: > <prediction for setosa> > <prediction for setosa> > <prediction for versicolor> > <prediction for setosa> > <prediction for versicolor> > <prediction for setosa> > > But instead, what I am ending up with is two result sets, and no way I can think of to combine them. I am sure this comes up alot where you have a factor you wish to split your models on, say sex (male vs. female), and you need to present the results back so it matches to the order of the orignal dataset. > > I have tried to think of ways to use an index, to try to keep things in order, but I can't figure it out. > > Any help is greatly appreciated. > > Brian > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]