thr3ads.net - R help - [R] splitting dataset based on variable and re-combining [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Brian Feeny

2012-Dec-10 22:41 UTC

[R] splitting dataset based on variable and re-combining

I have a dataset and I wish to use two different models to predict.  Both models
are SVM.  The reason for two different models is based
on the sex of the observation.  I wish to be able to make predictions and have
the results be in the same order as my original dataset.  To
illustrate I will use iris:

# Take Iris and create a dataframe of just two Species, setosa and versicolor,
shuffle them
data(iris)
iris <- iris[(iris$Species=="setosa" |
iris$Species=="versicolor"),]
irisindex <- sample(1:nrow(iris), nrow(iris))
iris <- iris[irisindex,]

# Make predictions on setosa using the mySetosaModel model, and on versicolor
using the myVersicolorModel:

predict(mySetosaModel, iris[iris$Species=="setosa",])
predict(myVersicolorModel, iris[iris$Species=="versicolor",])

The problem is this will give me a vector of just the setosa results, and then
one of just the versicolor results.

I wish to take the results and have them be in the same order as the original
dataset.  So if the original dataset had:


Species
setosa
setosa
versicolor
setosa
versicolor
setosa

I wish for my results to have:
<prediction for setosa>
<prediction for setosa>
<prediction for versicolor>
<prediction for setosa>
<prediction for versicolor>
<prediction for setosa>

But instead, what I am ending up with is two result sets, and no way I can think
of to combine them.  I am sure this comes up alot where you have a factor you
wish to split your models on, say sex (male vs. female), and you need to present
the results back so it matches to the order of the orignal dataset.

I have tried to think of ways to use an index, to try to keep things in order,
but I can't figure it out.

Any help is greatly appreciated.

Brian

David L Carlson

2012-Dec-11 00:02 UTC

head link

[R] splitting dataset based on variable and re-combining

Package plyr is designed for this sort of thing, but functions split() and
unsplit() will work as well. This example just uses a simple lm() model:
> data(iris)
> iris <- iris[(iris$Species=="setosa" |
iris$Species=="versicolor"),]
> set.seed(42)
> irisindex <- sample(1:nrow(iris), nrow(iris))
> iris <- iris[irisindex,]
> iris$Species <- factor(iris$Species) # Eliminate empty level virginica
> iris2 <- split(iris, iris$Species)   # List with two data.frames
> results <- lapply(iris2, function(x) lm(Sepal.Length ~ Sepal.Width + 
+     Petal.Length + Petal.Width, x))> fit <- lapply(results, predict)
> iris3 <- lapply(names(iris2), function(x) data.frame(iris2[[x]],
fitted=fit[[x]]))> iris4 <- unsplit(iris3, iris$Species)
> head(iris4)   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species   fitted
92          6.1         3.0          4.6         1.4 versicolor 6.283549
93          5.8         2.6          4.0         1.2 versicolor 5.719649
29          5.2         3.4          1.4         0.2     setosa 4.961338
81          5.5         2.4          3.8         1.1 versicolor 5.528532
62          5.9         3.0          4.2         1.5 versicolor 5.852292
50          5.0         3.3          1.4         0.2     setosa 4.895855

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Brian Feeny
> Sent: Monday, December 10, 2012 4:41 PM
> To: r-help at r-project.org
> Subject: [R] splitting dataset based on variable and re-combining
> 
> 
> I have a dataset and I wish to use two different models to predict.
> Both models are SVM.  The reason for two different models is based
> on the sex of the observation.  I wish to be able to make predictions
> and have the results be in the same order as my original dataset.  To
> illustrate I will use iris:
> 
> # Take Iris and create a dataframe of just two Species, setosa and
> versicolor, shuffle them
> data(iris)
> iris <- iris[(iris$Species=="setosa" |
iris$Species=="versicolor"),]
> irisindex <- sample(1:nrow(iris), nrow(iris))
> iris <- iris[irisindex,]
> 
> # Make predictions on setosa using the mySetosaModel model, and on
> versicolor using the myVersicolorModel:
> 
> predict(mySetosaModel, iris[iris$Species=="setosa",])
> predict(myVersicolorModel, iris[iris$Species=="versicolor",])
> 
> The problem is this will give me a vector of just the setosa results,
> and then one of just the versicolor results.
> 
> I wish to take the results and have them be in the same order as the
> original dataset.  So if the original dataset had:
> 
> 
> Species
> setosa
> setosa
> versicolor
> setosa
> versicolor
> setosa
> 
> I wish for my results to have:
> <prediction for setosa>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> 
> But instead, what I am ending up with is two result sets, and no way I
> can think of to combine them.  I am sure this comes up alot where you
> have a factor you wish to split your models on, say sex (male vs.
> female), and you need to present the results back so it matches to the
> order of the orignal dataset.
> 
> I have tried to think of ways to use an index, to try to keep things in
> order, but I can't figure it out.
> 
> Any help is greatly appreciated.
> 
> Brian
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Thomas Stewart

2012-Dec-11 03:38 UTC

head link

[R] splitting dataset based on variable and re-combining

Why not use an indicator variable?

P1 <- ... # prediction from model 1 (Setosa) for entire dataset

P2 <- ... # prediction from model 2 for entire dataset

I <- Species=="setosa" #

Predictions <- P1 * I + P2 * ( 1 - I )

On Monday, December 10, 2012, Brian Feeny wrote:
>
> I have a dataset and I wish to use two different models to predict.  Both
> models are SVM.  The reason for two different models is based
> on the sex of the observation.  I wish to be able to make predictions and
> have the results be in the same order as my original dataset.  To
> illustrate I will use iris:
>
> # Take Iris and create a dataframe of just two Species, setosa and
> versicolor, shuffle them
> data(iris)
> iris <- iris[(iris$Species=="setosa" |
iris$Species=="versicolor"),]
> irisindex <- sample(1:nrow(iris), nrow(iris))
> iris <- iris[irisindex,]
>
> # Make predictions on setosa using the mySetosaModel model, and on
> versicolor using the myVersicolorModel:
>
> predict(mySetosaModel, iris[iris$Species=="setosa",])
> predict(myVersicolorModel, iris[iris$Species=="versicolor",])
>
> The problem is this will give me a vector of just the setosa results, and
> then one of just the versicolor results.
>
> I wish to take the results and have them be in the same order as the
> original dataset.  So if the original dataset had:
>
>
> Species
> setosa
> setosa
> versicolor
> setosa
> versicolor
> setosa
>
> I wish for my results to have:
> <prediction for setosa>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
>
> But instead, what I am ending up with is two result sets, and no way I can
> think of to combine them.  I am sure this comes up alot where you have a
> factor you wish to split your models on, say sex (male vs. female), and you
> need to present the results back so it matches to the order of the orignal
> dataset.
>
> I have tried to think of ways to use an index, to try to keep things in
> order, but I can't figure it out.
>
> Any help is greatly appreciated.
>
> Brian
>
> ______________________________________________
> R-help@r-project.org <javascript:;> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
	[[alternative HTML version deleted]]

Brian Feeny

2012-Dec-11 03:46 UTC

head link

[R] splitting dataset based on variable and re-combining

I will look into that, thanks.  I am afraid I don't quite understand what is
going on there with the multiplication, so I will need to read up.  What I ended
up doing was like so:

For train data, its easy, as I can subset to have the model only work off the
data I want:

rbfSVM_setosa      <- train(Sepal.Length~., data = trainset,  subset =
trainset$Species=="setosa", ...)
rbfSVM_versicolor <- train(Sepal.Length~., data = trainset,  subset =
trainset$Species=="versicolor", ...)

For my test data (testset), I ended up doing like so which appears to work:

index_setosa<- which(testset$Species == "setosa")

svmPred <- as.vector(rep(NA,nrow(testset)))
svmPred[index_setosa] <- predict(rbfSVM_setosa, testset[testset$Species ==
"setosa",])
svmPred[is.na(svmPred)] <- predict(rbfSVM_versicolor, testset[testset$Species
== "versicolor",])

The above works when there are just two classes.  I am going to read on some of
these other ways suggested and give them a try.

Brian


            

On Dec 10, 2012, at 10:38 PM, Thomas Stewart <tgs.public.mail@gmail.com>
wrote:
> Why not use an indicator variable? 
> 
> P1 <- ... # prediction from model 1 (Setosa) for entire dataset
> 
> P2 <- ... # prediction from model 2 for entire dataset
> 
> I <- Species=="setosa" #
> 
> Predictions <- P1 * I + P2 * ( 1 - I )
> 
> On Monday, December 10, 2012, Brian Feeny wrote:
> 
> I have a dataset and I wish to use two different models to predict.  Both
models are SVM.  The reason for two different models is based
> on the sex of the observation.  I wish to be able to make predictions and
have the results be in the same order as my original dataset.  To
> illustrate I will use iris:
> 
> # Take Iris and create a dataframe of just two Species, setosa and
versicolor, shuffle them
> data(iris)
> iris <- iris[(iris$Species=="setosa" |
iris$Species=="versicolor"),]
> irisindex <- sample(1:nrow(iris), nrow(iris))
> iris <- iris[irisindex,]
> 
> # Make predictions on setosa using the mySetosaModel model, and on
versicolor using the myVersicolorModel:
> 
> predict(mySetosaModel, iris[iris$Species=="setosa",])
> predict(myVersicolorModel, iris[iris$Species=="versicolor",])
> 
> The problem is this will give me a vector of just the setosa results, and
then one of just the versicolor results.
> 
> I wish to take the results and have them be in the same order as the
original dataset.  So if the original dataset had:
> 
> 
> Species
> setosa
> setosa
> versicolor
> setosa
> versicolor
> setosa
> 
> I wish for my results to have:
> <prediction for setosa>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> <prediction for versicolor>
> <prediction for setosa>
> 
> But instead, what I am ending up with is two result sets, and no way I can
think of to combine them.  I am sure this comes up alot where you have a factor
you wish to split your models on, say sex (male vs. female), and you need to
present the results back so it matches to the order of the orignal dataset.
> 
> I have tried to think of ways to use an index, to try to keep things in
order, but I can't figure it out.
> 
> Any help is greatly appreciated.
> 
> Brian
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Dec 2012 - splitting dataset based on variable and re-combining

[R] splitting dataset based on variable and re-combining

[R] splitting dataset based on variable and re-combining

[R] splitting dataset based on variable and re-combining

[R] splitting dataset based on variable and re-combining

Apparently Analagous Threads