On 08.01.2013 21:14, Claus O'Rourke wrote:> Hi all,
> I've encountered an issue using svm (e1071) in the specific case of
> supplying new data which may not have the full range of levels that
> were present in the training data.
>
> I've constructed this really primitive example to illustrate the point:
>
>> library(e1071)
>> training.data <- data.frame(x =
c("yellow","red","yellow","red"), a =
c("alpha","alpha","beta","beta"), b =
c("a", "b", "a", "c"))
>> my.model <- svm(x ~ .,data=training.data)
>> test.data <- data.frame(x = c("yellow","red"), a
= c("alpha","beta"), b = c("a", "b"))
>> predict(my.model,test.data)
> Error in predict.svm(my.model, test.data) :
> test data does not match model !
>>
>> levels(test.data$b) <- levels(training.data$b)
>> predict(my.model,test.data)
> 1 2
> yellow red
> Levels: red yellow
>
> In the first case test.data$b does not have the level "c" and
this
> results in the input data being rejected. I've debugged this down to
> the point of model matrix creation in the SVM R code. Once I fill up
> the levels in the test data with the levels from the original data,
> then there is no problem at all.
>
> Assuming my test data has to come from another source where the number
> of category levels seen might not always be as large as those for the
> original training data, is there a better way I should be handling
> this?
You have to tell the factor about the possible levels, it does not
necessarily contain examples.
That means:
levels(test.data$b) <- C("a", "b", "c")
predict(my.model,test.data)
will help.
Best,
Uwe Ligges
> Thanks
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>