Björn Fisseler
2017-Feb-17 12:54 UTC
[R] Feature space problem regarding text classification using SVM
Dear list members, I'm currently working on text classification of student's essays, trying to identify texts that fit to a certain class or not. I use texts from one semester (A) for training and texts from another semester (B) for testing the classifier. My workflow is like this: * read all texts from A, build a DTM(A) with about 1387 terms * read all texts from B, build a DTM(B) with about 626 terms * train the classifier with DTM(A), using a SVM (package e1071) Now I want to classify all texts in DTM(B) using the classifyer. But when I try to use predict(), I always get the error message: Error in eval(expr, envir, enclos) : object 'XY' not found. As I found out, the reason for this is that DTM(A) and DTM(B) have a different number of terms and consequently not every term used for training the model is available in DTM(B). My question is: how should/do I deal with this? Should I match the terms used in DTM(A) and DTM(B), in order to get an identical feature space? This could be achieved either reducing the number of terms in DTM(A) or adding several empty/NA columns to DTM(B). Or is there another solution to my problem? Kind regards Bj?rn [[alternative HTML version deleted]]