Alekseiy Beloshitskiy
2012-Mar-23 14:34 UTC
[R] How do you scale variables which consist of tokens
Dear All, Let's suppose there's a case when you want to make a prediction using range of variables. Some variables are represented as set of words (tokens). For example there is a training set: x1,x2,..,x7, y where y - to be predicted (despite of the model to be used for prediction), and let's say: x4 - variable which presented as words from google search query (number of words may be different in each observation). For example: x4=(how,grow,tree) and can be presented in hashed form: x4=(11111,22222,33333) I need to scale this variable (x4) to be able to use it in model. I was thinking about scaling it with TF-IDF. In this way I can represent each observation of x4 as a scaled vector with N elements like: x4=(0.0175105020782697,...0.019135397913606) //scaled with TF-IDF However, it still isn't scaled properly (please correct me if I'm wrong) since I need x4 to be presented as INTEGRAL value for each observation to be able to use it in model. I assume the result of scaling should look like: x4=0.06789324432 //integral value Do you have any ideas how to do this? Appreciate for any ideas. -Aleksei [[alternative HTML version deleted]]