Hi everyone, I am working on text categorization (my first project so learning) and my dataset has several columns as text (detail about the data is pasted in the bottom. I worked before on numeric data but my advisor now asked me to perform predictive modeling on this text data. I am doing some preprocessing such as tokenizing, lower case, stemming etc. The following code is used for tokenization train.tokens <- tokens(train$DESCRIPTION,, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = TRUE). then train.tokens <- tokens_tolower(train.tokens) then train.tokens <- tokens_wordstem(train.tokens, language = "english") I have two questions (1) If we have more text features (apart from DESCRIPTION), do I need to repeat these steps for each feature? I tried the following but does not work *train.tokens <- tokens(c(train$DESCRIPTION,,train$NAME) , what = "word", remove_numbers = TRUE, remove_punct = TRUE,* * remove_symbols = TRUE, remove_hyphens = TRUE).* (2) My second question, after we preprocess the data and create our bag-of-words model like below train.tokens.dfm <- dfm(train.tokens, tolower = FALSE) train.tokens.matrix= as.matrix(train.tokens.dfm) , are we ready to *train our model and perform prediction*? *My data as mentioned also in previous emails* Rows: 1,819 Columns: 14 $ PLUGIN_RULE_KEY <chr> "InsufficientBranchCoverage", "InsufficientLin~ $ PLUGIN_CONFIG_KEY <chr> "", "", "", "", "", "", "", "", "", "", "S1120~ $ PLUGIN_NAME <chr> "common-java", "common-java", "common-java", "~ $ DESCRIPTION <chr> "An issue is created on a file as soon as the ~ $ SEVERITY <chr> "MAJOR", "MAJOR", "MAJOR", "MAJOR", "MAJOR", "~ $ NAME <chr> "Branches should have sufficient coverage by t~ $ DEF_REMEDIATION_FUNCTION <chr> "LINEAR", "LINEAR", "LINEAR", "LINEAR_OFFSET",~ $ REMEDIATION_GAP_MULT <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~ $ DEF_REMEDIATION_BASE_EFFORT <chr> "", "", "", "10min", "", "", "5min", "5min", "~ $ GAP_DESCRIPTION <chr> "number of uncovered conditions", "number of l~ $ SYSTEM_TAGS <chr> "bad-practice", "bad-practice", "convention", ~ $ IS_TEMPLATE <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~ $ DESCRIPTION_FORMAT <chr> "HTML", "HTML", "HTML", "HTML", "HTML", "HTML"~ $ TYPE <chr> "CODE_SMELL", "CODE_SMELL", "CODE_SMELL", "COD~ [[alternative HTML version deleted]]