James C Schopf
2023-Aug-11 10:20 UTC
[R] Different TFIDF settings in test set prevent testing model
Hello, I'd be very grateful for your help. I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv files, one for training an algorithm and the other for testing the algorithm. I applied similar preprocessing, including TFIDF transformation, to both sets, but R won't let me make predictions on the test set due to a different TFIDF matrix. I get the error message: Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type "nmatrix.27118" was supplied I'd greatly appreciate a suggestion to overcome this problem. Thanks! Here's my R codes:> library(tidyverse) > library(tidytext) > library(caret) > library(kernlab) > library(tokenizers) > library(tm) > library(e1071)***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 (labelled M2)> url <- "D:/test/M2_75.csv" > d <- read_csv(url)***CREATE TEXT CORPUS FROM TEXT COLUMN> train_text_corpus <- Corpus(VectorSource(d$Text))***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM> tokenize_document <- function(doc) {+ doc_tokens <- unlist(tokenize_words(doc)) + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) + return(all_tokens) + } ***APPLY TOKENS TO DOCUMENTS> all_train_tokens <- lapply(train_text_corpus, tokenize_document)***CREATE A DTM FROM THE TOKENS> train_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))***TRANSFORM THE DTM INTO A TF-IDF MATRIX> train_text_tfidf <- weightTfIdf(train_text_dtm)***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA> trainData <- data.frame(M2 = d$M2)***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME> trainData$text_tfidf <- I(as.matrix(train_text_tfidf))***DEFINE THE ML MODEL> ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, classProbs = TRUE)***TRAIN SVM> model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", trControl = ctrl)***SAVE SVM> saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS")R code on my test set, which didn't work at last step: ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 (labelled M2)> url <- "D:/test/M2_25.csv" > d <- read_csv(url)***CREATE TEXT CORPUS FROM TEXT COLUMN> test_text_corpus <- Corpus(VectorSource(d$Text))***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM> tokenize_document <- function(doc) {doc_tokens <- unlist(tokenize_words(doc)) doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) return(all_tokens) } ***APPLY TOKEN TO DOCUMENTS> all_test_tokens <- lapply(test_text_corpus, tokenize_document)***CREATE A DTM FROM THE TOKENS> test_text_dtm <- DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))***TRANSFORM THE DTM INTO A TF-IDF MATRIX> test_text_tfidf <- weightTfIdf(test_text_dtm)***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA> testData <- data.frame(M2 = d$M2)***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA> testData$text_tfidf <- I(as.matrix(test_text_tfidf))***LOAD OLD MODEL model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS") ***MAKE PREDICTIONS predictions <- predict(model_svmRadial, newdata = testData) This last line produces the error message: Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type "nmatrix.27118" was supplied Please help. Thanks! [[alternative HTML version deleted]]
Bert Gunter
2023-Aug-11 15:09 UTC
[R] Different TFIDF settings in test set prevent testing model
I know nothing about tf, etc., but can you not simply read in the whole file into R and then randomly split using R? The training and test sets would simply be defined by a single random sample of subscripts which is either chosen or not. e.g. (simplified example -- you would be subsetting the rows of your full dataset):> x<- 1:10 > samp <- sort(sample(x,5)) > x[samp] ## training[1] 3 4 6 7 8> x[-samp] ## test[1] 1 2 5 9 10 Apologies if my ignorance means this can't work. Cheers, Bert On Fri, Aug 11, 2023 at 7:17?AM James C Schopf <jcschopf at hotmail.com> wrote:> Hello, I'd be very grateful for your help. > > I randomly separated a .csv file with 1287 documents 75%/25% into 2 csv > files, one for training an algorithm and the other for testing the > algorithm. I applied similar preprocessing, including TFIDF > transformation, to both sets, but R won't let me make predictions on the > test set due to a different TFIDF matrix. > I get the error message: > > Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type > "nmatrix.27118" was supplied > > I'd greatly appreciate a suggestion to overcome this problem. > Thanks! > > > Here's my R codes: > > > library(tidyverse) > > library(tidytext) > > library(caret) > > library(kernlab) > > library(tokenizers) > > library(tm) > > library(e1071) > > ***LOAD TRAINING SET/959 rows with text in column1 and yes/no in column2 > (labelled M2) > > url <- "D:/test/M2_75.csv" > > d <- read_csv(url) > ***CREATE TEXT CORPUS FROM TEXT COLUMN > > train_text_corpus <- Corpus(VectorSource(d$Text)) > ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > > tokenize_document <- function(doc) { > + doc_tokens <- unlist(tokenize_words(doc)) > + doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) > + doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) > + all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) > + return(all_tokens) > + } > ***APPLY TOKENS TO DOCUMENTS > > all_train_tokens <- lapply(train_text_corpus, tokenize_document) > ***CREATE A DTM FROM THE TOKENS > > train_text_dtm <- > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens))) > ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > > train_text_tfidf <- weightTfIdf(train_text_dtm) > ***CREATE A NEW DATA FRAME WITH M2 COLUMN FROM ORIGINAL DATA > > trainData <- data.frame(M2 = d$M2) > ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO DATA FRAME > > trainData$text_tfidf <- I(as.matrix(train_text_tfidf)) > ***DEFINE THE ML MODEL > > ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 2, > classProbs = TRUE) > ***TRAIN SVM > > model_svmRadial <- train(M2 ~ ., data = trainData, method = "svmRadial", > trControl = ctrl) > ***SAVE SVM > > saveRDS(model_svmRadial, file = "D:/SML/model_M23_svmRadial_UP.RDS") > > R code on my test set, which didn't work at last step: > > ***LOAD TEST SET/ 309 rows with text in column1 and yes/no in column2 > (labelled M2) > > url <- "D:/test/M2_25.csv" > > d <- read_csv(url) > ***CREATE TEXT CORPUS FROM TEXT COLUMN > > test_text_corpus <- Corpus(VectorSource(d$Text)) > ***DEFINE TOKENS FOR EACH DOCUMENT IN CORPUS AND COMBINE THEM > > tokenize_document <- function(doc) { > doc_tokens <- unlist(tokenize_words(doc)) > doc_bigrams <- unlist(tokenize_ngrams(doc, n = 2)) > doc_trigrams <- unlist(tokenize_ngrams(doc, n = 3)) > all_tokens <- c(doc_tokens, doc_bigrams, doc_trigrams) > return(all_tokens) > } > ***APPLY TOKEN TO DOCUMENTS > > all_test_tokens <- lapply(test_text_corpus, tokenize_document) > ***CREATE A DTM FROM THE TOKENS > > test_text_dtm <- > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens))) > ***TRANSFORM THE DTM INTO A TF-IDF MATRIX > > test_text_tfidf <- weightTfIdf(test_text_dtm) > ***CREATE A NEW DATA WITH M2 COLUMN FROM ORIGINAL TEST DATA > > testData <- data.frame(M2 = d$M2) > ***ADD NEW TFIDF transformed TEXT COLUMN NEXT TO TEST DATA > > testData$text_tfidf <- I(as.matrix(test_text_tfidf)) > ***LOAD OLD MODEL > model_svmRadial <- readRDS("D:/SML/model_M2_75_svmRadial.RDS") > ***MAKE PREDICTIONS > predictions <- predict(model_svmRadial, newdata = testData) > > This last line produces the error message: > > Error: variable 'text_tfidf' was fitted with type "nmatrix.67503" but type > "nmatrix.27118" was supplied > > Please help. Thanks! > > > > > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Ivan Krylov
2023-Aug-11 15:49 UTC
[R] Different TFIDF settings in test set prevent testing model
? Fri, 11 Aug 2023 10:20:27 +0000 James C Schopf <jcschopf at hotmail.com> ?????:> > train_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_train_tokens)))> > test_text_dtm <- > > DocumentTermMatrix(Corpus(VectorSource(all_test_tokens)))I understand the need to prepare the test dataset separately (e.g. in order to be able to work with text that don't exist at the time when model is trained), but since the model has no representation for tokens it (well, the tokeniser) hasn't seen during the training process, you have to ensure that test_text_dtm references exactly the same tokens as train_text_dtm, in the same order of the columns. Also, it probably makes sense to reuse the term frequency learned on the training document set; otherwise you may be importance-weighting different tokens than ones your SVM has learned as important if your test set has a significantly different distribution from that of the training set. Bert is probably right: with the API given by the tm package, it's seems easiest to tokenise and weight document-term matrices first, then split them into the train and test subsets. It may be worth asking the maintainer about applying previously "learned" transformations to new corpora. -- Best regards, Ivan