nguy2952 University of Minnesota
2018-Jun-06 02:35 UTC
[R] Decision Tree Issue: Why does tree() not pick all variables for the nodes
I am working on a project at my work place and I am running into some issues with my decision tree analysis. THIS IS NOT A HOMEWORK ASSIGNMENT. Sample dataset PRODUCT_SUB_LINE_DESCR MAJOR_CATEGORY_DESCR CUST_REGION_DESCR SUNDRY SMALL EQUIP NORTH EAST REGION SUNDRY SMALL EQUIP SOUTH EAST REGION SUNDRY SMALL EQUIP SOUTH EAST REGION SUNDRY SMALL EQUIP NORTH EAST REGION SUNDRY PREVENTIVE SOUTH CENTRAL REGION SUNDRY PREVENTIVE SOUTH EAST REGION SUNDRY PREVENTIVE SOUTH EAST REGION SUNDRY SMALL EQUIP NORTH CENTRAL REGION SUNDRY SMALL EQUIP MOUNTAIN WEST REGION SUNDRY SMALL EQUIP MOUNTAIN WEST REGION SUNDRY COMPOSITE NORTH CENTRAL REGION SUNDRY COMPOSITE NORTH CENTRAL REGION SUNDRY COMPOSITE OHIO VALLEY REGION SUNDRY COMPOSITE NORTH EAST REGION Sales QtySold MFGCOST MarginDollars new_ProductName 209.97 3 134.55 72.72 no -76.15 -1 -44.85 -30.4 no 275.6 2 162.5 109.84 no 138.7 1 81.25 55.82 no 226 2 136 87.28 no 115 1 68 45.64 no 210.7 2 136 71.98 no 29 1 18.85 9.77 no 29 1 18.85 9.77 no 46.32 2 37.7 7.86 no 159.86 1 132.4 24.81 no 441.3 2 264.8 171.2 no 209.62 1 132.4 74.57 no 209.62 1 132.4 74.57 no 1) My tree has only two nodes and here is why >summary(tree_model) Classification tree: tree(formula = new_ProductName ~ ., data = training_data) Variables actually used in tree construction: [1] "PRODUCT_SUB_LINE_DESCR" Number of terminal nodes: 2 Residual mean deviance: 0 = 0 / 41140 Misclassification error rate: 0 = 0 / 41146 2) I did create a new data frame which has only factors with level less than 22 level. There is one factor with 25 levels, but the tree() does not give an error so I think the algorithm accepts 25 levels >str(new_Dataset) 'data.frame': 51433 obs. of 7 variables: $ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE LABEL",..: 3 3 3 3 3 3 3 3 3 3 ... $ MAJOR_CATEGORY_DESCR : Factor w/ 25 levels "AIR ABRASION",..: 23 23 23 23 21 21 21 23 23 23 ... $ CUST_REGION_DESCR : Factor w/ 7 levels "MOUNTAIN WEST REGION",..: 3 6 6 3 5 6 6 2 1 1 ... $ Sales : num 210 -76.2 275.6 138.7 226 ... $ QtySold : int 3 -1 2 1 2 1 2 1 1 2 ... $ MFGCOST : num 134.6 -44.9 162.5 81.2 136 ... $ MarginDollars : num 72.7 -30.4 109.8 55.8 87.3 ... 3) Here is how I set up my analysis # I choose product name as my main attribute(maybe that is why it appears at the root node?) new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE LABEL","yes","no") data = data.frame(new_Dataset, new_ProductName) set.seed(100) train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices training_data = data[train,] # training data testing_data = data[-train,] # testing data #fit the tree model using training data tree_model = tree(new_ProductName ~.,data = training_data) summary(tree_model) plot(tree_model) text(tree_model, pretty = 0) out = predict(tree_model) # predict the training data # actuals input.newproduct = as.character(training_data$new_ProductName) # predicted pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))] mean (input.newproduct != pred.newproduct) # misclassification % # Cross Validation to see how much we need to prune the tree set.seed(400) cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross validation attach(cv_Tree) plot(cv_Tree) # plot the CV plot(size, dev, type = "b") # set size corresponding to lowest value in the plot above. treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod) text(treePruneMod, pretty = 0) out = predict(treePruneMod) # fit the pruned tree # Predicted pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))] # calculate Mis-classification error mean(training_data$new_ProductName != pred.newproduct) # Predict testData with Pruned tree out = predict(treePruneMod, testing_data, type = "class") 4) I have never done this before. I watched a couple of youtube videos and started to do this. I welcome great advice, explanation, criticism and please help me through this process. This has been challenging to me. > table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName) no yes Handpieces 164 0 PRIVATE LABEL 0 14802 SUNDRY 36467 0 Best, Hugh N -------------- next part -------------- A non-text attachment was scrubbed... Name: Rplot01.png Type: image/png Size: 13333 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20180605/07f77314/attachment.png>