Preetam Pal
2016-Apr-30 22:34 UTC
[R] Data Issues with ctree/glm and controlling classification parameters
Hi, I have a dataset obtained as: mydata <- read.csv("data.csv", header = TRUE) which contains the variable 'y' (y is binary 0 or 1) and also another variable 'weight' (weight is a numerical variable - taking fractional values between 0 and 1). 1> I want to first apply ctree() on mydata, but dont want to use this 'weight' variable in the tree-buiding process. Can you please suggest how to do this? Please note, I *don't* want to delete/remove this variable from mydata. 2> Another question: Say, I split up mydata into train (80%) and test(20%) as: d<-sort(sample(nrow(mydata), nrow(mydata)*0.8)); train <- mydata[d,]; test < -mydata[-d,]; Then, I perform weighted glm (essentially, logistic regression) on train as: #Build GLM model on train data model <-glm(y~., data = train, weights = train$weight, family = binomial); ********************(A) #Apply model on test score <-predict(model, type = 'response',test); **************(B) #Get classification for each observation in test as 'positive' or 'negative' classify <-performance(score,"tpr","fpr"); **************(C) My question here is: 2a> Again, how do I proceed if I don't want to use the variable 'weight' as a regressor in the glm() function in (A) above (but use all other variables in train)? 2b> In step (B) & (C), how do I control the classification rule, i.e. R might classify observations with model-fitted probability > 0.5 as a 'positive' and <= 0.5 as a 'negative'. Is there a way I can change this threshold to say, 0.75 instead of whatever R might be using (I used 0.5 as example). Thank you in advance for your help. -Preetam -- Preetam Pal (+91)-9432212774 M-Stat 2nd Year, Room No. N-114 Statistics Division, C.V.Raman Hall Indian Statistical Institute, B.H.O.S. Kolkata. [[alternative HTML version deleted]]