So there are a couple parts to this question. I am trying to implement the rpart/random forest algorithms on a transaction lists. That is to say i am trying to train models in order to deduce what are the most predictive transactions within a customers history in order apply this model to future test data and identify accounting irregularities(ie. this account had x and y so they should have also had z.) I have utilized the arules package with some success but the output does not deduce what independent transactions are most telling just rates of co-occurrence.(ie x appears with y 75% of the time, not that x and y OR a and b should also have z) This form of independent transaction groupings are potentially very meaningful. ok now to actual R questions. i can load my transaction lists using the read.transaction function in arules from data like Customer-ID | Item-ID cust1 | 2 cust1 | 3 cust1 | 5 cust2 | 5 cust2 | 3 cust3 | 2 ... #read in data to a sparse binary transaction matrix txn = read.transactions(file="tranaction_list.txt", rm.duplicates= TRUE, format="single",sep="|",cols =c(1,2)); #tranaction matrix to matrix a<-as(txn, "matrix") #matrix to data.frame b<-as.data.frame(a) I end up with a data.frame like: X X.1 X.2 X.3 X.4 X.5 ... cust1 0 1 1 0 1 cust2 0 0 1 0 1 cust3 0 1 0 0 0 ... However the as.data.frame(a) transforms the matrix into a numeric data.frame so when I implement the rpart algorithm it automatically returns a regression classification tree. calling rpart like names<-colnames(b) tree_X.9911 <- rpart(X.9911 ~ ., data=b[, c(names)], method="class") and returns: 1) root 20000 625 0 (0.96875000 0.03125000) 2) X.9342< 0.5 19598 311 0 (0.98413103 0.01586897) * 3) X.9342>=0.5 402 88 1 (0.21890547 0.78109453) 6) X.9984>=0.5 81 7 0 (0.91358025 0.08641975) * 7) X.9984< 0.5 321 14 1 (0.04361371 0.95638629) 14) X.9983>=0.5 14 0 0 (1.00000000 0.00000000) * 15) X.9983< 0.5 307 0 1 (0.00000000 1.00000000) I understand that it would approach the numeric cols with a regression approach but is there any way to force it to view them as logical(yes, no or T/F) codes. I can't successfully transform the data.frame to a factor. i tried: b_factor<-as.factor(b) Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list? Furthermore i am fearful if i am ever successful i will compound my memory problems. 20,000 rows by 3,000 cols (while a substantial subset to the total training data is already is causing my 8gig linux box to moan.) Does a factor/logical col take up more room then a numeric col populated by 1 and 0's. I remember reading that R stores it factors as numeric anyways. I know that more precise variable section could reduce my memory usage but that is what rpart is good at, highlight the most meaningful/predictive variables so may work larger numbers of cust by removing uninformative variables. My final question is that is there a better aproach to load the data. rpart only works on data.frame(to the best of my knowledge)? How can one coerce a list to a form where predictive models can be applied? I am new to the world of R and to data mining for that matter.I am loving the diversity of applications and would appreciate any help. Many Thanks, John [[alternative HTML version deleted]]
On 06/03/11 22:34, John Dennison wrote:> [...] > from data like > > Customer-ID | Item-ID > cust1 | 2 > cust1 | 3 > cust1 | 5 > cust2 | 5 > cust2 | 3 > cust3 | 2 > ... > > #read in data to a sparse binary transaction matrix > txn = read.transactions(file="tranaction_list.txt", rm.duplicates= TRUE, > format="single",sep="|",cols =c(1,2)); > > #tranaction matrix to matrix > a<-as(txn, "matrix") > > #matrix to data.frame > b<-as.data.frame(a) > > I end up with a data.frame like: > > X X.1 X.2 X.3 X.4 X.5 ... > cust1 0 1 1 0 1 > cust2 0 0 1 0 1 > cust3 0 1 0 0 0 > ... > > However the as.data.frame(a) transforms the matrix into a numeric > data.frame so when I implement the rpart algorithm it automatically returns > a regression classification tree.I am not sure your approach with rpart is going to give you what you are looking for, but on to your R question:> [...] I can't successfully transform the data.frame to a factor. i > tried: > > b_factor<-as.factor(b) > Error in sort.list(y) : > 'x' must be atomic for 'sort.list' > Have you called 'sort' on a list?You need to do each column individually, i.e. b_factor$X.1 <- as.factor(b$X.1) or> str( as.data.frame(lapply(b, as.factor)) )'data.frame': 4 obs. of 4 variables: $ X.2 : Factor w/ 2 levels "0","1": 2 1 2 1 $ X.3 : Factor w/ 2 levels "0","1": 2 2 1 1 $ X.5 : Factor w/ 2 levels "0","1": 2 2 1 1 $ X.Item.ID: Factor w/ 2 levels "0","1": 1 1 1 2 Also have a look at as(txn, "data.frame") for a different format that may (with some clean up) be easier to use.> as(txn, "data.frame")transactionID items 1 cust1 { 2, 3, 5} 2 cust2 { 3, 5} 3 cust3 { 2} 4 Customer-ID { Item-ID} Hope this helps a little. Allan