thr3ads.net - R help - [R] transaction list transformation to use rpart. [Mar 2011]

If this information is useful, please help other people find it:
Share via:

John Dennison

2011-Mar-06 22:34 UTC

[R] transaction list transformation to use rpart.

So there are a couple parts to this question. I am trying to implement the
rpart/random forest algorithms on a transaction lists. That is to say i am
trying to train models in order to deduce what are the most predictive
transactions within a customers history in order apply this model to future
test data and identify accounting irregularities(ie. this account had x and
y so they should have also had z.) I have utilized the arules package with
some success but the output does not deduce what independent transactions
are most telling just rates of co-occurrence.(ie x appears with y 75% of the
time, not that x and y OR a and b should also have z) This form of
independent transaction groupings are potentially very meaningful.

ok now to actual R questions.

i can load my transaction lists using the read.transaction function in
arules

from data like

Customer-ID | Item-ID
cust1           | 2
cust1           | 3
cust1           | 5
cust2          | 5
cust2          | 3
cust3         | 2
...

#read in data to a sparse binary transaction matrix
txn = read.transactions(file="tranaction_list.txt", rm.duplicates=
TRUE,
format="single",sep="|",cols =c(1,2));

#tranaction matrix to matrix
a<-as(txn, "matrix")

#matrix to data.frame
b<-as.data.frame(a)

I end up with a data.frame like:

X       X.1 X.2  X.3 X.4 X.5 ...
cust1  0    1   1    0    1
cust2  0    0   1    0    1
cust3  0    1   0    0    0
...

 However the as.data.frame(a) transforms the matrix into a numeric
data.frame so when I implement the rpart algorithm it automatically returns
a regression classification tree.

calling rpart like

names<-colnames(b)

tree_X.9911 <- rpart(X.9911 ~ .,
    data=b[, c(names)],
    method="class")

and returns:

1) root 20000 625 0 (0.96875000 0.03125000)
   2) X.9342< 0.5 19598 311 0 (0.98413103 0.01586897) *
   3) X.9342>=0.5 402  88 1 (0.21890547 0.78109453)
     6) X.9984>=0.5 81   7 0 (0.91358025 0.08641975) *
     7) X.9984< 0.5 321  14 1 (0.04361371 0.95638629)
      14) X.9983>=0.5 14   0 0 (1.00000000 0.00000000) *
      15) X.9983< 0.5 307   0 1 (0.00000000 1.00000000)

I understand that it would approach the numeric cols with a regression
approach but is there any way to force it to view them as logical(yes, no or
T/F) codes. I can't successfully transform the data.frame to a factor. i
tried:

b_factor<-as.factor(b)
Error in sort.list(y) :
  'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

Furthermore i am fearful if i am ever successful i will compound my memory
problems. 20,000 rows by 3,000 cols (while a substantial subset to the total
training data is already is causing my 8gig linux box to moan.) Does a
factor/logical col take up more room then a numeric col populated by 1 and
0's. I remember reading that R stores it factors as numeric anyways. I know
that more precise variable section could reduce my memory usage but that is
what rpart is good at, highlight the most meaningful/predictive variables so
may work larger numbers of cust by removing uninformative variables.

My final question is that is there a better aproach to load the data. rpart
only works on data.frame(to the best of my knowledge)? How can one coerce a
list to a form where predictive models can be applied?

I am new to the world of R and to data mining for that matter.I am loving
the diversity of applications and would appreciate any help.

Many Thanks,

John

	[[alternative HTML version deleted]]

Allan Engelhardt

2011-Mar-07 09:29 UTC

head link

[R] transaction list transformation to use rpart.

On 06/03/11 22:34, John Dennison wrote:> [...]
> from data like
>
> Customer-ID | Item-ID
> cust1           | 2
> cust1           | 3
> cust1           | 5
> cust2          | 5
> cust2          | 3
> cust3         | 2
> ...
>
> #read in data to a sparse binary transaction matrix
> txn = read.transactions(file="tranaction_list.txt",
rm.duplicates= TRUE,
> format="single",sep="|",cols =c(1,2));
>
> #tranaction matrix to matrix
> a<-as(txn, "matrix")
>
> #matrix to data.frame
> b<-as.data.frame(a)
>
> I end up with a data.frame like:
>
> X       X.1 X.2  X.3 X.4 X.5 ...
> cust1  0    1   1    0    1
> cust2  0    0   1    0    1
> cust3  0    1   0    0    0
> ...
>
>   However the as.data.frame(a) transforms the matrix into a numeric
> data.frame so when I implement the rpart algorithm it automatically returns
> a regression classification tree.
I am not sure your approach with rpart is going to give you what you are 
looking for, but on to your R question:
> [...] I can't successfully transform the data.frame to a factor. i
> tried:
>
> b_factor<-as.factor(b)
> Error in sort.list(y) :
>    'x' must be atomic for 'sort.list'
> Have you called 'sort' on a list?
You need to do each column individually, i.e. b_factor$X.1 <- 
as.factor(b$X.1) or
>  str( as.data.frame(lapply(b, as.factor)) )'data.frame':    4 obs. of  4 variables:
  $ X.2      : Factor w/ 2 levels "0","1": 2 1 2 1
  $ X.3      : Factor w/ 2 levels "0","1": 2 2 1 1
  $ X.5      : Factor w/ 2 levels "0","1": 2 2 1 1
  $ X.Item.ID: Factor w/ 2 levels "0","1": 1 1 1 2


Also have a look at as(txn, "data.frame") for a different format that 
may (with some clean up) be easier to use.
>  as(txn, "data.frame")      transactionID      items
1 cust1            { 2, 3, 5}
2  cust2              { 3, 5}
3   cust3                { 2}
4     Customer-ID  { Item-ID}


Hope this helps a little.

Allan

Maybe Matching Threads

Search for more seemingly similar threads

R help - Mar 2011 - transaction list transformation to use rpart.

[R] transaction list transformation to use rpart.

[R] transaction list transformation to use rpart.

Maybe Matching Threads