thr3ads.net - R help - [R] coding to generate a matrix to prepare for chi-sqr test f or text mining [Jun 2005]

If this information is useful, please help other people find it:
Share via:

Huntsinger, Reid

2005-Jun-15 22:27 UTC

[R] coding to generate a matrix to prepare for chi-sqr test f or text mining

I would compile a table of all the words in the dataset (maybe you have it
already), then create a list where each component is an integer vector of
indices of words. That is, replace words by their positions in the table. 
>From that sparse form you could create binary features to use with standardclassification methods, or for example compute the X'X matrix for linear
regression directly (you would probably want to throw out infrequently
occurring words to keep the matrix small enough to work with in memory). For
your specific question, say "words" is the list of integer vectors as
above,
and "class" is the vector of class labels (1 or 2 to make it a valid
index)
corresponding to a given vector. Then you can fill in the "present"
(==1)
parts of the table class x presence x word via


n <- length(words)
tab <- array(as.integer(0),dim=c(2,2,n))

for (i in 1:n) {
  for (word in words[[i]]) tab[class[i],1,word] <- tab[class[i],1,word] + 1
}

and the "absent" (==2) parts are then easy:

tab[1,2,] <- sum(class == 1) - tab[1,1,]
tab[2,2,] <- sum(class == 2) - tab[2,1,] 

so now you can use chisq.test on each of the 2 x 2 tables tab[,,i] for i a
word index, all at once using apply() if convenient.

Reid Huntsinger

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Wednesday, June 15, 2005 5:10 PM
To: R-help at stat.math.ethz.ch
Subject: [R] coding to generate a matrix to prepare for chi-sqr test for
text mining


Hi, there:
I have a dataset like the following:

1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0
1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0
2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1
2414|IV|REAREND|CD|COG|LAB|ADVERS|1
2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0
2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1
2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1
2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1
2422|THEFT|RECOV|TOTAL|THEFT|0
...

The first column is always id_num, the last one is class label. I want
to do some chi-square test on the dependency between a word (or
further a word combination) on the class label.

for example, my goal is to build a table like the following, ready for
chi-square test
                      ACCID (Yes)                 ACCID(No)
class label
         1                  10                                15
         0                    5                                 9
 
the number is the number of lines (observations).
and later I want to do word-combination like ACCID & WINDOW (this
result was generated from association analysis from my another
program) instead of ACCID only.

My first question is, how to do it automatically in R to build a data
structure (data frame) to represent the table above for each word)
since I am learning R programming and I don't want to do it using
python.  (Don't worry if a word appears twice in one observation, and
I have another version of data set which only lists unique word.)

My target is to find a p-value for each word/class label from
chi-square test and evaluate the significance of feature for later
text mining. I am not sure if this is a good idea and I am reading
some papers on this.

Thanks,

--  
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Huntsinger, Reid

2005-Jun-15 23:05 UTC

head link

[R] coding to generate a matrix to prepare for chi-sqr test f or text mining

The data structure I described is a sparse (binary) matrix, and I should
have said you could use the SparseM or Matrix package's sparse matrix
classes and methods to do what I suggested below. There are linear algebra
routines available for them but for other things I'm not sure. 

Reid Huntsinger

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Huntsinger, Reid
Sent: Wednesday, June 15, 2005 6:28 PM
To: 'Weiwei Shi'; R-help at stat.math.ethz.ch
Subject: Re: [R] coding to generate a matrix to prepare for chi-sqr test f
or text mining


I would compile a table of all the words in the dataset (maybe you have it
already), then create a list where each component is an integer vector of
indices of words. That is, replace words by their positions in the table. 
>From that sparse form you could create binary features to use with standardclassification methods, or for example compute the X'X matrix for linear
regression directly (you would probably want to throw out infrequently
occurring words to keep the matrix small enough to work with in memory). For
your specific question, say "words" is the list of integer vectors as
above,
and "class" is the vector of class labels (1 or 2 to make it a valid
index)
corresponding to a given vector. Then you can fill in the "present"
(==1)
parts of the table class x presence x word via


n <- length(words)
tab <- array(as.integer(0),dim=c(2,2,n))

for (i in 1:n) {
  for (word in words[[i]]) tab[class[i],1,word] <- tab[class[i],1,word] + 1
}

and the "absent" (==2) parts are then easy:

tab[1,2,] <- sum(class == 1) - tab[1,1,]
tab[2,2,] <- sum(class == 2) - tab[2,1,] 

so now you can use chisq.test on each of the 2 x 2 tables tab[,,i] for i a
word index, all at once using apply() if convenient.

Reid Huntsinger

-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
Sent: Wednesday, June 15, 2005 5:10 PM
To: R-help at stat.math.ethz.ch
Subject: [R] coding to generate a matrix to prepare for chi-sqr test for
text mining


Hi, there:
I have a dataset like the following:

1412|WINDOW|SHATTER|TORN|SOFT|TOP|WATER|RAIN|LAB|AI|BOLL|CAMP|0
1413|PARK|IV|STRUCK|PARK|PUSH|COD|POLICI|CIA|TB|SIC|0
2412|ACCID|REAREND|MULTI|EH|IV|MIDDL|FAN|DUAL|LOSS|CALM|1
2414|IV|REAREND|CD|COG|LAB|ADVERS|1
2415|ACCID|SINGL|VEHICL|IV|SWERV|AVOID|OBJECT|STRUCK|PHONE|POLE|FAN|0
2417|ACCID|SINGL|VEHICL|ROLL|DUE|FATAL|FAN|DUAL|LOSS|CALM|1
2418|AI|FELL|ASLEEP|WHEEL|VEHICL|RETENT|POND|LAB|ADVERS|1
2419|ACCID|SINGL|VEHICL|TREE|FELL|IV|LIGHTN|STORM|IV|CAMP|CALM|AD|1
2422|THEFT|RECOV|TOTAL|THEFT|0
...

The first column is always id_num, the last one is class label. I want
to do some chi-square test on the dependency between a word (or
further a word combination) on the class label.

for example, my goal is to build a table like the following, ready for
chi-square test
                      ACCID (Yes)                 ACCID(No)
class label
         1                  10                                15
         0                    5                                 9
 
the number is the number of lines (observations).
and later I want to do word-combination like ACCID & WINDOW (this
result was generated from association analysis from my another
program) instead of ACCID only.

My first question is, how to do it automatically in R to build a data
structure (data frame) to represent the table above for each word)
since I am learning R programming and I don't want to do it using
python.  (Don't worry if a word appears twice in one observation, and
I have another version of data set which only lists unique word.)

My target is to find a p-value for each word/class label from
chi-square test and evaluate the significance of feature for later
text mining. I am not sure if this is a good idea and I am reading
some papers on this.

Thanks,

--  
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

----------------------------------------------------------------------------
--
Notice:  This e-mail message, together with any attachments,...{{dropped}}

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jun 2005 - coding to generate a matrix to prepare for chi-sqr test f or text mining

[R] coding to generate a matrix to prepare for chi-sqr test f or text mining

[R] coding to generate a matrix to prepare for chi-sqr test f or text mining

Apparently Analagous Threads