thr3ads.net - R help - [R] Word occurrence rate in a tweet [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Bembi Prima

2013-Jul-02 10:31 UTC

[R] Word occurrence rate in a tweet

Hi all,

Currently I am working on a code that will calculate word occurrence rate
in a tweet.
First, I have 'tweets' that contains all the tweet I grabbed and I make
'words' that contains all unique word in 'tweets'.
After that I use sapply to calculate probability of a word appearing in
'tweets'.
The main problems is speed, before using sapply, I use simple for loop that
takes a really long time to finish but I can make simple ETA in the loop.
After I learn to use sapply and implement it on the code, speed is
improving greatly but I don't know the ETA so I just waiting for the result
to appear.
Using just 5% of the data I have waited for hours and R is still busy with
no output.
Is there a faster solution or useful package to help on my problem?

Here is my code :

sample.num<-100000

tweets<-read.csv('data_conv.csv', sep=',', header=TRUE,
stringsAsFactors FALSE)
tweets.num<-dim(tweets)[1]
tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)]
tweets.num<-length(tweets)

words<-paste(tweets,collapse=' ')
words<-gsub("\\\r\\\n", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
newlines
words<-gsub(" *\\d+ *", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
digits
words<-gsub("[^\\w@]+", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
nonwords
words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) #
unique
words
words<-words[order(words),] # sort it
words<-as.character(words)
words.num<-length(words)

result<-as.data.frame(words)
result$prob<-0
result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]),
tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here

Thank you,
Bembi

	[[alternative HTML version deleted]]

Bembi Prima

2013-Jul-11 05:49 UTC

head link

[R] Word occurrence rate in a tweet

Hi all,

Currently I am working on a code that will calculate word occurrence rate
in a tweet.
First, I have 'tweets' that contains all the tweet I grabbed and I make
'words' that contains all unique word in 'tweets'.
After that I use sapply to calculate probability of a word appearing in
'tweets'.
The main problems is speed, before using sapply, I use simple for loop that
takes a really long time to finish but I can make simple ETA in the loop.
After I learn to use sapply and implement it on the code, speed is
improving greatly but I don't know the ETA so I just waiting for the result
to appear.
Using just 5% of the data I have waited for hours and R is still busy with
no output.
Is there a faster solution or useful package to help on my problem?

Here is my code :

sample.num<-100000

tweets<-read.csv('data_conv.csv', sep=',', header=TRUE,
stringsAsFactors FALSE)
tweets.num<-dim(tweets)[1]
tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)]
tweets.num<-length(tweets)

words<-paste(tweets,collapse=' ')
words<-gsub("\\\r\\\n", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
newlines
words<-gsub(" *\\d+ *", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
digits
words<-gsub("[^\\w@]+", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
nonwords
words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) #
unique
words
words<-words[order(words),] # sort it
words<-as.character(words)
words.num<-length(words)

result<-as.data.frame(words)
result$prob<-0
result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]),
tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here

Thank you,
Bembi

	[[alternative HTML version deleted]]

R help - Jul 2013 - Word occurrence rate in a tweet

[R] Word occurrence rate in a tweet

[R] Word occurrence rate in a tweet