Hi all,
Currently I am working on a code that will calculate word occurrence rate
in a tweet.
First, I have 'tweets' that contains all the tweet I grabbed and I make
'words' that contains all unique word in 'tweets'.
After that I use sapply to calculate probability of a word appearing in
'tweets'.
The main problems is speed, before using sapply, I use simple for loop that
takes a really long time to finish but I can make simple ETA in the loop.
After I learn to use sapply and implement it on the code, speed is
improving greatly but I don't know the ETA so I just waiting for the result
to appear.
Using just 5% of the data I have waited for hours and R is still busy with
no output.
Is there a faster solution or useful package to help on my problem?
Here is my code :
sample.num<-100000
tweets<-read.csv('data_conv.csv', sep=',', header=TRUE,
stringsAsFactors FALSE)
tweets.num<-dim(tweets)[1]
tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)]
tweets.num<-length(tweets)
words<-paste(tweets,collapse=' ')
words<-gsub("\\\r\\\n", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
newlines
words<-gsub(" *\\d+ *", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
digits
words<-gsub("[^\\w@]+", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
nonwords
words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) #
unique
words
words<-words[order(words),] # sort it
words<-as.character(words)
words.num<-length(words)
result<-as.data.frame(words)
result$prob<-0
result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]),
tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here
Thank you,
Bembi
[[alternative HTML version deleted]]
Hi all,
Currently I am working on a code that will calculate word occurrence rate
in a tweet.
First, I have 'tweets' that contains all the tweet I grabbed and I make
'words' that contains all unique word in 'tweets'.
After that I use sapply to calculate probability of a word appearing in
'tweets'.
The main problems is speed, before using sapply, I use simple for loop that
takes a really long time to finish but I can make simple ETA in the loop.
After I learn to use sapply and implement it on the code, speed is
improving greatly but I don't know the ETA so I just waiting for the result
to appear.
Using just 5% of the data I have waited for hours and R is still busy with
no output.
Is there a faster solution or useful package to help on my problem?
Here is my code :
sample.num<-100000
tweets<-read.csv('data_conv.csv', sep=',', header=TRUE,
stringsAsFactors FALSE)
tweets.num<-dim(tweets)[1]
tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)]
tweets.num<-length(tweets)
words<-paste(tweets,collapse=' ')
words<-gsub("\\\r\\\n", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
newlines
words<-gsub(" *\\d+ *", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
digits
words<-gsub("[^\\w@]+", " ",
words,ignore.case=TRUE,perl=TRUE) # remove
nonwords
words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) #
unique
words
words<-words[order(words),] # sort it
words<-as.character(words)
words.num<-length(words)
result<-as.data.frame(words)
result$prob<-0
result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]),
tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here
Thank you,
Bembi
[[alternative HTML version deleted]]