Hi all, Currently I am working on a code that will calculate word occurrence rate in a tweet. First, I have 'tweets' that contains all the tweet I grabbed and I make 'words' that contains all unique word in 'tweets'. After that I use sapply to calculate probability of a word appearing in 'tweets'. The main problems is speed, before using sapply, I use simple for loop that takes a really long time to finish but I can make simple ETA in the loop. After I learn to use sapply and implement it on the code, speed is improving greatly but I don't know the ETA so I just waiting for the result to appear. Using just 5% of the data I have waited for hours and R is still busy with no output. Is there a faster solution or useful package to help on my problem? Here is my code : sample.num<-100000 tweets<-read.csv('data_conv.csv', sep=',', header=TRUE, stringsAsFactors FALSE) tweets.num<-dim(tweets)[1] tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)] tweets.num<-length(tweets) words<-paste(tweets,collapse=' ') words<-gsub("\\\r\\\n", " ", words,ignore.case=TRUE,perl=TRUE) # remove newlines words<-gsub(" *\\d+ *", " ", words,ignore.case=TRUE,perl=TRUE) # remove digits words<-gsub("[^\\w@]+", " ", words,ignore.case=TRUE,perl=TRUE) # remove nonwords words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) # unique words words<-words[order(words),] # sort it words<-as.character(words) words.num<-length(words) result<-as.data.frame(words) result$prob<-0 result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]), tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here Thank you, Bembi [[alternative HTML version deleted]]
Hi all, Currently I am working on a code that will calculate word occurrence rate in a tweet. First, I have 'tweets' that contains all the tweet I grabbed and I make 'words' that contains all unique word in 'tweets'. After that I use sapply to calculate probability of a word appearing in 'tweets'. The main problems is speed, before using sapply, I use simple for loop that takes a really long time to finish but I can make simple ETA in the loop. After I learn to use sapply and implement it on the code, speed is improving greatly but I don't know the ETA so I just waiting for the result to appear. Using just 5% of the data I have waited for hours and R is still busy with no output. Is there a faster solution or useful package to help on my problem? Here is my code : sample.num<-100000 tweets<-read.csv('data_conv.csv', sep=',', header=TRUE, stringsAsFactors FALSE) tweets.num<-dim(tweets)[1] tweets<-tweets[sample(1:tweets.num,sample.num,replace=FALSE)] tweets.num<-length(tweets) words<-paste(tweets,collapse=' ') words<-gsub("\\\r\\\n", " ", words,ignore.case=TRUE,perl=TRUE) # remove newlines words<-gsub(" *\\d+ *", " ", words,ignore.case=TRUE,perl=TRUE) # remove digits words<-gsub("[^\\w@]+", " ", words,ignore.case=TRUE,perl=TRUE) # remove nonwords words<-unique(as.data.frame(strsplit(tolower(words),split=' '))) # unique words words<-words[order(words),] # sort it words<-as.character(words) words.num<-length(words) result<-as.data.frame(words) result$prob<-0 result$prob<-sapply(1:words.num,function(i)sum(grepl(sprintf('\\b%s\\b',words[i]), tweets, ignore.case = TRUE, perl = TRUE))/tweets.num) # Loooong time here Thank you, Bembi [[alternative HTML version deleted]]