I'm using text miner (the "tm" package) to process large numbers
of blog and message board postings (about 245,000). Does anyone have any advice
for how to efficiently extract the meta data from a corpus of this size?
TM does a great job of using MPI for many functions (e.g. tmMap) which greatly
speed up the processing. However, the "meta" function that I need does
not take advantage of MPI.
I have two ideas:
1) Find a way of running the meta function in parallel mode. Specifically, the
code that I'm running is:
urllist <- lapply(workingcorpus, meta, tag = "FeedUrl")
Unfortunately, I receive the following error message when I try to use the
command "parLapply"
"Error in checkCluster(cl) : not a valid cluster
Calls: parLapply ... is.vector -> clusterApply -> staticClusterApply ->
checkCluster"
2) Alternatively, I wonder if there might be a way of extracting all of the meta
data into a data.frame that would be faster for processing?
Thanks for any suggestions or ideas!
Shad
shad thomas | president | glass box research company | +1 (312) 451-3611 tel |
shad.thomas@glassboxresearch.com | www.glassboxresearch.com
[[alternative HTML version deleted]]