Jim Holtman
2015-Mar-03 17:04 UTC
[R] Using a text file as a removeWord dictionary in tm_map
Send me a copy of your file so I can see what it looks like and what the output should be. Sent from my Verizon Wireless 4G LTE Smartphone <div>-------- Original message --------</div><div>From: Sun Shine <phaedrusv at gmail.com> </div><div>Date:03/03/2015 09:43 (GMT-05:00) </div><div>To: jim holtman <jholtman at gmail.com> </div><div>Cc: r-help <r-help at r-project.org> </div><div>Subject: Re: [R] Using a text file as a removeWord dictionary in tm_map </div><div> </div>Hi again I've now had the chance to try this out, and using scan() doesn't seem to work either. This is what I used: 1) I generated a plain text file called stopDict.txt. This file is of the format: "a, bunch, of, words, to, use" 2) I invoked scan(), like this:> userStopList <- scan(text = '~/path/to/stopDict.txt', what = " ", sep= ",") 3) Then I used the externally generated list as stop words:> docs <- tm_map(docs, removeWords, userStopList)3) When I go to inspect the document, at least two of the user-defined stop words are in the text Is there a further argument I should be passing to scan(), or is the stopDict.txt file not set up the correct way? I tried each term separated by ' ' and ',', (e.g. 'all', 'the', 'text') but that didn't work, neither does it seem to work when the whole list is enclosed within quotes (e.g. "all, the, text"). While not critical to have the capacity to read in an externally generated list, it sure would be helpful. Thanks. Sun On 02/03/15 07:36, Sun Shine wrote:> Thanks Jim. > > I thought that I was passing a vector, not realising I had converted > this to a list object. > > I haven't come across the scan() function so far, so this is good to > know. > > Good explanation - I'll give this a go when I can get back to that > piece of work later today. > > Thanks again. > > Regards, > > Sun > > > On 01/03/15 21:13, jim holtman wrote: >> The 'read.table' was creating a data.frame (not a vector) and applying >> 'c' to it converted it to a list. You should alway look at the object >> you are creating. You probably want to use 'scan'. >> >> =====================>>> testFile <- >>> "Although,this,query,applies,specifically,to,the,tm,package" >>> # read in with read.table create a data.frame >>> df_words <- read.table(text = testFile, sep = ',') >>> df_words # not a vector >> V1 V2 V3 V4 V5 V6 V7 V8 V9 >> 1 Although this query applies specifically to the tm package >>> c(df_words) # this results in a list >> $V1 >> [1] Although >> Levels: Although >> $V2 >> [1] this >> Levels: this >> $V3 >> [1] query >> Levels: query >> $V4 >> [1] applies >> Levels: applies >> $V5 >> [1] specifically >> Levels: specifically >> $V6 >> [1] to >> Levels: to >> $V7 >> [1] the >> Levels: the >> $V8 >> [1] tm >> Levels: tm >> $V9 >> [1] package >> Levels: package >>> # now read with 'scan' >>> scan_words <- scan(text = testFile, what = '', sep = ',') >> Read 9 items >>> scan_words >> [1] "Although" "this" "query" "applies" >> "specifically" "to" >> [7] "the" "tm" "package" >>> >> Jim Holtman >> Data Munger Guru >> >> What is the problem that you are trying to solve? >> Tell me what you want to do, not how you want to do it. >> >> >> On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com> wrote: >>> Hi list >>> >>> Although this query applies specifically to the tm package, perhaps >>> it's >>> something that others might be able to lend a thought to. >>> >>> Using tm to do some initial text mining, I want to include an >>> external (to >>> R) generated dictionary of words that I want removed from the corpus. >>> >>> I have created a comma separated list of terms in " " marks in a >>> stopList.txt plain UTF-8 file. I want to read this into R, so do: >>> >>>> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',') >>> When I want to load it as part of the removeWords function in tm, I do: >>> >>>> docs <- tm_map(docs, removeWords, stopDict) >>> which has no effect. Neither does: >>> >>>> docs <- tm_map(docs, removeWords, c(stopDict)) >>> What am I not seeing/ doing? >>> >>> How do I pass a text file with pre-defined terms to the removeWords >>> transform of tm? >>> >>> Thanks for any ideas. >>> >>> Cheers >>> >>> Sun >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]