Sun Shine
2015-Feb-28 13:46 UTC
[R] Using a text file as a removeWord dictionary in tm_map
Hi list Although this query applies specifically to the tm package, perhaps it's something that others might be able to lend a thought to. Using tm to do some initial text mining, I want to include an external (to R) generated dictionary of words that I want removed from the corpus. I have created a comma separated list of terms in " " marks in a stopList.txt plain UTF-8 file. I want to read this into R, so do: > stopDict <- read.table('~/path/to/file/stopList.txt', sep=',') When I want to load it as part of the removeWords function in tm, I do: > docs <- tm_map(docs, removeWords, stopDict) which has no effect. Neither does: > docs <- tm_map(docs, removeWords, c(stopDict)) What am I not seeing/ doing? How do I pass a text file with pre-defined terms to the removeWords transform of tm? Thanks for any ideas. Cheers Sun
jim holtman
2015-Mar-01 21:13 UTC
[R] Using a text file as a removeWord dictionary in tm_map
The 'read.table' was creating a data.frame (not a vector) and applying 'c' to it converted it to a list. You should alway look at the object you are creating. You probably want to use 'scan'. =====================> testFile <- "Although,this,query,applies,specifically,to,the,tm,package"> # read in with read.table create a data.frame > df_words <- read.table(text = testFile, sep = ',') > df_words # not a vectorV1 V2 V3 V4 V5 V6 V7 V8 V9 1 Although this query applies specifically to the tm package> c(df_words) # this results in a list$V1 [1] Although Levels: Although $V2 [1] this Levels: this $V3 [1] query Levels: query $V4 [1] applies Levels: applies $V5 [1] specifically Levels: specifically $V6 [1] to Levels: to $V7 [1] the Levels: the $V8 [1] tm Levels: tm $V9 [1] package Levels: package> > # now read with 'scan' > scan_words <- scan(text = testFile, what = '', sep = ',')Read 9 items> scan_words[1] "Although" "this" "query" "applies" "specifically" "to" [7] "the" "tm" "package"> >Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com> wrote:> Hi list > > Although this query applies specifically to the tm package, perhaps it's > something that others might be able to lend a thought to. > > Using tm to do some initial text mining, I want to include an external (to > R) generated dictionary of words that I want removed from the corpus. > > I have created a comma separated list of terms in " " marks in a > stopList.txt plain UTF-8 file. I want to read this into R, so do: > >> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',') > > When I want to load it as part of the removeWords function in tm, I do: > >> docs <- tm_map(docs, removeWords, stopDict) > > which has no effect. Neither does: > >> docs <- tm_map(docs, removeWords, c(stopDict)) > > What am I not seeing/ doing? > > How do I pass a text file with pre-defined terms to the removeWords > transform of tm? > > Thanks for any ideas. > > Cheers > > Sun > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Sun Shine
2015-Mar-02 07:36 UTC
[R] Using a text file as a removeWord dictionary in tm_map
Thanks Jim. I thought that I was passing a vector, not realising I had converted this to a list object. I haven't come across the scan() function so far, so this is good to know. Good explanation - I'll give this a go when I can get back to that piece of work later today. Thanks again. Regards, Sun On 01/03/15 21:13, jim holtman wrote:> The 'read.table' was creating a data.frame (not a vector) and applying > 'c' to it converted it to a list. You should alway look at the object > you are creating. You probably want to use 'scan'. > > =====================>> testFile <- "Although,this,query,applies,specifically,to,the,tm,package" >> # read in with read.table create a data.frame >> df_words <- read.table(text = testFile, sep = ',') >> df_words # not a vector > V1 V2 V3 V4 V5 V6 V7 V8 V9 > 1 Although this query applies specifically to the tm package >> c(df_words) # this results in a list > $V1 > [1] Although > Levels: Although > $V2 > [1] this > Levels: this > $V3 > [1] query > Levels: query > $V4 > [1] applies > Levels: applies > $V5 > [1] specifically > Levels: specifically > $V6 > [1] to > Levels: to > $V7 > [1] the > Levels: the > $V8 > [1] tm > Levels: tm > $V9 > [1] package > Levels: package >> # now read with 'scan' >> scan_words <- scan(text = testFile, what = '', sep = ',') > Read 9 items >> scan_words > [1] "Although" "this" "query" "applies" > "specifically" "to" > [7] "the" "tm" "package" >> > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > > On Sat, Feb 28, 2015 at 8:46 AM, Sun Shine <phaedrusv at gmail.com> wrote: >> Hi list >> >> Although this query applies specifically to the tm package, perhaps it's >> something that others might be able to lend a thought to. >> >> Using tm to do some initial text mining, I want to include an external (to >> R) generated dictionary of words that I want removed from the corpus. >> >> I have created a comma separated list of terms in " " marks in a >> stopList.txt plain UTF-8 file. I want to read this into R, so do: >> >>> stopDict <- read.table('~/path/to/file/stopList.txt', sep=',') >> When I want to load it as part of the removeWords function in tm, I do: >> >>> docs <- tm_map(docs, removeWords, stopDict) >> which has no effect. Neither does: >> >>> docs <- tm_map(docs, removeWords, c(stopDict)) >> What am I not seeing/ doing? >> >> How do I pass a text file with pre-defined terms to the removeWords >> transform of tm? >> >> Thanks for any ideas. >> >> Cheers >> >> Sun >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.