Ventseslav Kozarev, MPP
2013-Apr-09 07:10 UTC
[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Hi, I bumped into a serious issue while trying to analyse some texts in Bulgarian language (with the tm package). I import a tab-separated csv file, which holds a total of 22 variables, most of which are text cells (not factors), using the read.delim function: data<-read.delim("bigcompanies_ascii.csv", header=TRUE, quote="'", sep="\t", as.is=TRUE, encoding='CP1251', fileEncoding='CP1251') (I also tried the above with UTF-8 encoding on a UTF-8-saved file.) I have my list of stop words written in a separate text file, one word per line, which I read into R using the scan function: stoplist<-scan(file='stoplist_ascii.txt', what='character', strip.white=TRUE, blank.lines.skip=TRUE, fileEncoding='CP1251', encoding='CP1251') (also tried with UTF-8 here on a correspondingly encoded file) I currently only test with a corpus based on the contents of just one variable, and I construct the corpus from a VectorSource. When I run inspect, all seems fine and I can see the text properly, with unicode characters present: data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'), readerControl=list(language='bulgarian')) However, no matter what I do - like which encoding I select - UTF-8 or CP1251, which is the typical code page for Bulgarian texts, I cannot get to remove the stop words from my corpus. The issue is present in both Linux and Windows, and across the computers I use R on, and I don't think it is related to bad configuration. Removal of punctuation, white space, and numbers is flawless, but the inability to remove stop words prevents me from further analysing the texts. Has somebody had experience with languages other than English, and for which there is no predefined stop list available through the stopwords function? I will highly appreciate any tips and advice! Thanks in advance, Vince
Milan Bouchet-Valat
2013-Apr-09 19:55 UTC
[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Le mardi 09 avril 2013 ? 10:10 +0300, Ventseslav Kozarev, MPP a ?crit :> Hi, > > I bumped into a serious issue while trying to analyse some texts in > Bulgarian language (with the tm package). I import a tab-separated csv > file, which holds a total of 22 variables, most of which are text cells > (not factors), using the read.delim function: > > data<-read.delim("bigcompanies_ascii.csv", > header=TRUE, > quote="'", > sep="\t", > as.is=TRUE, > encoding='CP1251', > fileEncoding='CP1251') > > (I also tried the above with UTF-8 encoding on a UTF-8-saved file.) > > I have my list of stop words written in a separate text file, one word > per line, which I read into R using the scan function: > > stoplist<-scan(file='stoplist_ascii.txt', > what='character', > strip.white=TRUE, > blank.lines.skip=TRUE, > fileEncoding='CP1251', > encoding='CP1251') > > (also tried with UTF-8 here on a correspondingly encoded file) > > I currently only test with a corpus based on the contents of just one > variable, and I construct the corpus from a VectorSource. When I run > inspect, all seems fine and I can see the text properly, with unicode > characters present: > > data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'), > readerControl=list(language='bulgarian')) > > However, no matter what I do - like which encoding I select - UTF-8 or > CP1251, which is the typical code page for Bulgarian texts, I cannot get > to remove the stop words from my corpus. The issue is present in both > Linux and Windows, and across the computers I use R on, and I don't > think it is related to bad configuration. Removal of punctuation, white > space, and numbers is flawless, but the inability to remove stop words > prevents me from further analysing the texts. > > Has somebody had experience with languages other than English, and for > which there is no predefined stop list available through the stopwords > function? I will highly appreciate any tips and advice!Well, at least show us the code that you use to remove stopwords... Can you provide a reproducible example with a toy corpus?> Thanks in advance, > Vince > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Milan Bouchet-Valat
2013-Apr-10 18:43 UTC
[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
Le mercredi 10 avril 2013 ? 13:17 +0200, Ingo Feinerer a ?crit :> On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote: > > Thanks for the reproducible example. Indeed, it does not work here > > either (Linux with UTF-8 locale). The problem seems to be in the call to > > gsub() in removeWords: the pattern "\\b" does not match anything when > > perl=TRUE. With perl=FALSE, it works. > > The \b versus perl versus UTF-8 issue seems to be known, and it is > advised to use perl = TRUE with \b. See e.g. the warning in the gsub > help page (?gsub): > > ---8<-------------------------------------------------------------------------- > Warning: > > POSIX 1003.2 mode of ?gsub? and ?gregexpr? does not work correctly with > repeated word-boundaries (e.g. ?pattern = "\b"?). Use ?perl = TRUE? for > such matches (but that may not work as expected with non-ASCII inputs, > as the meaning of ?word? is system-dependent). > ---8<--------------------------------------------------------------------------Thanks for the pointer. Indeed, this allowed me to discover the existence of the PCRE_UCP (Unicode Character Properties) flag, which changes matching behavior so that Unicode alphanumerics are not considered as word boundaries. This flag should probably be used by R when calling pcre_compile() in gsub() and friends. At the moment, R's behavior is inconsistent across platforms: - on Fedora 18, R 2.15.3 : gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE) [1] "?l?gramme" - on Windows 2008, R 2.15.1 and 3.0.0 : gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE) [1] "t?l?gramme" Luckily, the bug can be fixed at tm's level by adding (*UCP) at the beginning of the pattern. This works for our examples :> gsub(sprintf("\\b(%s)\\b", "?????"), "", "?????", perl=TRUE)[1] "?????"> gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "", "?????", perl=TRUE)[1] "" gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE) [1] "?l?gramme" gsub("(*UCP)\\bt\\b", "", "t?l?gramme", perl=TRUE) [1] "t?l?gramme" Regards
Ventseslav Kozarev
2013-Apr-10 20:26 UTC
[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text
I just wanted to confirm that Milan's suggestion about adding (*UCP) like in the example below: gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "", "?????", perl=TRUE) solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded input files and stop word list in UTF-8, and now stop words are properly removed using the suggested syntax: sme.corpus<-tm_map(sme.corpus,removeWords.PlainTextDocument,stoplist) where: removeWords.PlainTextDocument <- function (x, words) gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse = "|")), "", x, perl=TRUE) and stoplist is a character vector of stop words. The wordcloud function now also accept the preprocessed corpus without warnings or errors. Now, if only I could do stemming in Bulgarian, that would have been priceless! Thanks again, this has been tremendous help indeed! Vince On Wednesday 10 April 2013 20:43:27 Milan Bouchet-Valat wrote:> Le mercredi 10 avril 2013 ? 13:17 +0200, Ingo Feinerer a ?crit : > > On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote: > > > Thanks for the reproducible example. Indeed, it does not work here > > > either (Linux with UTF-8 locale). The problem seems to be in the call to > > > gsub() in removeWords: the pattern "\\b" does not match anything when > > > perl=TRUE. With perl=FALSE, it works. > > > > The \b versus perl versus UTF-8 issue seems to be known, and it is > > advised to use perl = TRUE with \b. See e.g. the warning in the gsub > > help page (?gsub): > > > > ---8<--------------------------------------------------------------------- > > ----- Warning: > > > > POSIX 1003.2 mode of ?gsub? and ?gregexpr? does not work correctly with > > repeated word-boundaries (e.g. ?pattern = "\b"?). Use ?perl = TRUE? for > > such matches (but that may not work as expected with non-ASCII inputs, > > as the meaning of ?word? is system-dependent). > > ---8<--------------------------------------------------------------------- > > ----- > Thanks for the pointer. Indeed, this allowed me to discover the > existence of the PCRE_UCP (Unicode Character Properties) flag, which > changes matching behavior so that Unicode alphanumerics are not > considered as word boundaries. > > This flag should probably be used by R when calling pcre_compile() in > gsub() and friends. At the moment, R's behavior is inconsistent across > platforms: > - on Fedora 18, R 2.15.3 : > gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE) > [1] "?l?gramme" > > - on Windows 2008, R 2.15.1 and 3.0.0 : > gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE) > [1] "t?l?gramme" > > > Luckily, the bug can be fixed at tm's level by adding (*UCP) at the > > beginning of the pattern. This works for our examples : > > gsub(sprintf("\\b(%s)\\b", "?????"), "", "?????", perl=TRUE) > > [1] "?????" > > > gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "", "?????", perl=TRUE) > > [1] "" > > gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE) > [1] "?l?gramme" > gsub("(*UCP)\\bt\\b", "", "t?l?gramme", perl=TRUE) > [1] "t?l?gramme" > > > Regards