thr3ads.net - R help - [R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Ventseslav Kozarev, MPP

2013-Apr-09 07:10 UTC

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Hi,

I bumped into a serious issue while trying to analyse some texts in 
Bulgarian language (with the tm package). I import a tab-separated csv 
file, which holds a total of 22 variables, most of which are text cells 
(not factors), using the read.delim function:

data<-read.delim("bigcompanies_ascii.csv",
                 header=TRUE,
                 quote="'",
                 sep="\t",
                 as.is=TRUE,
                 encoding='CP1251',
                 fileEncoding='CP1251')

(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)

I have my list of stop words written in a separate text file, one word 
per line, which I read into R using the scan function:

stoplist<-scan(file='stoplist_ascii.txt',
                what='character',
                strip.white=TRUE,
                blank.lines.skip=TRUE,
                fileEncoding='CP1251',
                encoding='CP1251')

(also tried with UTF-8 here on a correspondingly encoded file)

I currently only test with a corpus based on the contents of just one 
variable, and I construct the corpus from a VectorSource. When I run 
inspect, all seems fine and I can see the text properly, with unicode 
characters present:

data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
                    readerControl=list(language='bulgarian'))

However, no matter what I do - like which encoding I select - UTF-8 or 
CP1251, which is the typical code page for Bulgarian texts, I cannot get 
to remove the stop words from my corpus. The issue is present in both 
Linux and Windows, and across the computers I use R on, and I don't 
think it is related to bad configuration. Removal of punctuation, white 
space, and numbers is flawless, but the inability to remove stop words 
prevents me from further analysing the texts.

Has somebody had experience with languages other than English, and for 
which there is no predefined stop list available through the stopwords 
function? I will highly appreciate any tips and advice!

Thanks in advance,
Vince

Milan Bouchet-Valat

2013-Apr-09 19:55 UTC

head link

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Le mardi 09 avril 2013 ? 10:10 +0300, Ventseslav Kozarev, MPP a ?crit
:> Hi,
> 
> I bumped into a serious issue while trying to analyse some texts in 
> Bulgarian language (with the tm package). I import a tab-separated csv 
> file, which holds a total of 22 variables, most of which are text cells 
> (not factors), using the read.delim function:
> 
> data<-read.delim("bigcompanies_ascii.csv",
>                  header=TRUE,
>                  quote="'",
>                  sep="\t",
>                  as.is=TRUE,
>                  encoding='CP1251',
>                  fileEncoding='CP1251')
> 
> (I also tried the above with UTF-8 encoding on a UTF-8-saved file.)
> 
> I have my list of stop words written in a separate text file, one word 
> per line, which I read into R using the scan function:
> 
> stoplist<-scan(file='stoplist_ascii.txt',
>                 what='character',
>                 strip.white=TRUE,
>                 blank.lines.skip=TRUE,
>                 fileEncoding='CP1251',
>                 encoding='CP1251')
> 
> (also tried with UTF-8 here on a correspondingly encoded file)
> 
> I currently only test with a corpus based on the contents of just one 
> variable, and I construct the corpus from a VectorSource. When I run 
> inspect, all seems fine and I can see the text properly, with unicode 
> characters present:
> 
>
data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
>                     readerControl=list(language='bulgarian'))
> 
> However, no matter what I do - like which encoding I select - UTF-8 or 
> CP1251, which is the typical code page for Bulgarian texts, I cannot get 
> to remove the stop words from my corpus. The issue is present in both 
> Linux and Windows, and across the computers I use R on, and I don't 
> think it is related to bad configuration. Removal of punctuation, white 
> space, and numbers is flawless, but the inability to remove stop words 
> prevents me from further analysing the texts.
> 
> Has somebody had experience with languages other than English, and for 
> which there is no predefined stop list available through the stopwords 
> function? I will highly appreciate any tips and advice!Well, at least show us the code that you use to remove stopwords... Can
you provide a reproducible example with a toy corpus?
> Thanks in advance,
> Vince
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Milan Bouchet-Valat

2013-Apr-10 18:43 UTC

head link

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Le mercredi 10 avril 2013 ? 13:17 +0200, Ingo Feinerer a ?crit
:> On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote:
> > Thanks for the reproducible example. Indeed, it does not work here
> > either (Linux with UTF-8 locale). The problem seems to be in the call
to
> > gsub() in removeWords: the pattern "\\b" does not match
anything when
> > perl=TRUE. With perl=FALSE, it works.
> 
> The \b versus perl versus UTF-8 issue seems to be known, and it is
> advised to use perl = TRUE with \b. See e.g. the warning in the gsub
> help page (?gsub):
> 
>
---8<--------------------------------------------------------------------------
> Warning:
> 
> POSIX 1003.2 mode of ?gsub? and ?gregexpr? does not work correctly with
> repeated word-boundaries (e.g. ?pattern = "\b"?).  Use ?perl =
TRUE? for
> such matches (but that may not work as expected with non-ASCII inputs,
> as the meaning of ?word? is system-dependent).
>
---8<--------------------------------------------------------------------------Thanks for the pointer. Indeed, this allowed me to discover the
existence of the PCRE_UCP (Unicode Character Properties) flag, which
changes matching behavior so that Unicode alphanumerics are not
considered as word boundaries.

This flag should probably be used by R when calling pcre_compile() in
gsub() and friends. At the moment, R's behavior is inconsistent across
platforms:
- on Fedora 18, R 2.15.3 :
gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "?l?gramme"

- on Windows 2008, R 2.15.1 and 3.0.0 :
gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "t?l?gramme"


Luckily, the bug can be fixed at tm's level by adding (*UCP) at the
beginning of the pattern. This works for our examples :
> gsub(sprintf("\\b(%s)\\b", "?????"), "",
"?????", perl=TRUE)
[1] "?????"> gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"),
"", "?????", perl=TRUE)[1] ""

gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "?l?gramme"
gsub("(*UCP)\\bt\\b", "", "t?l?gramme", perl=TRUE)
[1] "t?l?gramme"


Regards

Ventseslav Kozarev

2013-Apr-10 20:26 UTC

head link

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

I just wanted to confirm that Milan's suggestion about adding (*UCP) like in
the example below:

gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"), "",
"?????", perl=TRUE)

solved all problems (under openSuse Linux 12.3 64-bit, R 2.15.2). I reencoded 
input files and stop word list in UTF-8, and now stop words are properly 
removed using the suggested syntax:

sme.corpus<-tm_map(sme.corpus,removeWords.PlainTextDocument,stoplist)

where:
removeWords.PlainTextDocument <- function (x, words)
  gsub(sprintf("(*UCP)\\b(%s)\\b", paste(words, collapse =
"|")), "", x,
perl=TRUE)

and stoplist is a character vector of stop words.

The wordcloud function now also accept the preprocessed corpus without 
warnings or errors. Now, if only I could do stemming in Bulgarian, that would 
have been priceless!

Thanks again, this has been tremendous help indeed!
Vince


On Wednesday 10 April 2013 20:43:27 Milan Bouchet-Valat
wrote:> Le mercredi 10 avril 2013 ? 13:17 +0200, Ingo Feinerer a ?crit :
> > On Wed, Apr 10, 2013 at 10:29:27AM +0200, Milan Bouchet-Valat wrote:
> > > Thanks for the reproducible example. Indeed, it does not work
here
> > > either (Linux with UTF-8 locale). The problem seems to be in the
call to
> > > gsub() in removeWords: the pattern "\\b" does not match
anything when
> > > perl=TRUE. With perl=FALSE, it works.
> > 
> > The \b versus perl versus UTF-8 issue seems to be known, and it is
> > advised to use perl = TRUE with \b. See e.g. the warning in the gsub
> > help page (?gsub):
> > 
> >
---8<---------------------------------------------------------------------
> > ----- Warning:
> > 
> > POSIX 1003.2 mode of ?gsub? and ?gregexpr? does not work correctly
with
> > repeated word-boundaries (e.g. ?pattern = "\b"?).  Use ?perl
= TRUE? for
> > such matches (but that may not work as expected with non-ASCII inputs,
> > as the meaning of ?word? is system-dependent).
> >
---8<---------------------------------------------------------------------
> > -----
> Thanks for the pointer. Indeed, this allowed me to discover the
> existence of the PCRE_UCP (Unicode Character Properties) flag, which
> changes matching behavior so that Unicode alphanumerics are not
> considered as word boundaries.
> 
> This flag should probably be used by R when calling pcre_compile() in
> gsub() and friends. At the moment, R's behavior is inconsistent across
> platforms:
> - on Fedora 18, R 2.15.3 :
> gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
> [1] "?l?gramme"
> 
> - on Windows 2008, R 2.15.1 and 3.0.0 :
> gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
> [1] "t?l?gramme"
> 
> 
> Luckily, the bug can be fixed at tm's level by adding (*UCP) at the
> 
> beginning of the pattern. This works for our examples :
> > gsub(sprintf("\\b(%s)\\b", "?????"), "",
"?????", perl=TRUE)
> 
> [1] "?????"
> 
> > gsub(sprintf("(*UCP)\\b(%s)\\b", "?????"),
"", "?????", perl=TRUE)
> 
> [1] ""
> 
> gsub("\\bt\\b", "", "t?l?gramme", perl=TRUE)
> [1] "?l?gramme"
> gsub("(*UCP)\\bt\\b", "", "t?l?gramme",
perl=TRUE)
> [1] "t?l?gramme"
> 
> 
> Regards

Reasonably Related Threads

Search for more apparently analagous threads

R help - Apr 2013 - Question on Stopword Removal from a Cyrillic (Bulgarian)Text

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

[R] Question on Stopword Removal from a Cyrillic (Bulgarian)Text

Reasonably Related Threads