Hello everyone, I´ve been using R in my doctoral reserach and would really appreciate it if any of you could help me find a solution for a problem. I have a file with texts in 3 languages - Portuguese, Spanish, and English - but I am interested only in the Portuguese words. Is there a function I can use to retrieve only the wordsof the language I want? Thanks a lot! Júlia [[alternative HTML version deleted]]
Hello, I'm portuguese and I don't know of any such function. Maybe a dictionary based search? Rui Barradas Em 28-05-2013 11:31, J?lia Zara escreveu:> Hello everyone, > > I?ve been using R in my doctoral reserach and would really appreciate it if > any of you could help me find a solution for a problem. I have a file with > texts in 3 languages - Portuguese, Spanish, and English - but I am > interested only in the Portuguese words. Is there a function I can use to > retrieve only the wordsof the language I want? > > Thanks a lot! > > J?lia > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Barry Rowlingson
2013-May-28 16:12 UTC
[R] Help retrieving only Portuguese words from a file
On Tue, May 28, 2013 at 5:02 PM, Rui Barradas <ruipbarradas at sapo.pt> wrote:> Hello, > > And some words exist in Portuguese, Spanish and English, the three > languages of the problem. For instance, "animal". I don't think this > problem can be solved, but a dictionary search would tell if it is a > Portuguese word, which it is.Is there any structure to the text? If it has complete paragraphs in one of the three languages then you can probably make a better guess of the language of the paragraph from the presence of key words. I wonder if some of the code for detecting spam can help you here... Train it on some known Portuguese, Spanish, and English text... If its just a stream of words in one of the languages in a random order then it is difficult or impossible. Barry