Hi list Using the tm package, part of the pre-processing work is to remove words, etc. from the corpus. I wish to remove people's names and also their initials which are peppered throughout the corpus. But, because some people's initials are the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a completely different meaning). Is there any way of doing this without leaving a trail of nonsense half-terms behind? I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those. Would it make a difference if I removed initials and names *prior* to converting all text to lower case, so I remove 'AM' and because 'became' is lower case, it should remain unaffected? Any recommendations on how best to proceed with this? Thanks as always. Sun
Hi Sun, In fact, case sensitivity is the default in functions like "sub". The problem may then become separating initials from acronyms if they are present in the corpus: gsub("NM","","An NMR was performed on NM Jones") [1] "An R was performed on Jones" How you are going to deal with names like York may also be tricky: gsub("York","","Reginald York took a holiday in New York.") [1] "Reginald took a holiday in New ." Jim On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com> wrote:> Hi list > > Using the tm package, part of the pre-processing work is to remove words, > etc. from the corpus. > > I wish to remove people's names and also their initials which are peppered > throughout the corpus. But, because some people's initials are the same as > parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' > => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a completely > different meaning). > > Is there any way of doing this without leaving a trail of nonsense > half-terms behind? I suspect that it might have something to do with > regular expressions, but to be honest, I'm (currently) pretty crap with > those. > > Would it make a difference if I removed initials and names *prior* to > converting all text to lower case, so I remove 'AM' and because 'became' is > lower case, it should remain unaffected? > > Any recommendations on how best to proceed with this? > > Thanks as always. > Sun > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hey Jim So far I've re-run the process and sub'bed initials and proper names with blank space, and changed other names (including acronyms) to something less tricky (your e.g. #1 NMR is therefore "NucMagRes", etc.) *before* I converted to lower case. By and large, that seems to cut it, at least for my present purposes. I don't have a workaround for your e.g. #2 though! One really has to have a relatively decent handle on the scope of the variations and text content first. I'm not sure how one would do this kind of thing effectively on a large and unseen corpus. Anyway, thanks for your reply and thoughts. Sun On 10/04/15 11:38, Jim Lemon wrote:> Hi Sun, > In fact, case sensitivity is the default in functions like "sub". The > problem may then become separating initials from acronyms if they are > present in the corpus: > > gsub("NM","","An NMR was performed on NM Jones") > [1] "An R was performed on Jones" > > How you are going to deal with names like York may also be tricky: > > gsub("York","","Reginald York took a holiday in New York.") > [1] "Reginald took a holiday in New ." > > Jim > > > On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com > <mailto:phaedrusv at gmail.com>> wrote: > > Hi list > > Using the tm package, part of the pre-processing work is to remove > words, etc. from the corpus. > > I wish to remove people's names and also their initials which are > peppered throughout the corpus. But, because some people's > initials are the same as parts of common words - e.g. 'am' > 'became' => 'bec e' or 'ec' = 'because' => 'b ause' or 'ar' > 'arrival' => 'rival' (which has a completely different meaning). > > Is there any way of doing this without leaving a trail of nonsense > half-terms behind? I suspect that it might have something to do > with regular expressions, but to be honest, I'm (currently) pretty > crap with those. > > Would it make a difference if I removed initials and names *prior* > to converting all text to lower case, so I remove 'AM' and because > 'became' is lower case, it should remain unaffected? > > Any recommendations on how best to proceed with this? > > Thanks as always. > Sun > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
"I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those." I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote:>Hi list > >Using the tm package, part of the pre-processing work is to remove >words, etc. from the corpus. > >I wish to remove people's names and also their initials which are >peppered throughout the corpus. But, because some people's initials are > >the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or >'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has >a >completely different meaning). > >Is there any way of doing this without leaving a trail of nonsense >half-terms behind? I suspect that it might have something to do with >regular expressions, but to be honest, I'm (currently) pretty crap with > >those. > >Would it make a difference if I removed initials and names *prior* to >converting all text to lower case, so I remove 'AM' and because >'became' >is lower case, it should remain unaffected? > >Any recommendations on how best to proceed with this? > >Thanks as always. >Sun > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Thanks Jeff. I'll add that to the ever-growing list my current studies are generating daily. :-) Cheers S On 10/04/15 14:32, Jeff Newmiller wrote:> "I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those." > > I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available. > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > Sent from my phone. Please excuse my brevity. > > On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote: >> Hi list >> >> Using the tm package, part of the pre-processing work is to remove >> words, etc. from the corpus. >> >> I wish to remove people's names and also their initials which are >> peppered throughout the corpus. But, because some people's initials are >> >> the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or >> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has >> a >> completely different meaning). >> >> Is there any way of doing this without leaving a trail of nonsense >> half-terms behind? I suspect that it might have something to do with >> regular expressions, but to be honest, I'm (currently) pretty crap with >> >> those. >> >> Would it make a difference if I removed initials and names *prior* to >> converting all text to lower case, so I remove 'AM' and because >> 'became' >> is lower case, it should remain unaffected? >> >> Any recommendations on how best to proceed with this? >> >> Thanks as always. >> Sun >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >