Thanks Jeff. I'll add that to the ever-growing list my current studies are generating daily. :-) Cheers S On 10/04/15 14:32, Jeff Newmiller wrote:> "I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those." > > I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available. > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --------------------------------------------------------------------------- > Sent from my phone. Please excuse my brevity. > > On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote: >> Hi list >> >> Using the tm package, part of the pre-processing work is to remove >> words, etc. from the corpus. >> >> I wish to remove people's names and also their initials which are >> peppered throughout the corpus. But, because some people's initials are >> >> the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or >> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has >> a >> completely different meaning). >> >> Is there any way of doing this without leaving a trail of nonsense >> half-terms behind? I suspect that it might have something to do with >> regular expressions, but to be honest, I'm (currently) pretty crap with >> >> those. >> >> Would it make a difference if I removed initials and names *prior* to >> converting all text to lower case, so I remove 'AM' and because >> 'became' >> is lower case, it should remain unaffected? >> >> Any recommendations on how best to proceed with this? >> >> Thanks as always. >> Sun >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >
Hi Sun, No, I was thinking of something like hunspell, which seems to fit into the sort of work that you are doing. Jim On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com> wrote:> Thanks Jeff. > > I'll add that to the ever-growing list my current studies are generating > daily. :-) > > Cheers > S > > > > On 10/04/15 14:32, Jeff Newmiller wrote: > >> "I suspect that it might have something to do with regular expressions, >> but to be honest, I'm (currently) pretty crap with those." >> >> I cannot think of a better incentive to take action on this hole in your >> education and buckle down to learn regular expressions. There are many >> books and tutorials available. >> ------------------------------------------------------------ >> --------------- >> Jeff Newmiller The ..... ..... Go >> Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >> rocks...1k >> ------------------------------------------------------------ >> --------------- >> Sent from my phone. Please excuse my brevity. >> >> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote: >> >>> Hi list >>> >>> Using the tm package, part of the pre-processing work is to remove >>> words, etc. from the corpus. >>> >>> I wish to remove people's names and also their initials which are >>> peppered throughout the corpus. But, because some people's initials are >>> >>> the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or >>> 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has >>> a >>> completely different meaning). >>> >>> Is there any way of doing this without leaving a trail of nonsense >>> half-terms behind? I suspect that it might have something to do with >>> regular expressions, but to be honest, I'm (currently) pretty crap with >>> >>> those. >>> >>> Would it make a difference if I removed initials and names *prior* to >>> converting all text to lower case, so I remove 'AM' and because >>> 'became' >>> is lower case, it should remain unaffected? >>> >>> Any recommendations on how best to proceed with this? >>> >>> Thanks as always. >>> Sun >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi Jim The name's come up on my radar, but that's about it. I'll look into it. Thanks for the reference. All the best S On 10/04/15 23:36, Jim Lemon wrote:> Hi Sun, > No, I was thinking of something like hunspell, which seems to fit into > the sort of work that you are doing. > > Jim > > > On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com > <mailto:phaedrusv at gmail.com>> wrote: > > Thanks Jeff. > > I'll add that to the ever-growing list my current studies are > generating daily. :-) > > Cheers > S > > > > On 10/04/15 14:32, Jeff Newmiller wrote: > > "I suspect that it might have something to do with regular > expressions, but to be honest, I'm (currently) pretty crap > with those." > > I cannot think of a better incentive to take action on this > hole in your education and buckle down to learn regular > expressions. There are many books and tutorials available. > --------------------------------------------------------------------------- > Jeff Newmiller The ..... ..... > Go Live... > DCN:<jdnewmil at dcn.davis.ca.us > <mailto:jdnewmil at dcn.davis.ca.us>> Basics: ##.#. > ##.#. Live Go... > Live: OO#.. Dead: > OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. > rocks...1k > --------------------------------------------------------------------------- > Sent from my phone. Please excuse my brevity. > > On April 10, 2015 3:19:51 AM PDT, Sun Shine > <phaedrusv at gmail.com <mailto:phaedrusv at gmail.com>> wrote: > > Hi list > > Using the tm package, part of the pre-processing work is > to remove > words, etc. from the corpus. > > I wish to remove people's names and also their initials > which are > peppered throughout the corpus. But, because some people's > initials are > > the same as parts of common words - e.g. 'am' = 'became' > => 'bec e' or > 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => > 'rival' (which has > a > completely different meaning). > > Is there any way of doing this without leaving a trail of > nonsense > half-terms behind? I suspect that it might have something > to do with > regular expressions, but to be honest, I'm (currently) > pretty crap with > > those. > > Would it make a difference if I removed initials and names > *prior* to > converting all text to lower case, so I remove 'AM' and > because > 'became' > is lower case, it should remain unaffected? > > Any recommendations on how best to proceed with this? > > Thanks as always. > Sun > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing > list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. > > > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]