Hey Jim So far I've re-run the process and sub'bed initials and proper names with blank space, and changed other names (including acronyms) to something less tricky (your e.g. #1 NMR is therefore "NucMagRes", etc.) *before* I converted to lower case. By and large, that seems to cut it, at least for my present purposes. I don't have a workaround for your e.g. #2 though! One really has to have a relatively decent handle on the scope of the variations and text content first. I'm not sure how one would do this kind of thing effectively on a large and unseen corpus. Anyway, thanks for your reply and thoughts. Sun On 10/04/15 11:38, Jim Lemon wrote:> Hi Sun, > In fact, case sensitivity is the default in functions like "sub". The > problem may then become separating initials from acronyms if they are > present in the corpus: > > gsub("NM","","An NMR was performed on NM Jones") > [1] "An R was performed on Jones" > > How you are going to deal with names like York may also be tricky: > > gsub("York","","Reginald York took a holiday in New York.") > [1] "Reginald took a holiday in New ." > > Jim > > > On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com > <mailto:phaedrusv at gmail.com>> wrote: > > Hi list > > Using the tm package, part of the pre-processing work is to remove > words, etc. from the corpus. > > I wish to remove people's names and also their initials which are > peppered throughout the corpus. But, because some people's > initials are the same as parts of common words - e.g. 'am' > 'became' => 'bec e' or 'ec' = 'because' => 'b ause' or 'ar' > 'arrival' => 'rival' (which has a completely different meaning). > > Is there any way of doing this without leaving a trail of nonsense > half-terms behind? I suspect that it might have something to do > with regular expressions, but to be honest, I'm (currently) pretty > crap with those. > > Would it make a difference if I removed initials and names *prior* > to converting all text to lower case, so I remove 'AM' and because > 'became' is lower case, it should remain unaffected? > > Any recommendations on how best to proceed with this? > > Thanks as always. > Sun > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
Hi Sun, Good thinking. Looking at your reply, I realized that you may be able to run a spell checker over the output to pick up mangled words. Jim On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com> wrote:> Hey Jim > > So far I've re-run the process and sub'bed initials and proper names with > blank space, and changed other names (including acronyms) to something less > tricky (your e.g. #1 NMR is therefore "NucMagRes", etc.) *before* I > converted to lower case. By and large, that seems to cut it, at least for > my present purposes. > > I don't have a workaround for your e.g. #2 though! > > One really has to have a relatively decent handle on the scope of the > variations and text content first. I'm not sure how one would do this kind > of thing effectively on a large and unseen corpus. > > Anyway, thanks for your reply and thoughts. > > Sun > > > On 10/04/15 11:38, Jim Lemon wrote: > > Hi Sun, > In fact, case sensitivity is the default in functions like "sub". The > problem may then become separating initials from acronyms if they are > present in the corpus: > > gsub("NM","","An NMR was performed on NM Jones") > [1] "An R was performed on Jones" > > How you are going to deal with names like York may also be tricky: > > gsub("York","","Reginald York took a holiday in New York.") > [1] "Reginald took a holiday in New ." > > Jim > > > On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com> wrote: > >> Hi list >> >> Using the tm package, part of the pre-processing work is to remove words, >> etc. from the corpus. >> >> I wish to remove people's names and also their initials which are >> peppered throughout the corpus. But, because some people's initials are the >> same as parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' >> 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a >> completely different meaning). >> >> Is there any way of doing this without leaving a trail of nonsense >> half-terms behind? I suspect that it might have something to do with >> regular expressions, but to be honest, I'm (currently) pretty crap with >> those. >> >> Would it make a difference if I removed initials and names *prior* to >> converting all text to lower case, so I remove 'AM' and because 'became' is >> lower case, it should remain unaffected? >> >> Any recommendations on how best to proceed with this? >> >> Thanks as always. >> Sun >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > >[[alternative HTML version deleted]]
Thanks Jim Can you say more about a R spell checker, or were you thinking of opening the parsed documents in a word processor, e.g. LibreOffice? After stemming the documents, most of the words are mangled, e.g. 'people' becomes 'peopl' so I think the spell checker would go crazy! I think a lot of this comes down to which sequence one runs the different transformations in. Cheers Sun On 10/04/15 12:30, Jim Lemon wrote:> Hi Sun, > Good thinking. Looking at your reply, I realized that you may be able > to run a spell checker over the output to pick up mangled words. > > Jim > > > On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com > <mailto:phaedrusv at gmail.com>> wrote: > > Hey Jim > > So far I've re-run the process and sub'bed initials and proper > names with blank space, and changed other names (including > acronyms) to something less tricky (your e.g. #1 NMR is therefore > "NucMagRes", etc.) *before* I converted to lower case. By and > large, that seems to cut it, at least for my present purposes. > > I don't have a workaround for your e.g. #2 though! > > One really has to have a relatively decent handle on the scope of > the variations and text content first. I'm not sure how one would > do this kind of thing effectively on a large and unseen corpus. > > Anyway, thanks for your reply and thoughts. > > Sun > > > On 10/04/15 11:38, Jim Lemon wrote: >> Hi Sun, >> In fact, case sensitivity is the default in functions like "sub". >> The problem may then become separating initials from acronyms if >> they are present in the corpus: >> >> gsub("NM","","An NMR was performed on NM Jones") >> [1] "An R was performed on Jones" >> >> How you are going to deal with names like York may also be tricky: >> >> gsub("York","","Reginald York took a holiday in New York.") >> [1] "Reginald took a holiday in New ." >> >> Jim >> >> >> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com >> <mailto:phaedrusv at gmail.com>> wrote: >> >> Hi list >> >> Using the tm package, part of the pre-processing work is to >> remove words, etc. from the corpus. >> >> I wish to remove people's names and also their initials which >> are peppered throughout the corpus. But, because some >> people's initials are the same as parts of common words - >> e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' => 'b >> ause' or 'ar' = 'arrival' => 'rival' (which has a completely >> different meaning). >> >> Is there any way of doing this without leaving a trail of >> nonsense half-terms behind? I suspect that it might have >> something to do with regular expressions, but to be honest, >> I'm (currently) pretty crap with those. >> >> Would it make a difference if I removed initials and names >> *prior* to converting all text to lower case, so I remove >> 'AM' and because 'became' is lower case, it should remain >> unaffected? >> >> Any recommendations on how best to proceed with this? >> >> Thanks as always. >> Sun >> >> ______________________________________________ >> R-help at r-project.org <mailto:R-help at r-project.org> mailing >> list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible >> code. >> >> > >[[alternative HTML version deleted]]