thr3ads.net - R help - [R] Removing words and initials with tm [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Sun Shine

2015-Apr-10 11:17 UTC

[R] Removing words and initials with tm

Hey Jim

So far I've re-run the process and sub'bed initials and proper names 
with blank space, and changed other names (including acronyms) to 
something less tricky (your e.g. #1 NMR is therefore "NucMagRes",
etc.)
*before* I converted to lower case. By and large, that seems to cut it, 
at least for my present purposes.

I don't have a workaround for your e.g. #2 though!

One really has to have a relatively decent handle on the scope of the 
variations and text content first. I'm not sure how one would do this 
kind of thing effectively on a large and unseen corpus.

Anyway, thanks for your reply and thoughts.

Sun

On 10/04/15 11:38, Jim Lemon wrote:> Hi Sun,
> In fact, case sensitivity is the default in functions like "sub".
The
> problem may then become separating initials from acronyms if they are 
> present in the corpus:
>
> gsub("NM","","An NMR was performed on NM
Jones")
> [1] "An R was performed on  Jones"
>
> How you are going to deal with names like York may also be tricky:
>
> gsub("York","","Reginald York took a holiday in
New York.")
> [1] "Reginald  took a holiday in New ."
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>
>     Hi list
>
>     Using the tm package, part of the pre-processing work is to remove
>     words, etc. from the corpus.
>
>     I wish to remove people's names and also their initials which are
>     peppered throughout the corpus. But, because some people's
>     initials are the same as parts of common words - e.g. 'am' >
'became' => 'bec e' or 'ec' = 'because' =>
'b ause' or 'ar' >     'arrival' =>
'rival' (which has a completely different meaning).
>
>     Is there any way of doing this without leaving a trail of nonsense
>     half-terms behind? I suspect that it might have something to do
>     with regular expressions, but to be honest, I'm (currently) pretty
>     crap with those.
>
>     Would it make a difference if I removed initials and names *prior*
>     to converting all text to lower case, so I remove 'AM' and
because
>     'became' is lower case, it should remain unaffected?
>
>     Any recommendations on how best to proceed with this?
>
>     Thanks as always.
>     Sun
>
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>
>

	[[alternative HTML version deleted]]

Jim Lemon

2015-Apr-10 11:30 UTC

head link

[R] Removing words and initials with tm

Hi Sun,
Good thinking. Looking at your reply, I realized that you may be able to
run a spell checker over the output to pick up mangled words.

Jim


On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com> wrote:
>  Hey Jim
>
> So far I've re-run the process and sub'bed initials and proper
names with
> blank space, and changed other names (including acronyms) to something less
> tricky (your e.g. #1 NMR is therefore "NucMagRes", etc.) *before*
I
> converted to lower case. By and large, that seems to cut it, at least for
> my present purposes.
>
> I don't have a workaround for your e.g. #2 though!
>
> One really has to have a relatively decent handle on the scope of the
> variations and text content first. I'm not sure how one would do this
kind
> of thing effectively on a large and unseen corpus.
>
> Anyway, thanks for your reply and thoughts.
>
> Sun
>
>
> On 10/04/15 11:38, Jim Lemon wrote:
>
> Hi Sun,
> In fact, case sensitivity is the default in functions like "sub".
The
> problem may then become separating initials from acronyms if they are
> present in the corpus:
>
>  gsub("NM","","An NMR was performed on NM
Jones")
> [1] "An R was performed on  Jones"
>
>  How you are going to deal with names like York may also be tricky:
>
>  gsub("York","","Reginald York took a holiday in
New York.")
> [1] "Reginald  took a holiday in New ."
>
>  Jim
>
>
> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com>
wrote:
>
>> Hi list
>>
>> Using the tm package, part of the pre-processing work is to remove
words,
>> etc. from the corpus.
>>
>> I wish to remove people's names and also their initials which are
>> peppered throughout the corpus. But, because some people's initials
are the
>> same as parts of common words - e.g. 'am' = 'became'
=> 'bec e' or 'ec' >> 'because' => 'b
ause' or 'ar' = 'arrival' => 'rival' (which has a
>> completely different meaning).
>>
>> Is there any way of doing this without leaving a trail of nonsense
>> half-terms behind? I suspect that it might have something to do with
>> regular expressions, but to be honest, I'm (currently) pretty crap
with
>> those.
>>
>> Would it make a difference if I removed initials and names *prior* to
>> converting all text to lower case, so I remove 'AM' and because
'became' is
>> lower case, it should remain unaffected?
>>
>> Any recommendations on how best to proceed with this?
>>
>> Thanks as always.
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
	[[alternative HTML version deleted]]

Sun Shine

2015-Apr-10 11:37 UTC

head link

[R] Removing words and initials with tm

Thanks Jim

Can you say more about a R spell checker, or were you thinking of 
opening the parsed documents in a word processor, e.g. LibreOffice?

After stemming the documents, most of the words are mangled, e.g. 
'people' becomes 'peopl' so I think the spell checker would go
crazy! I
think a lot of this comes down to which sequence one runs the different 
transformations in.

Cheers
Sun

On 10/04/15 12:30, Jim Lemon wrote:> Hi Sun,
> Good thinking. Looking at your reply, I realized that you may be able 
> to run a spell checker over the output to pick up mangled words.
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 9:17 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>
>     Hey Jim
>
>     So far I've re-run the process and sub'bed initials and proper
>     names with blank space, and changed other names (including
>     acronyms) to something less tricky (your e.g. #1 NMR is therefore
>     "NucMagRes", etc.) *before* I converted to lower case. By and
>     large, that seems to cut it, at least for my present purposes.
>
>     I don't have a workaround for your e.g. #2 though!
>
>     One really has to have a relatively decent handle on the scope of
>     the variations and text content first. I'm not sure how one would
>     do this kind of thing effectively on a large and unseen corpus.
>
>     Anyway, thanks for your reply and thoughts.
>
>     Sun
>
>
>     On 10/04/15 11:38, Jim Lemon wrote:
>>     Hi Sun,
>>     In fact, case sensitivity is the default in functions like
"sub".
>>     The problem may then become separating initials from acronyms if
>>     they are present in the corpus:
>>
>>     gsub("NM","","An NMR was performed on NM
Jones")
>>     [1] "An R was performed on  Jones"
>>
>>     How you are going to deal with names like York may also be tricky:
>>
>>     gsub("York","","Reginald York took a
holiday in New York.")
>>     [1] "Reginald  took a holiday in New ."
>>
>>     Jim
>>
>>
>>     On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at
gmail.com
>>     <mailto:phaedrusv at gmail.com>> wrote:
>>
>>         Hi list
>>
>>         Using the tm package, part of the pre-processing work is to
>>         remove words, etc. from the corpus.
>>
>>         I wish to remove people's names and also their initials
which
>>         are peppered throughout the corpus. But, because some
>>         people's initials are the same as parts of common words -
>>         e.g. 'am' = 'became' => 'bec e' or
'ec' = 'because' => 'b
>>         ause' or 'ar' = 'arrival' =>
'rival' (which has a completely
>>         different meaning).
>>
>>         Is there any way of doing this without leaving a trail of
>>         nonsense half-terms behind? I suspect that it might have
>>         something to do with regular expressions, but to be honest,
>>         I'm (currently) pretty crap with those.
>>
>>         Would it make a difference if I removed initials and names
>>         *prior* to converting all text to lower case, so I remove
>>         'AM' and because 'became' is lower case, it
should remain
>>         unaffected?
>>
>>         Any recommendations on how best to proceed with this?
>>
>>         Thanks as always.
>>         Sun
>>
>>         ______________________________________________
>>         R-help at r-project.org <mailto:R-help at r-project.org>
mailing
>>         list -- To UNSUBSCRIBE and more, see
>>         https://stat.ethz.ch/mailman/listinfo/r-help
>>         PLEASE do read the posting guide
>>         http://www.R-project.org/posting-guide.html
>>         and provide commented, minimal, self-contained, reproducible
>>         code.
>>
>>
>
>

	[[alternative HTML version deleted]]

R help - Apr 2015 - Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm