thr3ads.net - R help - [R] Removing words and initials with tm [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Sun Shine

2015-Apr-10 10:19 UTC

[R] Removing words and initials with tm

Hi list

Using the tm package, part of the pre-processing work is to remove 
words, etc. from the corpus.

I wish to remove people's names and also their initials which are 
peppered throughout the corpus. But, because some people's initials are 
the same as parts of common words - e.g. 'am' = 'became' =>
'bec e' or
'ec' = 'because' => 'b ause' or 'ar' =
'arrival' => 'rival' (which has a
completely different meaning).

Is there any way of doing this without leaving a trail of nonsense 
half-terms behind? I suspect that it might have something to do with 
regular expressions, but to be honest, I'm (currently) pretty crap with 
those.

Would it make a difference if I removed initials and names *prior* to 
converting all text to lower case, so I remove 'AM' and because
'became'
is lower case, it should remain unaffected?

Any recommendations on how best to proceed with this?

Thanks as always.
Sun

Jim Lemon

2015-Apr-10 10:38 UTC

head link

[R] Removing words and initials with tm

Hi Sun,
In fact, case sensitivity is the default in functions like "sub". The
problem may then become separating initials from acronyms if they are
present in the corpus:

gsub("NM","","An NMR was performed on NM Jones")
[1] "An R was performed on  Jones"

How you are going to deal with names like York may also be tricky:

gsub("York","","Reginald York took a holiday in New
York.")
[1] "Reginald  took a holiday in New ."

Jim


On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com> wrote:
> Hi list
>
> Using the tm package, part of the pre-processing work is to remove words,
> etc. from the corpus.
>
> I wish to remove people's names and also their initials which are
peppered
> throughout the corpus. But, because some people's initials are the same
as
> parts of common words - e.g. 'am' = 'became' => 'bec
e' or 'ec' = 'because'
> => 'b ause' or 'ar' = 'arrival' =>
'rival' (which has a completely
> different meaning).
>
> Is there any way of doing this without leaving a trail of nonsense
> half-terms behind? I suspect that it might have something to do with
> regular expressions, but to be honest, I'm (currently) pretty crap with
> those.
>
> Would it make a difference if I removed initials and names *prior* to
> converting all text to lower case, so I remove 'AM' and because
'became' is
> lower case, it should remain unaffected?
>
> Any recommendations on how best to proceed with this?
>
> Thanks as always.
> Sun
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Sun Shine

2015-Apr-10 11:17 UTC

head link

[R] Removing words and initials with tm

Hey Jim

So far I've re-run the process and sub'bed initials and proper names 
with blank space, and changed other names (including acronyms) to 
something less tricky (your e.g. #1 NMR is therefore "NucMagRes",
etc.)
*before* I converted to lower case. By and large, that seems to cut it, 
at least for my present purposes.

I don't have a workaround for your e.g. #2 though!

One really has to have a relatively decent handle on the scope of the 
variations and text content first. I'm not sure how one would do this 
kind of thing effectively on a large and unseen corpus.

Anyway, thanks for your reply and thoughts.

Sun

On 10/04/15 11:38, Jim Lemon wrote:> Hi Sun,
> In fact, case sensitivity is the default in functions like "sub".
The
> problem may then become separating initials from acronyms if they are 
> present in the corpus:
>
> gsub("NM","","An NMR was performed on NM
Jones")
> [1] "An R was performed on  Jones"
>
> How you are going to deal with names like York may also be tricky:
>
> gsub("York","","Reginald York took a holiday in
New York.")
> [1] "Reginald  took a holiday in New ."
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 8:19 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>
>     Hi list
>
>     Using the tm package, part of the pre-processing work is to remove
>     words, etc. from the corpus.
>
>     I wish to remove people's names and also their initials which are
>     peppered throughout the corpus. But, because some people's
>     initials are the same as parts of common words - e.g. 'am' >
'became' => 'bec e' or 'ec' = 'because' =>
'b ause' or 'ar' >     'arrival' =>
'rival' (which has a completely different meaning).
>
>     Is there any way of doing this without leaving a trail of nonsense
>     half-terms behind? I suspect that it might have something to do
>     with regular expressions, but to be honest, I'm (currently) pretty
>     crap with those.
>
>     Would it make a difference if I removed initials and names *prior*
>     to converting all text to lower case, so I remove 'AM' and
because
>     'became' is lower case, it should remain unaffected?
>
>     Any recommendations on how best to proceed with this?
>
>     Thanks as always.
>     Sun
>
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>
>

	[[alternative HTML version deleted]]

Jeff Newmiller

2015-Apr-10 13:32 UTC

head link

[R] Removing words and initials with tm

"I suspect that it might have something to do with regular expressions, but
to be honest, I'm (currently) pretty crap with those."

I cannot think of a better incentive to take action on this hole in your
education and buckle down to learn regular expressions. There are many books and
tutorials available.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com>
wrote:>Hi list
>
>Using the tm package, part of the pre-processing work is to remove 
>words, etc. from the corpus.
>
>I wish to remove people's names and also their initials which are 
>peppered throughout the corpus. But, because some people's initials are
>
>the same as parts of common words - e.g. 'am' = 'became'
=> 'bec e' or
>'ec' = 'because' => 'b ause' or 'ar' =
'arrival' => 'rival' (which has
>a 
>completely different meaning).
>
>Is there any way of doing this without leaving a trail of nonsense 
>half-terms behind? I suspect that it might have something to do with 
>regular expressions, but to be honest, I'm (currently) pretty crap with
>
>those.
>
>Would it make a difference if I removed initials and names *prior* to 
>converting all text to lower case, so I remove 'AM' and because
>'became' 
>is lower case, it should remain unaffected?
>
>Any recommendations on how best to proceed with this?
>
>Thanks as always.
>Sun
>
>______________________________________________
>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Sun Shine

2015-Apr-10 13:42 UTC

head link

[R] Removing words and initials with tm

Thanks Jeff.

I'll add that to the ever-growing list my current studies are generating 
daily. :-)

Cheers
S


On 10/04/15 14:32, Jeff Newmiller wrote:> "I suspect that it might have something to do with regular
expressions, but to be honest, I'm (currently) pretty crap with those."
>
> I cannot think of a better incentive to take action on this hole in your
education and buckle down to learn regular expressions. There are many books and
tutorials available.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com>
wrote:
>> Hi list
>>
>> Using the tm package, part of the pre-processing work is to remove
>> words, etc. from the corpus.
>>
>> I wish to remove people's names and also their initials which are
>> peppered throughout the corpus. But, because some people's initials
are
>>
>> the same as parts of common words - e.g. 'am' =
'became' => 'bec e' or
>> 'ec' = 'because' => 'b ause' or 'ar'
= 'arrival' => 'rival' (which has
>> a
>> completely different meaning).
>>
>> Is there any way of doing this without leaving a trail of nonsense
>> half-terms behind? I suspect that it might have something to do with
>> regular expressions, but to be honest, I'm (currently) pretty crap
with
>>
>> those.
>>
>> Would it make a difference if I removed initials and names *prior* to
>> converting all text to lower case, so I remove 'AM' and because
>> 'became'
>> is lower case, it should remain unaffected?
>>
>> Any recommendations on how best to proceed with this?
>>
>> Thanks as always.
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

R help - Apr 2015 - Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm