thr3ads.net - R help - [R] Removing words and initials with tm [Apr 2015]

If this information is useful, please help other people find it:
Share via:

Sun Shine

2015-Apr-10 13:42 UTC

[R] Removing words and initials with tm

Thanks Jeff.

I'll add that to the ever-growing list my current studies are generating 
daily. :-)

Cheers
S


On 10/04/15 14:32, Jeff Newmiller wrote:> "I suspect that it might have something to do with regular
expressions, but to be honest, I'm (currently) pretty crap with those."
>
> I cannot think of a better incentive to take action on this hole in your
education and buckle down to learn regular expressions. There are many books and
tutorials available.
> ---------------------------------------------------------------------------
> Jeff Newmiller                        The     .....       .....  Go Live...
> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#. 
Live Go...
>                                        Live:   OO#.. Dead: OO#..  Playing
> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> ---------------------------------------------------------------------------
> Sent from my phone. Please excuse my brevity.
>
> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com>
wrote:
>> Hi list
>>
>> Using the tm package, part of the pre-processing work is to remove
>> words, etc. from the corpus.
>>
>> I wish to remove people's names and also their initials which are
>> peppered throughout the corpus. But, because some people's initials
are
>>
>> the same as parts of common words - e.g. 'am' =
'became' => 'bec e' or
>> 'ec' = 'because' => 'b ause' or 'ar'
= 'arrival' => 'rival' (which has
>> a
>> completely different meaning).
>>
>> Is there any way of doing this without leaving a trail of nonsense
>> half-terms behind? I suspect that it might have something to do with
>> regular expressions, but to be honest, I'm (currently) pretty crap
with
>>
>> those.
>>
>> Would it make a difference if I removed initials and names *prior* to
>> converting all text to lower case, so I remove 'AM' and because
>> 'became'
>> is lower case, it should remain unaffected?
>>
>> Any recommendations on how best to proceed with this?
>>
>> Thanks as always.
>> Sun
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

Jim Lemon

2015-Apr-10 22:36 UTC

head link

[R] Removing words and initials with tm

Hi Sun,
No, I was thinking of something like hunspell, which seems to fit into the
sort of work that you are doing.

Jim


On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com>
wrote:
> Thanks Jeff.
>
> I'll add that to the ever-growing list my current studies are
generating
> daily. :-)
>
> Cheers
> S
>
>
>
> On 10/04/15 14:32, Jeff Newmiller wrote:
>
>> "I suspect that it might have something to do with regular
expressions,
>> but to be honest, I'm (currently) pretty crap with those."
>>
>> I cannot think of a better incentive to take action on this hole in
your
>> education and buckle down to learn regular expressions. There are many
>> books and tutorials available.
>> ------------------------------------------------------------
>> ---------------
>> Jeff Newmiller                        The     .....       .....  Go
>> Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.      
##.#.  Live
>> Go...
>>                                        Live:   OO#.. Dead: OO#.. 
Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.
>> rocks...1k
>> ------------------------------------------------------------
>> ---------------
>> Sent from my phone. Please excuse my brevity.
>>
>> On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at
gmail.com> wrote:
>>
>>> Hi list
>>>
>>> Using the tm package, part of the pre-processing work is to remove
>>> words, etc. from the corpus.
>>>
>>> I wish to remove people's names and also their initials which
are
>>> peppered throughout the corpus. But, because some people's
initials are
>>>
>>> the same as parts of common words - e.g. 'am' =
'became' => 'bec e' or
>>> 'ec' = 'because' => 'b ause' or
'ar' = 'arrival' => 'rival' (which has
>>> a
>>> completely different meaning).
>>>
>>> Is there any way of doing this without leaving a trail of nonsense
>>> half-terms behind? I suspect that it might have something to do
with
>>> regular expressions, but to be honest, I'm (currently) pretty
crap with
>>>
>>> those.
>>>
>>> Would it make a difference if I removed initials and names *prior*
to
>>> converting all text to lower case, so I remove 'AM' and
because
>>> 'became'
>>> is lower case, it should remain unaffected?
>>>
>>> Any recommendations on how best to proceed with this?
>>>
>>> Thanks as always.
>>> Sun
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Sun Shine

2015-Apr-11 06:21 UTC

head link

[R] Removing words and initials with tm

Hi Jim

The name's come up on my radar, but that's about it. I'll look into
it.

Thanks for the reference.

All the best
S

On 10/04/15 23:36, Jim Lemon wrote:> Hi Sun,
> No, I was thinking of something like hunspell, which seems to fit into 
> the sort of work that you are doing.
>
> Jim
>
>
> On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com 
> <mailto:phaedrusv at gmail.com>> wrote:
>
>     Thanks Jeff.
>
>     I'll add that to the ever-growing list my current studies are
>     generating daily. :-)
>
>     Cheers
>     S
>
>
>
>     On 10/04/15 14:32, Jeff Newmiller wrote:
>
>         "I suspect that it might have something to do with regular
>         expressions, but to be honest, I'm (currently) pretty crap
>         with those."
>
>         I cannot think of a better incentive to take action on this
>         hole in your education and buckle down to learn regular
>         expressions. There are many books and tutorials available.
>        
---------------------------------------------------------------------------
>         Jeff Newmiller                        The     .....    ..... 
>         Go Live...
>         DCN:<jdnewmil at dcn.davis.ca.us
>         <mailto:jdnewmil at dcn.davis.ca.us>>     Basics: ##.#.
>          ##.#.  Live Go...
>                                                Live:   OO#.. Dead:
>         OO#..  Playing
>         Research Engineer (Solar/Batteries            O.O#.    #.O#.  with
>         /Software/Embedded Controllers)               .OO#.    .OO#. 
>         rocks...1k
>        
---------------------------------------------------------------------------
>         Sent from my phone. Please excuse my brevity.
>
>         On April 10, 2015 3:19:51 AM PDT, Sun Shine
>         <phaedrusv at gmail.com <mailto:phaedrusv at
gmail.com>> wrote:
>
>             Hi list
>
>             Using the tm package, part of the pre-processing work is
>             to remove
>             words, etc. from the corpus.
>
>             I wish to remove people's names and also their initials
>             which are
>             peppered throughout the corpus. But, because some people's
>             initials are
>
>             the same as parts of common words - e.g. 'am' =
'became'
>             => 'bec e' or
>             'ec' = 'because' => 'b ause' or
'ar' = 'arrival' =>
>             'rival' (which has
>             a
>             completely different meaning).
>
>             Is there any way of doing this without leaving a trail of
>             nonsense
>             half-terms behind? I suspect that it might have something
>             to do with
>             regular expressions, but to be honest, I'm (currently)
>             pretty crap with
>
>             those.
>
>             Would it make a difference if I removed initials and names
>             *prior* to
>             converting all text to lower case, so I remove 'AM' and
>             because
>             'became'
>             is lower case, it should remain unaffected?
>
>             Any recommendations on how best to proceed with this?
>
>             Thanks as always.
>             Sun
>
>             ______________________________________________
>             R-help at r-project.org <mailto:R-help at r-project.org>
mailing
>             list -- To UNSUBSCRIBE and more, see
>             https://stat.ethz.ch/mailman/listinfo/r-help
>             PLEASE do read the posting guide
>             http://www.R-project.org/posting-guide.html
>             and provide commented, minimal, self-contained,
>             reproducible code.
>
>
>
>     ______________________________________________
>     R-help at r-project.org <mailto:R-help at r-project.org> mailing
list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>
>

	[[alternative HTML version deleted]]

R help - Apr 2015 - Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm

[R] Removing words and initials with tm