Richter-Dumke, Jonas
2011-Nov-08 12:48 UTC
[R] match first consecutive list of capitalized words in string
Dear R-Helpers,
this is my first post ever to a mailing list, so please feel free to point out
any missunderstandings on my side regarding the conventions of this mailing
list.
My problem:
Assuming the following character vector is given:
names <- c("filia Maria", "vidua Joh Dirck Kleve (oo
02.02.1732)", "Bernardus Engelb Franciscus Linde j.u.Doktor
referendarius sereniss Judex et gograven Rheinensis")
Is there a regular expression matching the first consecutive list of capitalized
words in a single characterstring ("Maria", "Joh Dirck
Kleve", "Bernardus Engelb Franciscus Linde")?
This expression would very reliably seperate the person names from the
additional information in my historic church register transcription.
Thank you very much for your effort,
Jonas
----------
This mail has been sent through the MPI for Demographic ...{{dropped:10}}
Peter Alspach
2011-Nov-10 02:24 UTC
[R] match first consecutive list of capitalized words in string
Tena koe Jonas
Something like the following may help, although you should probably read the
help on regexpr regarding locales.
Names <- c("filia Maria", "vidua Joh Dirck Kleve (oo
02.02.1732)", "Bernardus Engelb Franciscus Linde j.u.Doktor
referendarius sereniss Judex et gograven Rheinensis")
Names1 <- sub('^[0-9a-z]* ', '', Names)
Names1
ttReg <- regexpr(' [^A-Z]', Names1)
ifelse (ttReg>0, substring(Names1, 1, regexpr(' [^A-Z]', Names1)-1),
Names1)
Incidentally, it is not good practice to call your objects 'names' since
that is a function in R.
HTH ....
Peter Alspach
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Richter-Dumke, Jonas
> Sent: Wednesday, 9 November 2011 1:49 a.m.
> To: r-help at r-project.org
> Subject: [R] match first consecutive list of capitalized words in
> string
>
> Dear R-Helpers,
>
> this is my first post ever to a mailing list, so please feel free to
> point out any missunderstandings on my side regarding the conventions
> of this mailing list.
>
> My problem:
>
> Assuming the following character vector is given:
>
> names <- c("filia Maria", "vidua Joh Dirck Kleve (oo
02.02.1732)",
> "Bernardus Engelb Franciscus Linde j.u.Doktor referendarius sereniss
> Judex et gograven Rheinensis")
>
> Is there a regular expression matching the first consecutive list of
> capitalized words in a single characterstring ("Maria", "Joh
Dirck
> Kleve", "Bernardus Engelb Franciscus Linde")?
> This expression would very reliably seperate the person names from the
> additional information in my historic church register transcription.
>
> Thank you very much for your effort,
>
> Jonas
>
> ----------
> This mail has been sent through the MPI for Demographic
> ...{{dropped:10}}
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
The contents of this e-mail are confidential and may be subject to legal
privilege.
If you are not the intended recipient you must not use, disseminate, distribute
or
reproduce all or any part of this e-mail or attachments. If you have received
this
e-mail in error, please notify the sender and delete all material pertaining to
this
e-mail. Any opinion or views expressed in this e-mail are those of the
individual
sender and may not represent those of The New Zealand Institute for Plant and
Food Research Limited.
Gabor Grothendieck
2011-Nov-10 04:58 UTC
[R] match first consecutive list of capitalized words in string
On Tue, Nov 8, 2011 at 7:48 AM, Richter-Dumke, Jonas <Richter at demogr.mpg.de> wrote:> Dear R-Helpers, > > this is my first post ever to a mailing list, so please feel free to point out any missunderstandings on my side regarding the conventions of this mailing list. > > My problem: > > Assuming the following character vector is given: > > names <- c("filia Maria", "vidua Joh Dirck Kleve (oo 02.02.1732)", "Bernardus Engelb Franciscus Linde j.u.Doktor referendarius sereniss Judex et gograven Rheinensis") > > Is there a regular expression matching the first consecutive list of capitalized words in a single characterstring ("Maria", "Joh Dirck Kleve", "Bernardus Engelb Franciscus Linde")? > This expression would very reliably seperate the person names from the additional information in my historic church register transcription. >Try this. It matches a word boundary followed by zero or more of the parenthesized expression. That expression is an upper case letter followed by zero or more lower case letters followed by one or more spaces. Finally we match the last word which consists of an upper case letter followed by zero or more lower case letters and a word boundary. Note that it assumes R 2.14.0 or later:> re <- "\\b([[:upper:]][[:lower:]]* +)*[[:upper:]][[:lower:]]*\\b" > regmatches(names, regexpr(re, names))[1] "Maria" "Joh Dirck Kleve" [3] "Bernardus Engelb Franciscus Linde" -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com