thr3ads.net - R help - [R] extracting characters from a string [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Biau David

2013-Jan-23 17:38 UTC

[R] extracting characters from a string

Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)


I would like to construct a dataframe with only author's last name and each
last name in columns and the publication in rows. Basically I want to get rid of
the initials (max 2, always before a comma) and spaces surounding last name. I
would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the values
of the character string that would also be great!

 
David

	[[alternative HTML version deleted]]

arun

2013-Jan-23 17:58 UTC

head link

[R] extracting characters from a string

Hi,
You could try this:
dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F)
dat2<- as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("
$","",gsub("^
|\\w+$","",x)))),stringsAsFactors=F)


?dat2
#??????? V1????????????? V2???????? V3???????? V4
#1?? Brown????????? Santos?????? Rome?? Don Juan 
#2 Benigni?????????????????????????????????????? 
#3? Arstra?? Van den Hoops?? lamarque?????? 
A.K.

----- Original Message -----
From: Biau David <djmbiau at yahoo.fr>
To: r help list <r-help at r-project.org>
Cc: 
Sent: Wednesday, January 23, 2013 12:38 PM
Subject: [R] extracting characters from a string

Dear All,

I have a data frame of vectors of publication names such as 'pub':

pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')

pub <- rbind(pub1, pub2, pub3)


I would like to construct a dataframe with only author's last name and each
last name in columns and the publication in rows. Basically I want to get rid of
the initials (max 2, always before a comma) and spaces surounding last name. I
would like to avoid a loop.

ps: If I could have even a short explanation of the code that extract the values
of the character string that would also be great!

?
David

??? [[alternative HTML version deleted]]


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter

2013-Jan-23 18:01 UTC

head link

[R] extracting characters from a string

1. Study a regular expression tutorial on the web to learn how to do this.

2. ?regex in R summarizes (tersely! -- but clearly) R's regex's.

3. ?grep tells you about R's regular expression manipulation functions.

-- Bert

On Wed, Jan 23, 2013 at 9:38 AM, Biau David <djmbiau at yahoo.fr>
wrote:> Dear All,
>
> I have a data frame of vectors of publication names such as 'pub':
>
> pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
> pub2 <- c('Benigni D')
> pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>
> pub <- rbind(pub1, pub2, pub3)
>
>
> I would like to construct a dataframe with only author's last name and
each last name in columns and the publication in rows. Basically I want to get
rid of the initials (max 2, always before a comma) and spaces surounding last
name. I would like to avoid a loop.
>
> ps: If I could have even a short explanation of the code that extract the
values of the character string that would also be great!
>
>
> David
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

Rui Barradas

2013-Jan-23 18:33 UTC

head link

[R] extracting characters from a string

Hello,

Try the following.

fun <- function(x, sep = ", "){
	s <- unlist(strsplit(x, sep))
	regmatches(s, regexpr("[[:alpha:]]*", s))
}

fun(pub)


Hope this helps,

Rui Barradas

Em 23-01-2013 17:38, Biau David escreveu:> Dear All,
>
> I have a data frame of vectors of publication names such as 'pub':
>
> pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
> pub2 <- c('Benigni D')
> pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>
> pub <- rbind(pub1, pub2, pub3)
>
>
> I would like to construct a dataframe with only author's last name and
each last name in columns and the publication in rows. Basically I want to get
rid of the initials (max 2, always before a comma) and spaces surounding last
name. I would like to avoid a loop.
>
> ps: If I could have even a short explanation of the code that extract the
values of the character string that would also be great!
>
>
> David
>
> 	[[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

arun

2013-Jan-24 06:37 UTC

head link

[R] extracting characters from a string

HI David,


It could be related to spaces in the data or something else.? 
Suppose, if the data has some spaces at the end or the beginning.
pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
pub2 <- c('Benigni D')
pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D ')

pubnew<-rbind(pub1, pub2, pub3)
res<-as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("^ |
$","",gsub("[A-Za-z]+$","",gsub("
$","",x))))),stringsAsFactors=F)
str(res)
#'data.frame':??? 3 obs. of? 4 variables:
# $ V1: chr? "Brown" "Benigni" "Arstra"
# $ V2: chr? "Santos" "" "Van den Hoops"
# $ V3: chr? "Rome" "" "lamarque"
# $ V4: chr? "Don Juan" "" ""



#If I used the previous solution:
as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("
$","",gsub("^
|\\w+$","",x)))),stringsAsFactors=F)
?????? V1??????????? V2???????? V3?????? V4
1?? Brown??????? Santos?????? Rome Don Juan
2 Benigni????????????????????????????????? 
3? Arstra Van den Hoops lamarque D? # initial present.

I tried this case with Rui's solution:
fun2(pubnew)
#[[1]]
#[1] " Brown"?? "Santos"?? "Rome"???? "Don
Juan"

#[[2]]
#[1] "Benigni"
#
#[[3]]
#[1] "Arstra"??????? "Van den Hoops" "lamarque
D"?? # tinitials present.

As Rui's solution works for you, the problem might be something else.
A.K.


?????? 



________________________________
From: Biau David <djmbiau at yahoo.fr>
To: arun <smartpink111 at yahoo.com> 
Sent: Thursday, January 24, 2013 12:40 AM
Subject: Re: [R] extracting characters from a string


thanks a lot. it doesn't entirely work well yet; poabably because of the
format of the data I import. I have to look into it and thanks to your
explanation, I should be able to find the problem in the data.



David

>________________________________
> De?: arun <smartpink111 at yahoo.com>
>??: Biau David <djmbiau at yahoo.fr> 
>Envoy? le : Mercredi 23 janvier 2013 19h06
>Objet?: Re: [R] extracting characters from a string
> 
>Hi David,
>
>I forgot about the explanation part.
>dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F)
# here, I converted it to dataframe, delimited by ",", Used fill=TRUE
because you have unequal number of publications in each line
>as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("
$","",gsub("^
|\\w+$","",x)))),stringsAsFactors=F)
>
>#splitting codes into smaller pieces;
>?lapply(dat1,function(x) gsub("^ |\\w+$","",x))
#lapply() will ensure that the columns in dataframe are split to list elements.?
Here, the gsub command within first double quotes matches if there are any empty
spaces at the start of the string and also the last word characters in each
string and removes them ( 2nd set of double quotes are
empty).>$V1
>[1] "Brown "?? "Benigni " "Arstra " 
>
>$V2
>[1] "Santos "??????? ""?????????????? "Van den
Hoops "
>
>$V3
>[1] "Rome "???? ""????????? "lamarque "
>
>$V4
>[1] "Don Juan " ""????????? ""???????? 
>lapply(dat1,function(x) gsub(" $","",gsub("^
|\\w+$","",x))) # I used a second gsub because there are some
spaces at the end e.g. "Brown "
>$V1
>[1] "Brown"?? "Benigni" "Arstra" 
>
>$V2
>[1] "Santos"??????? ""????????????? "Van den
Hoops"
>
>$V3
>[1] "Rome"???? ""????????
"lamarque">
>$V4
>[1] "Don Juan" ""???????? ""??????? 
>
>do.call(cbind,lapply(dat1,function(x) gsub("
$","",gsub("^ |\\w+$","",x)))) #bind by
columns
>???? V1??????? V2????????????? V3???????? V4??????? 
>[1,] "Brown"?? "Santos"??????? "Rome"????
"Don Juan"
>[2,] "Benigni" ""????????????? ""????????
""???????
>[3,] "Arstra"? "Van den Hoops" "lamarque"
""???????
>
>Hope it
helps.>A.K.
>
>
>
>
>
>
>
>
>
>
>
>----- Original Message -----
>From: Biau David <djmbiau at yahoo.fr>
>To: r help list <r-help at r-project.org>
>Cc: 
>Sent: Wednesday, January 23, 2013 12:38 PM
>Subject: [R] extracting characters from a string
>
>Dear All,
>
>I have a data frame of vectors of publication names such as 'pub':
>
>pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X')
>pub2 <- c('Benigni D')
>pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D')
>
>pub <- rbind(pub1, pub2, pub3)
>
>
>I would like to construct a dataframe with only author's last name and
each last name in columns and the publication in rows. Basically I want to get
rid of the initials (max 2, always before a comma) and spaces surounding last
name. I would like to avoid a loop.>
>ps: If I could have even a short explanation of the code that extract the
values of the character string that would also be great!
>
>?
>David
>
>??? [[alternative HTML version deleted]]
>
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.r-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>
>?

Maybe Matching Threads

Search for more reasonably related threads

R help - Jan 2013 - extracting characters from a string

[R] extracting characters from a string

[R] extracting characters from a string

[R] extracting characters from a string

[R] extracting characters from a string

[R] extracting characters from a string

Maybe Matching Threads