thr3ads.net - R help - [R] extracting character values [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Biau David

2013-Jan-13 08:53 UTC

[R] extracting character values

Dear all,

I have a dataframe of names (netw), with each cell including last name and
initials of an author; some cells have NA. I would like to extract only the last
name from each cell; this new dataframe is calle 'res'


Here is what I do:

res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))

for (i in 1:x)
{
wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
res[i] <- substring(as.character(netw[,i]), wh, wh +
attr(wh,'match.length')-1)
}

 
the problem is that I cannot manage to extract 'complex' names properly
such as ' van der hoops bf  ': here I only get 'van', the real
last name is 'van der hoops' and 'bf' are the initials.
Basically the last name has always a minimum of 3 consecutive letters, but may
have 3 or more letters separated by one or more space; the cell may start by a
space too; initials never have more than 2 letters.

Someone would have a nice idea for that? Thanks,


David

	[[alternative HTML version deleted]]

Biau David

2013-Jan-13 17:02 UTC

head link

[R] extracting character values

OK,

here is a minimal working example:

au1 <- c('biau dj', 'jones kb', 'van den hoofs j',
' biau dj', 'biau dj', 'campagna r', 'biau dj',
'weiss kr', 'verdegaal sh', 'riad s')
au2 <- c('weiss kr', 'ferguson pc', ' greidanus nv',
' porcher r', 'ferguson pc', 'pessis e', 'leclerc
p', 'biau dj', 'bovee jv', 'biau d')
au3 <- c('bhumbra rs', 'lam b', 'garbuz ds', NA,
'chung p', ' biau dj', 'marmor s', 'bhumbra r',
'pansuriya tc', NA)

netw <- data.frame(au1, au2, au3)
res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))

for (i in 1:dim(netw)[2])
{
wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
res[i] <- substring(as.character(netw[,i]), wh, wh +
attr(wh,'match.length')-1)
}

 problem is for author "van den hoofs j" who is only retrieved as
'van'

thanks,


David Biau

>________________________________
> De : arun <smartpink111@yahoo.com>
>À : Biau David <djmbiau@yahoo.fr> 
>Envoyé le : Dimanche 13 janvier 2013 17h38
>Objet : Re: [R] extracting character values
> 
>HI,
>
>
> res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))
>#Error in matrix(NA, nrow = dim(netw)[1], ncol = dim(netw)[2]) : 
> # object 'netw' not found
>Can you provide an example dataset of netw?
>Thanks.
>A.K.
>
>
>
>----- Original Message -----
>From: Biau David <djmbiau@yahoo.fr>
>To: r help list <r-help@r-project.org>
>Cc: 
>Sent: Sunday, January 13, 2013 3:53 AM
>Subject: [R] extracting character values
>
>Dear all,
>
>I have a dataframe of names (netw), with each cell including last name and
initials of an author; some cells have NA. I would like to extract only the last
name from each cell; this new dataframe is calle 'res'
>
>
>Here is what I do:
>
>res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))
>
>for (i in 1:x)
>{
>wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
>res[i] <- substring(as.character(netw[,i]), wh, wh +
attr(wh,'match.length')-1)
>}
>
> 
>the problem is that I cannot manage to extract 'complex' names
properly such as ' van der hoops bf  ': here I only get 'van',
the real last name is 'van der hoops' and 'bf' are the initials.
Basically the last name has always a minimum of 3 consecutive letters, but may
have 3 or more letters separated by one or more space; the cell may start by a
space too; initials never have more than 2 letters.
>
>Someone would have a nice idea for that? Thanks,
>
>
>David
>
>    [[alternative HTML version deleted]]
>
>
>______________________________________________
>R-help@r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
>
>
>
>	[[alternative HTML version deleted]]

Uwe Ligges

2013-Jan-13 17:06 UTC

head link

[R] extracting character values

On 13.01.2013 09:53, Biau David wrote:> Dear all,
>
> I have a dataframe of names (netw), with each cell including last name and
initials of an author; some cells have NA. I would like to extract only the last
name from each cell; this new dataframe is calle 'res'
>
>
> Here is what I do:
>
> res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))
>
> for (i in 1:x)
> {
> wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
> res[i] <- substring(as.character(netw[,i]), wh, wh +
attr(wh,'match.length')-1)
> }
>
>
> the problem is that I cannot manage to extract 'complex' names
properly such as ' van der hoops bf  ': here I only get 'van',
the real last name is 'van der hoops' and 'bf' are the initials.
Basically the last name has always a minimum of 3 consecutive letters, but may
have 3 or more letters separated by one or more space; the cell may start by a
space too; initials never have more than 2 letters.
>
> Someone would have a nice idea for that? Thanks,
>
Maybe some poeple will, but an example of your data will actually help 
them to help.

Your code is not reproducible without providing the netw object.

Best,
Uwe Ligges

>
> David
>
> 	[[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

arun

2013-Jan-13 17:12 UTC

head link

[R] extracting character values

HI,

Not sure this helps:
netw<-read.table(text="
lastname_initial, year
Aaron H, 1900
Beecher HW, 1947
Cannon JP, 1985
Stone WC, 1982
?van der hoops bf, 1948
NA, 1976
",sep=",",header=TRUE,stringsAsFactors=FALSE)


res1<-sub("^[[:space:]]*(.*?)[[:space:]]*$","\\1",gsub("\\w+$","",netw[,1]))
res1[!is.na(res1)]
#[1] "Aaron"???????? "Beecher"??????
"Cannon"??????? "Stone"???????
#[5] "van der hoops"
A.K.



----- Original Message -----
From: Biau David <djmbiau at yahoo.fr>
To: r help list <r-help at r-project.org>
Cc: 
Sent: Sunday, January 13, 2013 3:53 AM
Subject: [R] extracting character values

Dear all,

I have a dataframe of names (netw), with each cell including last name and
initials of an author; some cells have NA. I would like to extract only the last
name from each cell; this new dataframe is calle 'res'


Here is what I do:

res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2]))

for (i in 1:x)
{
wh <- regexpr('[a-z]{3,}', as.character(netw[,i]))
res[i] <- substring(as.character(netw[,i]), wh, wh +
attr(wh,'match.length')-1)
}

?
the problem is that I cannot manage to extract 'complex' names properly
such as ' van der hoops bf? ': here I only get 'van', the real
last name is 'van der hoops' and 'bf' are the initials.
Basically the last name has always a minimum of 3 consecutive letters, but may
have 3 or more letters separated by one or more space; the cell may start by a
space too; initials never have more than 2 letters.

Someone would have a nice idea for that? Thanks,


David

??? [[alternative HTML version deleted]]


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more maybe matching threads

R help - Jan 2013 - extracting character values

[R] extracting character values

[R] extracting character values

[R] extracting character values

[R] extracting character values

Reasonably Related Threads