Dear all, I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res' Here is what I do: res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2])) for (i in 1:x) { wh <- regexpr('[a-z]{3,}', as.character(netw[,i])) res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1) } the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters. Someone would have a nice idea for that? Thanks, David [[alternative HTML version deleted]]
OK, here is a minimal working example: au1 <- c('biau dj', 'jones kb', 'van den hoofs j', ' biau dj', 'biau dj', 'campagna r', 'biau dj', 'weiss kr', 'verdegaal sh', 'riad s') au2 <- c('weiss kr', 'ferguson pc', ' greidanus nv', ' porcher r', 'ferguson pc', 'pessis e', 'leclerc p', 'biau dj', 'bovee jv', 'biau d') au3 <- c('bhumbra rs', 'lam b', 'garbuz ds', NA, 'chung p', ' biau dj', 'marmor s', 'bhumbra r', 'pansuriya tc', NA) netw <- data.frame(au1, au2, au3) res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2])) for (i in 1:dim(netw)[2]) { wh <- regexpr('[a-z]{3,}', as.character(netw[,i])) res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1) } problem is for author "van den hoofs j" who is only retrieved as 'van' thanks, David Biau>________________________________ > De : arun <smartpink111@yahoo.com> >À : Biau David <djmbiau@yahoo.fr> >Envoyé le : Dimanche 13 janvier 2013 17h38 >Objet : Re: [R] extracting character values > >HI, > > > res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2])) >#Error in matrix(NA, nrow = dim(netw)[1], ncol = dim(netw)[2]) : > # object 'netw' not found >Can you provide an example dataset of netw? >Thanks. >A.K. > > > >----- Original Message ----- >From: Biau David <djmbiau@yahoo.fr> >To: r help list <r-help@r-project.org> >Cc: >Sent: Sunday, January 13, 2013 3:53 AM >Subject: [R] extracting character values > >Dear all, > >I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res' > > >Here is what I do: > >res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2])) > >for (i in 1:x) >{ >wh <- regexpr('[a-z]{3,}', as.character(netw[,i])) >res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1) >} > > >the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters. > >Someone would have a nice idea for that? Thanks, > > >David > > [[alternative HTML version deleted]] > > >______________________________________________ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > > > >[[alternative HTML version deleted]]
On 13.01.2013 09:53, Biau David wrote:> Dear all, > > I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res' > > > Here is what I do: > > res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2])) > > for (i in 1:x) > { > wh <- regexpr('[a-z]{3,}', as.character(netw[,i])) > res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1) > } > > > the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters. > > Someone would have a nice idea for that? Thanks, >Maybe some poeple will, but an example of your data will actually help them to help. Your code is not reproducible without providing the netw object. Best, Uwe Ligges> > David > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
HI, Not sure this helps: netw<-read.table(text=" lastname_initial, year Aaron H, 1900 Beecher HW, 1947 Cannon JP, 1985 Stone WC, 1982 ?van der hoops bf, 1948 NA, 1976 ",sep=",",header=TRUE,stringsAsFactors=FALSE) res1<-sub("^[[:space:]]*(.*?)[[:space:]]*$","\\1",gsub("\\w+$","",netw[,1])) res1[!is.na(res1)] #[1] "Aaron"???????? "Beecher"?????? "Cannon"??????? "Stone"??????? #[5] "van der hoops" A.K. ----- Original Message ----- From: Biau David <djmbiau at yahoo.fr> To: r help list <r-help at r-project.org> Cc: Sent: Sunday, January 13, 2013 3:53 AM Subject: [R] extracting character values Dear all, I have a dataframe of names (netw), with each cell including last name and initials of an author; some cells have NA. I would like to extract only the last name from each cell; this new dataframe is calle 'res' Here is what I do: res <- data.frame(matrix(NA, nrow=dim(netw)[1], ncol=dim(netw)[2])) for (i in 1:x) { wh <- regexpr('[a-z]{3,}', as.character(netw[,i])) res[i] <- substring(as.character(netw[,i]), wh, wh + attr(wh,'match.length')-1) } ? the problem is that I cannot manage to extract 'complex' names properly such as ' van der hoops bf? ': here I only get 'van', the real last name is 'van der hoops' and 'bf' are the initials. Basically the last name has always a minimum of 3 consecutive letters, but may have 3 or more letters separated by one or more space; the cell may start by a space too; initials never have more than 2 letters. Someone would have a nice idea for that? Thanks, David ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Reasonably Related Threads
- count combined occurrences of categories
- removing loops from code in making data.frame
- extracting characters from a string
- Install latest R version from apt on Lenny
- interpretation of coefficients in survreg AND obtaining the hazard function for an individual given a set of predictors