Dear All, I have a data frame of vectors of publication names such as 'pub': pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') pub <- rbind(pub1, pub2, pub3) I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! David [[alternative HTML version deleted]]
Hi, You could try this: dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F) dat2<- as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))),stringsAsFactors=F) ?dat2 #??????? V1????????????? V2???????? V3???????? V4 #1?? Brown????????? Santos?????? Rome?? Don Juan #2 Benigni?????????????????????????????????????? #3? Arstra?? Van den Hoops?? lamarque?????? A.K. ----- Original Message ----- From: Biau David <djmbiau at yahoo.fr> To: r help list <r-help at r-project.org> Cc: Sent: Wednesday, January 23, 2013 12:38 PM Subject: [R] extracting characters from a string Dear All, I have a data frame of vectors of publication names such as 'pub': pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') pub <- rbind(pub1, pub2, pub3) I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! ? David ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
1. Study a regular expression tutorial on the web to learn how to do this. 2. ?regex in R summarizes (tersely! -- but clearly) R's regex's. 3. ?grep tells you about R's regular expression manipulation functions. -- Bert On Wed, Jan 23, 2013 at 9:38 AM, Biau David <djmbiau at yahoo.fr> wrote:> Dear All, > > I have a data frame of vectors of publication names such as 'pub': > > pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') > pub2 <- c('Benigni D') > pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') > > pub <- rbind(pub1, pub2, pub3) > > > I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. > > ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! > > > David > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
Hello, Try the following. fun <- function(x, sep = ", "){ s <- unlist(strsplit(x, sep)) regmatches(s, regexpr("[[:alpha:]]*", s)) } fun(pub) Hope this helps, Rui Barradas Em 23-01-2013 17:38, Biau David escreveu:> Dear All, > > I have a data frame of vectors of publication names such as 'pub': > > pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') > pub2 <- c('Benigni D') > pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') > > pub <- rbind(pub1, pub2, pub3) > > > I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding last name. I would like to avoid a loop. > > ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! > > > David > > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
HI David, It could be related to spaces in the data or something else.? Suppose, if the data has some spaces at the end or the beginning. pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') pub2 <- c('Benigni D') pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D ') pubnew<-rbind(pub1, pub2, pub3) res<-as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub("^ | $","",gsub("[A-Za-z]+$","",gsub(" $","",x))))),stringsAsFactors=F) str(res) #'data.frame':??? 3 obs. of? 4 variables: # $ V1: chr? "Brown" "Benigni" "Arstra" # $ V2: chr? "Santos" "" "Van den Hoops" # $ V3: chr? "Rome" "" "lamarque" # $ V4: chr? "Don Juan" "" "" #If I used the previous solution: as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))),stringsAsFactors=F) ?????? V1??????????? V2???????? V3?????? V4 1?? Brown??????? Santos?????? Rome Don Juan 2 Benigni????????????????????????????????? 3? Arstra Van den Hoops lamarque D? # initial present. I tried this case with Rui's solution: fun2(pubnew) #[[1]] #[1] " Brown"?? "Santos"?? "Rome"???? "Don Juan" #[[2]] #[1] "Benigni" # #[[3]] #[1] "Arstra"??????? "Van den Hoops" "lamarque D"?? # tinitials present. As Rui's solution works for you, the problem might be something else. A.K. ?????? ________________________________ From: Biau David <djmbiau at yahoo.fr> To: arun <smartpink111 at yahoo.com> Sent: Thursday, January 24, 2013 12:40 AM Subject: Re: [R] extracting characters from a string thanks a lot. it doesn't entirely work well yet; poabably because of the format of the data I import. I have to look into it and thanks to your explanation, I should be able to find the problem in the data. David>________________________________ > De?: arun <smartpink111 at yahoo.com> >??: Biau David <djmbiau at yahoo.fr> >Envoy? le : Mercredi 23 janvier 2013 19h06 >Objet?: Re: [R] extracting characters from a string > >Hi David, > >I forgot about the explanation part. >dat1<-read.table(text=pub,sep=",",fill=TRUE,stringsAsFactors=F) # here, I converted it to dataframe, delimited by ",", Used fill=TRUE because you have unequal number of publications in each line >as.data.frame(do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))),stringsAsFactors=F) > >#splitting codes into smaller pieces; >?lapply(dat1,function(x) gsub("^ |\\w+$","",x)) #lapply() will ensure that the columns in dataframe are split to list elements.? Here, the gsub command within first double quotes matches if there are any empty spaces at the start of the string and also the last word characters in each string and removes them ( 2nd set of double quotes areempty).>$V1 >[1] "Brown "?? "Benigni " "Arstra " > >$V2 >[1] "Santos "??????? ""?????????????? "Van den Hoops " > >$V3 >[1] "Rome "???? ""????????? "lamarque " > >$V4 >[1] "Don Juan " ""????????? ""???????? >lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x))) # I used a second gsub because there are some spaces at the end e.g. "Brown " >$V1 >[1] "Brown"?? "Benigni" "Arstra" > >$V2 >[1] "Santos"??????? ""????????????? "Van den Hoops" > >$V3 >[1] "Rome"???? ""????????"lamarque"> >$V4 >[1] "Don Juan" ""???????? ""??????? > >do.call(cbind,lapply(dat1,function(x) gsub(" $","",gsub("^ |\\w+$","",x)))) #bind by columns >???? V1??????? V2????????????? V3???????? V4??????? >[1,] "Brown"?? "Santos"??????? "Rome"???? "Don Juan" >[2,] "Benigni" ""????????????? ""???????? ""??????? >[3,] "Arstra"? "Van den Hoops" "lamarque" ""??????? > >Hope ithelps.>A.K. > > > > > > > > > > > >----- Original Message ----- >From: Biau David <djmbiau at yahoo.fr> >To: r help list <r-help at r-project.org> >Cc: >Sent: Wednesday, January 23, 2013 12:38 PM >Subject: [R] extracting characters from a string > >Dear All, > >I have a data frame of vectors of publication names such as 'pub': > >pub1 <- c('Brown DK, Santos R, Rome DF, Don Juan X') >pub2 <- c('Benigni D') >pub3 <- c('Arstra SD, Van den Hoops DD, lamarque D') > >pub <- rbind(pub1, pub2, pub3) > > >I would like to construct a dataframe with only author's last name and each last name in columns and the publication in rows. Basically I want to get rid of the initials (max 2, always before a comma) and spaces surounding lastname. I would like to avoid a loop.> >ps: If I could have even a short explanation of the code that extract the values of the character string that would also be great! > >? >David > >??? [[alternative HTML version deleted]] > > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.r-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > > > >?
Seemingly Similar Threads
- extracting character values
- removing loops from code in making data.frame
- Install latest R version from apt on Lenny
- count combined occurrences of categories
- interpretation of coefficients in survreg AND obtaining the hazard function for an individual given a set of predictors