Mark W. Miller
2009-Nov-04 21:09 UTC
[R] splitting scientific names into genus, species, and subspecies
I have a list of scientific names in a data set. I would like to split the names into genus, species and subspecies. Not all names include a subspecies. Could someone show me how to do this? My example code is: a <- matrix(c('genusA speciesA', 10, 'genusB speciesAA', 20, 'genusC speciesAAA subspeciesA', 15, 'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE) aa <- data.frame(a) colnames(aa) <- c('species', 'counts') aa # The code returns species counts 1 genusA speciesA 10 2 genusB speciesAA 20 3 genusC speciesAAA subspeciesA 15 4 genusC speciesAAA subspeciesB 25 # I would like there to be 4 columns as below genus species subspecies counts genusA speciesA no.subspecies 10 genusB speciesAA no.subspecies 20 genusC speciesAAA subspeciesA 15 genusC speciesAAA subspeciesB 25 I have tried using 'strsplit', but cannot get the desired result. Thank you for any help with this. Mark Miller Gainesville, Florida -- View this message in context: http://old.nabble.com/splitting-scientific-names-into-genus%2C-species%2C-and-subspecies-tp26204666p26204666.html Sent from the R help mailing list archive at Nabble.com.
(Ted Harding)
2009-Nov-04 21:37 UTC
[R] splitting scientific names into genus, species, and sub
On 04-Nov-09 21:09:42, Mark W. Miller wrote:> I have a list of scientific names in a data set. I would like > to split the names into genus, species and subspecies. > Not all names include a subspecies. Could someone show me how > to do this? > > My example code is: > a <- matrix(c('genusA speciesA', 10, > 'genusB speciesAA', 20, > 'genusC speciesAAA subspeciesA', 15, > 'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE) > aa <- data.frame(a) > colnames(aa) <- c('species', 'counts') > aa > ># The code returns > species counts > 1 genusA speciesA 10 > 2 genusB speciesAA 20 > 3 genusC speciesAAA subspeciesA 15 > 4 genusC speciesAAA subspeciesB 25 > ># I would like there to be 4 columns as below > genus species subspecies counts > genusA speciesA no.subspecies 10 > genusB speciesAA no.subspecies 20 > genusC speciesAAA subspeciesA 15 > genusC speciesAAA subspeciesB 25 > > I have tried using 'strsplit', but cannot get the desired result. > Thank you for any help with this. > > Mark Miller > Gainesville, FloridaThe following seems to work for your example. However, others can probably propose a less clumsy version (but at least this one breaks it down into its elements): a <- matrix(c('genusA speciesA', 10, 'genusB speciesAA', 20, 'genusC speciesAAA subspeciesA', 15, 'genusC speciesAAA subspeciesB', 25), nrow=4, byrow=TRUE) a # [,1] [,2] # [1,] "genusA speciesA" "10" # [2,] "genusB speciesAA" "20" # [3,] "genusC speciesAAA subspeciesA" "15" # [4,] "genusC speciesAAA subspeciesB" "25" A <- NULL for( i in (1:nrow(a))){ Names <- unlist(strsplit(a[i,1],"[ ]+")) if(length(Names)==2) Names <- c(Names,"no.subspecies") A <- rbind(A,c(Names,a[i,2])) } colnames(A) <- c("Genus","Species","Subspecies","Count") A <- as.data.frame(A) A$Count <- as.numeric(A$Count) A # Genus Species Subspecies Count # 1 genusA speciesA no.subspecies 1 # 2 genusB speciesAA no.subspecies 3 # 3 genusC speciesAAA subspeciesA 2 # 4 genusC speciesAAA subspeciesB 4 Hoping this helps! Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 04-Nov-09 Time: 21:37:03 ------------------------------ XFMail ------------------------------
Chris Stubben
2009-Nov-04 22:19 UTC
[R] splitting scientific names into genus, species, and subspecies
Mark W. Miller wrote:> > I have a list of scientific names in a data set. I would like to split > the names into genus, species and subspecies. Not all names include a > subspecies. Could someone show me how to do this? >strsplit should work for your example... data.frame( genus=sapply(strsplit(aa, " "), "[", 1), species=sapply(strsplit(aa, " "), "[", 2), subspecies=sapply(strsplit(aa, " "), "[", 3) ## will be NA for missing subsp ) However, scientific names are often pretty messy - I often have datasets like this... x [1] "Aquilegia caerulea James var. caerulea" [2] "Aquilegia caerulea James var. ochroleuca Hook." [3] "Aquilegia caerulea James var. pinetorum (Tidestrom) Payson ex Kearney & Peebles" [4] "Aquilegia caerulea James" [5] "Aquilegia chaplinei Standl." [6] "Aquilegia chaplinei Standley ex Payson" [7] "Aquilegia chrysantha Gray var. chrysantha" [8] "Aquilegia chrysantha Gray" So I first strip out author names using strsplit and use grep to find subspecies/variety abbreviations noauthor<-function(x){ ## split name into vector of separate words y<-strsplit(x, " ") sapply(y, function(x){ n<-grep( "^var\\.$|^ssp\\.$|^var$|^f\\.$",x) # apply a function to paste together the first and second elements # plus element after matching var., spp., f. (or and others) # use sort in case the name includes both var and spp -sometimes happens paste( x[sort(c(1:2, n,n+1))], collapse=" ") })} noauthor(x[1:8]) [1] "Aquilegia caerulea var. caerulea" [2] "Aquilegia caerulea var. ochroleuca" [3] "Aquilegia caerulea var. pinetorum" [4] "Aquilegia caerulea" [5] "Aquilegia chaplinei" [6] "Aquilegia chaplinei" [7] "Aquilegia chrysantha var. chrysantha" [8] "Aquilegia chrysantha" Chris -- View this message in context: http://old.nabble.com/splitting-scientific-names-into-genus%2C-species%2C-and-subspecies-tp26204666p26205654.html Sent from the R help mailing list archive at Nabble.com.