Dear List, I have a data set stored in the following format:> head(dat, n = 10)id sppcode abundance 1 10307 10000000 1 2 10307 16220602 2 3 10307 20000000 5 4 10307 20110000 2 5 10307 24000000 1 6 10307 40210000 83 7 10307 40210102 45 8 10307 45140000 1 9 10307 45630000 1 10 10307 45630600 41> str(dat)'data.frame': 111 obs. of 3 variables: $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ... $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ... $ abundance: num 1 2 5 2 1 83 45 1 1 41 ... that represent counts of species, recorded with a particular coding system. The abundance column is not needed for this particular operation, but is present in the data files. I am interested in counting entries (rows) in the sppcode component of dat. The sppcode takes a particular format: Order Family Genus Species, with 2 alphanumeric digits allocated for each level of the hierarchy. I want to know how many species there are in each site (the id factor), but I should only count a higher level entry if there are no lower levels present. For example, for the above data excerpt (just the headed rows), I would count the following rows: 10000000 16220602 20110000 24000000 40320203 45140000 45630600 == 7 "species" present. To be more specific, I don't count 45630000 (row 9) because there exists a sppcode for this 'id' where either of the next two pairs of digits are not all 0's. In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then, rows where ZZ == 00 only if the WWXXYY combination has not been counted yet. An example data set has been placed in my University web space and can be read into R with the following: ## read example csv data dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), colClasses = c("factor","character","numeric")) ## show the data head(dat, n = 10) And the sppcode variable can be broken out into the 4 levels if required via: ## split out the four levels of categorisation: dat2 <- data.frame(dat, order = with(dat, substr(sppcode, 1, 2)), family = with(dat, substr(sppcode, 3, 4)), genus = with(dat, substr(sppcode, 5, 6)), species = with(dat, substr(sppcode, 7, 8))) The actual data set/problem contains several hundred different id's. I can't see an efficient way of processing these data in the manner described. Any help would be most gratefully received. Many thanks, Gavin -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/d76209d9/attachment-0002.bin>
Apologies, Jim Holtman has pointed out a couple of problems/queries with my original email that I would like to make clear. Firstly, I introduced a typo when trying to be helpful. In my email below, I had incorrectly typed out one of the species codes I would count: 10000000 16220602 20110000 24000000 40320203 ## This should have been 40210102 45140000 45630600 == 7 "species" present. Secondly, the criteria I laid out might suggest that in the 10 rows of example I quoted, I would count both: 45630000 45630600 This is not what I wanted and apologies that this was not clear. I only want to count 45630600 because this is more "specific" in terms of what creature this is than 45630000. I don't know that 45630000 is not 45630600, so I should not count both 45630000 and 45630600, as this could be double accounting. These data are species counts and sometimes it is not possible to identify an individual to species level. Sometime we can't even get the genera, or even family, hence why sometimes we have a count for the family (45630000) as well as for the genus (45630600) in the same sample/site. It depends on how much of the individual there is to identify it from as to how precise the identification is. So I only want to count a higher level category only if I have not counted a lower level category contained within this higher level. I hope this is a little bit clearer? And no, I did not come up with this coding system nor the idea to use "counts" of "species" in this way... ;-) Apologies if my original email caused unnecessary confusion. All the best, G On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote:> Dear List, > > I have a data set stored in the following format: > > > head(dat, n = 10) > id sppcode abundance > 1 10307 10000000 1 > 2 10307 16220602 2 > 3 10307 20000000 5 > 4 10307 20110000 2 > 5 10307 24000000 1 > 6 10307 40210000 83 > 7 10307 40210102 45 > 8 10307 45140000 1 > 9 10307 45630000 1 > 10 10307 45630600 41 > > str(dat) > 'data.frame': 111 obs. of 3 variables: > $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ... > $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ... > $ abundance: num 1 2 5 2 1 83 45 1 1 41 ... > > that represent counts of species, recorded with a particular coding > system. The abundance column is not needed for this particular > operation, but is present in the data files. > > I am interested in counting entries (rows) in the sppcode component of > dat. The sppcode takes a particular format: Order Family Genus Species, > with 2 alphanumeric digits allocated for each level of the hierarchy. I > want to know how many species there are in each site (the id factor), > but I should only count a higher level entry if there are no lower > levels present. > > For example, for the above data excerpt (just the headed rows), I would > count the following rows: > > 10000000 > 16220602 > 20110000 > 24000000 > 40320203 > 45140000 > 45630600 == 7 "species" present. > > To be more specific, I don't count 45630000 (row 9) because there exists > a sppcode for this 'id' where either of the next two pairs of digits are > not all 0's. > > In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then, > rows where ZZ == 00 only if the WWXXYY combination has not been counted > yet. > > An example data set has been placed in my University web space and can > be read into R with the following: > > ## read example csv data > dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), > colClasses = c("factor","character","numeric")) > ## show the data > head(dat, n = 10) > > And the sppcode variable can be broken out into the 4 levels if required via: > > ## split out the four levels of categorisation: > dat2 <- data.frame(dat, > order = with(dat, substr(sppcode, 1, 2)), > family = with(dat, substr(sppcode, 3, 4)), > genus = with(dat, substr(sppcode, 5, 6)), > species = with(dat, substr(sppcode, 7, 8))) > > The actual data set/problem contains several hundred different id's. > > I can't see an efficient way of processing these data in the manner > described. Any help would be most gratefully received. > > Many thanks, > > Gavin > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/b623c37a/attachment-0002.bin>
To answer my own post, and for the archives (hopefully not that anyone has to repeat what I had to do ;-), after much hair-pulling , frowning at the screen and general dumb headedness the following slab of R code achieves the results I wanted. It isn't elegant but does a job. msr <- function(x) { res <- numeric(length = length(levels(x$id))) names(res) <- levels(x$id) for(site in levels(x$id)) { ## subset just data for this site DAT <- x[x$id == site, ] ## split out the spp and count the ones not 00 spp <- with(DAT, substr(sppcode, 7, 8)) spp.counted <- which(spp != "00") spp <- with(DAT[spp.counted, ], sppcode) SPP <- length(spp.counted) DAT <- DAT[-spp.counted, ] ## drop genera for spp already counted want <- with(DAT, which(substr(sppcode, 1, 6) %in% substr(spp, 1, 6))) if(length(want) >= 1) { DAT <- DAT[-want, ] } ## now count genera remaining not 00 gen <- with(DAT, substr(sppcode, 5, 6)) gen.counted <- which(gen != "00") gen <- with(DAT[gen.counted, ], sppcode) GEN <- length(gen.counted) DAT <- DAT[-gen.counted, ] ## drop families already in spp, or genera that we already caught want1 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(spp, 1, 4))) want2 <- with(DAT, which(substr(sppcode, 1, 4) %in% substr(gen, 1, 4))) if(length(want <- unique(c(want1, want2))) >= 1) { DAT <- DAT[-want, ] } ## count remaining families != 00 fam <- with(DAT, substr(sppcode, 3, 4)) fam.counted <- which(fam != "00") fam <- with(DAT[fam.counted, ], sppcode) FAM <- length(fam.counted) DAT <- DAT[-fam.counted, ] ## drop orders for families already counted want1 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(spp, 1, 2))) want2 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(gen, 1, 2))) want3 <- with(DAT, which(substr(sppcode, 1, 2) %in% substr(fam, 1, 2))) if(length(want <- unique(c(want1, want2, want3))) >= 1) { DAT <- DAT[-want, ] } ## count the orders remaining ORD <- nrow(DAT) ## populate return vector res[site] <- SPP + GEN + FAM + ORD } return(res) } ## read example csv data dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), colClasses = c("factor","character","numeric")) ## show the data head(dat, n = 10) ## split out the four levels of categorisation: dat2 <- data.frame(dat, order = with(dat, substr(sppcode, 1, 2)), family = with(dat, substr(sppcode, 3, 4)), genus = with(dat, substr(sppcode, 5, 6)), species = with(dat, substr(sppcode, 7, 8))) msr(dat) Yields:> msr(dat)10307 10719 10786 15 40 35 Which are correct. G On Wed, 2009-02-18 at 13:37 +0000, Gavin Simpson wrote:> Dear List, > > I have a data set stored in the following format: > > > head(dat, n = 10) > id sppcode abundance > 1 10307 10000000 1 > 2 10307 16220602 2 > 3 10307 20000000 5 > 4 10307 20110000 2 > 5 10307 24000000 1 > 6 10307 40210000 83 > 7 10307 40210102 45 > 8 10307 45140000 1 > 9 10307 45630000 1 > 10 10307 45630600 41 > > str(dat) > 'data.frame': 111 obs. of 3 variables: > $ id : Factor w/ 3 levels "10307","10719",..: 1 1 1 1 1 1 1 1 1 1 ... > $ sppcode : chr "10000000" "16220602" "20000000" "20110000" ... > $ abundance: num 1 2 5 2 1 83 45 1 1 41 ... > > that represent counts of species, recorded with a particular coding > system. The abundance column is not needed for this particular > operation, but is present in the data files. > > I am interested in counting entries (rows) in the sppcode component of > dat. The sppcode takes a particular format: Order Family Genus Species, > with 2 alphanumeric digits allocated for each level of the hierarchy. I > want to know how many species there are in each site (the id factor), > but I should only count a higher level entry if there are no lower > levels present. > > For example, for the above data excerpt (just the headed rows), I would > count the following rows: > > 10000000 > 16220602 > 20110000 > 24000000 > 40320203 > 45140000 > 45630600 == 7 "species" present. > > To be more specific, I don't count 45630000 (row 9) because there exists > a sppcode for this 'id' where either of the next two pairs of digits are > not all 0's. > > In words, I want to count all rows where WWXXYYZZ are ZZ != 00, then, > rows where ZZ == 00 only if the WWXXYY combination has not been counted > yet. > > An example data set has been placed in my University web space and can > be read into R with the following: > > ## read example csv data > dat <- read.csv(url("http://www.homepages.ucl.ac.uk/~ucfagls/files/example_data.csv"), > colClasses = c("factor","character","numeric")) > ## show the data > head(dat, n = 10) > > And the sppcode variable can be broken out into the 4 levels if required via: > > ## split out the four levels of categorisation: > dat2 <- data.frame(dat, > order = with(dat, substr(sppcode, 1, 2)), > family = with(dat, substr(sppcode, 3, 4)), > genus = with(dat, substr(sppcode, 5, 6)), > species = with(dat, substr(sppcode, 7, 8))) > > The actual data set/problem contains several hundred different id's. > > I can't see an efficient way of processing these data in the manner > described. Any help would be most gratefully received. > > Many thanks, > > Gavin > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090218/205da37d/attachment-0002.bin>