Mark Wardle
2007-Jun-26 19:14 UTC
[R] Aggregation of data frame with calculations of proportions
Dear all, I have been stuck on this problem, am rather struggling and would appreciate some advice if anyone can help. I apologise if this is a bit long-winded, but I've tried to limit it to the bare essentials, but don't know how to make it more generic! I have some slightly odd real world data that I'm looking at representing number of positive diagnoses for different diseases, plus the number tested. It's all in a data frame: This should all be directly runnable with cut+paste: sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)] str(sp) #'data.frame': 46 obs. of 19 variables: # $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ... # $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ... # $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ... # $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ... # $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ... # $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ... # $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ... # $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ... # $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ... # $ frda : int NA NA NA NA 0 1 NA NA NA NA ... # $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ... # $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ... # $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ... # $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ... # $ total : int 26 31 416 393 183 43 143 97 29 22 ... # $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2 4 8 3 2 2 ... # $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1 1 ... # $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8 8 ... # $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3 3 3 3 3 ... # It is straightforward to aggregate data: smart.sum <- function(x,...) { if (is.numeric(x)) return(sum(x, na.rm=T)) else return(paste(unique(x), collapse=", ")) # for natbib citations } sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum) sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=smart.sum) print(sp.ag1) # and even to calculate crude percentages spp = sp.ag1 spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total, 1)}) # *BUT*, the problem is that this is underestimating the true proportions, because some of the data is NA. This means the diagnosis was not tested rather than not found (which would be zero). # What I want to do is calculate a true proportion, with the number found divided by the number tested. # I have managed to do this: # calculates true proportion by only including values in the denominator that have non-NAs in the numerator # x= vector of numbers # total = vector of numbers calc.prevalence <- function(x, total) { sum.x = sum(x, na.rm=T) # calculate numerator sum.y = sum(total[!is.na(x)]) # calculate denominator return(sum.x/sum.y) } correct.prevalence <- calc.prevalence(sp$drpla, sp$total) incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T) print(correct.prevalence) print(incorrect.prevalence) This is easy to apply to one column in one table, but I'm finding it very difficult to do when I manipulate the data using the aggregate() function above. Is there a straightforward way, while aggregating columns by a variable (number of) factors, in generating the sum of a column and dividing by the sum of another column, only including data from the second column when the first column is not NA. AND is it possible to do that to all of the relevant columns (3:15)? You won't believe how many iterations of by(), aggregate(), tapply(), apply() and (finally) "for loops" I have tried with no success. The best partial solution involved a combination of by() calling a function calling aggregate(), but I can't parse the data returned. I'm sure I am missing something! Can anyone help? Many (many many) thanks! Best wishes, Mark P.S. this is enough to drive me to drink! -- Dr. Mark Wardle Clinical research fellow and specialist registrar, Neurology Cardiff, UK
jim holtman
2007-Jun-26 22:42 UTC
[R] Aggregation of data frame with calculations of proportions
I think something like this will do it for you. set.seed(1) n <- 10 x <- data.frame(a=sample(1:100,n), b=sample(1:100,n),d=sample(1:100,n)) x$a[c(1,5)] <- NA x$b[c(7,6,4)] <- NA x$d[c(5,8)] <- NA x$total <- rowSums(x, na.rm=TRUE) x$type <- sample(1:3, n, replace=TRUE) by(x, list(x$type), function(z){ apply(z[,1:3], 2, calc.prevalence, total=z$total) }) On 6/26/07, Mark Wardle <mark@wardle.org> wrote:> > Dear all, > > I have been stuck on this problem, am rather struggling and would > appreciate some advice if anyone can help. I apologise if this is a > bit long-winded, but I've tried to limit it to the bare essentials, > but don't know how to make it more generic! > > I have some slightly odd real world data that I'm looking at > representing number of positive diagnoses for different diseases, plus > the number tested. It's all in a data frame: > > This should all be directly runnable with cut+paste: > > sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)] > > str(sp) > > #'data.frame': 46 obs. of 19 variables: > # $ sca1 : int 5 1 NA NA 48 1 NA 17 21 4 ... > # $ sca2 : int 7 3 NA NA 53 1 NA NA 7 7 ... > # $ sca3 : int 3 1 NA NA 2 0 NA 29 0 0 ... > # $ sca6 : int 1 0 NA NA 2 2 NA NA 0 0 ... > # $ sca7 : int NA NA NA NA 2 0 NA NA 1 0 ... > # $ sca8 : int NA NA NA NA 2 1 NA NA NA NA ... > # $ sca10 : int NA NA NA NA 0 0 NA NA NA NA ... > # $ sca12 : int NA NA NA NA 0 0 NA NA NA NA ... > # $ sca17 : int NA NA NA NA 2 1 NA NA NA NA ... > # $ frda : int NA NA NA NA 0 1 NA NA NA NA ... > # $ drpla : int NA NA 1 1 1 1 NA NA 0 0 ... > # $ fmr1 : int NA NA NA NA NA NA 6 NA NA NA ... > # $ diagnosis : int 16 5 1 1 112 8 6 46 29 11 ... > # $ unexplained : int 10 26 415 392 71 35 137 51 0 11 ... > # $ total : int 26 31 416 393 183 43 143 97 29 22 ... > # $ patient.type: Factor w/ 8 levels "","ADCA","ADCA I",..: 2 5 2 5 2 > 4 8 3 2 2 ... > # $ age.group : Factor w/ 5 levels "","<40",">40",..: 5 5 5 5 1 1 4 5 1 > 1 ... > # $ country : Factor w/ 19 levels "","Australia",..: 7 7 1 1 8 8 8 1 8 > 8 ... > # $ region : Factor w/ 6 levels "Americas","Asia",..: 4 4 3 3 3 3 > 3 3 3 3 ... > > # It is straightforward to aggregate data: > > smart.sum <- function(x,...) { > if (is.numeric(x)) return(sum(x, na.rm=T)) > else return(paste(unique(x), collapse=", ")) # for > natbib citations > } > sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum) > sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN> smart.sum) > print(sp.ag1) > > # and even to calculate crude percentages > spp = sp.ag1 > spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total, > 1)}) > > # *BUT*, the problem is that this is underestimating the true > proportions, because some of the data is NA. This means the diagnosis > was not tested rather than not found (which would be zero). > > # What I want to do is calculate a true proportion, with the number > found divided by the number tested. > # I have managed to do this: > > # calculates true proportion by only including values in the > denominator that have non-NAs in the numerator > # x= vector of numbers > # total = vector of numbers > calc.prevalence <- function(x, total) { > sum.x = sum(x, na.rm=T) # calculate numerator > sum.y = sum(total[!is.na(x)]) # calculate denominator > return(sum.x/sum.y) > } > > correct.prevalence <- calc.prevalence(sp$drpla, sp$total) > incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T) > print(correct.prevalence) > print(incorrect.prevalence) > > > This is easy to apply to one column in one table, but I'm finding it > very difficult to do when I manipulate the data using the aggregate() > function above. > > Is there a straightforward way, while aggregating columns by a > variable (number of) factors, in generating the sum of a column and > dividing by the sum of another column, only including data from the > second column when the first column is not NA. AND is it possible to > do that to all of the relevant columns (3:15)? > > You won't believe how many iterations of by(), aggregate(), tapply(), > apply() and (finally) "for loops" I have tried with no success. The > best partial solution involved a combination of by() calling a > function calling aggregate(), but I can't parse the data returned. I'm > sure I am missing something! Can anyone help? > > Many (many many) thanks! > > Best wishes, > > Mark > > P.S. this is enough to drive me to drink! > -- > Dr. Mark Wardle > Clinical research fellow and specialist registrar, Neurology > Cardiff, UK > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]]
Mark Wardle
2007-Jun-27 07:44 UTC
[R] Aggregation of data frame with calculations of proportions
Dear Jim, On 26/06/07, jim holtman <jholtman at gmail.com> wrote:> I think something like this will do it for you. > > > > set.seed(1) > n <- 10 > x <- data.frame(a=sample(1:100,n), > b=sample(1:100,n),d=sample(1:100,n)) > x$a[c(1,5)] <- NA > x$b[c(7,6,4)] <- NA > x$d[c(5,8)] <- NA > x$total <- rowSums(x, na.rm=TRUE) > x$type <- sample(1:3, n, replace=TRUE) > by(x, list(x$type), function(z){ > apply(z[,1:3], 2, calc.prevalence, total=z$total) > }) > > > [...SNIP...] > -- > Jim Holtman > Cincinnati, OH > +1 513 646 9390 > > What is the problem you are trying to solve?It works perfectly. The problem now is how to send over that beer I owe you! Many thanks, Best wishes, -- Dr. Mark Wardle Clinical research fellow and specialist registrar, Neurology Cardiff, UK
Apparently Analagous Threads
- Matching data to a new column
- read table not reading lines containing single quotes
- spearman correlation and p-value as a matrix
- Am I having problems with codecs? or am I not receiving an invite at all from my DID provider?
- Is there any design based two proportions z test?