thr3ads.net - R help - [R] Aggregation of data frame with calculations of proportions [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Mark Wardle

2007-Jun-26 19:14 UTC

[R] Aggregation of data frame with calculations of proportions

Dear all,

I have been stuck on this problem, am rather struggling and would
appreciate some advice if anyone can help. I apologise if this is a
bit long-winded, but I've tried to limit it to the bare essentials,
but don't know how to make it more generic!

I have some slightly odd real world data that I'm looking at
representing number of positive diagnoses for different diseases, plus
the number tested. It's all in a data frame:

This should all be directly runnable with cut+paste:

sp = read.csv('http://www.wardle.org/sca-prev.csv', header=T)[,c(-1,-2)]

str(sp)

#'data.frame':   46 obs. of  19 variables:
# $ sca1        : int  5 1 NA NA 48 1 NA 17 21 4 ...
# $ sca2        : int  7 3 NA NA 53 1 NA NA 7 7 ...
# $ sca3        : int  3 1 NA NA 2 0 NA 29 0 0 ...
# $ sca6        : int  1 0 NA NA 2 2 NA NA 0 0 ...
# $ sca7        : int  NA NA NA NA 2 0 NA NA 1 0 ...
# $ sca8        : int  NA NA NA NA 2 1 NA NA NA NA ...
# $ sca10       : int  NA NA NA NA 0 0 NA NA NA NA ...
# $ sca12       : int  NA NA NA NA 0 0 NA NA NA NA ...
# $ sca17       : int  NA NA NA NA 2 1 NA NA NA NA ...
# $ frda        : int  NA NA NA NA 0 1 NA NA NA NA ...
# $ drpla       : int  NA NA 1 1 1 1 NA NA 0 0 ...
# $ fmr1        : int  NA NA NA NA NA NA 6 NA NA NA ...
# $ diagnosis   : int  16 5 1 1 112 8 6 46 29 11 ...
# $ unexplained : int  10 26 415 392 71 35 137 51 0 11 ...
# $ total       : int  26 31 416 393 183 43 143 97 29 22 ...
# $ patient.type: Factor w/ 8 levels "","ADCA","ADCA
I",..: 2 5 2 5 2
4 8 3 2 2 ...
# $ age.group   : Factor w/ 5 levels
"","<40",">40",..: 5 5 5 5 1 1 4 5 1 1 ...
# $ country     : Factor w/ 19 levels "","Australia",..: 7 7
1 1 8 8 8 1 8 8 ...
# $ region      : Factor w/ 6 levels "Americas","Asia",..: 4
4 3 3 3 3
3 3 3 3 ...

# It is straightforward to aggregate data:

smart.sum <- function(x,...) {	
	if (is.numeric(x)) return(sum(x, na.rm=T))
	else return(paste(unique(x), collapse=", "))		# for natbib citations
}
sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum)
sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN=smart.sum)
print(sp.ag1)

# and even to calculate crude percentages
spp = sp.ag1
spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total, 1)})

# *BUT*, the problem is that this is underestimating the true
proportions, because some of the data is NA. This means the diagnosis
was not tested rather than not found (which would be zero).

# What I want to do is calculate a true proportion, with the number
found divided by the number tested.
# I have managed to do this:

# calculates true proportion by only including values in the
denominator that have non-NAs in the numerator
# x= vector of numbers
# total = vector of numbers
calc.prevalence <- function(x, total) {
	sum.x = sum(x, na.rm=T)		# calculate numerator
	sum.y = sum(total[!is.na(x)])		# calculate denominator
	return(sum.x/sum.y)
}

correct.prevalence <- calc.prevalence(sp$drpla, sp$total)
incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T)
print(correct.prevalence)
print(incorrect.prevalence)


This is easy to apply to one column in one table, but I'm finding it
very difficult to do when I manipulate the data using the aggregate()
function above.

Is there a straightforward way, while aggregating columns by a
variable (number of) factors, in generating the sum of a column and
dividing by the sum of another column, only including data from the
second column when the first column is not NA. AND is it possible to
do that to all of the relevant columns (3:15)?

You won't believe how many iterations of by(), aggregate(), tapply(),
apply() and (finally) "for loops" I have tried with no success. The
best partial solution involved a combination of by() calling a
function calling aggregate(), but I can't parse the data returned. I'm
sure I am missing something! Can anyone help?

Many (many many) thanks!

Best wishes,

Mark

P.S. this is enough to drive me to drink!
-- 
Dr. Mark Wardle
Clinical research fellow and specialist registrar, Neurology
Cardiff, UK

jim holtman

2007-Jun-26 22:42 UTC

head link

[R] Aggregation of data frame with calculations of proportions

I think something like this will do it for you.


set.seed(1)
n <- 10
x <- data.frame(a=sample(1:100,n), b=sample(1:100,n),d=sample(1:100,n))
x$a[c(1,5)] <- NA
x$b[c(7,6,4)] <- NA
x$d[c(5,8)] <- NA
x$total <- rowSums(x, na.rm=TRUE)
x$type <- sample(1:3, n, replace=TRUE)
by(x, list(x$type), function(z){
    apply(z[,1:3], 2, calc.prevalence, total=z$total)
})





On 6/26/07, Mark Wardle <mark@wardle.org> wrote:>
> Dear all,
>
> I have been stuck on this problem, am rather struggling and would
> appreciate some advice if anyone can help. I apologise if this is a
> bit long-winded, but I've tried to limit it to the bare essentials,
> but don't know how to make it more generic!
>
> I have some slightly odd real world data that I'm looking at
> representing number of positive diagnoses for different diseases, plus
> the number tested. It's all in a data frame:
>
> This should all be directly runnable with cut+paste:
>
> sp = read.csv('http://www.wardle.org/sca-prev.csv',
header=T)[,c(-1,-2)]
>
> str(sp)
>
> #'data.frame':   46 obs. of  19 variables:
> # $ sca1        : int  5 1 NA NA 48 1 NA 17 21 4 ...
> # $ sca2        : int  7 3 NA NA 53 1 NA NA 7 7 ...
> # $ sca3        : int  3 1 NA NA 2 0 NA 29 0 0 ...
> # $ sca6        : int  1 0 NA NA 2 2 NA NA 0 0 ...
> # $ sca7        : int  NA NA NA NA 2 0 NA NA 1 0 ...
> # $ sca8        : int  NA NA NA NA 2 1 NA NA NA NA ...
> # $ sca10       : int  NA NA NA NA 0 0 NA NA NA NA ...
> # $ sca12       : int  NA NA NA NA 0 0 NA NA NA NA ...
> # $ sca17       : int  NA NA NA NA 2 1 NA NA NA NA ...
> # $ frda        : int  NA NA NA NA 0 1 NA NA NA NA ...
> # $ drpla       : int  NA NA 1 1 1 1 NA NA 0 0 ...
> # $ fmr1        : int  NA NA NA NA NA NA 6 NA NA NA ...
> # $ diagnosis   : int  16 5 1 1 112 8 6 46 29 11 ...
> # $ unexplained : int  10 26 415 392 71 35 137 51 0 11 ...
> # $ total       : int  26 31 416 393 183 43 143 97 29 22 ...
> # $ patient.type: Factor w/ 8 levels
"","ADCA","ADCA I",..: 2 5 2 5 2
> 4 8 3 2 2 ...
> # $ age.group   : Factor w/ 5 levels
"","<40",">40",..: 5 5 5 5 1 1 4 5 1
> 1 ...
> # $ country     : Factor w/ 19 levels
"","Australia",..: 7 7 1 1 8 8 8 1 8
> 8 ...
> # $ region      : Factor w/ 6 levels
"Americas","Asia",..: 4 4 3 3 3 3
> 3 3 3 3 ...
>
> # It is straightforward to aggregate data:
>
> smart.sum <- function(x,...) {
>        if (is.numeric(x)) return(sum(x, na.rm=T))
>        else return(paste(unique(x), collapse=", "))            #
for
> natbib citations
> }
> sp.ag1 = aggregate(sp, by=list(region=sp$region), FUN=smart.sum)
> sp.ag2 = aggregate(sp, by=list(patient.type=sp$patient.type), FUN>
smart.sum)
> print(sp.ag1)
>
> # and even to calculate crude percentages
> spp = sp.ag1
> spp[,2:15] = sapply(spp[,2:15], FUN=function(x) { round(x*100/ spp$total,
> 1)})
>
> # *BUT*, the problem is that this is underestimating the true
> proportions, because some of the data is NA. This means the diagnosis
> was not tested rather than not found (which would be zero).
>
> # What I want to do is calculate a true proportion, with the number
> found divided by the number tested.
> # I have managed to do this:
>
> # calculates true proportion by only including values in the
> denominator that have non-NAs in the numerator
> # x= vector of numbers
> # total = vector of numbers
> calc.prevalence <- function(x, total) {
>        sum.x = sum(x, na.rm=T)         # calculate numerator
>        sum.y = sum(total[!is.na(x)])           # calculate denominator
>        return(sum.x/sum.y)
> }
>
> correct.prevalence <- calc.prevalence(sp$drpla, sp$total)
> incorrect.prevalence <- sum(sp$drpla, na.rm=T) / sum(sp$total, na.rm=T)
> print(correct.prevalence)
> print(incorrect.prevalence)
>
>
> This is easy to apply to one column in one table, but I'm finding it
> very difficult to do when I manipulate the data using the aggregate()
> function above.
>
> Is there a straightforward way, while aggregating columns by a
> variable (number of) factors, in generating the sum of a column and
> dividing by the sum of another column, only including data from the
> second column when the first column is not NA. AND is it possible to
> do that to all of the relevant columns (3:15)?
>
> You won't believe how many iterations of by(), aggregate(), tapply(),
> apply() and (finally) "for loops" I have tried with no success.
The
> best partial solution involved a combination of by() calling a
> function calling aggregate(), but I can't parse the data returned.
I'm
> sure I am missing something! Can anyone help?
>
> Many (many many) thanks!
>
> Best wishes,
>
> Mark
>
> P.S. this is enough to drive me to drink!
> --
> Dr. Mark Wardle
> Clinical research fellow and specialist registrar, Neurology
> Cardiff, UK
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

	[[alternative HTML version deleted]]

Mark Wardle

2007-Jun-27 07:44 UTC

head link

[R] Aggregation of data frame with calculations of proportions

Dear Jim,

On 26/06/07, jim holtman <jholtman at gmail.com>
wrote:> I think something like this will do it for you.
>
>
>
> set.seed(1)
> n <- 10
> x <- data.frame(a=sample(1:100,n),
> b=sample(1:100,n),d=sample(1:100,n))
> x$a[c(1,5)] <- NA
> x$b[c(7,6,4)] <- NA
> x$d[c(5,8)] <- NA
> x$total <- rowSums(x, na.rm=TRUE)
> x$type <- sample(1:3, n, replace=TRUE)
> by(x, list(x$type), function(z){
>     apply(z[,1:3], 2, calc.prevalence, total=z$total)
> })
>
>
> [...SNIP...]
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem you are trying to solve?

It works perfectly. The problem now is how to send over that beer I owe you!

Many thanks,

Best wishes,

-- 
Dr. Mark Wardle
Clinical research fellow and specialist registrar, Neurology
Cardiff, UK

Reasonably Related Threads

Search for more apparently analagous threads

R help - Jun 2007 - Aggregation of data frame with calculations of proportions

[R] Aggregation of data frame with calculations of proportions

[R] Aggregation of data frame with calculations of proportions

[R] Aggregation of data frame with calculations of proportions

Reasonably Related Threads