Lyndon Estes
2011-Jul-11  20:57 UTC
[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives
Dear List,
I have encountered an odd problem that I cannot understand. It stems
from the calculation of true and false positives based on two input
vectors x and y based on different thresholds of x, extracted using
the quantile function. I am in certain cases getting different values
of true positives for the same threshold value when the threshold was
found under different quantiles (e.g. the threshold value Z was found
with quantile(x, probs = 0.045) and quantile(x, probs = 0.05). The
following illustrates the problem:
# Start of code with comments
# Load vectors x and y
con <-
url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata")
load(con)
close(con)
# Data frame to collect TP and FP values based on different thresholds
of vector x
bins <- 0.005
ctch <- data.frame(matrix(nrow = 100 / (bins * 100) + 1, ncol = 4))
colnames(ctch) <- c("threshold", "val", "tp",
"fp")
# Extract different TP and FP values in loop, where thresholds are
based on 1/2 percent increments
# in quantiles of x
for(i in 1:(100 / (bins * 100) + 1)) {
    bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE)
    tp <- length(x) - length(x[x <= bin.ct])  # N true positives
    fp <- length(y) - length(y[y <= bin.ct])  # N false positives
    ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp)
}
# The problem is here
ctch[1:20, ]
#   threshold    val   tp    fp
#1        0.0 330.57 3139 19485
#2        0.5 374.11 3118 17510
#3        1.0 395.38 3029 16883
#4        1.5 395.38 3029 16883
#5        2.0 395.38 3029 16883
#6        2.5 395.38 3029 16883
#7        3.0 395.38 3029 16883
#8        3.5 395.38 3029 16883
#9        4.0 430.29 2875 15346
#10       4.5 430.29 2875 15346
#11       5.0 430.29 2875 15346
#12       5.5 430.29 3029 15346
#13       6.0 430.29 3029 15346
#14       6.5 430.29 2875 15346
#15       7.0 430.29 2875 15346
#16       7.5 430.29 2875 15346
#17       8.0 430.29 2875 15346
#18       8.5 438.20 2872 14791
#19       9.0 441.66 2835 14656
#20       9.5 441.66 2835 14656
# Note that the values (val) are identical (430.29) for thresholds
ranging between 4.0 and 8.0. However,
# the tp values for thresholds 5.5 and 6.0 are 3029 whereas they are
2875 for thresholds of 4.0-5.0 and
# 6.5 - 8.0.  Given that the threshold value is the same throughout,
it makes no sense that the there is any
# variation in the tp rates (also note that fp is the same throughout
this range).
# The problem seems to be here. Re-running the loop with the following
modification:
for(i in 1:(100 / (bins * 100) + 1)) {
    #bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE)
    # Substitute line above with the following line:
    bin.ct <- as.numeric(as.character(quantile(x, bins * (i - 1),
na.rm = TRUE)))  # Converts bin.ct to nameless vector
    tp <- length(x) - length(x[x <= bin.ct])  # N true positives
    fp <- length(y) - length(y[y <= bin.ct])  # N false positives
    ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp)
}
# Produces more sensible results
ctch[1:20, ]
#   threshold    val   tp    fp
#1        0.0 330.57 3139 19485
#2        0.5 374.11 3118 17510
#3        1.0 395.38 3029 16883
#4        1.5 395.38 3029 16883
#5        2.0 395.38 3029 16883
#6        2.5 395.38 3029 16883
#7        3.0 395.38 3029 16883
#8        3.5 395.38 3029 16883
#9        4.0 430.29 2875 15346
#10       4.5 430.29 2875 15346
#11       5.0 430.29 2875 15346
#12       5.5 430.29 2875 15346  # tp values are consistent now
#13       6.0 430.29 2875 15346  # ""
#14       6.5 430.29 2875 15346
#15       7.0 430.29 2875 15346
#16       7.5 430.29 2875 15346
#17       8.0 430.29 2875 15346
#18       8.5 438.20 2872 14791
#19       9.0 441.66 2835 14656
#20       9.5 441.66 2835 14656
# I am not sure why this is this the way it is. The variable bin.ct
was a vector of class "numeric" in both
# versions above, but in the former case it was named:
# First version
quantile(x, bins * (i - 1), na.rm = TRUE)
# 100%
#771.51
class(quantile(x, bins * (i - 1), na.rm = TRUE))
# "numeric"
# Seond version
as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE)))
#771.51
class(as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE))))
# "numeric"
# I am therefore not clear why the named vectors resulting from the
quantile function might be calculating
# different tp values for the same numeric value but with different
vector names. It is not necessarily the
# fact that the vector is named, either. For instance, if I do this:
j <- 0.055  # bins value for 5.5% threshold
bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE)))
names(bin.ct) <- "5.5%"
length(x) - length(x[x <= bin.ct])
# The value 2875 is returned
# Whereas this (the original construction) produces 3029, the problematic value
bin.ct <- quantile(x, j, na.rm = TRUE)
length(x) - length(x[x <= bin.ct])
# Very curious. One last thing which may or may not be related:
# If I try and subset results from ctch using different values of
ctch$threshold, this happens:
ctch[ctch$threshold == 3.0, ]
#  threshold    val   tp    fp    tn  fn     tpr      fpr      tnr     fnr
#7         3 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535
ctch[ctch$threshold == 3.5, ]
# [1] threshold val       tp        fp        tn        fn        tpr
     fpr       tnr       fnr
#<0 rows> (or 0-length row.names)
ctch[ctch$threshold == 4.0, ]
#  threshold    val   tp    fp    tn  fn      tpr      fpr      tnr      fnr
#9         4 430.29 2875 15346 18425 265 0.915605 0.454414 0.545586 0.084395
# Why is the indexing failing in the case of ctch[ctch$threshold =3.5, ], when
there is a row corresponding
# to this value in the dataframe ctch?
# A post-hoc fix gets rid of this problem, the cause of which I also
do not understand:
ctch$threshold <- seq(0, 100, by = 0.5)
ctch[ctch$threshold == 3.5, ]
#  threshold    val   tp    fp    tn  fn     tpr      fpr      tnr     fnr
#8       3.5 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535
# End code
I would very much appreciate any insight into the issues detailed
above. Am I doing something wrong with my code, or missing something
obvious? As a last bit of information, I should mention that I have
found the same results with both R 2.13 and 2.13.1 (installed today).
Thanks in advance for your help.
Best, Lyndon
p.s. Here is my sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] rgdal_0.7-1   raster_1.8-39 sp_0.9-83
loaded via a namespace (and not attached):
[1] grid_2.13.1     lattice_0.19-30 tools_2.13.1
-- 
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
+1-609-258-2392 (o)
+1-609-258-6082 (f)
+1-202-431-0496 (m)
lestes at princeton.edu
Eik Vettorazzi
2011-Jul-12  09:09 UTC
[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives
Hi, Am 11.07.2011 22:57, schrieb Lyndon Estes:> ctch[ctch$threshold == 3.5, ] > # [1] threshold val tp fp tn fn tpr > fpr tnr fnr > #<0 rows> (or 0-length row.names)this is the very effective FAQ 7.31 trap. http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f Welcome to the first circle of Patrick Burns' R Inferno! Also, unname() is a more intuitive way of removing names. And I think your code is quite inefficient, because you calculate quantiles many times, which involves repeated ordering of x, and you may use a inefficient size of bin (either to small and therefore calculating the same split many times or to large and then missing some splits). I'm a bit puzzled what is x and y in your code, so any further advise is vague but you might have a look at any package that calculates ROC-curves such as ROCR or pROC (and many more). Hth -- Eik Vettorazzi Department of Medical Biometry and Epidemiology University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790