Lyndon Estes
2011-Jul-11 20:57 UTC
[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives
Dear List, I have encountered an odd problem that I cannot understand. It stems from the calculation of true and false positives based on two input vectors x and y based on different thresholds of x, extracted using the quantile function. I am in certain cases getting different values of true positives for the same threshold value when the threshold was found under different quantiles (e.g. the threshold value Z was found with quantile(x, probs = 0.045) and quantile(x, probs = 0.05). The following illustrates the problem: # Start of code with comments # Load vectors x and y con <- url("http://sites.google.com/site/ldemisc/r_general/tpfpdat.Rdata") load(con) close(con) # Data frame to collect TP and FP values based on different thresholds of vector x bins <- 0.005 ctch <- data.frame(matrix(nrow = 100 / (bins * 100) + 1, ncol = 4)) colnames(ctch) <- c("threshold", "val", "tp", "fp") # Extract different TP and FP values in loop, where thresholds are based on 1/2 percent increments # in quantiles of x for(i in 1:(100 / (bins * 100) + 1)) { bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE) tp <- length(x) - length(x[x <= bin.ct]) # N true positives fp <- length(y) - length(y[y <= bin.ct]) # N false positives ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp) } # The problem is here ctch[1:20, ] # threshold val tp fp #1 0.0 330.57 3139 19485 #2 0.5 374.11 3118 17510 #3 1.0 395.38 3029 16883 #4 1.5 395.38 3029 16883 #5 2.0 395.38 3029 16883 #6 2.5 395.38 3029 16883 #7 3.0 395.38 3029 16883 #8 3.5 395.38 3029 16883 #9 4.0 430.29 2875 15346 #10 4.5 430.29 2875 15346 #11 5.0 430.29 2875 15346 #12 5.5 430.29 3029 15346 #13 6.0 430.29 3029 15346 #14 6.5 430.29 2875 15346 #15 7.0 430.29 2875 15346 #16 7.5 430.29 2875 15346 #17 8.0 430.29 2875 15346 #18 8.5 438.20 2872 14791 #19 9.0 441.66 2835 14656 #20 9.5 441.66 2835 14656 # Note that the values (val) are identical (430.29) for thresholds ranging between 4.0 and 8.0. However, # the tp values for thresholds 5.5 and 6.0 are 3029 whereas they are 2875 for thresholds of 4.0-5.0 and # 6.5 - 8.0. Given that the threshold value is the same throughout, it makes no sense that the there is any # variation in the tp rates (also note that fp is the same throughout this range). # The problem seems to be here. Re-running the loop with the following modification: for(i in 1:(100 / (bins * 100) + 1)) { #bin.ct <- quantile(x, bins * (i - 1), na.rm = TRUE) # Substitute line above with the following line: bin.ct <- as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE))) # Converts bin.ct to nameless vector tp <- length(x) - length(x[x <= bin.ct]) # N true positives fp <- length(y) - length(y[y <= bin.ct]) # N false positives ctch[i, ] <- c(bins * (i - 1) * 100, bin.ct, tp, fp) } # Produces more sensible results ctch[1:20, ] # threshold val tp fp #1 0.0 330.57 3139 19485 #2 0.5 374.11 3118 17510 #3 1.0 395.38 3029 16883 #4 1.5 395.38 3029 16883 #5 2.0 395.38 3029 16883 #6 2.5 395.38 3029 16883 #7 3.0 395.38 3029 16883 #8 3.5 395.38 3029 16883 #9 4.0 430.29 2875 15346 #10 4.5 430.29 2875 15346 #11 5.0 430.29 2875 15346 #12 5.5 430.29 2875 15346 # tp values are consistent now #13 6.0 430.29 2875 15346 # "" #14 6.5 430.29 2875 15346 #15 7.0 430.29 2875 15346 #16 7.5 430.29 2875 15346 #17 8.0 430.29 2875 15346 #18 8.5 438.20 2872 14791 #19 9.0 441.66 2835 14656 #20 9.5 441.66 2835 14656 # I am not sure why this is this the way it is. The variable bin.ct was a vector of class "numeric" in both # versions above, but in the former case it was named: # First version quantile(x, bins * (i - 1), na.rm = TRUE) # 100% #771.51 class(quantile(x, bins * (i - 1), na.rm = TRUE)) # "numeric" # Seond version as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE))) #771.51 class(as.numeric(as.character(quantile(x, bins * (i - 1), na.rm = TRUE)))) # "numeric" # I am therefore not clear why the named vectors resulting from the quantile function might be calculating # different tp values for the same numeric value but with different vector names. It is not necessarily the # fact that the vector is named, either. For instance, if I do this: j <- 0.055 # bins value for 5.5% threshold bin.ct <- as.numeric(as.character(quantile(x, j, na.rm = TRUE))) names(bin.ct) <- "5.5%" length(x) - length(x[x <= bin.ct]) # The value 2875 is returned # Whereas this (the original construction) produces 3029, the problematic value bin.ct <- quantile(x, j, na.rm = TRUE) length(x) - length(x[x <= bin.ct]) # Very curious. One last thing which may or may not be related: # If I try and subset results from ctch using different values of ctch$threshold, this happens: ctch[ctch$threshold == 3.0, ] # threshold val tp fp tn fn tpr fpr tnr fnr #7 3 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535 ctch[ctch$threshold == 3.5, ] # [1] threshold val tp fp tn fn tpr fpr tnr fnr #<0 rows> (or 0-length row.names) ctch[ctch$threshold == 4.0, ] # threshold val tp fp tn fn tpr fpr tnr fnr #9 4 430.29 2875 15346 18425 265 0.915605 0.454414 0.545586 0.084395 # Why is the indexing failing in the case of ctch[ctch$threshold =3.5, ], when there is a row corresponding # to this value in the dataframe ctch? # A post-hoc fix gets rid of this problem, the cause of which I also do not understand: ctch$threshold <- seq(0, 100, by = 0.5) ctch[ctch$threshold == 3.5, ] # threshold val tp fp tn fn tpr fpr tnr fnr #8 3.5 395.38 3029 16883 16888 111 0.96465 0.499926 0.500074 0.03535 # End code I would very much appreciate any insight into the issues detailed above. Am I doing something wrong with my code, or missing something obvious? As a last bit of information, I should mention that I have found the same results with both R 2.13 and 2.13.1 (installed today). Thanks in advance for your help. Best, Lyndon p.s. Here is my sessionInfo() R version 2.13.1 (2011-07-08) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] C/en_US.UTF-8/C/C/C/C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rgdal_0.7-1 raster_1.8-39 sp_0.9-83 loaded via a namespace (and not attached): [1] grid_2.13.1 lattice_0.19-30 tools_2.13.1 -- Lyndon Estes Research Associate Woodrow Wilson School Princeton University +1-609-258-2392 (o) +1-609-258-6082 (f) +1-202-431-0496 (m) lestes at princeton.edu
Eik Vettorazzi
2011-Jul-12 09:09 UTC
[R] Named numeric vectors with the same value but different names return different results when used as thresholds for calculating true positives
Hi, Am 11.07.2011 22:57, schrieb Lyndon Estes:> ctch[ctch$threshold == 3.5, ] > # [1] threshold val tp fp tn fn tpr > fpr tnr fnr > #<0 rows> (or 0-length row.names)this is the very effective FAQ 7.31 trap. http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f Welcome to the first circle of Patrick Burns' R Inferno! Also, unname() is a more intuitive way of removing names. And I think your code is quite inefficient, because you calculate quantiles many times, which involves repeated ordering of x, and you may use a inefficient size of bin (either to small and therefore calculating the same split many times or to large and then missing some splits). I'm a bit puzzled what is x and y in your code, so any further advise is vague but you might have a look at any package that calculates ROC-curves such as ROCR or pROC (and many more). Hth -- Eik Vettorazzi Department of Medical Biometry and Epidemiology University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790