Jeffrey Horner
2008-Feb-13 22:31 UTC
[Rd] RFC for package PopCon: a popularity contest for R and packages
Hello all, I've developed a prototype package called PopCon (short for popularity contest), a package for tracking the popularity of R and its packages. I'd like this work to be similar in spirit to the Debian package popularity-contest: http://popcon.debian.org/. Once Popcon is loaded, it captures two kinds of information from the user and stores it into a cache: the names of the libraries he/she loads, and the names of symbols requested from his/her code. Once the cache is full, the goal is to flush the data to a central server for storage, free for anyone to download and analyze. That's it. Pretty simple use and works behind the scenes. You can get the prototype here: http://biostat.mc.vanderbilt.edu/twiki/pub/Main/JeffreyHorner/PopCon_0.1.tar.gz And note that flushing of the cache is NOT TURNED ON and IT WON'T FORWARD ANY DATA ANYWHERE! It only gets deleted. So, I envision all the software and data generated and stored to be licensed under a GPL and a Creative Commons license, or even public domain. Thoughts? I'm looking for volunteers, because there are many issues to hash out. Here's a few of them: 1. Obviously storing IP addresses or any bit of personal information is out, but I'm interested in generating a permanent random key of some sort so that data from the same R installs can be tracked. I'm wondering if just md5 hashing the combination of R version, platform, and IP address would be appropriate and reproducible per R install. The debian package popularity-contest has the benefit of installing an '/etc' config file and generating the key once, while I'd like PopCon users to just call 'library(PopCon)' and do nothing else. 2. I'm willing to maintain the central server and work on the infrastructure, but help will definitely be needed. Also, if there's significant interested, maybe R core would be interested in this. 3. What exactly is PopCon tracking as far as symbol names go? It currently used an R_ObjectTable object attached to the search path to capture names, but is this the best way? see http://www.omegahat.org/RObjectTables/. It's also replacing base::getHook to trap library loads. 4. What else would be interesting to track? Some folks have suggested various bits of R.Version() output. Here's what PopCon can currently do: > library(PopCon) > search() [1] ".GlobalEnv" "package:PopCon" ".pcUDB" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "package:methods" [10] "Autoloads" "package:base" # Notice the above search entry .pcUDB. That's the R Object Table > typeof(PopCon::getCache()) [1] "character" > PopCon::getCache() [1] ".conflicts.OK" "search" "::" # Now the cache contains the name 'search', which I called above, # and the double colon operator. > library(cluster) > any(PopCon::getCache()=='package:cluster') [1] TRUE # Package names are represented in the PopCon cache just like # their name on the search path. > PopCon::getCache() [1] ".conflicts.OK" "search" [3] "::" "$.data.frame" [5] "$.default" "$.data.frame" [7] "$.default" "unique.integer" [9] "unique.numeric" "$.data.frame" [11] "$.default" "unique.integer" [13] "unique.numeric" "unique.character" [15] "unique.integer" "unique.numeric" [17] "close.gzfile" "$.packageDescription2" [19] "$.default" "$.data.frame" [21] "$.default" "unique.integer" [23] "unique.numeric" "unique.character" [25] "unique.integer" "unique.numeric" [27] "close.gzfile" "$.packageDescription2" [29] "$.default" "unique.integer" [31] "unique.numeric" "close.gzfile" [33] "names.simple.list" "names.default" [35] "[.default" "as.character.simple.list" [37] "as.vector.simple.list" "as.vector.default" [39] "unique.character" "$.packageDescription2" [41] "$.default" ">=.R_system_version" [43] "Ops.R_system_version" ">=.package_version" [45] "Ops.package_version" ">=.numeric_version" [47] ">=.package_version" "Ops.package_version" [49] ">=.numeric_version" "unlist.R_system_version" [51] "unlist.package_version" "unlist.numeric_version" [53] "unlist.default" "unlist.package_version" [55] "unlist.numeric_version" "unlist.default" [57] "as.list.R_system_version" "as.list.package_version" [59] "unique.integer" "unique.numeric" [61] "as.list.R_system_version" "as.list.package_version" [63] "unique.integer" "unique.numeric" [65] "as.list.package_version" "unique.integer" [67] "unique.numeric" "as.list.package_version" [69] "unique.integer" "unique.numeric" [71] ">=.default" "$.packageDescription2" [73] "$.default" "<.R_system_version" [75] "Ops.R_system_version" "<.package_version" [77] "Ops.package_version" "<.numeric_version" [79] "unique.character" "unlist.R_system_version" [81] "unlist.package_version" "unlist.numeric_version" [83] "unlist.default" "unlist.numeric_version" [85] "unlist.default" "as.list.R_system_version" ... # I've truncated the output here. But you get the idea. Any and all comments welcome. Jeff -- http://biostat.mc.vanderbilt.edu/JeffreyHorner