One of the recurring themes in the recent UserR conference was that many people find it difficult to find the functions they need for a particular task. Sandy Weisberg suggested a small idea he would like to see: a hints function that given an object, lists likely operations. I've done my best to implement this function using the tools currently available in R, and my code is included at the bottom of this email (I hope that I haven't just duplicated something already present in R). I think Sandy's idea is genuinely useful, even in the limited form provided by my implementation, and I have already discovered a few useful functions that I was unaware of. While developing and testing this function, I ran into a few problems which, I think, represent underlying problems with the current documentation system. These are typified by the results of running hints on a object produced by glm (having class c("glm", "lm")). I have outlined (very tersely) some possible solutions. Please note that while these solutions are largely technological, the problem is at heart sociological: writing documentation is no easier (and perhaps much harder) than writing a scientific publication, but the rewards are fewer. Problems: * Many functions share the same description (eg. head, tail). Solution: each rdoc file should only describe one method. Problem: Writing rdoc files is tedious, there is a lot of information duplicated between the code and the documenation (eg. the usage statement) and some functions share a lot of similar information. Solution: make it easier to write documentation (eg. documentation inline with code), and easier to include certain common descriptions in multiple methods (eg. new include command) * It is difficult to tell which functions are commonly used/important. Solution: break down by keywords. Problem: keywords are not useful at the moment. Solution: make better list of keywords available and encourage people to use it. Problem: people won't unless there is a strong incentive, plus good keywording requires considerable expertise (especially in bulding up list). This is probably insoluable unless one person systematically keywords all of the base packages. * Some functions aren't documented (eg. simulate.lm, formula.glm) - typically, these are methods where the documentation is in the generic. Solution: these methods should all be aliased to the generic (by default?), and R CMD check should be amended to check for this situation. You could also argue that this is a deficiency with my function, and easily fixed by automatically referring to the generic if the specific isn't documented. * It can't supply suggestions when there isn't an explicit method (ie. .default is used), this makes it pretty useless for basic vectors. This may not really be a problem, as all possible operations are probably too numerous to list. * Provides full name for function, when best practice is to use generic part only when calling function. However, getting precise documentation may requires that full name. I do the best I can (returning the generic if specific is alias to a documentation file with the same method name), but this reflects a deeper problem that the name you should use when calling a function may be different to the name you use to get documentation. * Can only display methods from currently loaded packages. This is a shortcoming of the methods function, but I suspect it is difficult to find S3 methods without loading a package. Relatively trivial problems: * Needs wide display to be effective. Could be dealt with by breaking description in a sensible manner (there may already by R code to do this. Please let me know if you know of any) * Doesn't currently include S4 methods. Solution: add some more code to wrap showMethods * Personally, I think sentence case is more aesthetically pleasing (and more flexible) than title case. Hadley hints <- function(x) { db <- eval(utils:::.hsearch_db()) if (is.null(db)) { help.search("abcd!", rebuild=TRUE, agrep=FALSE) db <- eval(utils:::.hsearch_db()) } base <- db$Base alias <- db$Aliases key <- db$Keywords m <- all.methods(class=class(x)) m_id <- alias[match(m, alias[,1]), 2] keywords <- lapply(m_id, function(id) key[key[,2] %in% id, 1]) f.names <- cbind(m, base[match(m_id, base[,3]), 4]) f.names <- unlist(lapply(1:nrow(f.names), function(i) { if (is.na(f.names[i, 2])) return(f.names[i, 1]) a <- methodsplit(f.names[i, 1]) b <- methodsplit(f.names[i, 2]) if (a[1] == b[1]) f.names[i, 2] else f.names[i, 1] })) hints <- cbind(f.names, base[match(m_id, base[,3]), 5]) hints <- hints[order(tolower(hints[,1])),] hints <- rbind( c("--------", "---------------"), hints) rownames(hints) <- rep("", nrow(hints)) colnames(hints) <- c("Function", "Task") hints[is.na(hints)] <- "(Unknown)" class(hints) <- "hints" hints } print.hints <- function(x, ...) print(unclass(x), quote=FALSE) all.methods <- function(classes) { methods <- do.call(rbind,lapply(classes, function(x) { m <- methods(class=x) t(sapply(as.vector(m), methodsplit)) #m[attr(m, "info")$visible] })) rownames(methods[!duplicated(methods[,1]),]) } methodsplit <- function(m) { parts <- strsplit(m, "\\.")[[1]] if (length(parts) == 1) { c(name=m, class="") } else{ c(name=paste(parts[-length(parts)], collapse="."), class=parts[length(parts)]) } }
I've moved this from R-help to R-devel, where I think it is more appropriate, and interspersed comments below. On 6/19/2006 8:51 AM, hadley wickham wrote:> One of the recurring themes in the recent UserR conference was that > many people find it difficult to find the functions they need for a > particular task. Sandy Weisberg suggested a small idea he would like > to see: a hints function that given an object, lists likely > operations. I've done my best to implement this function using the > tools currently available in R, and my code is included at the bottom > of this email (I hope that I haven't just duplicated something already > present in R). I think Sandy's idea is genuinely useful, even in the > limited form provided by my implementation, and I have already > discovered a few useful functions that I was unaware of. > > While developing and testing this function, I ran into a few problems > which, I think, represent underlying problems with the current > documentation system. These are typified by the results of running > hints on a object produced by glm (having class c("glm", "lm")). I > have outlined (very tersely) some possible solutions. Please note > that while these solutions are largely technological, the problem is > at heart sociological: writing documentation is no easier (and perhaps > much harder) than writing a scientific publication, but the rewards > are fewer. > > Problems: > > * Many functions share the same description (eg. head, tail). > Solution: each rdoc file should only describe one method. Problem: > Writing rdoc files is tedious, there is a lot of information > duplicated between the code and the documenation (eg. the usage > statement) and some functions share a lot of similar information. > Solution: make it easier to write documentation (eg. documentation > inline with code), and easier to include certain common descriptions > in multiple methods (eg. new include command)I think it's bad to document dissimilar functions in the same file, but similar related functions *should* be documented together. Not doing this just adds to the burden of documenting them, and the risk of modifying only part of the documentation so that it is inconsistent. The user also gets the benefit of seeing a common description all at once, rather than having to decide whether to follow "See also" links. Your solutions would both be interesting on their own merits regardless of the above. We did decide to work on preprocessing directives for .Rd files at the R core meetings; some sort of include directive may be possible. I don't think I would want complete documentation mixed with the original source, but it would certainly be interesting to have partial documentation there. (Complete documentation is too long, and would make it harder to read the source without a dedicated editor that could hide it. Though ESS users may see it as a reasonable requirement to have everyone use the same editor, I don't think it is.) However, this is a lot of work, depending on infrastructure that is not in place.> * It is difficult to tell which functions are commonly > used/important. Solution: break down by keywords. Problem: keywords > are not useful at the moment. Solution: make better list of keywords > available and encourage people to use it. Problem: people won't > unless there is a strong incentive, plus good keywording requires > considerable expertise (especially in bulding up list). This is > probably insoluable unless one person systematically keywords all of > the base packages.I think it is worse than that. There are concepts in packages that just don't arise in base R, and hence there would be no keywords for them other than "misc", even if someone redesigned the current system. Keywording is hard, and it's not clear to me how to do much better than we currently do. We do already have user-defined keywords (via \concept), but these are not widely used.> > * Some functions aren't documented (eg. simulate.lm, formula.glm) - > typically, these are methods where the documentation is in the > generic. Solution: these methods should all be aliased to the generic > (by default?), and R CMD check should be amended to check for this > situation. You could also argue that this is a deficiency with my > function, and easily fixed by automatically referring to the generic > if the specific isn't documented.I'd say it's a deficiency of your function. You might want to look at the code in get("?") and .helpForCall() to see how those functions work out things like ?simulate(x) where x is an lm object. (But notice that .helpForCall is an undocumented internal function; don't depend on its implementation working forever).> * It can't supply suggestions when there isn't an explicit method > (ie. .default is used), this makes it pretty useless for basic > vectors. This may not really be a problem, as all possible operations > are probably too numerous to list. > > * Provides full name for function, when best practice is to use > generic part only when calling function. However, getting precise > documentation may requires that full name.No, not if the call syntax above is used. I do the best I can> (returning the generic if specific is alias to a documentation file > with the same method name), but this reflects a deeper problem that > the name you should use when calling a function may be different to > the name you use to get documentation. > > * Can only display methods from currently loaded packages. This is a > shortcoming of the methods function, but I suspect it is difficult to > find S3 methods without loading a package. > > Relatively trivial problems: > > * Needs wide display to be effective. Could be dealt with by > breaking description in a sensible manner (there may already by R code > to do this. Please let me know if you know of any)I think strwrap() may do what you want.> > * Doesn't currently include S4 methods. Solution: add some more code > to wrap showMethods > > * Personally, I think sentence case is more aesthetically pleasing > (and more flexible) than title case.It's quite hard to go from existing title case to sentence case, because we don't have any markup to indicate proper names. One would think it would be easier to go in the opposite direction, but in fact the same problem arises: "van Beethoven" for example, not "Van Beethoven".> > > Hadley > > > hints <- function(x) {I don't like the name "hints". I think we already have too many ways into the help system: help ? help.search apropos etc.? I like your function, but I'd rather see it attached to one of the existing help functions, probably help.search(). For example, help.search(x) could look for functions designed to work with the class of x, if it had one. (There's some ambiguity here: perhaps x contains a string, and I want help on that string.) Anyway, thanks for your efforts on this so far; I hope we end up with something that can make it into the next release. Duncan Murdoch> db <- eval(utils:::.hsearch_db()) > if (is.null(db)) { > help.search("abcd!", rebuild=TRUE, agrep=FALSE) > db <- eval(utils:::.hsearch_db()) > } > > base <- db$Base > alias <- db$Aliases > key <- db$Keywords > > m <- all.methods(class=class(x)) > m_id <- alias[match(m, alias[,1]), 2] > keywords <- lapply(m_id, function(id) key[key[,2] %in% id, 1]) > > f.names <- cbind(m, base[match(m_id, base[,3]), 4]) > f.names <- unlist(lapply(1:nrow(f.names), function(i) { > if (is.na(f.names[i, 2])) return(f.names[i, 1]) > a <- methodsplit(f.names[i, 1]) > b <- methodsplit(f.names[i, 2]) > > if (a[1] == b[1]) f.names[i, 2] else f.names[i, 1] > })) > > hints <- cbind(f.names, base[match(m_id, base[,3]), 5]) > hints <- hints[order(tolower(hints[,1])),] > hints <- rbind( c("--------", "---------------"), hints) > rownames(hints) <- rep("", nrow(hints)) > colnames(hints) <- c("Function", "Task") > hints[is.na(hints)] <- "(Unknown)" > > class(hints) <- "hints" > hints > } > > print.hints <- function(x, ...) print(unclass(x), quote=FALSE) > > all.methods <- function(classes) { > methods <- do.call(rbind,lapply(classes, function(x) { > m <- methods(class=x) > t(sapply(as.vector(m), methodsplit)) #m[attr(m, "info")$visible] > })) > rownames(methods[!duplicated(methods[,1]),]) > } > > methodsplit <- function(m) { > parts <- strsplit(m, "\\.")[[1]] > if (length(parts) == 1) { > c(name=m, class="") > } else{ > c(name=paste(parts[-length(parts)], collapse="."), class=parts[length(parts)]) > } > } > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
hadley wickham wrote:> One of the recurring themes in the recent UserR conference was that > many people find it difficult to find the functions they need for a > particular task. Sandy Weisberg suggested a small idea he would like > to see: a hints function that given an object, lists likely > operations. I've done my best to implement this function using the > tools currently available in R, and my code is included at the bottom > of this email (I hope that I haven't just duplicated something already > present in R). I think Sandy's idea is genuinely useful, even in the > limited form provided by my implementation, and I have already > discovered a few useful functions that I was unaware of. > > While developing and testing this function, I ran into a few problems > which, I think, represent underlying problems with the current > documentation system. These are typified by the results of running > hints on a object produced by glm (having class c("glm", "lm")). I > have outlined (very tersely) some possible solutions. Please note > that while these solutions are largely technological, the problem is > at heart sociological: writing documentation is no easier (and perhaps > much harder) than writing a scientific publication, but the rewards > are fewer. > > Problems: > > * Many functions share the same description (eg. head, tail). > Solution: each rdoc file should only describe one method. Problem: > Writing rdoc files is tedious, there is a lot of information > duplicated between the code and the documenation (eg. the usage > statement) and some functions share a lot of similar information. > Solution: make it easier to write documentation (eg. documentation > inline with code), and easier to include certain common descriptions > in multiple methods (eg. new include command) > > * It is difficult to tell which functions are commonly > used/important. Solution: break down by keywords. Problem: keywords > are not useful at the moment. Solution: make better list of keywords > available and encourage people to use it. Problem: people won't > unless there is a strong incentive, plus good keywording requires > considerable expertise (especially in bulding up list). This is > probably insoluable unless one person systematically keywords all of > the base packages. > > * Some functions aren't documented (eg. simulate.lm, formula.glm) - > typically, these are methods where the documentation is in the > generic. Solution: these methods should all be aliased to the generic > (by default?), and R CMD check should be amended to check for this > situation. You could also argue that this is a deficiency with my > function, and easily fixed by automatically referring to the generic > if the specific isn't documented. > > * It can't supply suggestions when there isn't an explicit method > (ie. .default is used), this makes it pretty useless for basic > vectors. This may not really be a problem, as all possible operations > are probably too numerous to list. > > * Provides full name for function, when best practice is to use > generic part only when calling function. However, getting precise > documentation may requires that full name. I do the best I can > (returning the generic if specific is alias to a documentation file > with the same method name), but this reflects a deeper problem that > the name you should use when calling a function may be different to > the name you use to get documentation. > > * Can only display methods from currently loaded packages. This is a > shortcoming of the methods function, but I suspect it is difficult to > find S3 methods without loading a package. > > Relatively trivial problems: > > * Needs wide display to be effective. Could be dealt with by > breaking description in a sensible manner (there may already by R code > to do this. Please let me know if you know of any) > > * Doesn't currently include S4 methods. Solution: add some more code > to wrap showMethods > > * Personally, I think sentence case is more aesthetically pleasing > (and more flexible) than title case. > > > Hadley > > > hints <- function(x) { > db <- eval(utils:::.hsearch_db()) > if (is.null(db)) { > help.search("abcd!", rebuild=TRUE, agrep=FALSE) > db <- eval(utils:::.hsearch_db()) > } > > base <- db$Base > alias <- db$Aliases > key <- db$Keywords > > m <- all.methods(class=class(x)) > m_id <- alias[match(m, alias[,1]), 2] > keywords <- lapply(m_id, function(id) key[key[,2] %in% id, 1]) > > f.names <- cbind(m, base[match(m_id, base[,3]), 4]) > f.names <- unlist(lapply(1:nrow(f.names), function(i) { > if (is.na(f.names[i, 2])) return(f.names[i, 1]) > a <- methodsplit(f.names[i, 1]) > b <- methodsplit(f.names[i, 2]) > > if (a[1] == b[1]) f.names[i, 2] else f.names[i, 1] > })) > > hints <- cbind(f.names, base[match(m_id, base[,3]), 5]) > hints <- hints[order(tolower(hints[,1])),] > hints <- rbind( c("--------", "---------------"), hints) > rownames(hints) <- rep("", nrow(hints)) > colnames(hints) <- c("Function", "Task") > hints[is.na(hints)] <- "(Unknown)" > > class(hints) <- "hints" > hints > } > > print.hints <- function(x, ...) print(unclass(x), quote=FALSE) > > all.methods <- function(classes) { > methods <- do.call(rbind,lapply(classes, function(x) { > m <- methods(class=x) > t(sapply(as.vector(m), methodsplit)) #m[attr(m, "info")$visible] > })) > rownames(methods[!duplicated(methods[,1]),]) > } > > methodsplit <- function(m) { > parts <- strsplit(m, "\\.")[[1]] > if (length(parts) == 1) { > c(name=m, class="") > } else{ > c(name=paste(parts[-length(parts)], collapse="."), class=parts[length(parts)]) > } > } > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.htmljust a feedback: that's a useful function, thank you. but the problem is probably more general: frequently I do not really want to know what I generally can do with a data frame, for instance, but rather I would like to use `help.search' as I would use, say, Google (and with the same rate of success...). but the actual `keywords' in the manpages seem insufficient and `help.search' does not allow full text search in the manpages (I can imagine why (1000 hits...), but without such a thing google, for instance, would probably not be half as useful as it is, right?) and there is no "sorting by relevance" in the `help.search' output, I think. how this sorting could be achieved is a different question, of course.