Dear all, Sorry that I am not sure that whether I should ask the question here or R-devel. Is there any existed packages which implements or is implementing feature hashing or similar function? For who does not know "feature hashing", please let me give a brief explanation here. Feature hashing is a technique to convert a large amount of string to dummy variables quickly( similar to `stats::contrasts` ). For example, if I want to convert a character vector `x <- c("asdfa", "adsfausd", .....)` to dummy variable, I need to construct a mapping between the string and the index (`base::factor`). However, if the `x` has lots of different elements and the size of `x` is huge, the overhead of constructing index is large. Moreover, the overhead is larger for the distributed environment. A good hashing function could be used to map the string to the index quickly without the overhead of constructing the index. The probability of "collision" might be small if we pick a good hashing function. For details, please see http://en.wikipedia.org/wiki/Feature_hashing Best, Wush Wu PhD Student Graduate Institute of Electrical Engineering, National Taiwan University [[alternative HTML version deleted]]
On 25/10/2014, 5:25 AM, Wush Wu wrote:> Dear all, > > Sorry that I am not sure that whether I should ask the question here or > R-devel. Is there any existed packages which implements or is implementing > feature hashing or similar function? > > For who does not know "feature hashing", please let me give a brief > explanation here. > > Feature hashing is a technique to convert a large amount of string to dummy > variables quickly( similar to `stats::contrasts` ). For example, if I want > to convert a character vector `x <- c("asdfa", "adsfausd", .....)` to dummy > variable, I need to construct a mapping between the string and the index > (`base::factor`). However, if the `x` has lots of different elements and > the size of `x` is huge, the overhead of constructing index is large. > Moreover, the overhead is larger for the distributed environment. > > A good hashing function could be used to map the string to the index > quickly without the overhead of constructing the index. The probability of > "collision" might be small if we pick a good hashing function. For details, > please see http://en.wikipedia.org/wiki/Feature_hashingThe "digest" package implements several different hash functions. You could use the hash values as names in an environment to index arbitrary objects associated with the values. Duncan Murdoch