Dear all,
Sorry that I am not sure that whether I should ask the question here or
R-devel. Is there any existed packages which implements or is implementing
feature hashing or similar function?
For who does not know "feature hashing", please let me give a brief
explanation here.
Feature hashing is a technique to convert a large amount of string to dummy
variables quickly( similar to `stats::contrasts` ). For example, if I want
to convert a character vector `x <- c("asdfa",
"adsfausd", .....)` to dummy
variable, I need to construct a mapping between the string and the index
(`base::factor`). However, if the `x` has lots of different elements and
the size of `x` is huge, the overhead of constructing index is large.
Moreover, the overhead is larger for the distributed environment.
A good hashing function could be used to map the string to the index
quickly without the overhead of constructing the index. The probability of
"collision" might be small if we pick a good hashing function. For
details,
please see http://en.wikipedia.org/wiki/Feature_hashing
Best,
Wush Wu
PhD Student Graduate Institute of Electrical Engineering, National Taiwan
University
[[alternative HTML version deleted]]
On 25/10/2014, 5:25 AM, Wush Wu wrote:> Dear all, > > Sorry that I am not sure that whether I should ask the question here or > R-devel. Is there any existed packages which implements or is implementing > feature hashing or similar function? > > For who does not know "feature hashing", please let me give a brief > explanation here. > > Feature hashing is a technique to convert a large amount of string to dummy > variables quickly( similar to `stats::contrasts` ). For example, if I want > to convert a character vector `x <- c("asdfa", "adsfausd", .....)` to dummy > variable, I need to construct a mapping between the string and the index > (`base::factor`). However, if the `x` has lots of different elements and > the size of `x` is huge, the overhead of constructing index is large. > Moreover, the overhead is larger for the distributed environment. > > A good hashing function could be used to map the string to the index > quickly without the overhead of constructing the index. The probability of > "collision" might be small if we pick a good hashing function. For details, > please see http://en.wikipedia.org/wiki/Feature_hashingThe "digest" package implements several different hash functions. You could use the hash values as names in an environment to index arbitrary objects associated with the values. Duncan Murdoch