On 25/10/2014, 5:25 AM, Wush Wu wrote:> Dear all,
>
> Sorry that I am not sure that whether I should ask the question here or
> R-devel. Is there any existed packages which implements or is implementing
> feature hashing or similar function?
>
> For who does not know "feature hashing", please let me give a
brief
> explanation here.
>
> Feature hashing is a technique to convert a large amount of string to dummy
> variables quickly( similar to `stats::contrasts` ). For example, if I want
> to convert a character vector `x <- c("asdfa",
"adsfausd", .....)` to dummy
> variable, I need to construct a mapping between the string and the index
> (`base::factor`). However, if the `x` has lots of different elements and
> the size of `x` is huge, the overhead of constructing index is large.
> Moreover, the overhead is larger for the distributed environment.
>
> A good hashing function could be used to map the string to the index
> quickly without the overhead of constructing the index. The probability of
> "collision" might be small if we pick a good hashing function.
For details,
> please see en.wikipedia.org/wiki/Feature_hashing
The "digest" package implements several different hash functions. You
could use the hash values as names in an environment to index arbitrary
objects associated with the values.
Duncan Murdoch