Rory,
On Dec 27, 2012, at 3:14 AM, Rory Winston wrote:
> Hi guys
>
> I am currently working on a small bit of bridging code between a database
system and R. The database system has the concept of varchars, a la factors in
R, as distinct from plain character strings.
varchars are character strings. Factors consists of index and level set, so if
your DB doesn't keep those separate, it is not a factor (and below you
suggest it doesn't). Even if the DB supports ordered and unordered sets, the
drivers typically only return the strings anyway, so you don't get at the
set (without querying the schema). To make a point - a factor is if you can have
a column consisting of values A,A,B,B and a level set of A,B,C (i.e. C is not
used so it is extra information that you cannot express in a character string).
if you don't have levels information nor the order then it's just a
character vector.
> What I would like to do is when I receive a list of character strings from
the remote database system that are of type varchar, turn these into a factor
variable. This would ideally need to be done in C code, where the rest of the
datatype translation is occuring.
>
It really depends on what you want to get out and what your input really is. If
your DB will be delivering results in rows, probably the most efficient way to
construct a factor from string input is to simply create the index as you go and
keep a hash of the levels. Then at the end you just put the two together into
one factor object. Note that if your DB doesn't pre-specify the levels the
the order is undefined.
If you are collecting the whole character vector first anyway, then I see no
real point of not using as.factor() - even from C code.
Note, however, that in such case you should really give the user an option not
do to that - dealing with factors is very painful and they are bad for data
manipulation so many users prefer to set stringsAsFactors default to FALSE
(including me) because it's much more efficient and less error-prone to deal
with character vectors. Having to convert factors back to strings is very
inefficient (in particular with large data) and superfluous since you already
had strings to start with.
> My first attempt was a bit naive (setting the factor class attribute on a
vector of character strings, which obviously results in an error), looking at
the R factor() implementation, I can see the core logic for factor conversion
is:
>
> y <- unique(x)
> ind <- sort.list(y)
> y <- as.character(y)
> levels <- unique(y[ind])
>
> So I am guessing this would need to be replicated in C? My question is - is
it possible to create a fully-formed factor variable in C code (Ive struggled to
find many / any examples), or should this be done in R when the call returns? I
would like to make it seamless to the end user, so an automatic conversion to
factors would be preferable..
>
It would not for reasons above which is why it's typically done at R level
as an optional post-processing step. That doesn't mean you can't do it
in C, but it is somewhat painful as you'll have to hash the levels -
it's more convenient to have R do that for you.
Cheers,
Simon
> Cheers
> -- Rory
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>