I would like to perform a multinomial logistic regression on a large
data set, but do not know how. I've only thought of a few possibilities
and write to seek advice and guidance on them or deepening or expanding
my search.
On smaller data sets, I have successfully loaded the data and issued
commands such as:
length(levels(factor(data$response)))
[1] 6 # implies polychotomy
library(nnet)
result <- multinom(data$response ~ 1 + data$var1 + data$var2 + ...)
# (I am interested in at most ten
# parameters; usually less than six)
For a 60-MB comma-separated-values text-format data file (with a few
hundred thousand records), object.size(data) returns roughly 86 MB. Now
I am considering loading a 7-GB data file (with about 30 million
records). (In the near future, I may be interested in loading a 50-GB
data file, but right now I am still trying things out on smaller sets.)
What should I do?
1. I recall some discussion from August 2006 about the use of the biglm
package. (Subject: lean and mean lm/glm?) This seems potentially very
useful, but it's not clear to me how to fit a multinomial response. Can
I get bigglm to fit polychotomous data?
2. Earlier, I thought I ran across an example (perhaps in V&R's MASS4 or
Harrell's Regression Modeling Strategies) showing how to use glm and an
appropriate family specification to perform a multinomial logistic
regression, but now I cannot find the example. This is what had to be
done before the multinom() function became available, and it still
works, but I need a reference or example --- can anyone point me to it?
I suspect part of my problem is that I do not understand the
documentation on 'family': I'm not sure what the 'object'
argument is,
defined:
"object: the function family accesses the family objects which are
stored within objects created by modelling functions (e.g., glm)."
My impression is that glm() returns a glm object. I'm not sure what to
write there.
If the example doesn't exist, my brain may have [wishfully] inserted the
"multinomial" into my memory. It's clear that glm can be used for
[ordinary/binomial] logistic regression.
3. I have skimmed Chen & Ripley's papers on computing near the data, but
suspect that I will need to do quite a lot of work (read: careful
reading, hand holding, and development) to adapt their solution.
4. I have briefly browsed the documentation on setting larger memory
size flags, but suspect that that's not a scalable route. My desktop
WinXP PC has 2 GB of RAM; a linux computer I prefer has 8 GB, and I
suspect both copies of R were compiled as 32-bit (but I don't know how
to verify this).
box$ uname -a
Linux box 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686
i686 i386 GNU/Linux
box$ R --max-vsize='4G'
WARNING: --max-vsize=4G=4'M': too large and ignored
5. If all else fails, I can sample the data and check the sample for an
appropriate distribution.
Richard
212-933-3305 / richard.c.yeh at bankofamerica.com
NOTICE TO RECIPIENTS: Any information contained in or attached to this message
is intended solely for the use of the intended recipient(s). If you are not the
intended recipient of this transmittal, you are hereby notified that you
received this transmittal in error, and we request that you please delete and
destroy all copies and attachments in your possession, notify the sender that
you have received this communication in error, and note that any review or
dissemination of, or the taking of any action in reliance on, this communication
is expressly prohibited.
Banc of America Securities LLC ("BAS") does not accept time-sensitive,
action-oriented messages or transaction orders, including orders to purchase or
sell securities, via e-mail.
Regular internet e-mail transmission cannot be guaranteed to be secure or
error-free. Therefore, we do not represent that this information is complete or
accurate, and it should not be relied upon as such. If you prefer to
communicate with BAS using secure (i.e., encrypted) e-mail transmission, please
notify the sender. Otherwise, you will be deemed to have consented to
communicate with BAS via regular internet e-mail transmission. Please note that
BAS reserves the right to intercept, monitor, and retain all e-mail messages
(including secure e-mail messages) sent to or from its systems as permitted by
applicable law.
----------------------------------------------------------------------
IRS Circular 230 Disclosure:
Bank of America Corporation and its affiliates, including BAS, ("Bank of
America") do not provide tax advice. Accordingly, any statements contained
herein as to tax matters were neither written nor intended by the sender or Bank
of America to be used and cannot be used by any taxpayer for the purpose of
avoiding tax penalties that may be imposed on such taxpayer. If any person uses
or refers to any such tax statement in promoting, marketing or recommending a
partnership or other entity, investment plan or arrangement to any taxpayer,
then the statement expressed above is being delivered to support the promotion
or marketing of the transaction or matter addressed, and the recipient should
seek advice based on its particular circumstances from an independent tax
advisor.