An interesting example of this is the forest cover data set that is
available from http://www.ics.uci.edu/~mlearn
The proportions of the different cover types change systematically
as one moves through the file. It seems that distance through the
file is a proxy for the geographical co-ordinates. Fitting a tree-based
or suchlike model to the total data is not the way to go, unless one
is going to model the pattern of change through the file as part of the
modeling exercise. In any case, some preliminary exploration of the
data is called for, so that such matters come to attention. For my
money, the issue is not ease of performing regression with huge data
sets, but ease of data exploration.
John Maindonald email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473 fax : +61 2(6125)5549
Mathematical Sciences Institute, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.
On 26 Apr 2006, at 8:00 PM, r-help-request at stat.math.ethz.ch wrote:
> From: Berton Gunter <gunter.berton at gene.com>
> Date: 26 April 2006 6:47:12 AM
> To: 'Weiwei Shi' <helprhelp at gmail.com>, 'bogdan
romocea'
> <br44114 at gmail.com>
> Cc: 'r-help' <R-help at stat.math.ethz.ch>
> Subject: Re: [R] regression modeling
>
>
> May I offer a perhaps contrary perspective on this.
>
> Statistical **theory** tells us that the precision of estimates
> improves as
> sample size increases. However, in practice, this is not always the
> case.
> The reason is that it can take time to collect that extra data, and
> things
> change over time. So the very definition of what one is measuring, the
> measurement technology by which it is measured (think about
> estimating tumor
> size or disease incidence or underemployment, for example), the
> presence or
> absence of known or unknown large systematic effects, and so forth may
> change in unknown ways. This defeats, or at least complicates, the
> fundamental assumption that one is sampling from a (fixed)
> population or
> stable (e.g. homogeneous, stationary) process, so it's no wonder
> that all
> statistical bets are off. Of course, sometimes the necessary
> information to
> account for these issues is present, and appropriate (but often
> complex)
> statistical analyses can be performed. But not always.
>
> Thus, I am suspicious, cynical even, about those who advocate
> collecting
> "all the data" and subjecting the whole vast heterogeneous mess
to
> arcane
> and ever more computer intensive (and adjustable parameter ridden)
> "data
> mining" algorithms to "detect trends" or "discover
knowledge." To
> me, it
> sounds like a prescription for "turning on all the equipment and
> waiting to
> see what happens" in the science lab instead of performing careful,
> well-designed experiments.
>
> I realize, of course, that there are many perfectly legitimate
> areas of
> scientific research, from geophysics to evolutionary biology to
> sociology,
> where one cannot (easily) perform planned experiments. But my point
> is that
> good science demands that in all circumstances, and especially when
> one
> accumulates and attempts to aggregata data taken over spans of time
> and
> space, one needs to beware of oversimplification, including
> statistical
> oversimplification. So interrogate the measurement, be skeptical of
> stability, expect inconsistency. While "all models are wrong but
> some are
> useful" (George Box), the second law tells us that entropy still
> rules.
>
> (Needless to say, public or private contrary views are welcome).
>
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>
> "The business of the statistician is to catalyze the scientific
> learning
> process." - George E. P. Box