thr3ads.net - R help - Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Graham Jones

2005-Feb-15 17:06 UTC

Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

In message <200502151112.j1FB5fZ5002722 at hypatia.math.ethz.ch>, r-help-
request at stat.math.ethz.ch writes
>Can comeone give me an example (perhaps in a private response, since I'm
off
>topic here) where one actually needs all cases in a large data set
("large"
>being > 1e6, say) to do a STATISTICAL analysis? By
"statistical" I exclude,
>say searching for some particular characteristic like an adverse event in a
>medical or customer repair database, etc. Maybe a definition of
>"statistical" is: anything that cannot be routinely done in a
single pass
>database query.
If the dimensionality of the data is large, you may need a large number
of cases too. An example from my own experience would be using quadratic
discriminant analysis (with regularization) for classifying symbols for
an OCR program. With 200 classes and 100 features, I'd really like many
millions of cases. I've been using about 20,000 per class or 4 million
in total, but if I had 40 million it would probably work better.
Compared to many applications in pattern recognition and data mining, I
think this is a fairly small example. 

-- 
Graham Jones, author of SharpEye Music Reader
visiv.co.uk
21e Balnakeil, Durness, Lairg, Sutherland, IV27 4PT, Scotland, UK

Prof Brian Ripley

2005-Feb-15 17:51 UTC

head link

Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

On Tue, 15 Feb 2005, Graham Jones wrote:
> In message <200502151112.j1FB5fZ5002722 at hypatia.math.ethz.ch>,
r-help-
> request at stat.math.ethz.ch writes
[Actually quoting Bert Gunter, BTW]
>> Can comeone give me an example (perhaps in a private response, since
I'm off
>> topic here) where one actually needs all cases in a large data set
("large"
>> being > 1e6, say) to do a STATISTICAL analysis? By
"statistical" I exclude,
>> say searching for some particular characteristic like an adverse event
in a
>> medical or customer repair database, etc. Maybe a definition of
>> "statistical" is: anything that cannot be routinely done in a
single pass
>> database query.
>
> If the dimensionality of the data is large, you may need a large number
> of cases too. An example from my own experience would be using quadratic
> discriminant analysis (with regularization) for classifying symbols for
> an OCR program. With 200 classes and 100 features, I'd really like many
> millions of cases. I've been using about 20,000 per class or 4 million
> in total, but if I had 40 million it would probably work better.
> Compared to many applications in pattern recognition and data mining, I
> think this is a fairly small example.
But Bert's caveats apply: you have 200 problems of size 20,000 since in 
QDA each class's distribution is estimated separately, and a single pass 
will give you the sufficient statistics however large the dataset is.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  stats.ox.ac.uk/~ripley
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Feb 2005 - Off topic -- large data sets. Was RE: 64 Bit R Background Question

Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

Off topic -- large data sets. Was RE: [R] 64 Bit R Background Question

Seemingly Similar Threads