thr3ads.net - R help - [R] dixon test [Aug 2008]

If this information is useful, please help other people find it:
Share via:

giov

2008-Aug-12 08:50 UTC

[R] dixon test

Hi, I need some help using the R outliers package. I would like to perform a
Q-test (Dixon test) on my data set. I used the dixon.test function, but I
cannot understand what is the confidence level used to perform the test. I
have n=101 (n= number of data). So, can I use directly dixon.test ? What
about qdixon and qtable functions? 

thank you so much!
-- 
View this message in context:
http://www.nabble.com/dixon-test-tp18940260p18940260.html
Sent from the R help mailing list archive at Nabble.com.

Fernando Marmolejo-Ramos

2008-Aug-12 22:15 UTC

head link

[R] dixon test

hi giov

about the dixon test... i just run a simple test with a sample of 40 and I
got:

Error in dixon.test(x) : Sample size must be in range 3-30

So it seems that most of the test in the "outliers" package are
designed for
small samples. See also the Rnews article published in May 2006 (vol 6/2)
called "processing data for outliers" by Lukasz Komsta (the developer
of the
package).

However there is in that package a function called "scores" which
works for
big samples. You can also see the p-values and z scores for the observations
you have and determine which values are considered outliers.

Try this simple syntax:

library(outliers)
library(gamlss.dist)

# this produces a exponential+Gaussian distribution (which usually has heaps
of outliers!)
x <- rexGAUS(100,2000,3000,5000)

# this confirms that Dixon works for samples between 3 and 30!!!
dixon.test(x)

# just to see what the data set looks like and visually confirm the outliers
boxplot(x, notch=T)

# sort the scores in ascending order
sort(x)

# returns probability of each score (using z scores) to be an outlier in
order
sort(scores(x, type="z", prob=1))

# determines which scores are considered outliers with a 95% confidence
sort(scores(x, prob=0.95))

The author points regarding the "prob" part...

prob ---- If set, the corresponding p-values instead of scores are given. If
value is set to 1, p-value are returned. Otherwise, a logical vector is
formed, indicating which values are exceeding specified probability. In
"z"
and "mad" types, there is also possibility to set this value to zero,
and
then scores are confirmed to (n-1)/sqrt(n) value, according to Shiffler
(1998). The "iqr" type does not support probabilities, but
"lim" value can
be specified. 

The reference of Shiffler is not as the one that appears in the help. It is
this one:

Schiffler, R.E (1988). Maximum Z scores and outliers. Am. Stat. 42, 1,
79-80. 

I hope this helps,

Fernando

-- 
View this message in context:
http://www.nabble.com/dixon-test-tp18940260p18953571.html
Sent from the R help mailing list archive at Nabble.com.

S Ellison

2008-Aug-13 11:35 UTC

head link

[R] dixon test

>>> giov <biowoman at libero.it> 13/08/2008 10:59:32 >>>
> just a question...I don't know
>what is the distribution of my data (normal, T, etc...). So, how can I
set>the type parameter? 
You must assume an underlying distribution or you can't do an outlier
test.

Outliers are just unusually extreme data points. They can only be
considered 'unusual' if there is some basis - a distribution assumption
- for deciding what is 'usual'.  The assumed underlying distribution
describes what is expected to be 'usual'. 

With no distribution assumption, there is no basis for considering any
data point unusual, so the idea of an outlier really has no meaning. 

Steve E




*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

S Ellison

2008-Aug-14 10:56 UTC

head link

[R] dixon test

giov,

It sounds like you have approximately symmetric distributions. If that
is so, and particularly if the standard deviation is less than about 20%
of the mean, I'll stick my neck out and say I would assume underlying
normality for outlier testing purposes unless there's a reason to do
otherwise (eg if you're testing variances, normality would _not_ be a
good assumption!).

The reason I'd do that is that is that it should not make a big
difference to the outcome with near-symmetric distributions. If it does,
your 'outliers' are borderline anyway. 
Similarly, although folk can get quite exercised over which test to use
and what significance level to choose, the test you use isn't very
important either, as long as the intention is just to screen data to
make sure the most influential/extreme points are not mistakes. 

Given that, you can use any of the tests in library(outliers). You can
also use boxplot.stats, and look at the $out list, like

y<-c(rnorm(15,10), 25.1) #25.1 should be an outlier
(bxs<-boxplot.stats(y))

#and locate the outliers in y:
which(y %in% bxs$out)

Another useful approach is to use robust estimates of mean and
dispersion, like hubers() in the MASS package, and then calculate simple
scores, with a z-like cutoff to identify outliers:

require(MASS)
hy<-hubers(y)
hscore<-(y-hy$mu)/hy$s
which(abs(hscore)>3)

Using the 'mad' or iqr options in outliers::scores will be broadly
similar in outcome.

Most of the modelling tools in R also offer useful diagnostics for
'odd' points. I find examining the residuals from rlm in MASS
particularly useful if you're seeking outliers in a regression context.

A more important question is what you will do if you find any outliers.
Outliers are just unusual compared to some expectation, not
automatically 'wrong'. Screening data for anomalies is good practice;
checking them to make sure they aren't mistakes is to be encouraged;
correcting mistakes if you find them is a no-brainer. But throwing
outliers away is something to think about very carefully, and on a
case-by-case basis. Sometimes, outliers are a genuine feature of the
process under study, or even the 'interesting' parts of the data.
It's
generally unsafe to throw them out without good reason.

Steve E


PS: Contrary to my earlier confident assertion of the non-existence of
nonparametric outlier tests, Barnett and Lewis DOES include some general
suggestions on 'nonparametric' outlier testing. But it also includes the
note that this "... smacks of throwing out the bathwater before the baby
has even been immersed". I guess they don't think much of the idea
either.
>>> giov <biowoman at libero.it> 13/08/2008 15:21:25 >>>
Thank you so much, I have not much experience on outliers =), I thought
that
there were nonparametric distribution-free outliers test =(. What is
the
most general distribution  I can use? I did histogram of my data set
and
sometimes normal distribution seems to occur, sometimes an uniform
distribution seems to occur. So, I cannot understand what distribution
I can
use for my whole data set....




*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Aug 2008 - dixon test

[R] dixon test

[R] dixon test

[R] dixon test

[R] dixon test

Seemingly Similar Threads