thr3ads.net - R help - [R] Outlier Detection with k-Means [May 2014]

If this information is useful, please help other people find it:
Share via:

marioger

2014-May-07 08:34 UTC

[R] Outlier Detection with k-Means

Hi,

i am hoping you can help me with my problem. I am trying to detect outliers
with use of the kmeans algorithm. First I perform the algorithm and choose
those object as possible outliers which have a big distance to their cluster
center. Instead of using the absolute distance I want to use the relative
distance, i.e. the ration of absolute distance of the object to the cluster
center and the average distance of all objects of the cluster to their
cluster center. The code for outlier detection based on absolute distance is
the following:
> # remove species from the data to cluster
> iris2 <- iris[,1:4]
> kmeans.result <- kmeans(iris2, centers=3)
> # cluster centers
> kmeans.result$centers
> # calculate distances between objects and cluster centers
> centers <- kmeans.result$centers[kmeans.result$cluster, ]
> distances <- sqrt(rowSums((iris2 - centers)^2))
> # pick top 5 largest distances
> outliers <- order(distances, decreasing=T)[1:5]
> # who are outliers
> print(outliers)
But how can I use the relative instead of the absolute distance to find
outliers?
Thanks in advance.

Mario



--
View this message in context:
http://r.789695.n4.nabble.com/Outlier-Detection-with-k-Means-tp4690098.html
Sent from the R help mailing list archive at Nabble.com.

William Dunlap

2014-May-07 15:35 UTC

head link

[R] Outlier Detection with k-Means

Try replacing your order() call with the following 2 lines
    meanClusterRadius <- ave(distances, kmeans.result$cluster,  FUN = mean)
    outliers <- order(distances/meanClusterRadius, decreasing = T)[1:5]
ave(x,group,FUN=fun) applies FUN to the subsets of x defined by the
group argument(s) and puts the results of FUN(x[group[i]]) back into
x[group[i]], returning the modified x.
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Wed, May 7, 2014 at 1:34 AM, marioger <mario_wiegand at gmx.de>
wrote:> Hi,
>
> i am hoping you can help me with my problem. I am trying to detect outliers
> with use of the kmeans algorithm. First I perform the algorithm and choose
> those object as possible outliers which have a big distance to their
cluster
> center. Instead of using the absolute distance I want to use the relative
> distance, i.e. the ration of absolute distance of the object to the cluster
> center and the average distance of all objects of the cluster to their
> cluster center. The code for outlier detection based on absolute distance
is
> the following:
>
>> # remove species from the data to cluster
>> iris2 <- iris[,1:4]
>> kmeans.result <- kmeans(iris2, centers=3)
>> # cluster centers
>> kmeans.result$centers
>> # calculate distances between objects and cluster centers
>> centers <- kmeans.result$centers[kmeans.result$cluster, ]
>> distances <- sqrt(rowSums((iris2 - centers)^2))
>> # pick top 5 largest distances
>> outliers <- order(distances, decreasing=T)[1:5]
>> # who are outliers
>> print(outliers)
>
> But how can I use the relative instead of the absolute distance to find
> outliers?
> Thanks in advance.
>
> Mario
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Outlier-Detection-with-k-Means-tp4690098.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Boris Steipe

2014-May-07 15:51 UTC

head link

[R] Outlier Detection with k-Means

Three comments:
(i)    If you calculate distances like this, you are weighting all columns
       equally by absolute numbers. Depending on your application, you might
       want to normalize the columns first (and before clustering).

(ii)   Your distance calculation is not the cartesian distance. That would be:
       sqrt(rowSums(iris2[1,]^2 - centers[1,]^2)). 

(iii)  To scale to relative distances you need to define a common,
       commensurable scale. This is often done by scaling to the standard
       deviation, which gives you a probability estimate for your outliers
       if you can model the distribution as being normal. You could also
       scale to the median.

But in the end ... detecting "outliers" in the first place implies
that you have an underlying model of the distribution. If you do, you should
apply it. If you don't (and I don't see how such a model would be
possible, given that your outliers of one group could simply be members of
another) you could simply rank, and investigate (or discard) some n most
distant. That said, I still don't think this would be meaningful because of
the fact that k-means assigns *all* points to *some* cluster - but perhaps your
specific application supports a different reasoning.

Cheers,
Boris




On 2014-05-07, at 4:34 AM, marioger wrote:
> Hi,
> 
> i am hoping you can help me with my problem. I am trying to detect outliers
> with use of the kmeans algorithm. First I perform the algorithm and choose
> those object as possible outliers which have a big distance to their
cluster
> center. Instead of using the absolute distance I want to use the relative
> distance, i.e. the ration of absolute distance of the object to the cluster
> center and the average distance of all objects of the cluster to their
> cluster center. The code for outlier detection based on absolute distance
is
> the following:
> 
>> # remove species from the data to cluster
>> iris2 <- iris[,1:4]
>> kmeans.result <- kmeans(iris2, centers=3)
>> # cluster centers
>> kmeans.result$centers
>> # calculate distances between objects and cluster centers
>> centers <- kmeans.result$centers[kmeans.result$cluster, ]
>> distances <- sqrt(rowSums((iris2 - centers)^2))
>> # pick top 5 largest distances
>> outliers <- order(distances, decreasing=T)[1:5]
>> # who are outliers
>> print(outliers)
> 
> But how can I use the relative instead of the absolute distance to find
> outliers?
> Thanks in advance.
> 
> Mario
> 
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/Outlier-Detection-with-k-Means-tp4690098.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - May 2014 - Outlier Detection with k-Means

[R] Outlier Detection with k-Means

[R] Outlier Detection with k-Means

[R] Outlier Detection with k-Means