thr3ads.net - R help - [R] slightly OT: (un)supervised clustering? [Oct 2008]

If this information is useful, please help other people find it:
Share via:

viktoras didziulis

2008-Oct-28 19:32 UTC

[R] slightly OT: (un)supervised clustering?

Hi,

my question is not exactly about R... What I am looking for are hints 
and directions on suitable methods (available in R or elsewhere)  to 
solve a grouping (or pattern recognition) problem of environmental 
features in an environmental gradient as described below.

Given environmental sampling data set  (Depth, Presence of sand, 
Presence of boulders, Presence of clay).
1 1 1 0
1 1 0 0
1 1 1 0
2 1 1 0
3 1 1 0
3 1 1 0
4 1 1 0
5 1 0 0
5 1 0 0
5 1 1 0
5 1 0 0
6 1 0 0
6 1 0 0
6 1 1 0
7 1 0 1
7 1 0 0
8 1 0 1
9 1 1 1
9 1 0 1
9 1 0 1

Once I have sampling data ordered by depth, using my own "expert" 
opinion I can distinguish 3 groups A, B, C: A (1 - 4 m depth range) - 
where both sand and boulders are present, B (5 - 6 m range) - where sand 
is dominant with just a few observations of boulders, C (7 - 9 m range) 
- substrate dominated by sand and clay.

Now the question - is there any formal method that can do the same e.g. 
separate the groups A, B and C by analyzing how does feature occurrence 
patterns change in samples along an environmental gradient (depth in 
this case)? Sample dataset here is simplified, in fact I have to deal 
with a dozen of features like salinity, exposure and related species 
lists. I "see" these groups as an expert, but it would be nice having
a
helper algorithm to see the groups for me, so I could describe it in 
Methods section of my writings :-)

Similarity matrix and Cluster analysis or MDS do not perform as 
expected, because it groups stations from group A together with stations 
of other groups that have most similar substrate observations e.g. it 
ignores environmental gradient.
Discriminant analysis expects me to do the grouping and then it will 
"decide" the rest. Therefore not suitable.
A bunch of significance tests can help in deciding whether the 
differences are statistically significant. But again, I have to present 
my own groups, therefore - not suitable.
Other unsupervised learning algorithms (Neural Networks & Co) - well, 
how can I instruct them to do analysis along an environmental gradient 
of depth ?..

If anyone among the experts on this list has dealt with similar problems 
before I would highly appreciate if you could briefly describe your 
approaches or point to the right sources.

And in general I am interested in approaches of locating discontinuities 
in data patterns sampled along environmental gradients.

Best wishes!
Viktoras Didziulis
P.S. just subscribed to this list, sorry if I'm missing something

Dylan Beaudette

2008-Oct-28 20:21 UTC

head link

[R] slightly OT: (un)supervised clustering?

On Tuesday 28 October 2008, viktoras didziulis wrote:> Hi,
>
> my question is not exactly about R... What I am looking for are hints
> and directions on suitable methods (available in R or elsewhere)  to
> solve a grouping (or pattern recognition) problem of environmental
> features in an environmental gradient as described below.
>
> Given environmental sampling data set  (Depth, Presence of sand,
> Presence of boulders, Presence of clay).
> 1 1 1 0
> 1 1 0 0
> 1 1 1 0
> 2 1 1 0
> 3 1 1 0
> 3 1 1 0
> 4 1 1 0
> 5 1 0 0
> 5 1 0 0
> 5 1 1 0
> 5 1 0 0
> 6 1 0 0
> 6 1 0 0
> 6 1 1 0
> 7 1 0 1
> 7 1 0 0
> 8 1 0 1
> 9 1 1 1
> 9 1 0 1
> 9 1 0 1
Are these bore-hole logs? If so check the literature in geophysics / earth 
science / soil science.
> Once I have sampling data ordered by depth, using my own "expert"
> opinion I can distinguish 3 groups A, B, C: A (1 - 4 m depth range) -
> where both sand and boulders are present, B (5 - 6 m range) - where sand
> is dominant with just a few observations of boulders, C (7 - 9 m range)
> - substrate dominated by sand and clay.
hmm. I get something like that with a simple call to pam():

# need this
library(cluster)

# had to make your data into something useable first...
# partition into 4 groups
x.pam <- pam(x, k=4)

# add the clustering vector back to your original data:
x$cluster <- x.pam$clustering

# looks like this
   X1 X2 X3 X4 cluster
1   1  1  1  0       1
2   1  1  0  0       1
3   1  1  1  0       1
4   2  1  1  0       1
5   3  1  1  0       1
6   3  1  1  0       1
7   4  1  1  0       2
8   5  1  0  0       2
9   5  1  0  0       2
10  5  1  1  0       2
11  5  1  0  0       2
12  6  1  0  0       3
13  6  1  0  0       3
14  6  1  1  0       3
15  7  1  0  1       3
16  7  1  0  0       3
17  8  1  0  1       4
18  9  1  1  1       4
19  9  1  0  1       4
20  9  1  0  1       4

Not sure if that is meaningful-- if you are interested in the methods from the 
cluster package, be sure to get the book that it is based on.
> Now the question - is there any formal method that can do the same e.g.
> separate the groups A, B and C by analyzing how does feature occurrence
> patterns change in samples along an environmental gradient (depth in
> this case)? Sample dataset here is simplified, in fact I have to deal
> with a dozen of features like salinity, exposure and related species
> lists. I "see" these groups as an expert, but it would be nice
having a
> helper algorithm to see the groups for me, so I could describe it in
> Methods section of my writings :-)
This is a classic problem of variation in some property along some axis of 
anisotropy-- I tend to see this in my field as variation in soil properties 
with depth -aka- horizons.

> Similarity matrix and Cluster analysis or MDS do not perform as
> expected, because it groups stations from group A together with stations
> of other groups that have most similar substrate observations e.g. it
> ignores environmental gradient.
What happens if you were to include some indicator of the gradient in the 
unsupervised classification? See the example above where I included the 
depth.

> Discriminant analysis expects me to do the grouping and then it will
> "decide" the rest. Therefore not suitable.
> A bunch of significance tests can help in deciding whether the
> differences are statistically significant. But again, I have to present
> my own groups, therefore - not suitable.
> Other unsupervised learning algorithms (Neural Networks & Co) - well,
> how can I instruct them to do analysis along an environmental gradient
> of depth ?..
if you have an idea on the number of groupings you are looking for, then the 
pam() and clara() functions in the cluster package may do what you need. 
These are especially nice as they can deal with continuous, ordinal, and 
binary variables. If you do not know how many groups there may be, see the 
diana() and daisy() functions. With all of these use of the 'stand=TRUE'
argument will be important if your variables are on different scales.

# an example using data from above:
x.hc <- as.hclust(diana(daisy(x[,1:4], stand=TRUE)))
x.hc$labels <- x$cluster
plot(x.hc)
> If anyone among the experts on this list has dealt with similar problems
> before I would highly appreciate if you could briefly describe your
> approaches or point to the right sources.
>
> And in general I am interested in approaches of locating discontinuities
> in data patterns sampled along environmental gradients.
The soil science literature may have some relevent insight into this matter.

Good luck,

Dylan

> Best wishes!
> Viktoras Didziulis
> P.S. just subscribed to this list, sorry if I'm missing something
>
> ______________________________________________
-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

viktoras didziulis

2008-Oct-29 07:27 UTC

head link

[R] slightly OT: (un)supervised clustering?

Thank you Dylan for the hints - I found them very useful, a good 
starting point for me to learn about clustering in R.

Best wishes
Viktoras

Possibly Parallel Threads

Search for more maybe matching threads

R help - Oct 2008 - slightly OT: (un)supervised clustering?

[R] slightly OT: (un)supervised clustering?

[R] slightly OT: (un)supervised clustering?

[R] slightly OT: (un)supervised clustering?

Possibly Parallel Threads