Hello,
Thankyou for the clarification about the NAs. For your interest, thankfully my
end goal was not to plot a dendrogram with 23371 elements, but just to use the
output of the clustering to re-order the rows of a matrix before plotting it
with image(). Since clara() and pam() are partitioning based approaches, I
suppose I could instead stay with hclust() after removing the offending rows, so
that I have the ordering position of each gene, not its cluster membership. I
have 12 GB RAM on my 64-bit system, so the time it takes to run should be my
only problem.
- Dario.
---- Original message ---->Date: Fri, 28 Jan 2011 12:34:26 +0100
>From: Martin Maechler <maechler at stat.math.ethz.ch>
>Subject: Re: [R] agnes clustering and NAs
>To: gavin.simpson at ucl.ac.uk
>Cc: D.Strbenac at garvan.org.au, r-help at r-project.org, Uwe Ligges
<ligges at statistik.tu-dortmund.de>
>
>>>>>> Gavin Simpson <gavin.simpson at ucl.ac.uk>
>>>>>> on Fri, 28 Jan 2011 09:23:05 +0000 writes:
>
> > On Fri, 2011-01-28 at 10:00 +1100, Dario Strbenac wrote:
> >> Hello,
> >>
> >> Yes, that's right, it is a values matrix. Not a
dissimilarity matrix.
> >>
> >> i.e.
> >>
> >> > str(iMatrix)
> >> num [1:23371, 1:56] -0.407 0.198 NA -0.133 NA ...
> >> - attr(*, "dimnames")=List of 2
> >> ..$ : NULL
> >> ..$ : chr [1:56] "-8100" "-7900"
"-7700" "-7500" ...
>
>Ok, so in the end you want to draw a dendrogram for 23'371
>observational units, really ?
>
>I think I would not use a hierarchical clustering method for so
>many units, but rather clara() or maybe pam() or then model
>based or other methods, rather than fully hierarchical ones....
>...
>but yes, that's not the issue here, and see further down ...
>
>BTW: The object 'iMatrix' you provided for download has only 50
> columns, not 56...
> >>
> >> For the snippet of checking for NAs, I get all TRUEs, so I have
at least one NA in each column.
>
> GS> Sorry, my bad. Try this:
>
> GS> apply(iMatrix, 1, function(x) all(is.na(x)))
>
> GS> will check that you have no fully `NA` rows.
>
> GS> Also look at str(iMatrix) for potential problems.
>
> GS> Finally, try:
>
> GS> out <- dist(iMatrix) any(is.na(out))
>
> GS> should repeat what agnes is doing to compute the
> GS> dissimilarity matrix. If that returns TRUE, go and find
> GS> which samples are giving NA dissimilarity and why.
>
> GS> The issue is not NA in the input data, but that your
> GS> input data is leading to NA in the computed
> GS> dissimilarities. This might be due to NA's in your input
> GS> data, where a pair of samples has no common set of data
> GS> for example.
>
>Yes, that's right on spot, thank you Gavin.
>
>This is indeed to true:
>It *does* allow for NA's (in the data matrix), but if the
>pattern of NA's is such that the dissimilarity between two
>observations becomes undefined, namely e.g. if they have no
>common non-missings, then ``that's too much''.
>
>In general, I'd recommend to use
> dm <- daisy(....,...)
>trying methods, that are better with NAs, e.g. Gower's metric,
>until dm() has {nearly} no NAs,
>and then figure out some imputation to replace all NA's in dm
>by "reasonable values",
>then do clustering with the resulting dissimilarity "matrix" dm.
>
>HOWEVER, in your case, dm would correspond to
> 23371 x 23371 dissimilarity matrix,
>stored as a double precision matrix (on a 64-bit platform)
>that's an object of size 4.4 GBytes, not very convenient to work
>with.
>as dissimilarity object it will only be about half of that size,
>but that's still ``a bit large''..
>As I said above, for such data, I would never do fully
>hierarchical clustering,
>but rather something else.
>
>Martin Maechler, ETH Zurich
>
>
> GS> HTH
> GS> G
>
> >> The part of the agnes documentation I was referring to is :
> >>
> >> "In case of a matrix or data frame, each row corresponds
to an observation, and each column corresponds to a variable. All variables must
be numeric. Missing values (NAs) are allowed."
> >>
> >> So, I'm under the impression it handles NAs on its own ?
> >>
> >> - Dario.
> >>
> >> ---- Original message ----
> >> >Date: Thu, 27 Jan 2011 12:53:27 +0000
> >> >From: Gavin Simpson <gavin.simpson at ucl.ac.uk>
> >> >Subject: Re: [R] agnes clustering and NAs
> >> >To: Uwe Ligges <ligges at statistik.tu-dortmund.de>
> >> >Cc: D.Strbenac at garvan.org.au, r-help at r-project.org
> >> >
> >> >On Thu, 2011-01-27 at 10:45 +0100, Uwe Ligges wrote:
> >> >>
> >> >> On 27.01.2011 05:00, Dario Strbenac wrote:
> >> >> > Hello,
> >> >> >
> >> >> > In the documentation for agnes in the package
'cluster', it says that NAs are allowed, and sure enough it works for a
small example like :
> >> >> >
> >> >> >> m<- matrix(c(
> >> >> > 1, 1, 1, 2,
> >> >> > 1, NA, 1, 1,
> >> >> > 1, 2, 2, 2), nrow = 3, byrow = TRUE)
> >> >> >> agnes(m)
> >> >> > Call: agnes(x = m)
> >> >> > Agglomerative coefficient: 0.1614168
> >> >> > Order of objects:
> >> >> > [1] 1 2 3
> >> >> > Height (summary):
> >> >> > Min. 1st Qu. Median Mean 3rd Qu. Max.
> >> >> > 1.155 1.247 1.339 1.339 1.431 1.524
> >> >> >
> >> >> > Available components:
> >> >> > [1] "order" "height"
"ac" "merge" "diss" "call"
"method" "data"
> >> >> >
> >> >> > But I have a large matrix (23371 rows, 50
columns) with some NAs in it and it runs for about a minute, then gives an error
:
> >> >> >
> >> >> >> agnes(iMatrix)
> >> >> > Error in agnes(iMatrix) :
> >> >> > No clustering performed, NA-values in the
dissimilarity matrix.
> >> >> >
> >> >> > I've also tried getting rid of rows with all
NAs in them, and it still gave me the same error. Is this a bug in agnes() ? It
doesn't seem to fulfil the claim made by its documentation.
> >> >>
> >> >>
> >> >> I haven't looked in the file, but you need to get
rid of all NA, or in
> >> >> other words, all rows that contain *any* NA values.
> >> >
> >> >If one believes the documentation, then that only applies
to the case
> >> >where `x` is a dissimilarity matrix. `NA`s are allowed if x
is the raw
> >> >data matrix or data frame.
> >> >
> >> >The only way the OP could have gotten that error with the
call shown is
> >> >if iMatrix were not a dissimilarity matrix inheriting from
class "dist",
> >> >so `NA`s should be allowed.
> >> >
> >> >My guess would be that the OP didn't get rid of all the
`NA`s.
> >> >
> >> >Dario: what does:
> >> >
> >> >sapply(iMatrix, function(x) any(is.na(x)))
> >> >
> >> >or if iMatrix is a matrix:
> >> >
> >> >apply(iMatrix, 2, function(x) any(is.na(x)))
> >> >
> >> >say?
> >> >
> >> >G
> >> >
> >> >> Uwe Ligges
> >> >>
> >> >>
> >> >>
> >> >> > The matrix I'm using can be obtained here :
> >> >> > http://129.94.136.7/file_dump/dario/iMatrix.obj
> >> >> >
> >> >> > --------------------------------------
> >> >> > Dario Strbenac
> >> >> > Research Assistant
> >> >> > Cancer Epigenetics
> >> >> > Garvan Institute of Medical Research
> >> >> > Darlinghurst NSW 2010
> >> >> > Australia
> >> >> >
>
> >> >--
> >>
>%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
> >> > Dr. Gavin Simpson [t] +44 (0)20 7679 0522
> >> > ECRC, UCL Geography, [f] +44 (0)20 7679 0565
> >> > Pearson Building, [e]
gavin.simpsonATNOSPAMucl.ac.uk
> >> > Gower Street, London [w]
http://www.ucl.ac.uk/~ucfagls/
> >> > UK. WC1E 6BT. [w]
http://www.freshwaters.org.uk
> >>
>%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
--------------------------------------
Dario Strbenac
Research Assistant
Cancer Epigenetics
Garvan Institute of Medical Research
Darlinghurst NSW 2010
Australia