thr3ads.net - R help - [R] positive log likelihood and BIC values from mCLUST analysis [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Barry Hall

2010-Jan-06 20:59 UTC

[R] positive log likelihood and BIC values from mCLUST analysis

My question is with respect to mCLUST and the values of BIC and log
likelihood. The relevant part of my R script is:


######################### BEGIN MDS ANALYSIS #########################
#load data
 data <- read.table("Ecoli33_Barry.dis", header = TRUE, row.names =
1)

 #perform MDS Scaling
 mds <- metaMDS(data, k = Dimensions, trymax = 20, autotransform =TRUE,
noshare = 0.1, wascores = TRUE, expand = TRUE, trace = FALSE, plot = FALSE,
old.wa = FALSE)

######################### BEGIN EM ANALYSIS #########################

#Use the points determined by MDS to perform EM clustering.
#Allow only the unconstrained models.  Sometimes, constrained models mess
things up!
EMclusters <- mclustBIC(mds$points, G=Clusterrange, modelNames=
c("VII",
"VVI", "VVV"), prior=NULL, control=emControl(), 
          initialization=list(hcPairs=NULL, subset=NULL, noise=NULL), 
          Vinv=NULL, warn=FALSE, x=NULL)

The input data are in the form of an N X N matrix of pairwise genetic
distances between strains. Those distances can either be the total
number of differences over X characters, or can be normalized to the
fraction 
of characters that differ by dividing the number of differences by X.


When the data are the total number of differences (over 5866 characters),
the optimal model is VVV  for which BIC is -944.1225  and the likelihood 
is -452.8305.  Two clusters are found

When the data are normalized to the fraction of characters that differ,
the optimal model is VII  for which the BIC is 202.3095  and the likelihood 
is 127.3786 .  Four clusters are found.

There are several things that I do not understand:
(1)  How can log likelihood be a positive number?
(2)	 Why should simply scaling the data change the BIC and log likelihood 
values?
(3)	 Perhaps most important, why should scaling the data change the
optimum model and the number of clusters?

To explore the effects of scaling the data I further scaled it
by multiplying the normalized caluesby  10, by 1E4 and by 1E14.

The larger the values the more negative were the BIC and log likelihood
values, and the optimum model and number of clusters changed with each
change to the scale of the data (though in no obvious pattern).
>From my perspective the normalized values would be preferable becausewhen there are missing data they could be normalized to the number of
characters or which there are daa in both members of the pair.


Any help with this would be greatly appreciated.

Barry Hall

-- 
View this message in context:
http://n4.nabble.com/positive-log-likelihood-and-BIC-values-from-mCLUST-analysis-tp1008356p1008356.html
Sent from the R help mailing list archive at Nabble.com.

Peter Dalgaard

2010-Jan-06 22:29 UTC

head link

[R] positive log likelihood and BIC values from mCLUST analysis

Barry Hall wrote:> My question is with respect to mCLUST and the values of BIC and log
> likelihood. The relevant part of my R script is:
> 
> 
> ######################### BEGIN MDS ANALYSIS #########################
> #load data
>  data <- read.table("Ecoli33_Barry.dis", header = TRUE,
row.names = 1)
> 
>  #perform MDS Scaling
>  mds <- metaMDS(data, k = Dimensions, trymax = 20, autotransform =TRUE,
> noshare = 0.1, wascores = TRUE, expand = TRUE, trace = FALSE, plot = FALSE,
> old.wa = FALSE)
> 
> ######################### BEGIN EM ANALYSIS #########################
> 
> #Use the points determined by MDS to perform EM clustering.
> #Allow only the unconstrained models.  Sometimes, constrained models mess
> things up!
> EMclusters <- mclustBIC(mds$points, G=Clusterrange, modelNames=
c("VII",
> "VVI", "VVV"), prior=NULL, control=emControl(), 
>           initialization=list(hcPairs=NULL, subset=NULL, noise=NULL), 
>           Vinv=NULL, warn=FALSE, x=NULL)
> 
> The input data are in the form of an N X N matrix of pairwise genetic
> distances between strains. Those distances can either be the total
> number of differences over X characters, or can be normalized to the
> fraction 
> of characters that differ by dividing the number of differences by X.
> 
> 
> When the data are the total number of differences (over 5866 characters),
> the optimal model is VVV  for which BIC is -944.1225  and the likelihood 
> is -452.8305.  Two clusters are found
> 
> When the data are normalized to the fraction of characters that differ,
> the optimal model is VII  for which the BIC is 202.3095  and the likelihood
> is 127.3786 .  Four clusters are found.
> 
> There are several things that I do not understand:
> (1)  How can log likelihood be a positive number?
Because likelihoods are densities.
> (2)	 Why should simply scaling the data change the BIC and log likelihood 
> values?
Because likelihoods are densities. And/or because it is not finding the 
same optimum.
> (3)	 Perhaps most important, why should scaling the data change the
> optimum model and the number of clusters?
Hmm, well... I don't really know. I wouldn't expect it if you are 
scaling equally in all directions. Perhaps in theory, it shouldn't 
change, but clustering models are notoriously unstable and sensitive to 
starting values.  So maybe you are just seeing the effect of slightly 
changed convergence paths?
> To explore the effects of scaling the data I further scaled it
> by multiplying the normalized caluesby  10, by 1E4 and by 1E14.
> 
> The larger the values the more negative were the BIC and log likelihood
> values, and the optimum model and number of clusters changed with each
> change to the scale of the data (though in no obvious pattern).
> 
>>From my perspective the normalized values would be preferable because
> when there are missing data they could be normalized to the number of
> characters or which there are daa in both members of the pair.
> 
> 
> Any help with this would be greatly appreciated.
> 
> Barry Hall
> 

-- 
    O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

Reasonably Related Threads

Search for more possibly parallel threads

R help - Jan 2010 - positive log likelihood and BIC values from mCLUST analysis

[R] positive log likelihood and BIC values from mCLUST analysis

[R] positive log likelihood and BIC values from mCLUST analysis

Reasonably Related Threads