The line in question randomly decides which of the two correlated
columns to drop. If C1 and C2 are correlated you could drop either one,
the code decides which randomly, which is a principled way to do this.
This does mean that repeated runs of this code will give you different
results, but the final result is what you want in all cases: the columns
of the resultant data.frame do not have pairwise correlation above a
threshold. I guess the point here is that starting from the same
data.frame there are several resultant data.frames possible which
satisfy the property you desire, so are equally valid from that
criterion's perspective.
A simple code for your more recent query is :
cor.mat=cor(/data.frame/)
ind = which(abs(cor.mat)</cutoff/, arr.ind=T)
where you would replace /data.frame/ and /cutoff/ with appropriate names
and values. ind will give you the pairs of columns with the
abs(cor)<cutoff. Note that you will get, for example, both (1,2) and
(2,1) due to the symmetry of the correlation matrix.
Abhijit
Nataraj wrote:> Dear all,
> Sorry to post my query once again in the list, since I did
> not get attention from anyone in my previous mail to this
> list.
> Now I make it simple here that please give me a code for
> find out the columns of a dataframe whose correlation
> coefficient is below a pre-determined threshold. (For
> detailed query please see my previous message to this list,
> pasted hereunder)
>
> Thanks and regards,
> B.Nataraj
>
> Following is my previous message to this list to which I do
> not get any reply.
>
> Dear all,
> For removing correlated columns in a data frame,df.
> I found a code written in R in the page
> http://cheminfo.informatics.indiana.edu/~rguha/code/R/ of
> Mr.Rajarshi Guha.
> The code is
> #################
> r2test <- function(df, cutoff=0.8) {
> if (cutoff > 1 || cutoff <= 0) {
> stop(" 0 <= cutoff < 1")
> }
> if (!is.matrix(d) && !is.data.frame(d)) {
> stop("Must supply a data.frame or matrix")
> }
> r2cut = sqrt(cutoff);
> cormat <- cor(d);
> bad.idx <- which(abs(cormat)>r2cut,arr.ind=T);
> bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]],
> ncol=2);
> drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
> bad.idx[,1], bad.idx [,2]);
> if (length(drop.idx) == 0) {
> 1:ncol(d)
> } else {
> (1:ncol(d))[-unique(drop.idx)]
> }
> }
> ############################################
> Now the problem is the code return different output (i.e.
> different column number) for a different call. I could not
> understood why it happens from that code, but I can
> understand the logic in code except the line
> ********************************************
> drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1],
> bad.idx [,2]);
> ****************************************
> what it means by comparing > 0.5 of nrow(bad.idx).
> So I am looking for anyone to help me for different output
> generation between the different function call as well as
> meaning of the line which I mentioned above.
>
> Thanks!
> B.Nataraj
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]