thr3ads.net - R help - [R] Correlated Columns in data frame [May 2008]

If this information is useful, please help other people find it:
Share via:

Nataraj

2008-May-17 05:40 UTC

[R] Correlated Columns in data frame

Dear all,
Sorry to post my query once again in the list, since I did
not get attention from anyone in my previous mail to this
list. 
Now I make it simple here that please give me a code for
find out the columns of a dataframe whose correlation
coefficient is below a pre-determined threshold. (For
detailed query please see my previous message to this list,
pasted hereunder)

Thanks and regards,
B.Nataraj

Following is my previous message to this list to which I do
not get any reply.

Dear all,
For removing correlated columns in a data frame,df.
I found a code written in R in the page
http://cheminfo.informatics.indiana.edu/~rguha/code/R/ of
Mr.Rajarshi Guha. 
The code is 
#################
r2test <- function(df, cutoff=0.8) {
  if (cutoff > 1 || cutoff <= 0) {
    stop(" 0 <= cutoff < 1")
  }
  if (!is.matrix(d) && !is.data.frame(d)) {
    stop("Must supply a data.frame or matrix")
  }
  r2cut = sqrt(cutoff);
  cormat <- cor(d);
  bad.idx <- which(abs(cormat)>r2cut,arr.ind=T);
  bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]],
ncol=2);
  drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
bad.idx[,1], bad.idx [,2]);
  if (length(drop.idx) == 0) {
      1:ncol(d)
  } else {
      (1:ncol(d))[-unique(drop.idx)]
  }
}
############################################
Now the problem is the code return different output (i.e.
different column number) for a different call. I could not
understood why it happens from that code, but I can
understand the logic in code except the line
********************************************
drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1],
bad.idx [,2]);
****************************************
what it means by comparing > 0.5 of nrow(bad.idx).
So I am looking for anyone to help me for different output
generation between the different function call as well as
 meaning of the line which I mentioned above.

Thanks!
B.Nataraj

Abhijit Dasgupta

2008-May-17 12:02 UTC

head link

[R] Correlated Columns in data frame

The line in question randomly decides which of the two correlated 
columns to drop. If C1 and C2 are correlated you could drop either one, 
the code decides which randomly, which is a principled way to do this. 
This does mean that repeated runs of this code will give you different 
results, but the final result is what you want in all cases: the columns 
of the resultant data.frame do not have pairwise correlation above a 
threshold. I guess the point here is that starting from the same 
data.frame there are several resultant data.frames possible which 
satisfy the property you desire, so are equally valid from that 
criterion's perspective.

A simple code for your more recent query is :
    cor.mat=cor(/data.frame/)
    ind = which(abs(cor.mat)</cutoff/, arr.ind=T)
where you would replace /data.frame/ and /cutoff/ with appropriate names 
and values. ind will give you the pairs of columns with the 
abs(cor)<cutoff. Note that you will get, for example, both (1,2) and 
(2,1) due to the symmetry of the correlation matrix.

Abhijit

Nataraj wrote:> Dear all,
> Sorry to post my query once again in the list, since I did
> not get attention from anyone in my previous mail to this
> list. 
> Now I make it simple here that please give me a code for
> find out the columns of a dataframe whose correlation
> coefficient is below a pre-determined threshold. (For
> detailed query please see my previous message to this list,
> pasted hereunder)
>
> Thanks and regards,
> B.Nataraj
>
> Following is my previous message to this list to which I do
> not get any reply.
>
> Dear all,
> For removing correlated columns in a data frame,df.
> I found a code written in R in the page
> http://cheminfo.informatics.indiana.edu/~rguha/code/R/ of
> Mr.Rajarshi Guha. 
> The code is 
> #################
> r2test <- function(df, cutoff=0.8) {
>   if (cutoff > 1 || cutoff <= 0) {
>     stop(" 0 <= cutoff < 1")
>   }
>   if (!is.matrix(d) && !is.data.frame(d)) {
>     stop("Must supply a data.frame or matrix")
>   }
>   r2cut = sqrt(cutoff);
>   cormat <- cor(d);
>   bad.idx <- which(abs(cormat)>r2cut,arr.ind=T);
>   bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]],
> ncol=2);
>   drop.idx <- ifelse(runif(nrow(bad.idx)) > .5,
> bad.idx[,1], bad.idx [,2]);
>   if (length(drop.idx) == 0) {
>       1:ncol(d)
>   } else {
>       (1:ncol(d))[-unique(drop.idx)]
>   }
> }
> ############################################
> Now the problem is the code return different output (i.e.
> different column number) for a different call. I could not
> understood why it happens from that code, but I can
> understand the logic in code except the line
> ********************************************
> drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1],
> bad.idx [,2]);
> ****************************************
> what it means by comparing > 0.5 of nrow(bad.idx).
> So I am looking for anyone to help me for different output
> generation between the different function call as well as
>  meaning of the line which I mentioned above.
>
> Thanks!
> B.Nataraj
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   
	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more reasonably related threads

R help - May 2008 - Correlated Columns in data frame

[R] Correlated Columns in data frame

[R] Correlated Columns in data frame

Reasonably Related Threads