Dear R-List, I would like to recode categorial variables into binary data, so that all values above median are coded 1 and all values below 0, separating each var into two equally large groups (e.g. good performers = 0 vs. bad performers =1). I have not succeeded so far in finding a nice solution to do that in R. I thought there might be a better way than ordering each column and recoding the first 50% into 0 and the second into 1. If I use ifelse I have a problem with cases that share the same rank being all median. e.g. df<-as.data.frame(cbind(snr=c(1,2,3,4,5,6,7,8,9,10),k1=c(1,1,4,2,3,2,2,5,2,2),k2=c(1,2,3,2,1,2,1,3,3,2),result=c(4,3,5,4,2,6,4,4,2,3))) now I want to recode k1 and k2 so that I have half of the values recoded 0 and half recoded 1, split around the median point. The median of k1 is 2 which would lead to unequal groupsize if used 2 as cutoff, so all values k1=2 should be recoded 1 or 0 randomly until both categories have the same length. something like df.rec<-as.data.frame(cbind(snr=c(1,2,3,4,5,6,7,8,9,10),k1=c(0,0,1,0,1,1,0,1,0,1),k2=c(0,1,1,0,0,1,0,1,1,0),result=c(4,3,5,4,2,6,4,4,2,3))) Can anyone help? Thank you in advance. Best wishes. AlainĀ [[alternative HTML version deleted]]
Hello, First of all, you don't need as.data.frame(cbind(...)). It's much better to simply do data.frame(...). As for the conversion, the following function doesn't use randomness but gets the job done df <- data.frame(snr=c(1,2,3,4,5,6,7,8,9,10), k1=c(1,1,4,2,3,2,2,5,2,2), k2=c(1,2,3,2,1,2,1,3,3,2), result=c(4,3,5,4,2,6,4,4,2,3)) fun <- function(x){ n <- length(x) y <- rep(NA, n) y[x < median(x)] <- 0 y[x > median(x)] <- 1 w <- which(x == median(x)) y[w[seq_len(n/2 - length(which(x < median(x))))]] <- 0 y[is.na(y)] <- 1 y } fun(df$k1) fun(df$k2) Hope this helps, Rui Barradas Em 07-05-2013 17:20, D. Alain escreveu:> Dear R-List, > > I would like to recode categorial variables into binary data, so that all values above median are coded 1 and all values below 0, separating each var into two equally large groups (e.g. good performers = 0 vs. bad performers =1). > > I have not succeeded so far in finding a nice solution to do that in R. I thought there might be a better way than ordering each column and recoding the first 50% into 0 and the second into 1. If I use ifelse I have a problem with cases that share the same rank being all median. > > e.g. > df<-as.data.frame(cbind(snr=c(1,2,3,4,5,6,7,8,9,10),k1=c(1,1,4,2,3,2,2,5,2,2),k2=c(1,2,3,2,1,2,1,3,3,2),result=c(4,3,5,4,2,6,4,4,2,3))) > > now I want to recode k1 and k2 so that I have half of the values recoded 0 and half recoded 1, split around the median point. The median of k1 is 2 which would lead to unequal groupsize if used 2 as cutoff, so all values k1=2 should be recoded 1 or 0 randomly until both categories have the same length. > > something like > > df.rec<-as.data.frame(cbind(snr=c(1,2,3,4,5,6,7,8,9,10),k1=c(0,0,1,0,1,1,0,1,0,1),k2=c(0,1,1,0,0,1,0,1,1,0),result=c(4,3,5,4,2,6,4,4,2,3))) > > Can anyone help? > > Thank you in advance. > > Best wishes. > Alain > [[alternative HTML version deleted]] > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On May 7, 2013, at 9:20 AM, D. Alain wrote:> Dear R-List, > > I would like to recode categorial variables into binary data, so that all values above median are coded 1 and all values below 0, separating each var into two equally large groups (e.g. good performers = 0 vs. bad performers =1). > > I have not succeeded so far in finding a nice solution to do that in R. I thought there might be a better way than ordering each column and recoding the first 50% into 0 and the second into 1. If I use ifelse I have a problem with cases that share the same rank being all median. > > e.g. > df<-as.data.frame(cbind(snr=c(1,2,3,4,5,6,7,8,9,10),k1=c(1,1,4,2,3,2,2,5,2,2),k2=c(1,2,3,2,1,2,1,3,3,2),result=c(4,3,5,4,2,6,4,4,2,3)))First off, stop using cbind() when it is not needed. You will not see the reason when the columns are all numeric but you will start experiencing pain and puzzlement when the arguments are of mixed classes. The data.frame function will do what you want. (Where do people pick up this practice anyway?) df[,2] <- as.numeric( order(df[,2]) >= length(df[,2])/2 )> > now I want to recode k1 and k2 so that I have half of the values recoded 0 and half recoded 1, split around the median point. The median of k1 is 2 which would lead to unequal groupsize if used 2 as cutoff, so all values k1=2 should be recoded 1 or 0 randomly until both categories have the same length. > > something like > > df.rec<-as.data.frame(cbind(snr=c(1,2,3,4,5,6,7,8,9,10),k1=c(0,0,1,0,1,1,0,1,0,1),k2=c(0,1,1,0,0,1,0,1,1,0),result=c(4,3,5,4,2,6,4,4,2,3))) > > Can anyone help? > > Thank you in advance. > > Best wishes. > Alain > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA