Ruofei Mo【莫若飞】
2016-Mar-02 21:06 UTC
[R] Ruofei Mo - How can I generate correlated data with non-normal distribution?
Hi, All, I have a question about how to generate correlated data with non-normal distribution? Basic, I have a variable a that follows a normal distribution, a ~ N(0,1), then I want to generate another variable b that follows a uniform distribution, b ~ U(0, 1). Most importantly, I want the correlation between a and b to be fixed at -.9, cor(a,b) = -.90 I tried the following code, ### Correlation matrix rmvnorm() function ### Cormat <- matrix(c(1, -.9, -.9, 1), ncol = 2) # Here, I want to create 2 variables that have correlation -.9 ### Theta-Transform-Guessing ### DataSet <- data.frame(rmvnorm(1000, mean=c(0, 0), sigma=Cormat)) Names(DataSet) <- c("a", "trans") ### Using trans to be transformed into guessing parameters ### DataSet$b <- pnorm(DataSet$trans, mean=mean(DataSet$trans), sd=sd(DataSet$trans)) # Here, I used the pnorm() function to transform one variable to a U(0, 1) However, the correlation is changed. Can anyone give me some suggestion that how can I generate the data? Thanks, Ruofei [[alternative HTML version deleted]]
Ben Bolker
2016-Mar-02 23:46 UTC
[R] Ruofei Mo - How can I generate correlated data with non-normal distribution?
Ruofei Mo????? <911mruofei <at> tongji.edu.cn> writes:> > Hi, All, > > I have a question about how to generate correlated data with non-normal > distribution? Basic, I have a variable a that follows a normal distribution, > a ~ N(0,1), then I want to generate another variable b that follows a > uniform distribution, b ~ U(0, 1). Most importantly, I want the correlation > between a and b to be fixed at -.9, cor(a,b) = -.90 > > I tried the following code, > > ### Correlation matrix rmvnorm() function ### >I don't know that there's a closed-form solution to this problem. Here's an attempt to do it by brute force. By eyeball, you need to set the nominal rho to about -0.92 to get a realized rho of -0.9. simfun <- function(rho,n=10000) { cormat <- matrix(c(1, rho, rho, 1), ncol = 2) dd <- setNames(data.frame(MASS::mvrnorm(1000, mu=c(0,0), Sigma=cormat)), c("a","trans")) dd <- transform(dd, b=pnorm(trans,mean(trans),sd(trans))) dd[,c("a","b")] } cvec <- seq(-0.999,-0.85,length=51) res <- expand.grid(rho=cvec,rep=1:10) set.seed(101) res$cor <- sapply(res$rho, function(r) cor(simfun(rho=r,n1e6))[1,2]) par(las=1,bty="l") plot(cor~rho,data=res) abline(a=0,b=1,col=2) abline(h=-0.9,col=4) abline(v=-0.92,col=4)
Dalthorp, Daniel
2016-Mar-03 16:31 UTC
[R] Ruofei Mo - How can I generate correlated data with non-normal distribution?
Ruofei, Ben's suggestion is simple and gets you close: require(MASS) nsim <- 1000000 rho <- -.9 Z <- mvrnorm(nsim, mu=c(0,0),Sigma = cbind(c(1,rho),c(rho, 1))) U <- pnorm(Z); a <- Z[,1] b <- qunif(U[,2]) cor(a,b) Pearson correlation characterizes the linear relationship between normal r.v.'s, but there's always a question about what Pearson correlation means for non-normal r.v.'s....the "close" approach that Ben gave gives a pair of r.v.'s with a non-linear relationship (because of the pnorm non-linear transformation), which can be seen if you plot the means of a in bins across the range of b. plot(0,0,type='n',xlim=range(a), ylim=c(-.2,1.1)) for (i in seq(0,.99,by=0.01)){ ind<-which(b>min(b)+i*diff(range(b)) & b<=min(b)+(i+.01)*diff(range(b))) points(mean(a[ind]),i-.005,pch=3) # means of a in 100 bins that span the range of b } # so Spearman rank correlation is often used for "correlation" between non-normal r.v.'s. Exact Spearman correlations can be attained using the same approach, but adjusting the target correlation. # to get Spearman correlation of rho = -0.90, use 2*sin(rho*pi/6) in place of rho in the mvrnorm simulation: nsim <- 1000000 rho <- -.9 Z <- mvrnorm(nsim, mu=c(0,0),Sigma cbind(c(1,2*sin(rho*pi/6)),c(2*sin(rho*pi/6), 1))) a<-Z[,1]; b<-qunif(pnorm(Z[,2])) cor(rank(a),rank(b)) # this gives r.v.'s with exact Spearman correlations as desired, while aiming at Pearson correlation tends to be off by some amount TargetCorrelation<-seq(-.99, -.1, by=.01) SimulatedCorrelationP<-numeric(length(TargetCorrelation)) SimulatedCorrelationS<-numeric(length(TargetCorrelation)) nsim<-100000 for (i in 1:length(TargetCorrelation)){ rho<-TargetCorrelation[i] Z <- mvrnorm(nsim, mu=c(0,0),Sigma = cbind(c(1,rho),c(rho, 1))) U <- pnorm(Z); a <- Z[,1] b <- qunif(U[,2]) SimulatedCorrelationP[i]<-cor(a,b) Z <- mvrnorm(nsim, mu=c(0,0),Sigma cbind(c(1,2*sin(rho*pi/6)),c(2*sin(rho*pi/6), 1))) U <- pnorm(Z); a <- Z[,1] b <- qunif(U[,2]) SimulatedCorrelationS[i]<-cor(rank(a),rank(b)) } plot(TargetCorrelation,SimulatedCorrelationP) points(TargetCorrelation,SimulatedCorrelationS,pch=20) lines(c(-1,0),c(-1,0),col=2) plot(TargetCorrelation,SimulatedCorrelationP) points(TargetCorrelation,SimulatedCorrelationS,pch=20) lines(c(-1,0),c(-1,0),col=2) -Dan On Wed, Mar 2, 2016 at 3:46 PM, Ben Bolker <bbolker at gmail.com> wrote:> Ruofei Mo????? <911mruofei <at> tongji.edu.cn> writes: > > > > > Hi, All, > > > > I have a question about how to generate correlated data with non-normal > > distribution? Basic, I have a variable a that follows a normal > distribution, > > a ~ N(0,1), then I want to generate another variable b that follows a > > uniform distribution, b ~ U(0, 1). Most importantly, I want the > correlation > > between a and b to be fixed at -.9, cor(a,b) = -.90 > > > > I tried the following code, > > > > ### Correlation matrix rmvnorm() function ### > > > > I don't know that there's a closed-form solution to this problem. > Here's an attempt to do it by brute force. By eyeball, you need to > set the nominal rho to about -0.92 to get a realized rho of -0.9. > > simfun <- function(rho,n=10000) { > cormat <- matrix(c(1, rho, rho, 1), ncol = 2) > dd <- setNames(data.frame(MASS::mvrnorm(1000, mu=c(0,0), > Sigma=cormat)), > c("a","trans")) > dd <- transform(dd, > b=pnorm(trans,mean(trans),sd(trans))) > dd[,c("a","b")] > } > > cvec <- seq(-0.999,-0.85,length=51) > res <- expand.grid(rho=cvec,rep=1:10) > set.seed(101) > res$cor <- sapply(res$rho, > function(r) cor(simfun(rho=r,n1e6))[1,2]) > > par(las=1,bty="l") > plot(cor~rho,data=res) > abline(a=0,b=1,col=2) > abline(h=-0.9,col=4) > abline(v=-0.92,col=4) > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Dan Dalthorp, PhD USGS Forest and Rangeland Ecosystem Science Center Forest Sciences Lab, Rm 189 3200 SW Jefferson Way Corvallis, OR 97331 ph: 541-750-0953 ddalthorp at usgs.gov [[alternative HTML version deleted]]