Hi, I am using the sim1000G R package to simulate data for case/control study. I can not figure out how to manipulate this code to be able to generate 10% or 50% causal SNPs in R. This is whole code provided as example on GitHub: library(sim1000G) vcf_file = "region-chr4-357-ANK2.vcf.gz" #nvariants = 442, ss=1000 vcf = readVCF( vcf_file, maxNumberOfVariants = 442 ,min_maf 0.0005,max_maf = 0.01) #lowest MAF dim( vcf$gt1 ) #rows represent number of variants, columns represent number of individuals ## Download and use full chromosome genetic map downloadGeneticMap(4) readGeneticMap(4) sample.size=3000 startSimulation(vcf, totalNumberOfIndividuals = sample.size) data_sim = function(seed.num){ SIM$reset() id = generateUnrelatedIndividuals(sample.size) gt = retrieveGenotypes(id) freq = apply(gt,2,sum)/(2*nrow(gt)) causal = sample(setdiff(1:ncol(gt),which(freq==0)),45) beta.sign = rep(1,45) c.value = 0.402 beta.abs = c.value*abs(log10(freq[causal])) beta.val = beta.sign*beta.abs x.bar = apply(gt[,causal],2,mean) x.bar = as.matrix(x.bar) beta.val = t(as.matrix(beta.val)) #disease prvalance = 1% #beta0 = -log(99)-beta.val %*% x.bar #disease prvalance = 1.5% beta0 = 0-beta.val %*% x.bar eta = beta.val %*% t(gt[,causal]) eta = as.vector(eta) + rep(beta0,nrow(gt)) prob = exp(eta)/(1+exp(eta)) genocase = rep(NA, sample.size) set.seed(seed.num) for(i in 1:sample.size){ genocase[i] = rbinom(1, 1, prob[i]) } case.idx = sample(which(genocase==1),1000) control.idx = sample(which(genocase==0),1000) return(rbind(gt[case.idx,],gt[control.idx,])) } How I can modify code in a way that it will simulate: 50 % of causal SNPs** ( exmp. 24 causal variants and 24 non causal SNPs) 10 % of causal SNP (exmpl. 5 causal and 43 non causal SNPs) Thanks a lot for any suggestion. [[alternative HTML version deleted]]

Hi Berina, I'm not an expert on genetics. I haven't looked at the package. And I've only glanced at your question. So, this is probably not the best response. But as no one else has responded, here's some comments: (1) Have you checked if there's a function in the package to do what you want? The remainder of these questions assume that you have, and the answer is no. (2) I'm having some difficulty following the sample size. The initial code sets an explicit value to 3000. But if I'm following it correctly, the object that's returned has 2000 rows, containing two 1000 row groups. But then your question implies you want 48? Given that the sample already contains two groups, are they relevant to the sample you're trying to produce? And are you wanting to take a small sample of 48 from a larger sample of 2000, or something else? (3) Are the observations (not sure if that's the correct term here) in each 1000 row group, statistically independent? If they are, then taking a smaller sample should be relatively simple. If they're not, then this is a much more complex question, that's probably off-topic. (4) What exactly is in the data? i.e. Could you call the str() or head() functions on the data, and show us the results? (5) Is there a boolean-style variable in the data, indicating whether each row is casual or non-casual? B. On Fri, Oct 30, 2020 at 10:37 PM Berina Zametica UNI <s0bezame at uni-bonn.de> wrote:> > Hi, > > I am using the sim1000G R package to simulate data for case/control study. > I can not figure out how to manipulate this code to be able to generate 10% > or 50% causal SNPs in R. > > This is whole code provided as example on GitHub: > > library(sim1000G) > vcf_file = "region-chr4-357-ANK2.vcf.gz" #nvariants = 442, ss=1000 > > vcf = readVCF( vcf_file, maxNumberOfVariants = 442 ,min_maf > 0.0005,max_maf = 0.01) #lowest MAF > dim( vcf$gt1 ) #rows represent number of variants, columns represent > number of individuals > > ## Download and use full chromosome genetic map > downloadGeneticMap(4) > readGeneticMap(4) > > sample.size=3000 > > startSimulation(vcf, totalNumberOfIndividuals = sample.size) > > data_sim = function(seed.num){ > > SIM$reset() > > id = generateUnrelatedIndividuals(sample.size) > > gt = retrieveGenotypes(id) > > > freq = apply(gt,2,sum)/(2*nrow(gt)) > causal = sample(setdiff(1:ncol(gt),which(freq==0)),45) > > beta.sign = rep(1,45) > c.value = 0.402 > beta.abs = c.value*abs(log10(freq[causal])) > beta.val = beta.sign*beta.abs > x.bar = apply(gt[,causal],2,mean) > x.bar = as.matrix(x.bar) > beta.val = t(as.matrix(beta.val)) > #disease prvalance = 1% > #beta0 = -log(99)-beta.val %*% x.bar > #disease prvalance = 1.5% > beta0 = 0-beta.val %*% x.bar > > eta = beta.val %*% t(gt[,causal]) > eta = as.vector(eta) + rep(beta0,nrow(gt)) > prob = exp(eta)/(1+exp(eta)) > > genocase = rep(NA, sample.size) > > set.seed(seed.num) > for(i in 1:sample.size){ > genocase[i] = rbinom(1, 1, prob[i]) > } > case.idx = sample(which(genocase==1),1000) > control.idx = sample(which(genocase==0),1000) > > return(rbind(gt[case.idx,],gt[control.idx,])) > } > > How I can modify code in a way that it will simulate: > 50 % of causal SNPs** ( exmp. 24 causal variants and 24 non causal SNPs) > 10 % of causal SNP (exmpl. 5 causal and 43 non causal SNPs) > > Thanks a lot for any suggestion. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.