Hi All, I want to select rows at random from a large data.frame while achieving a particular distribution defined my a given subset of this data.frame. How can I do this? More details and what I've done so far is given below. I have gene expression data and gene sets of interest. In order to look at enrichment of differential expression I'm doing a simple permutation approach: Selecting a an random set of genes (same size at those diff exp) and recording the overlap, repeating 10 000 times. The problem: The expression level and significance in differential expression is correlated (more power). Hence I want to do a biased permutation, selecting random genes that together follow the same expression level distribution. This is what I've done so far: geneExp is my data.frame with DE statistics. 6585 rows of genes, col one is gene ID. geneSet is my gene set, column one is gene ID. index is the index of the genes DE in my geneExp. dSign=density(geneExp[index,'baseMean']) #baseMean is a measure of expressionlevel prob=lapply(geneExp[,"baseMean"],function(x) approx(dSign$x,dSign$y,x)$y) prob=unlist(prob) So when I am doing my permutation I do: overlap=vector(0,length=10000) for (i in 1:10000) { index=sample(1:6585,543,prob=prob) overlap[i]=sum(!is.na(match(geneSet[,1],geneExp[index,1]))) } And thereafter look at the distribution of random overlaps compared to the initially observed overlap. But, the distribution of values that this permutation gives in NOT equal to the distr of significant genes, but a lot narrower. Simple because my method assumes a uniform distribution of values to chose from. Sorry if this was a complicated message, I would highly appreciate any help or comments! Best, Bryo -- View this message in context: http://r.789695.n4.nabble.com/Selecting-a-subsample-so-that-it-follows-a-distribution-tp3331659p3331659.html Sent from the R help mailing list archive at Nabble.com.