Hi, I am wanting to simulate data where a percentage of the data has multiple duplicated id variables (with unique values of another factor variable for the dupicated id variables). Im having trouble figuring out an efficent way to do so. For example, consider this mock output [Note: Although the mock data doesnt display this, I am eventually interested in 73% of id having 1 unique id, 22% with a duplicated id and 5% with 2 duplicated ids. Also, I would like the 'al' variable to be randomly selected, perhaps using sample() , from a 3-level factor "pt", "th", "ob" AND for an id with duplicates to have unique values for the 'al' variable]: Something like this: id z al 1 .5 "pt" 2 .4 "ob" 3 .7 "pt" 4 .3 "th" 5 .5 "pt" 5 .6 "ob" 6 .3 "th" 6 .2 "ob" 7 .1 "pt" 7 .3 "th" 7 .1 "ob" This would be the general idea although I will eventually create a much larger data set with z based on rnorm(), etc. Any help toward a solution is much appreciated! AC