Stephen HK Wong
2014-Aug-01 18:58 UTC
[R] How to randomly extract a number of rows in a data frame
Dear ALL, I have a dataframe contains 4 columns and several 10 millions of rows like below! I want to extract out "randomly" say 1 millions of rows, can you tell me how to do that in R using base packages? Many Thanks!!!! Col_1 Col_2 Col_3 Col_4 chr1 3000215 3000250 - chr1 3000909 3000944 + chr1 3001025 3001060 + chr1 3001547 3001582 + chr1 3002254 3002289 + chr1 3002324 3002359 - chr1 3002833 3002868 - chr1 3004565 3004600 - chr1 3004945 3004980 + chr1 3004974 3005009 - chr1 3005115 3005150 + chr1 3005124 3005159 + chr1 3005240 3005275 - chr1 3005558 3005593 - chr1 3005890 3005925 + chr1 3005929 3005964 + chr1 3005913 3005948 - chr1 3005913 3005948 - Stephen HK Wong
Marc Schwartz
2014-Aug-01 19:08 UTC
[R] How to randomly extract a number of rows in a data frame
On Aug 1, 2014, at 1:58 PM, Stephen HK Wong <honkit at stanford.edu> wrote:> Dear ALL, > > I have a dataframe contains 4 columns and several 10 millions of rows like below! I want to extract out "randomly" say 1 millions of rows, can you tell me how to do that in R using base packages? Many Thanks!!!! > > Col_1 Col_2 Col_3 Col_4 > chr1 3000215 3000250 - > chr1 3000909 3000944 + > chr1 3001025 3001060 + > chr1 3001547 3001582 + > chr1 3002254 3002289 + > chr1 3002324 3002359 - > chr1 3002833 3002868 - > chr1 3004565 3004600 - > chr1 3004945 3004980 + > chr1 3004974 3005009 - > chr1 3005115 3005150 + > chr1 3005124 3005159 + > chr1 3005240 3005275 - > chr1 3005558 3005593 - > chr1 3005890 3005925 + > chr1 3005929 3005964 + > chr1 3005913 3005948 - > chr1 3005913 3005948 - > > Stephen HK WongIf your data frame is called 'DF': DF.Rand <- DF[sample(nrow(DF), 1000000), ] See ?sample which will generate a random sample from a uniform distribution. In the above, nrow(DF) returns the number of rows in DF and defines the sample space of 1:nrow(DF), from which 1000000 random integer values will be selected and used as indices to return the rows. Using the built in 'iris' dataset, select 20 random rows from the 150 total:> iris[sample(nrow(iris), 20), ]Sepal.Length Sepal.Width Petal.Length Petal.Width Species 122 5.6 2.8 4.9 2.0 virginica 79 6.0 2.9 4.5 1.5 versicolor 109 6.7 2.5 5.8 1.8 virginica 106 7.6 3.0 6.6 2.1 virginica 49 5.3 3.7 1.5 0.2 setosa 125 6.7 3.3 5.7 2.1 virginica 1 5.1 3.5 1.4 0.2 setosa 68 5.8 2.7 4.1 1.0 versicolor 84 6.0 2.7 5.1 1.6 versicolor 110 7.2 3.6 6.1 2.5 virginica 113 6.8 3.0 5.5 2.1 virginica 64 6.1 2.9 4.7 1.4 versicolor 102 5.8 2.7 5.1 1.9 virginica 71 5.9 3.2 4.8 1.8 versicolor 69 6.2 2.2 4.5 1.5 versicolor 65 5.6 2.9 3.6 1.3 versicolor 74 6.1 2.8 4.7 1.2 versicolor 99 5.1 2.5 3.0 1.1 versicolor 135 6.1 2.6 5.6 1.4 virginica 41 5.0 3.5 1.3 0.3 setosa Regards, Marc Schwartz
William Dunlap
2014-Aug-01 19:12 UTC
[R] How to randomly extract a number of rows in a data frame
Do you know how to extract some rows of a data.frame? A short answer is with subscripts, either integer, first10 <- 1:10 dFirst10 <- d[first10, ] # I assume your data.frame is called 'd' or logical plus4 <- d[, "Col_4"] == "+" dPlus4 <- d[ plus4, ] If you are not familiar with that sort of thing, read the introduction to R document that comes with R. So you can solve your problem if you can generate a vector containing 1 million integers in the range 1:10^7. Use the sample function for that. You must decide if you want to allow duplicate rows or not (i.e., sampling with or without replacement). Type ?sample to see the details. Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Aug 1, 2014 at 11:58 AM, Stephen HK Wong <honkit at stanford.edu> wrote:> Dear ALL, > > I have a dataframe contains 4 columns and several 10 millions of rows like below! I want to extract out "randomly" say 1 millions of rows, can you tell me how to do that in R using base packages? Many Thanks!!!! > > Col_1 Col_2 Col_3 Col_4 > chr1 3000215 3000250 - > chr1 3000909 3000944 + > chr1 3001025 3001060 + > chr1 3001547 3001582 + > chr1 3002254 3002289 + > chr1 3002324 3002359 - > chr1 3002833 3002868 - > chr1 3004565 3004600 - > chr1 3004945 3004980 + > chr1 3004974 3005009 - > chr1 3005115 3005150 + > chr1 3005124 3005159 + > chr1 3005240 3005275 - > chr1 3005558 3005593 - > chr1 3005890 3005925 + > chr1 3005929 3005964 + > chr1 3005913 3005948 - > chr1 3005913 3005948 - > > Stephen HK Wong > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.