Hello R experts,
I am trying to do a job where I need to take random subsample from a data
and then frequency count of that. Then the median or the frequency from
say, 1000 replicates. Should I try this with subsample in loop or
bootstrap?
My data format is
> str(Data)
'data.frame': 155752 obs. of 2 variables:
$ ReadName: Factor w/ 155752 levels
"HWI-ST884185C1PEWACXX:3:1101:10047:62439#0/2",..: 49 325 800 624 786
77 203
825 249 369 ...
$ Taxa : Factor w/ 25 levels "Acidimicrobium",..: 1 1 1 1 1 1 1 1
1 1 ..
and then if I take 10 sample like
> Data[sample(nrow(Data), 10), ]
ReadName Taxa
122657 HWI-ST884185C1PEWACXX:4:2105:16386:68246#0/2 Frankia
91721 HWI-ST884185C1PEWACXX:3:2314:16967:14996#0/1 Rhodococcus
62980 HWI-ST884185C1PEWACXX:4:2101:13052:29946#0/1 Mycobacterium
::::
::::
And count the frequency as:
counts <- ddply(Sample, .(Sample$Taxa), nrow), which results like
> counts
Sample$Taxa V1
1 Actinomyces 1
2 Frankia 3
3 Gordonia 1
4 Modestobacter 1
5 Mycobacterium 2
6 Rhodococcus 1
7 Tsukamurella 1
Now I need to do this 1000 times and get a median of counts (V1 col). Can
you please suggest the quickest way?
I want to do this with really big data, and my subsample size will be 1
mil, replicate 1000, out of 10 mil size (row) data.
Thanks a lot for help.
Mitra
[[alternative HTML version deleted]]