thr3ads.net - R help - [R] random sampling but with caveats! [Sep 2011]

If this information is useful, please help other people find it:
Share via:

Rebecca Ross

2011-Sep-08 15:47 UTC

[R] random sampling but with caveats!

Hi,
I wonder if someone can help me. I have built a gam model to predict the
presence of cold water corals and am now trying to evaluate my model by
splitting my dataset into training/test datasets.

In an ideal world I would use the sample() function to randomly select rows of
data for me so for example with 936 rows of data in my HH dataset I might say

ss <- sample(nrow(HH), size = nrow(HH)-312, replace = FALSE)
training<-HH[ss,]
test<-HH[-ss,]

in order to create a random training sub-sample of  roughly 65% of my data and
test of 35%. (I would use a for() loop to automate the process of building the
datasets and running the prediction e.g.1000times)

The problem is that I do have 2 caveats for the subsampling:


a)      I need to have control over the prevalence (proportion of observed
presences within the dataset) in my build and test datasets
I realise I could do this by sorting my column of presences and absences and
then taking a subsample of the required size from the rows containing presences
then the rows containing absences and combining them.

e.g.        presence_records<-sample(1:117,size=75,replace=FALSE)

absence_records<-sample(118:936,size=549,replace=FALSE)

ss<-c(presence_records,absence_records)
                but...

b)      My samples are within video transects and due to the risk of
autocorrelation within each transect, ideally it is by transect cluster that
they will be randomly selected. (a point within a transect cannot be allocated
to the training dataset when another point from that same transect is already
allocated to the test dataset)

Is there a way I can fulfil both of these caveats and come out with my (slightly
less)random subsamples?

Many thanks for your time!
All the best,
Bex


	[[alternative HTML version deleted]]

Jean-Christophe BOUËTTÉ

2011-Sep-09 00:43 UTC

head link

[R] random sampling but with caveats!

Hi there,
It seems you got no answer. Maybe providing a reproducible example
would help, as well as expressing your problem in more general terms.
I am not an expert in sampling, but I would suggest (as does the help
for sample) that you take a look at the sampling package, available on
CRAN, and the strata function in this package that allows for
stratified sampling.

HTH,
Jean-Christophe

2011/9/8 Rebecca Ross <rebecca.ross at
plymouth.ac.uk>:> Hi,
> I wonder if someone can help me. I have built a gam model to predict the
presence of cold water corals and am now trying to evaluate my model by
splitting my dataset into training/test datasets.
>
> In an ideal world I would use the sample() function to randomly select rows
of data for me so for example with 936 rows of data in my HH dataset I might say
>
> ss <- sample(nrow(HH), size = nrow(HH)-312, replace = FALSE)
> training<-HH[ss,]
> test<-HH[-ss,]
>
> in order to create a random training sub-sample of ?roughly 65% of my data
and test of 35%. (I would use a for() loop to automate the process of building
the datasets and running the prediction e.g.1000times)
>
> The problem is that I do have 2 caveats for the subsampling:
>
>
> a) ? ? ?I need to have control over the prevalence (proportion of observed
presences within the dataset) in my build and test datasets
> I realise I could do this by sorting my column of presences and absences
and then taking a subsample of the required size from the rows containing
presences then the rows containing absences and combining them.
>
> e.g. ? ? ? ?presence_records<-sample(1:117,size=75,replace=FALSE)
>
> absence_records<-sample(118:936,size=549,replace=FALSE)
>
> ss<-c(presence_records,absence_records)
> ? ? ? ? ? ? ? ?but...
>
> b) ? ? ?My samples are within video transects and due to the risk of
autocorrelation within each transect, ideally it is by transect cluster that
they will be randomly selected. (a point within a transect cannot be allocated
to the training dataset when another point from that same transect is already
allocated to the test dataset)
>
> Is there a way I can fulfil both of these caveats and come out with my
(slightly less)random subsamples?
>
> Many thanks for your time!
> All the best,
> Bex
>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Sep 2011 - random sampling but with caveats!

[R] random sampling but with caveats!

[R] random sampling but with caveats!

Seemingly Similar Threads