thr3ads.net - R help - [R] sampling dataframe based upon number of record occurrences [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Curtis Burkhalter

2015-Mar-03 21:22 UTC

[R] sampling dataframe based upon number of record occurrences

Hello everyone,

I'm having trouble performing a task that is probably very simple, but
can't seem to figure out how to get my code to work. What I want to do is
use the sample function to pick records within in a dataframe, but only if
a column attribute value is repeated more than 3 times. So if you look at
the data below I have created a unique attribute value that corresponds to
every site by year combination (i.e. IDxYear). So you can see that for the
site called "A-Airport" it was sampled 6 times in 2006, "A-Bank
Corral
East" was sampled twice in 2008. So what I want to do is randomly select 3
records for "A-Airport" in 2006 for the existing 6 records, but for
"A-Bark
Corral East" in 2008 I just want to leave these records as they currently
are.

I've used the following code to try and  accomplish this, but like I said I
can't get it to work so I'm clearly doing something wrong. If you could
check out the code and provide any suggestions that would be great. It
should be noted that there are 5589 unique IDxYear combinations so that's
why that number is in the code. If any further clarification is needed also
let me know.

boom=data.frame()
for (i in 1:5589){

boom[i,]=ifelse(length(fitting_set$IDbyYear[i]>3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)

}
boom


              *IDbyYear*           *SiteID *                  *Year*
 *6 other column attributes*
              42.24               A-Airport                 2006
             42.24               A-Airport                 2006
              42.24               A-Airport                 2006
             42.24               A-Airport                 2006
              42.24               A-Airport                 2006
             42.24               A-Airport                 2006
             45.32              A-Bark Corral East    2008
             45.32              A-Bark Corral East    2008
             45.36              A-Bark Corral East    2009
             45.40              A-Bark Corral East    2010
             45.40               A-Bark Corral East   2010

 Thanks


-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

	[[alternative HTML version deleted]]

JS Huang

2015-Mar-04 01:13 UTC

head link

[R] sampling dataframe based upon number of record occurrences

Here is an implementation with function named getSample. Some modification to
the data was made so that it can be read as a table.
> fitting.set   IDbyYear             SiteID Year
1     42.24          A-Airport 2006
2     42.24          A-Airport 2006
3     42.24          A-Airport 2006
4     42.24          A-Airport 2006
5     42.24          A-Airport 2006
6     42.24          A-Airport 2006
7     45.32 A-Bark.Corral.East 2008
8     45.32 A-Bark.Corral.East 2008
9     45.36 A-Bark.Corral.East 2009
10    45.40 A-Bark.Corral.East 2010
11    45.40 A-Bark.Corral.East 2010> getSamplefunction(x)
{
  sites <- unique(x$SiteID)
  years <- unique(x$Year)
  result <- data.frame()
  x$ID <- seq(1,nrow(x))
  for (i in 1:length(sites))
  {
    for (j in 1:length(years))
    {
      if (nrow(x[as.character(x$SiteID)==as.character(sites[i]) &
x$Year==years[j],]) > 3)
      {
        sampledID <- sample(x[as.character(x$SiteID)==as.character(sites[i])
& x$Year==years[j],]$ID,3,replace=FALSE)
        for (k in 1:length(sampledID))
        {
          result <- rbind(result,x[x$ID==sampledID[k],-4])
        }          
      }
    }
  }
  names(result) <-
c("IDbyYear","SiteID","Year")
  rownames(result) <- NULL
  return(result)
}> getSample(fitting.set)  IDbyYear    SiteID Year
1    42.24 A-Airport 2006
2    42.24 A-Airport 2006
3    42.24 A-Airport 2006



--
View this message in context:
http://r.789695.n4.nabble.com/sampling-dataframe-based-upon-number-of-record-occurrences-tp4704144p4704154.html
Sent from the R help mailing list archive at Nabble.com.

JS Huang

2015-Mar-04 01:25 UTC

head link

[R] sampling dataframe based upon number of record occurrences

Since you indicated there are six more columns in the data.frame, getSample
modified below to take care of it.
> getSamplefunction(x)
{
  sites <- unique(x$SiteID)
  years <- unique(x$Year)
  result <- data.frame()
  x$ID <- seq(1,nrow(x))
  for (i in 1:length(sites))
  {
    for (j in 1:length(years))
    {
      if (nrow(x[as.character(x$SiteID)==as.character(sites[i]) &
x$Year==years[j],]) > 3)
      {
        sampledID <- sample(x[as.character(x$SiteID)==as.character(sites[i])
& x$Year==years[j],]$ID,3,replace=FALSE)
        for (k in 1:length(sampledID))
        {
          result <- rbind(result,x[x$ID==sampledID[k],-ncol(x)])
        }          
      }
    }
  }
  names(result) <- names(x)[-ncol(x)]
  rownames(result) <- NULL
  return(result)
}> getSample(fitting.set)  IDbyYear    SiteID Year
1    42.24 A-Airport 2006
2    42.24 A-Airport 2006
3    42.24 A-Airport 2006




--
View this message in context:
http://r.789695.n4.nabble.com/sampling-dataframe-based-upon-number-of-record-occurrences-tp4704144p4704155.html
Sent from the R help mailing list archive at Nabble.com.

David L Carlson

2015-Mar-04 15:23 UTC

head link

[R] sampling dataframe based upon number of record occurrences

I'm not sure I understand, but I think you have a large data frame with
records and you want to construct a sample of that data frame that includes no
more than 3 records for each IDbyYear combination? You say there are 5589 unique
combinations and your code uses a data frame called fitting_set. Assuming this
is the data frame you are describing, your code will select all of the lines
since fitting_set$IDbyYear[i] is always a vector of length 1.

We need a reproducible example. The best way for you to give us that would be to
copy the result of dput(head(fitting_set, 10)). It would look something like
this plus the 6 other columns you mention except that I've added dta <-
in front of structure() to create a data frame:

dta <- structure(list(IDbyYear = c(42.24, 42.24, 42.24, 42.24, 42.24, 
42.24, 45.32, 45.32, 45.36, 45.4, 45.4), SiteID = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("A-Airport", 
"A-Bark Corral East"), class = "factor"), Year = c(2006L,
2006L,
2006L, 2006L, 2006L, 2006L, 2008L, 2008L, 2009L, 2010L, 2010L
)), .Names = c("IDbyYear", "SiteID", "Year"),
class = "data.frame", row.names = c(NA,
-11L))

Now create a list of data frames, one for each IDbyYear:

dta.list <- split(dta, dta$IDbyYear)

Now a function that will select 3 rows or all of them if there are fewer:

smp <- function(dframe) {
	ind <- seq_len(nrow(dframe))
	dframe[sample(ind, ifelse(length(ind)>2, 3, length(ind))),]
}

Now take the samples and combine them into a single data frame:

sample <- do.call(rbind, lapply(dta.list, smp))
sample

-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Curtis
Burkhalter
Sent: Tuesday, March 3, 2015 3:23 PM
To: r-help at r-project.org
Subject: [R] sampling dataframe based upon number of record occurrences

Hello everyone,

I'm having trouble performing a task that is probably very simple, but
can't seem to figure out how to get my code to work. What I want to do is
use the sample function to pick records within in a dataframe, but only if
a column attribute value is repeated more than 3 times. So if you look at
the data below I have created a unique attribute value that corresponds to
every site by year combination (i.e. IDxYear). So you can see that for the
site called "A-Airport" it was sampled 6 times in 2006, "A-Bank
Corral
East" was sampled twice in 2008. So what I want to do is randomly select 3
records for "A-Airport" in 2006 for the existing 6 records, but for
"A-Bark
Corral East" in 2008 I just want to leave these records as they currently
are.

I've used the following code to try and  accomplish this, but like I said I
can't get it to work so I'm clearly doing something wrong. If you could
check out the code and provide any suggestions that would be great. It
should be noted that there are 5589 unique IDxYear combinations so that's
why that number is in the code. If any further clarification is needed also
let me know.

boom=data.frame()
for (i in 1:5589){

boom[i,]=ifelse(length(fitting_set$IDbyYear[i]>3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)

}
boom


              *IDbyYear*           *SiteID *                  *Year*
 *6 other column attributes*
              42.24               A-Airport                 2006
             42.24               A-Airport                 2006
              42.24               A-Airport                 2006
             42.24               A-Airport                 2006
              42.24               A-Airport                 2006
             42.24               A-Airport                 2006
             45.32              A-Bark Corral East    2008
             45.32              A-Bark Corral East    2008
             45.36              A-Bark Corral East    2009
             45.40              A-Bark Corral East    2010
             45.40               A-Bark Corral East   2010

 Thanks


-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Curtis Burkhalter

2015-Mar-04 18:56 UTC

head link

[R] sampling dataframe based upon number of record occurrences

That worked great, thanks so much David!

On Wed, Mar 4, 2015 at 8:23 AM, David L Carlson <dcarlson at tamu.edu>
wrote:
> I'm not sure I understand, but I think you have a large data frame with
> records and you want to construct a sample of that data frame that includes
> no more than 3 records for each IDbyYear combination? You say there are
> 5589 unique combinations and your code uses a data frame called
> fitting_set. Assuming this is the data frame you are describing, your code
> will select all of the lines since fitting_set$IDbyYear[i] is always a
> vector of length 1.
>
> We need a reproducible example. The best way for you to give us that would
> be to copy the result of dput(head(fitting_set, 10)). It would look
> something like this plus the 6 other columns you mention except that
I've
> added dta <- in front of structure() to create a data frame:
>
> dta <- structure(list(IDbyYear = c(42.24, 42.24, 42.24, 42.24, 42.24,
> 42.24, 45.32, 45.32, 45.36, 45.4, 45.4), SiteID = structure(c(1L,
> 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("A-Airport",
> "A-Bark Corral East"), class = "factor"), Year =
c(2006L, 2006L,
> 2006L, 2006L, 2006L, 2006L, 2008L, 2008L, 2009L, 2010L, 2010L
> )), .Names = c("IDbyYear", "SiteID", "Year"),
class = "data.frame",
> row.names = c(NA,
> -11L))
>
> Now create a list of data frames, one for each IDbyYear:
>
> dta.list <- split(dta, dta$IDbyYear)
>
> Now a function that will select 3 rows or all of them if there are fewer:
>
> smp <- function(dframe) {
>         ind <- seq_len(nrow(dframe))
>         dframe[sample(ind, ifelse(length(ind)>2, 3, length(ind))),]
> }
>
> Now take the samples and combine them into a single data frame:
>
> sample <- do.call(rbind, lapply(dta.list, smp))
> sample
>
> -------------------------------------
> David L Carlson
> Department of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
>
>
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Curtis
> Burkhalter
> Sent: Tuesday, March 3, 2015 3:23 PM
> To: r-help at r-project.org
> Subject: [R] sampling dataframe based upon number of record occurrences
>
> Hello everyone,
>
> I'm having trouble performing a task that is probably very simple, but
> can't seem to figure out how to get my code to work. What I want to do
is
> use the sample function to pick records within in a dataframe, but only if
> a column attribute value is repeated more than 3 times. So if you look at
> the data below I have created a unique attribute value that corresponds to
> every site by year combination (i.e. IDxYear). So you can see that for the
> site called "A-Airport" it was sampled 6 times in 2006,
"A-Bank Corral
> East" was sampled twice in 2008. So what I want to do is randomly
select 3
> records for "A-Airport" in 2006 for the existing 6 records, but
for "A-Bark
> Corral East" in 2008 I just want to leave these records as they
currently
> are.
>
> I've used the following code to try and  accomplish this, but like I
said I
> can't get it to work so I'm clearly doing something wrong. If you
could
> check out the code and provide any suggestions that would be great. It
> should be noted that there are 5589 unique IDxYear combinations so
that's
> why that number is in the code. If any further clarification is needed also
> let me know.
>
> boom=data.frame()
> for (i in 1:5589){
>
>
>
boom[i,]=ifelse(length(fitting_set$IDbyYear[i]>3),fitting_set[sample(nrow(fitting_set),3),],fitting_set)
>
> }
> boom
>
>
>               *IDbyYear*           *SiteID *                  *Year*
>  *6 other column attributes*
>               42.24               A-Airport                 2006
>              42.24               A-Airport                 2006
>               42.24               A-Airport                 2006
>              42.24               A-Airport                 2006
>               42.24               A-Airport                 2006
>              42.24               A-Airport                 2006
>              45.32              A-Bark Corral East    2008
>              45.32              A-Bark Corral East    2008
>              45.36              A-Bark Corral East    2009
>              45.40              A-Bark Corral East    2010
>              45.40               A-Bark Corral East   2010
>
>  Thanks
>
>
> --
> Curtis Burkhalter
>
> https://sites.google.com/site/curtisburkhalter/
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Curtis Burkhalter

https://sites.google.com/site/curtisburkhalter/

	[[alternative HTML version deleted]]

R help - Mar 2015 - sampling dataframe based upon number of record occurrences

[R] sampling dataframe based upon number of record occurrences

[R] sampling dataframe based upon number of record occurrences

[R] sampling dataframe based upon number of record occurrences

[R] sampling dataframe based upon number of record occurrences

[R] sampling dataframe based upon number of record occurrences