thr3ads.net - R help - [R] Making tapply code more efficient [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2009-Feb-27 14:46 UTC

[R] Making tapply code more efficient

Previously, I posed the question pasted down below to the list and
received some very helpful responses. While the code suggestions
provided in response indeed work, they seem to only work with *very*
small data sets and so I wanted to follow up and see if anyone had ideas
for better efficiency. I was quite embarrased on this as our SAS
programmers cranked out programs that did this in the blink of an eye
(with a few variables), but R was spinning for days on my Ubuntu machine
and ultimately I saw a message that R was "killed".

The data I am working with has 800967 total rows and 31 total columns.
The ID variable I use as the index variable in tapply() has 326397
unique cases.
> length(unique(qq$student_unique_id))[1] 326397

To give a sense of what my data look like and the actual problem,
consider the following:

qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
teacher_unique_id = factor(c(10,10,20,20,25)))

This is a student achievement database where students occupy multiple
rows in the data and the variable teacher_unique_id denotes the class
the student was in. What I am doing is looking to see if the teacher is
the same for each instance of the unique student ID. So, if I implement
the following:

same <- function(x) length( unique(x) ) == 1
results <- data.frame(
        freq = tapply(qq$student_unique_id, qq$student_unique_id,
length),
        tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
)

I get the following results. I can see that student 1 appears in the
data twice and the teacher is always the same. However, student 2
appears three times and the teacher is not always the same.
> results  freq   tch
1    2  TRUE
2    3 FALSE

Now, implementing this same procedure to a large data set with the
characteristics described above seems to be problematic in this
implementation. 

Does anyone have reactions on how this could be more efficient such that
it can run with large data as I described?

Harold
> sessionInfo()R version 2.8.1 (2008-12-22)
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAMEC;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
ON=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base




##### Original question posted on 1/13/09
Suppose I have a dataframe as follows:

dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2
c('foo', 'foo', 'foo', 'foobar', 'foo'))

Now, if I were to subset by id, such as:
> subset(dat, id==1)  id var1 var2
1  1   10  foo
2  1   10  foo

I can see that the elements in var1 are exactly the same and the
elements in var2 are exactly the same. However,
> subset(dat, id==2)  id var1   var2
3  2   20    foo
4  2   20 foobar
5  2   25    foo

Shows the elements are not the same for either variable in this
instance. So, what I am looking to create is a data frame that would be
like this

id      freq    var1    var2
1       2       TRUE    TRUE   
2       3       FALSE   FALSE

Where freq is the number of times the ID is repeated in the dataframe. A
TRUE appears in the cell if all elements in the column are the same for
the ID and FALSE otherwise. It is insignificant which values differ for
my problem.

The way I am thinking about tackling this is to loop through the ID
variable and compare the values in the various columns of the dataframe.
The problem I am encountering is that I don't think all.equal or
identical are the right functions in this case.

So, say I was wanting to compare the elements of var1 for id ==1. I
would have

x <- c(10,10)

Of course, the following works
> all.equal(x[1], x[2])[1] TRUE

As would a similar call to identical. However, what if I only have a
vector of values (or if the column consists of names) that I want to
assess for equality when I am trying to automate a process over
thousands of cases? As in the example above, the vector may contain only
two values or it may contain many more. The number of values in the
vector differ by id.

Any thoughts?

Harold

ONKELINX, Thierry

2009-Feb-27 15:23 UTC

head link

[R] Making tapply code more efficient

Hi Harold,

What about this? You one have to make the crosstabulation once.
> qq <- data.frame(student = factor(c(1,1,2,2,2)), teacher
factor(c(10,10,20,20,25)))
> tab <- table(qq$student, qq$teacher)
> data.frame(Student = rownames(tab), Freq = rowSums(tab), tch rowSums(tab
> 0) == 1)  Student Freq   tch
1       1    2  TRUE
2       2    3 FALSE

HTH,

Thierry


------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Namens Doran, Harold
Verzonden: vrijdag 27 februari 2009 15:47
Aan: r-help at r-project.org
Onderwerp: [R] Making tapply code more efficient

Previously, I posed the question pasted down below to the list and
received some very helpful responses. While the code suggestions
provided in response indeed work, they seem to only work with *very*
small data sets and so I wanted to follow up and see if anyone had ideas
for better efficiency. I was quite embarrased on this as our SAS
programmers cranked out programs that did this in the blink of an eye
(with a few variables), but R was spinning for days on my Ubuntu machine
and ultimately I saw a message that R was "killed".

The data I am working with has 800967 total rows and 31 total columns.
The ID variable I use as the index variable in tapply() has 326397
unique cases.
> length(unique(qq$student_unique_id))[1] 326397

To give a sense of what my data look like and the actual problem,
consider the following:

qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
teacher_unique_id = factor(c(10,10,20,20,25)))

This is a student achievement database where students occupy multiple
rows in the data and the variable teacher_unique_id denotes the class
the student was in. What I am doing is looking to see if the teacher is
the same for each instance of the unique student ID. So, if I implement
the following:

same <- function(x) length( unique(x) ) == 1
results <- data.frame(
        freq = tapply(qq$student_unique_id, qq$student_unique_id,
length),
        tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
)

I get the following results. I can see that student 1 appears in the
data twice and the teacher is always the same. However, student 2
appears three times and the teacher is not always the same.
> results  freq   tch
1    2  TRUE
2    3 FALSE

Now, implementing this same procedure to a large data set with the
characteristics described above seems to be problematic in this
implementation. 

Does anyone have reactions on how this could be more efficient such that
it can run with large data as I described?

Harold
> sessionInfo()R version 2.8.1 (2008-12-22)
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAMEC;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
ON=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base




##### Original question posted on 1/13/09
Suppose I have a dataframe as follows:

dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2
c('foo', 'foo', 'foo', 'foobar', 'foo'))

Now, if I were to subset by id, such as:
> subset(dat, id==1)  id var1 var2
1  1   10  foo
2  1   10  foo

I can see that the elements in var1 are exactly the same and the
elements in var2 are exactly the same. However,
> subset(dat, id==2)  id var1   var2
3  2   20    foo
4  2   20 foobar
5  2   25    foo

Shows the elements are not the same for either variable in this
instance. So, what I am looking to create is a data frame that would be
like this

id      freq    var1    var2
1       2       TRUE    TRUE   
2       3       FALSE   FALSE

Where freq is the number of times the ID is repeated in the dataframe. A
TRUE appears in the cell if all elements in the column are the same for
the ID and FALSE otherwise. It is insignificant which values differ for
my problem.

The way I am thinking about tackling this is to loop through the ID
variable and compare the values in the various columns of the dataframe.
The problem I am encountering is that I don't think all.equal or
identical are the right functions in this case.

So, say I was wanting to compare the elements of var1 for id ==1. I
would have

x <- c(10,10)

Of course, the following works
> all.equal(x[1], x[2])[1] TRUE

As would a similar call to identical. However, what if I only have a
vector of values (or if the column consists of names) that I want to
assess for equality when I am trying to automate a process over
thousands of cases? As in the example above, the vector may contain only
two values or it may contain many more. The number of values in the
vector differ by id.

Any thoughts?

Harold

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.

jim holtman

2009-Feb-27 18:59 UTC

head link

[R] Making tapply code more efficient

On something the size of your data it took about 30 seconds to
determine the number of unique teachers per student.
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, TRUE))
> # split the data so you have the number of teachers per student
> system.time(t.s <- split(x[,2], x[,1]))   user  system elapsed
   0.92    0.01    0.94> t.s[1:7]  # sample data$`1`
[1] 16

$`2`
[1] 3

$`3`
[1] 1

$`4`
[1] 17

$`6`
[1]  9  9 19

$`7`
[1] 20

$`9`
[1]  3 16 16 10  8 17
> # count number of unique teachers per student
> system.time(t.a <- sapply(t.s, function(x) length(unique(x))))   user  system elapsed
  20.17    0.10   20.26>
>
>
> t.a[1:10] 1  2  3  4  6  7  9 10 11 12
 1  1  1  1  2  1  5  1  1  1


On Fri, Feb 27, 2009 at 9:46 AM, Doran, Harold <HDoran at air.org>
wrote:> Previously, I posed the question pasted down below to the list and
> received some very helpful responses. While the code suggestions
> provided in response indeed work, they seem to only work with *very*
> small data sets and so I wanted to follow up and see if anyone had ideas
> for better efficiency. I was quite embarrased on this as our SAS
> programmers cranked out programs that did this in the blink of an eye
> (with a few variables), but R was spinning for days on my Ubuntu machine
> and ultimately I saw a message that R was "killed".
>
> The data I am working with has 800967 total rows and 31 total columns.
> The ID variable I use as the index variable in tapply() has 326397
> unique cases.
>
>> length(unique(qq$student_unique_id))
> [1] 326397
>
> To give a sense of what my data look like and the actual problem,
> consider the following:
>
> qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
> teacher_unique_id = factor(c(10,10,20,20,25)))
>
> This is a student achievement database where students occupy multiple
> rows in the data and the variable teacher_unique_id denotes the class
> the student was in. What I am doing is looking to see if the teacher is
> the same for each instance of the unique student ID. So, if I implement
> the following:
>
> same <- function(x) length( unique(x) ) == 1
> results <- data.frame(
> ? ? ? ?freq = tapply(qq$student_unique_id, qq$student_unique_id,
> length),
> ? ? ? ?tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
> )
>
> I get the following results. I can see that student 1 appears in the
> data twice and the teacher is always the same. However, student 2
> appears three times and the teacher is not always the same.
>
>> results
> ?freq ? tch
> 1 ? ?2 ?TRUE
> 2 ? ?3 FALSE
>
> Now, implementing this same procedure to a large data set with the
> characteristics described above seems to be problematic in this
> implementation.
>
> Does anyone have reactions on how this could be more efficient such that
> it can run with large data as I described?
>
> Harold
>
>> sessionInfo()
> R version 2.8.1 (2008-12-22)
> x86_64-pc-linux-gnu
>
> locale:
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
> TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME>
C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
> ON=C
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
>
>
>
> ##### Original question posted on 1/13/09
> Suppose I have a dataframe as follows:
>
> dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 >
c('foo', 'foo', 'foo', 'foobar', 'foo'))
>
> Now, if I were to subset by id, such as:
>
>> subset(dat, id==1)
> ?id var1 var2
> 1 ?1 ? 10 ?foo
> 2 ?1 ? 10 ?foo
>
> I can see that the elements in var1 are exactly the same and the
> elements in var2 are exactly the same. However,
>
>> subset(dat, id==2)
> ?id var1 ? var2
> 3 ?2 ? 20 ? ?foo
> 4 ?2 ? 20 foobar
> 5 ?2 ? 25 ? ?foo
>
> Shows the elements are not the same for either variable in this
> instance. So, what I am looking to create is a data frame that would be
> like this
>
> id ? ? ?freq ? ?var1 ? ?var2
> 1 ? ? ? 2 ? ? ? TRUE ? ?TRUE
> 2 ? ? ? 3 ? ? ? FALSE ? FALSE
>
> Where freq is the number of times the ID is repeated in the dataframe. A
> TRUE appears in the cell if all elements in the column are the same for
> the ID and FALSE otherwise. It is insignificant which values differ for
> my problem.
>
> The way I am thinking about tackling this is to loop through the ID
> variable and compare the values in the various columns of the dataframe.
> The problem I am encountering is that I don't think all.equal or
> identical are the right functions in this case.
>
> So, say I was wanting to compare the elements of var1 for id ==1. I
> would have
>
> x <- c(10,10)
>
> Of course, the following works
>
>> all.equal(x[1], x[2])
> [1] TRUE
>
> As would a similar call to identical. However, what if I only have a
> vector of values (or if the column consists of names) that I want to
> assess for equality when I am trying to automate a process over
> thousands of cases? As in the example above, the vector may contain only
> two values or it may contain many more. The number of values in the
> vector differ by id.
>
> Any thoughts?
>
> Harold
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

William Dunlap

2009-Mar-09 18:33 UTC

head link

[R] Making tapply code more efficient

The sparse xtabs solution is good.  It avoids the overhead
of a long list of small vectors that tapply can require.

Here is another approach, perhaps more like what
a SAS programmer might do.  It can run quickly
and use little memory but also can be tricky to set up.
Sort the data into groups (by student_unique_id here)
and within the groups sort by teacher_unique_id.  order()
does this nicely. Once the sorting is done you only have 
to do comparisons of observations with their neighbors to
identify the first and last observations in a group (now
a run, because of the sorting).  In this case you just
need to ask if the first teacher id in a given run is
the same as the last teach id in that run.  

f2 <- function(qq, needs.sorting = TRUE) {
  # we sort by student_id, breaking ties with teacher_id, unless
  # user asserts this has already been done.
  student_unique_id <- qq$student_unique_id
  teacher_unique_id <- qq$teacher_unique_id
  if (needs.sorting) {
     ord <- order(student_unique_id, teacher_unique_id)
     student_unique_id <- student_unique_id[ord]
     teacher_unique_id <- teacher_unique_id[ord]
  }
  freq <- table(student_unique_id) # table is fast way to count by
groups
  freq <- freq[freq>0]
  # Because of sorting, if teacher_id in first record for a given
student
  # is the same as the teacher_id for the last record for that student,
  # then that student had only 1 teacher.
  # These functions fail if x contains NA's, but sorting has removed
them
  isFirstInRun <- function(x)c(TRUE, x[-1]!=x[-length(x)])
  isLastInRun <- function(x)c(x[-1]!=x[-length(x)], TRUE)
  tch <- teacher_unique_id[isFirstInRun(student_unique_id)] =        
teacher_unique_id[isLastInRun(student_unique_id)]
  results <- data.frame(
      row.names=dimnames(freq)[[1]],
      Freq=as.vector(freq),
      tch=tch)
  results
}

In this case this uses 50% more time than the xtabs(sparse=TRUE)
approach
but it may be worthwhile to get familiar with this sort of algorithm.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 
--------------------------------------------------
[R] Making tapply code more efficient

Doran, Harold HDoran at air.org 
Mon Mar 9 15:43:47 CET 2009
Previous message: [R] Making tapply code more efficient
Next message: [R] rcorr.cens Goodman-Kruskal gamma
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I think I might be able to answer my own question here. It turns out in
the step 

tab <- table(dats3$student_unique_id, dats3$teacher_unique_id)

The dimensions of this table in my data are significantly larger than
the simulated data, thus consuming more memory. So, I know that the lme4
package has a method for sparse crosstabs, so I tried this:
> library(lme4)
> tab <- xtabs(~ dats3$student_unique_id + dats3$teacher_unique_id,sparse = TRUE)
> result <- data.frame(Student = rownames(tab), Freq = rowSums(tab), tch= rowSums(tab > 0) == 1)

And the world works beautifully.


> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Doran, Harold
> Sent: Monday, March 09, 2009 10:25 AM
> To: ONKELINX, Thierry; jholtman at gmail.com
> Cc: r-help at r-project.org
> Subject: Re: [R] Making tapply code more efficient
> 
> Thierry and Jim:
> 
> Thank you both for your reply. I remain a bit baffled over something.
> Here is the sample data generated by jim and code by Thierry, 
> which works exactly as expected.
> 
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, 
> TRUE)) x <- data.frame(x) names(x)[1:2] <- 
> c('student_unique_id', 'teacher_unique_id') tab <- 
> table(x$student_unique_id, x$teacher_unique_id) result <- 
> data.frame(Student = rownames(tab), Freq = rowSums(tab), tch 
> = rowSums(tab > 0) == 1)
> 
> Now, here is what happens when I run this on my data (called dats3)
> 
> > tab <- table(dats3$student_unique_id, dats3$teacher_unique_id)
> Error: cannot allocate vector of size 942.8 Mb
> 
> So, let's take a look at a couple of things:
> 
> > object.size(dats3) < object.size(x)
> [1] TRUE
> 
> > str(x)
> 'data.frame':   800967 obs. of  2 variables:
>  $ student_unique_id: int  121914 89142 127790 61350 54684 
> 28018 313428
> 27595 316285 173571 ...
>  $ teacher_unique_id: int  17 1 19 20 3 18 15 1 14 15 ...
> > str(dats3)
> 'data.frame':   56204 obs. of  2 variables:
>  $ student_unique_id: int  20504 26172 20504 3609 4313 5058 5363 5669
> 6429 6560 ...
>  $ teacher_unique_id: int  35078 41029 35078 41437 41476 41456 41486
> 35415 41508 35413 ...
> 
> The sample data are smaller in size than my actual data and 
> the structure is exactly the same. Do you see any other 
> reason why the memory issue would arise here?
> 
> Harold
> 
> 
> 
>  
> 
> > -----Original Message-----
> > From: ONKELINX, Thierry [mailto:Thierry.ONKELINX at inbo.be]
> > Sent: Friday, February 27, 2009 10:24 AM
> > To: Doran, Harold; r-help at r-project.org
> > Subject: RE: [R] Making tapply code more efficient
> > 
> > Hi Harold,
> > 
> > What about this? You one have to make the crosstabulation once.
> > 
> > > qq <- data.frame(student = factor(c(1,1,2,2,2)), teacher >
> factor(c(10,10,20,20,25)))
> > > tab <- table(qq$student, qq$teacher) data.frame(Student = 
> > > rownames(tab), Freq = rowSums(tab), tch > > rowSums(tab
> 0) == 1)
> >   Student Freq   tch
> > 1       1    2  TRUE
> > 2       2    3 FALSE
> > 
> > HTH,
> > 
> > Thierry
> > 
> > 
> > --------------------------------------------------------------
> > ----------
> > ----
> > ir. Thierry Onkelinx
> > Instituut voor natuur- en bosonderzoek / Research Institute 
> for Nature 
> > and Forest Cel biometrie, methodologie en kwaliteitszorg / Section 
> > biometrics, methodology and quality assurance Gaverstraat 4 9500 
> > Geraardsbergen Belgium tel. + 32
> > 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be
> > 
> > To call in the statistician after the experiment is done may be no 
> > more than asking him to perform a post-mortem
> > examination: he may be able to say what the experiment died of.
> > ~ Sir Ronald Aylmer Fisher
> > 
> > The plural of anecdote is not data.
> > ~ Roger Brinner
> > 
> > The combination of some data and an aching desire for an 
> answer does 
> > not ensure that a reasonable answer can be extracted from a 
> given body 
> > of data.
> > ~ John Tukey
> > 
> > -----Oorspronkelijk bericht-----
> > Van: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org]
> > Namens Doran, Harold
> > Verzonden: vrijdag 27 februari 2009 15:47
> > Aan: r-help at r-project.org
> > Onderwerp: [R] Making tapply code more efficient
> > 
> > Previously, I posed the question pasted down below to the list and 
> > received some very helpful responses. While the code suggestions 
> > provided in response indeed work, they seem to only work 
> with *very* 
> > small data sets and so I wanted to follow up and see if anyone had 
> > ideas for better efficiency.
> > I was quite embarrased on this as our SAS programmers cranked out 
> > programs that did this in the blink of an eye (with a few 
> variables), 
> > but R was spinning for days on my Ubuntu machine and 
> ultimately I saw 
> > a message that R was "killed".
> > 
> > The data I am working with has 800967 total rows and 31 
> total columns.
> > The ID variable I use as the index variable in tapply() has
> > 326397 unique cases.
> > 
> > > length(unique(qq$student_unique_id))
> > [1] 326397
> > 
> > To give a sense of what my data look like and the actual problem, 
> > consider the following:
> > 
> > qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)), 
> > teacher_unique_id = factor(c(10,10,20,20,25)))
> > 
> > This is a student achievement database where students 
> occupy multiple 
> > rows in the data and the variable teacher_unique_id denotes 
> the class 
> > the student was in. What I am doing is looking to see if 
> the teacher 
> > is the same for each instance of the unique student ID. So, if I 
> > implement the following:
> > 
> > same <- function(x) length( unique(x) ) == 1 results <-
data.frame(
> >         freq = tapply(qq$student_unique_id, qq$student_unique_id, 
> > length),
> >         tch = tapply(qq$teacher_unique_id, 
> qq$student_unique_id, same)
> > )
> > 
> > I get the following results. I can see that student 1 
> appears in the 
> > data twice and the teacher is always the same.
> > However, student 2 appears three times and the teacher is 
> not always 
> > the same.
> > 
> > > results
> >   freq   tch
> > 1    2  TRUE
> > 2    3 FALSE
> > 
> > Now, implementing this same procedure to a large data set with the 
> > characteristics described above seems to be problematic in this 
> > implementation.
> > 
> > Does anyone have reactions on how this could be more efficient such 
> > that it can run with large data as I described?
> > 
> > Harold
> > 
> > > sessionInfo()
> > R version 2.8.1 (2008-12-22)
> > x86_64-pc-linux-gnu
> > 
> > locale:
> > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA
> > TE=en_US.U
> > TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-
> > 8;LC_NAME> >
C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_ID
> > ENTIFICATI
> > ON=C
> > 
> > attached base packages:
> > [1] stats     graphics  grDevices utils     datasets  methods   base
> > 
> > 
> > 
> > 
> > ##### Original question posted on 1/13/09 Suppose I have a 
> dataframe 
> > as follows:
> > 
> > dat <- data.frame(id = c(1,1,2,2,2), var1 = 
> c(10,10,20,20,25), var2 = 
> > c('foo', 'foo', 'foo', 'foobar',
'foo'))
> > 
> > Now, if I were to subset by id, such as:
> > 
> > > subset(dat, id==1)
> >   id var1 var2
> > 1  1   10  foo
> > 2  1   10  foo
> > 
> > I can see that the elements in var1 are exactly the same and the 
> > elements in var2 are exactly the same. However,
> > 
> > > subset(dat, id==2)
> >   id var1   var2
> > 3  2   20    foo
> > 4  2   20 foobar
> > 5  2   25    foo
> > 
> > Shows the elements are not the same for either variable in this 
> > instance. So, what I am looking to create is a data frame 
> that would 
> > be like this
> > 
> > id      freq    var1    var2
> > 1       2       TRUE    TRUE   
> > 2       3       FALSE   FALSE
> > 
> > Where freq is the number of times the ID is repeated in the 
> dataframe. 
> > A TRUE appears in the cell if all elements in the column 
> are the same 
> > for the ID and FALSE otherwise. It is insignificant which values 
> > differ for my problem.
> > 
> > The way I am thinking about tackling this is to loop through the ID 
> > variable and compare the values in the various columns of the 
> > dataframe.
> > The problem I am encountering is that I don't think all.equal or 
> > identical are the right functions in this case.
> > 
> > So, say I was wanting to compare the elements of var1 for id ==1. I 
> > would have
> > 
> > x <- c(10,10)
> > 
> > Of course, the following works
> > 
> > > all.equal(x[1], x[2])
> > [1] TRUE
> > 
> > As would a similar call to identical. However, what if I 
> only have a 
> > vector of values (or if the column consists of names) that 
> I want to 
> > assess for equality when I am trying to automate a process over 
> > thousands of cases? As in the example above, the vector may contain 
> > only two values or it may contain many more. The number of 
> values in 
> > the vector differ by id.
> > 
> > Any thoughts?
> > 
> > Harold
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> > Dit bericht en eventuele bijlagen geven enkel de visie van de 
> > schrijver weer en binden het INBO onder geen enkel beding, 
> zolang dit 
> > bericht niet bevestigd is door een geldig ondertekend document. The 
> > views expressed in  this message and any annex are purely 
> those of the 
> > writer and may not be regarded as stating an official position of 
> > INBO, as long as the message is not confirmed by a duly signed 
> > document.
> > 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Feb 2009 - Making tapply code more efficient

[R] Making tapply code more efficient

[R] Making tapply code more efficient

[R] Making tapply code more efficient

[R] Making tapply code more efficient

Apparently Analagous Threads