thr3ads.net - R help - [R] data manipulation and summaries with few million rows [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Juliet Hannah

2011-Aug-24 16:48 UTC

[R] data manipulation and summaries with few million rows

I have a data set with about 6 million rows and 50 columns. It is a
mixture of dates, factors, and numerics.

What I am trying to accomplish can be seen with the following
simplified data, which is given as dput output below.
> head(myData)      mydate gender mygroup id
1 2012-03-25      F       A  1
2 2005-05-23      F       B  2
3 2005-09-08      F       B  2
4 2005-12-07      F       B  2
5 2006-02-26      F       C  2
6 2006-05-13      F       C  2

For each id, I want to count the number of changes of the variable
'mygroup' that occur. For example, id=1 has 0 changes because it is
observed only once.  id=2 has 2 changes (B to C, and C to D).  I also
need to calculate the total observation time for each id using the
variable mydate.  In the end, I am trying to have a new data set in
which each row has an id, days observed, number of changes, and
gender.

I made some simple summaries using data.table and plyr, but I'm stuck
on this reformatting.

Thanks for your help.

myData <- structure(list(mydate = c("2012-03-25",
"2005-05-23", "2005-09-08",
"2005-12-07", "2006-02-26", "2006-05-13",
"2006-09-01", "2006-12-12",
"2006-02-19", "2006-05-03", "2006-04-23",
"2007-12-08", "2011-03-19",
"2007-12-20", "2008-06-15", "2008-12-16",
"2009-06-07", "2009-10-09",
"2010-01-28", "2007-06-05"), gender = c("F",
"F", "F", "F", "F",
"F", "F", "F", "F", "F",
"F", "F", "F", "M", "M",
"M", "M", "M",
"M", "M"), mygroup = c("A", "B",
"B", "B", "C", "C", "C",
"D",
"D", "D", "D", "D", "D",
"A", "A", "A", "B", "B",
"B", "A"),
    id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
    3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate",
"gender",
"mygroup", "id"), class = "data.frame", row.names
= c(NA, -20L
))
> sessionInfo()R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

Dennis Murphy

2011-Aug-24 17:28 UTC

head link

[R] data manipulation and summaries with few million rows

Hi Juliet:

Here's a Q & D solution:

# (1) plyr> f <- function(d) length(unique(d$mygroup)) - 1
> ddply(myData, .(id), f)  id V1
1  1  0
2  2  2
3  3  1
4  4  0

# (2) data.table

myDT <- data.table(myData, key = 'id')
myDT[, list(nswitch = length(unique(mygroup)) - 1), by = 'id']

If one can switch back and forth between levels more than once, then
the above is clearly not appropriate. A more robust method would be to
employ rle() [run length encoding]:

g <- function(d) length(rle(d$mygroup)$lengths) - 1
ddply(myData, .(id), g)    # gives the same answer as above
myDT[, list(nswitch = length(rle(mygroup)$lengths) - 1), by = 'id']   #
ditto


HTH,
Dennis

On Wed, Aug 24, 2011 at 9:48 AM, Juliet Hannah <juliet.hannah at
gmail.com> wrote:> I have a data set with about 6 million rows and 50 columns. It is a
> mixture of dates, factors, and numerics.
>
> What I am trying to accomplish can be seen with the following
> simplified data, which is given as dput output below.
>
>> head(myData)
> ? ? ?mydate gender mygroup id
> 1 2012-03-25 ? ? ?F ? ? ? A ?1
> 2 2005-05-23 ? ? ?F ? ? ? B ?2
> 3 2005-09-08 ? ? ?F ? ? ? B ?2
> 4 2005-12-07 ? ? ?F ? ? ? B ?2
> 5 2006-02-26 ? ? ?F ? ? ? C ?2
> 6 2006-05-13 ? ? ?F ? ? ? C ?2
>
> For each id, I want to count the number of changes of the variable
> 'mygroup' that occur. For example, id=1 has 0 changes because it is
> observed only once. ?id=2 has 2 changes (B to C, and C to D). ?I also
> need to calculate the total observation time for each id using the
> variable mydate. ?In the end, I am trying to have a new data set in
> which each row has an id, days observed, number of changes, and
> gender.
>
> I made some simple summaries using data.table and plyr, but I'm stuck
> on this reformatting.
>
> Thanks for your help.
>
> myData <- structure(list(mydate = c("2012-03-25",
"2005-05-23", "2005-09-08",
> "2005-12-07", "2006-02-26", "2006-05-13",
"2006-09-01", "2006-12-12",
> "2006-02-19", "2006-05-03", "2006-04-23",
"2007-12-08", "2011-03-19",
> "2007-12-20", "2008-06-15", "2008-12-16",
"2009-06-07", "2009-10-09",
> "2010-01-28", "2007-06-05"), gender = c("F",
"F", "F", "F", "F",
> "F", "F", "F", "F", "F",
"F", "F", "F", "M", "M",
"M", "M", "M",
> "M", "M"), mygroup = c("A", "B",
"B", "B", "C", "C", "C",
"D",
> "D", "D", "D", "D", "D",
"A", "A", "A", "B", "B",
"B", "A"),
> ? ?id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
> ? ?3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate",
"gender",
> "mygroup", "id"), class = "data.frame",
row.names = c(NA, -20L
> ))
>
>> sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C
> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8
> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8
> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C
> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

jim holtman

2011-Aug-27 22:52 UTC

head link

[R] data manipulation and summaries with few million rows

Factors are you friend here:
> myData       mydate gender mygroup id mygrp.f
1  2012-03-25      F       A  1       1
2  2005-05-23      F       B  2       2
3  2005-09-08      F       B  2       2
4  2005-12-07      F       B  2       2
5  2006-02-26      F       C  2       3
6  2006-05-13      F       C  2       3
7  2006-09-01      F       C  2       3
8  2006-12-12      F       D  2       4
9  2006-02-19      F       D  2       4
10 2006-05-03      F       D  2       4
11 2006-04-23      F       D  2       4
12 2007-12-08      F       D  2       4
13 2011-03-19      F       D  2       4
14 2007-12-20      M       A  3       1
15 2008-06-15      M       A  3       1
16 2008-12-16      M       A  3       1
17 2009-06-07      M       B  3       2
18 2009-10-09      M       B  3       2
19 2010-01-28      M       B  3       2
20 2007-06-05      M       A  4       1> # change 'mygroup' to a factor so you can use 'diff' to
count the changes
> myData$mygrp.f <- as.integer(factor(myData$mygroup))
> # count the changes for each 'id'
> changes <- tapply(myData$mygrp.f, myData$id, function(x){+     sum(diff(x) != 0)
+ })>
>
> changes1 2 3 4
0 2 1 0>

On Wed, Aug 24, 2011 at 12:48 PM, Juliet Hannah <juliet.hannah at
gmail.com> wrote:> I have a data set with about 6 million rows and 50 columns. It is a
> mixture of dates, factors, and numerics.
>
> What I am trying to accomplish can be seen with the following
> simplified data, which is given as dput output below.
>
>> head(myData)
> ? ? ?mydate gender mygroup id
> 1 2012-03-25 ? ? ?F ? ? ? A ?1
> 2 2005-05-23 ? ? ?F ? ? ? B ?2
> 3 2005-09-08 ? ? ?F ? ? ? B ?2
> 4 2005-12-07 ? ? ?F ? ? ? B ?2
> 5 2006-02-26 ? ? ?F ? ? ? C ?2
> 6 2006-05-13 ? ? ?F ? ? ? C ?2
>
> For each id, I want to count the number of changes of the variable
> 'mygroup' that occur. For example, id=1 has 0 changes because it is
> observed only once. ?id=2 has 2 changes (B to C, and C to D). ?I also
> need to calculate the total observation time for each id using the
> variable mydate. ?In the end, I am trying to have a new data set in
> which each row has an id, days observed, number of changes, and
> gender.
>
> I made some simple summaries using data.table and plyr, but I'm stuck
> on this reformatting.
>
> Thanks for your help.
>
> myData <- structure(list(mydate = c("2012-03-25",
"2005-05-23", "2005-09-08",
> "2005-12-07", "2006-02-26", "2006-05-13",
"2006-09-01", "2006-12-12",
> "2006-02-19", "2006-05-03", "2006-04-23",
"2007-12-08", "2011-03-19",
> "2007-12-20", "2008-06-15", "2008-12-16",
"2009-06-07", "2009-10-09",
> "2010-01-28", "2007-06-05"), gender = c("F",
"F", "F", "F", "F",
> "F", "F", "F", "F", "F",
"F", "F", "F", "M", "M",
"M", "M", "M",
> "M", "M"), mygroup = c("A", "B",
"B", "B", "C", "C", "C",
"D",
> "D", "D", "D", "D", "D",
"A", "A", "A", "B", "B",
"B", "A"),
> ? ?id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
> ? ?3L, 3L, 3L, 3L, 3L, 3L, 4L)), .Names = c("mydate",
"gender",
> "mygroup", "id"), class = "data.frame",
row.names = c(NA, -20L
> ))
>
>> sessionInfo()
> R version 2.13.1 (2011-07-08)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C
> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8
> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8
> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C
> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Reasonably Related Threads

Search for more possibly parallel threads

R help - Aug 2011 - data manipulation and summaries with few million rows

[R] data manipulation and summaries with few million rows

[R] data manipulation and summaries with few million rows

[R] data manipulation and summaries with few million rows

Reasonably Related Threads