Hi all, I'm working with a sizable dataset that I'd like to summarize, but I can't find a tool or function that will do quite what I'd like. Basically, I'd like to summarize the data by fully crossing three variables and getting a count of the number of observations for every level of that 3-way interaction. For example, if factors A, B, and C each have 3 levels (all of which were observed someplace in the dataset), I'd like to know how many times A1, B1, and C1 co-occurred in the dataset. Functions like aggregate and summaryBy do a decent job when I sum a vector of ones of the same length as the original dataset, but I'm getting stuck on the fact that neither will return 0-count combinations of the three variables in question. I understand that this is a desirable outcome (if A1, B1, C2 didn't occur, it shouldn't be counted and isn't), but I need to know both when these combinations of factor did and did not occur. I'm stuck on this one, and would really appreciate any help. Thanks in advance! Dave Warren PS A functional solution would be best; the original dataset contains about 2.3 million observations, so any looping is going to be very slow. -- Post-doctoral Fellow Neurology Department University of Iowa Hospitals and Clinics davideugenewarren@gmail.com [[alternative HTML version deleted]]
You don't offer a reproducible example, but what do you need that table() doesn't provide? testdata <- data.frame(A=factor(sample(1:3, 20)), B=factor(sample(1:3, 20)), C=factor(sample(1:3, 20))) table(testdata) Sarah On Thu, Jul 28, 2011 at 4:24 PM, David Warren <davideugenewarren at gmail.com> wrote:> Hi all, > > ? ? I'm working with a sizable dataset that I'd like to summarize, but I > can't find a tool or function that will do quite what I'd like. ?Basically, > I'd like to summarize the data by fully crossing three variables and getting > a count of the number of observations for every level of that 3-way > interaction. ?For example, if factors A, B, and C each have 3 levels (all of > which were observed someplace in the dataset), I'd like to know how many > times A1, B1, and C1 co-occurred in the dataset. ?Functions like aggregate > and summaryBy do a decent job when I sum a vector of ones of the same length > as the original dataset, but I'm getting stuck on the fact that neither will > return 0-count combinations of the three variables in question. ?I > understand that this is a desirable outcome (if A1, B1, C2 didn't occur, it > shouldn't be counted and isn't), but I need to know both when these > combinations of factor did and did not occur. ?I'm stuck on this one, and > would really appreciate any help. ?Thanks in advance! > > Dave Warren > > PS A functional solution would be best; the original dataset contains about > 2.3 million observations, so any looping is going to be very slow. >-- Sarah Goslee http://www.functionaldiversity.org
Have you tried using table()? E.g.,> df <- data.frame(x=c("A","A","B","C"), y=c("ii","ii","i","ii"), Age=2^(1:4)) > tab <- do.call("table", df[c("x","y")]) > taby x i ii A 0 2 B 1 0 C 0 1> as.data.frame(tab)x y Freq 1 A i 0 2 B i 1 3 C i 0 4 A ii 2 5 B ii 0 6 C ii 1> str(.Last.value)'data.frame': 6 obs. of 3 variables: $ x : Factor w/ 3 levels "A","B","C": 1 2 3 1 2 3 $ y : Factor w/ 2 levels "i","ii": 1 1 1 2 2 2 $ Freq: int 0 1 0 2 0 1 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Warren > Sent: Thursday, July 28, 2011 1:25 PM > To: r-help at r-project.org > Subject: [R] Data aggregation question > > Hi all, > > I'm working with a sizable dataset that I'd like to summarize, but I > can't find a tool or function that will do quite what I'd like. Basically, > I'd like to summarize the data by fully crossing three variables and getting > a count of the number of observations for every level of that 3-way > interaction. For example, if factors A, B, and C each have 3 levels (all of > which were observed someplace in the dataset), I'd like to know how many > times A1, B1, and C1 co-occurred in the dataset. Functions like aggregate > and summaryBy do a decent job when I sum a vector of ones of the same length > as the original dataset, but I'm getting stuck on the fact that neither will > return 0-count combinations of the three variables in question. I > understand that this is a desirable outcome (if A1, B1, C2 didn't occur, it > shouldn't be counted and isn't), but I need to know both when these > combinations of factor did and did not occur. I'm stuck on this one, and > would really appreciate any help. Thanks in advance! > > Dave Warren > > PS A functional solution would be best; the original dataset contains about > 2.3 million observations, so any looping is going to be very slow. > > -- > Post-doctoral Fellow > Neurology Department > University of Iowa Hospitals and Clinics > davideugenewarren at gmail.com > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Jul 28, 2011, at 4:24 PM, David Warren wrote:> Hi all, > > I'm working with a sizable dataset that I'd like to summarize, > but I > can't find a tool or function that will do quite what I'd like. > Basically, > I'd like to summarize the data by fully crossing three variables and > getting > a count of the number of observations for every level of that 3-way > interaction. For example, if factors A, B, and C each have 3 levels > (all of > which were observed someplace in the dataset), I'd like to know how > many > times A1, B1, and C1 co-occurred in the dataset. Functions like > aggregate > and summaryBy do a decent job when I sum a vector of ones of the > same length > as the original dataset, but I'm getting stuck on the fact that > neither will > return 0-count combinations of the three variables in question.I think that may depend on what functions and arguments you use.> I understand that this is a desirable outcome (if A1, B1, C2 didn't > occur, it > shouldn't be counted and isn't), but I need to know both when these > combinations of factor did and did not occur. I'm stuck on this > one, and > would really appreciate any help. Thanks in advance!?xtabs> > Dave Warren > > PS A functional solution would be best; the original dataset > contains about > 2.3 million observations, so any looping is going to be very slow.In general tabulations like these are very efficient. -- David Winsemius, MD West Hartford, CT