Clay Heaton
2010-Mar-22 13:27 UTC
[R] Accessing data in groups created with split() and other beginner questions
Hi, very new to R here... I have a data frame called 'set' with 100k+ rows in it that looks like this: subject timestamp yvalue traceabs subjtrace 1 1 1992-07-12 06:05:00 12 1 1-1 2 1 1992-07-12 06:10:00 15 1 1-1 3 1 1992-07-12 06:15:00 17 1 1-1 4 1 1992-07-12 06:20:00 20 1 1-1 5 1 1992-07-12 06:25:00 24 1 1-1 .... There are 89 subjects, each of which have a different number of traces -- it's time series data. There are, in total, around 180 traces. The "subjtrace" variable is just a concatenation of the subject number, a hyphen, and the relative trace number. For instance, the first trace for subject 46 is "46-1" but the traceabs value for the same trace is 71. I need to perform simple statistics on each subject and on each trace. I also need to graph each trace. It seems like the easy approach to identifying the variables would be to use the split() function to create groups:> temp <- split(set, set$subject)When I then try, for example:> summary(temp[1])all I get as a result is: Length Class Mode 1 5 data.frame list So I went with:> lapply(temp[1], summary)That works, but I'm unable to do something like:> lapply(temp[1]$yvalue, mean)because the result returned is: list() Ultimately, I'm trying to run the exact same code on each group, as defined by the subject number, and each trace. I would like to display something like the following: Subject # and Summary Statistics -- Graph of a trace belonging to the subject -- Summary statistics for the trace -- Graph of the next trace belonging to the subject -- Summary statistics for the trace -- etc... My intention is to dump this all into a .pdf file with Sweave and LaTeX. Questions: - Is split() the best function to use to create the proper groups? or should I look to create a separate variable for each group using subset, like: temp.46 <- subset(set, subject==46,select=c(subject, timestamp, yvalue, subjtrace)) - How do I call functions on data within the groups created by split()? Like... lapply(temp[1]$yvalue, sd) - In an effort to try to learn the proper way to approach this, what would be the best practice for iterating through the data and pushing it to .pdf? Thanks!
Benilton Carvalho
2010-Mar-22 13:33 UTC
[R] Accessing data in groups created with split() and other beginner questions
To access elements of a list (object returned by split), you need to use "[[". Therefore, summary(temp[[1]]) is what you meant to use (or even summ = lapply(temp, summary) - which will give you the summaries for every subject). About using PDFs, I'd recommend you to take a look at Sweave ( http://www.statistik.lmu.de/~leisch/Sweave/ ) b On Mon, Mar 22, 2010 at 1:27 PM, Clay Heaton <ccheaton at gmail.com> wrote:> Hi, very new to R here... > > I have a data frame called 'set' with 100k+ rows in it that looks like this: > > ?subject ? ? ? ? ? timestamp ?yvalue traceabs subjtrace > 1 ? ? ? 1 1992-07-12 06:05:00 ? ? ?12 ? ? ? ?1 ? ? ? 1-1 > 2 ? ? ? 1 1992-07-12 06:10:00 ? ? ?15 ? ? ? ?1 ? ? ? 1-1 > 3 ? ? ? 1 1992-07-12 06:15:00 ? ? ?17 ? ? ? ?1 ? ? ? 1-1 > 4 ? ? ? 1 1992-07-12 06:20:00 ? ? ?20 ? ? ? ?1 ? ? ? 1-1 > 5 ? ? ? 1 1992-07-12 06:25:00 ? ? ?24 ? ? ? ?1 ? ? ? 1-1 > .... > > There are 89 subjects, each of which have a different number of traces -- it's time series data. There are, in total, around 180 traces. The "subjtrace" variable is just a concatenation of the subject number, a hyphen, and the relative trace number. For instance, the first trace for subject 46 is "46-1" but the traceabs value for the same trace is 71. > > I need to perform simple statistics on each subject and on each trace. I also need to graph each trace. > > It seems like the easy approach to identifying the variables would be to use the split() function to create groups: > >> temp <- split(set, set$subject) > > When I then try, for example: > >> summary(temp[1]) > > all I get as a result is: > ?Length Class ? ? ?Mode > 1 5 ? ? ?data.frame list > > So I went with: > >> lapply(temp[1], summary) > > That works, but I'm unable to do something like: > >> lapply(temp[1]$yvalue, mean) > > because the result returned is: > list() > > Ultimately, I'm trying to run the exact same code on each group, as defined by the subject number, and each trace. I would like to display something like the following: > > Subject # and Summary Statistics > -- Graph of a trace belonging to the subject > -- Summary statistics for the trace > -- Graph of the next trace belonging to the subject > -- Summary statistics for the trace > -- etc... > > My intention is to dump this all into a .pdf file with Sweave and LaTeX. > > Questions: > - Is split() the best function to use to create the proper groups? or should I look to create a separate variable for each group using subset, like: > temp.46 <- subset(set, subject==46,select=c(subject, timestamp, yvalue, subjtrace)) > > - How do I call functions on data within the groups created by split()? Like... > lapply(temp[1]$yvalue, sd) > > - In an effort to try to learn the proper way to approach this, what would be the best practice for iterating through the data and pushing it to .pdf? > > Thanks! > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >