Bert Gunter
2022-Oct-21 23:20 UTC
[R] getting data from a "vertical" table into a "2-dimensional" grid
"As my end result, I want a matrix or data frame, with one row for each year, and one column for each category." If I understand you correctly, no reshaping gymnastics are needed -- just use ?tapply: set.seed(1) do <- data.frame(year = rep(1990:1999, length = 50), category = sample(1:5, size = 50, replace = TRUE), sales = sample(0:99999, size = 50 , replace = TRUE) ) with(do, tapply(sales, list(year, category),sum)) ## which gives the matrix: 1 2 3 4 5 1990 13283 NA 55083 87522 64877 1991 NA 80963 NA 30100 28277 1992 9391 202916 NA 55090 NA 1993 29696 167344 NA NA 17625 1994 98015 99521 NA 70536 52252 1995 157003 NA 26875 NA 11366 1996 32986 88683 6562 79475 95282 1997 13601 NA 134757 12398 NA 1998 30537 51117 31333 20204 NA 1999 39240 87845 62479 NA 98804 If this is not what you wanted, you may need to explain further or await a response from someone more insightful than I. Cheers, Bert On Fri, Oct 21, 2022 at 3:34 PM Kelly Thompson <kt1572757 at gmail.com> wrote:> > As my end result, I want a matrix or data frame, with one row for each > year, and one column for each category. > > On Fri, Oct 21, 2022 at 6:23 PM Kelly Thompson <kt1572757 at gmail.com> wrote: > > > > # I think this might be a better example. > > > > # I have data presented in a "vertical" dataframe as shown below in > > data_original. > > # I want this data in a matrix or "grid", as shown below. > > # What I show below seems like one way this can be done. > > > > # My question: Are there easier or better ways to do this, especially > > in Base R, and also in R packages? > > > > #create data > > set.seed(1) > > data_original <- data.frame(year = rep(1990:1999, length = 50), > > category = sample(1:5, size = 50, replace = TRUE), sales > > sample(0:99999, size = 50 , replace = TRUE) ) > > dim(data_original) > > > > #remove rows where data_original$year == 1990 & data_original$category > > == 5, to ensure there is at least one NA in the "grid" > > data_original <- data_original[ (data_original$year == 1990 & > > data_original$category == 5) == FALSE, ] > > dim(data_original) > > > > #aggregate data > > data_aggregate_sum_by_year_and_category <- aggregate(x > > data_original$sales, by = list(year = data_original$year, category > > data_original$category), FUN = sum) > > colnames(data_aggregate_sum_by_year_and_category) <- c('year', > > 'category', 'sum_of_sales') > > dim(data_aggregate_sum_by_year_and_category) > > > > data_expanded <- expand.grid(year > > unique(data_aggregate_sum_by_year_and_category$year), category > > unique(data_aggregate_sum_by_year_and_category$category)) > > dim(data_expanded) > > data_expanded <- merge(data_expanded, > > data_aggregate_sum_by_year_and_category, all = TRUE) > > dim(data_expanded) > > > > mat <- matrix(data = data_expanded$sum_of_sales, nrow > > length(unique(data_expanded$year)), ncol > > length(unique(data_expanded$category)) , byrow = TRUE, dimnames > > list( unique(data_expanded$year), unique(data_expanded$category) ) ) > > > > > > data_original > > data_expanded > > mat > > > > On Fri, Oct 21, 2022 at 5:03 PM Kelly Thompson <kt1572757 at gmail.com> wrote: > > > > > > ### > > > #I have data presented in a "vertical" data frame as shown below in > > > data_original. > > > #I want this data in a matrix or "grid", as shown below. > > > #What I show below seems like one way this can be done. > > > > > > #My question: Are there easier or better ways to do this, especially > > > in Base R, and also in R packages? > > > > > > #reproducible example > > > > > > data_original <- data.frame(year = c('1990', '1999', '1990', '1989'), > > > size = c('s', 'l', 'xl', 'xs'), n = c(99, 33, 3, 4) ) > > > > > > data_expanded <- expand.grid(unique(data_original$year), > > > unique(data_original$size), stringsAsFactors = FALSE ) > > > colnames(data_expanded) <- c('year', 'size') > > > data_expanded <- merge(data_expanded, data_original, all = TRUE) > > > > > > mat <- matrix(data = data_expanded $n, nrow > > > length(unique(data_expanded $year)), ncol > > > length(unique(data_expanded $size)) , byrow = TRUE, dimnames = list( > > > unique(data_expanded$year), unique(data_expanded$size) ) ) > > > > > > data_original > > > data_expanded > > > mat > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Kelly Thompson
2022-Oct-22 00:13 UTC
[R] getting data from a "vertical" table into a "2-dimensional" grid
Bert, Thanks! I'm pretty sure what you provided gets me to what I was looking for, and is much simpler. I really appreciate your help. A follow-up question: I adjusted the code to not use "hard-coded" column names. mat2 <- with(data_original, tapply( get(names(data_original)[3]), list( get(names(data_original)[1]), get(names(data_original)[2])), sum )) Is there any better way to write that? Thanks again! ----- For clarity and to improve upon what I previously wrote, and so I can practice writing questions like this and asking for help, here's a recap of my question and "reproducible code", and the "better way" you provided: I have data presented in a 3-column data frame as shown below in "data_original". I want to aggregate the data in column 3, with the "by" argument using the first and second columns of "data_original". I want the results of the aggregation in a matrix, as shown below in "mat1". As my end "result", I want a matrix with one row for each unique value of column1 of data_original and one column for each unique value of column2 of data_original. What I show below seems like one way this can be done. My question: Are there easier or better ways to do this, especially in Base R, and also in R packages? #create data set.seed(1) data_original <- data.frame(year = rep(1990:1999, length = 50), category = sample(1:5, size = 50, replace = TRUE), sales sample(0:99999, size = 50 , replace = TRUE) ) dim(data_original) #remove rows where data_original[,1] == 1990 & data_original[,2] == 5, to ensure there is at least one NA in the desired matrix (this is an "edge" case I want the code to "deal with" correctly.) data_original <- data_original[ (data_original[,1] == 1990 & data_original[,2] == 5) == FALSE, ] dim(data_original) #aggregate data data_aggregate_col3_by_col1_and_col2 <- aggregate(x data_original[3], by = list(data_original[,1], data_original[,2]), FUN = sum) colnames(data_aggregate_col3_by_col1_and_col2) <- colnames(data_original) dim(data_aggregate_col3_by_col1_and_col2) data_expanded <- expand.grid(unique(data_aggregate_col3_by_col1_and_col2[,1]), unique(data_aggregate_col3_by_col1_and_col2[,2])) colnames(data_expanded) <- colnames(data_aggregate_col3_by_col1_and_col2)[1:2] dim(data_expanded) data_expanded <- merge(data_expanded, data_aggregate_col3_by_col1_and_col2, all = TRUE) dim(data_expanded) mat1 <- matrix(data = data_expanded[,3], nrow length(unique(data_expanded[,1])), ncol length(unique(data_expanded[,2])) , byrow = TRUE, dimnames = list( unique(data_expanded[,1]), unique(data_expanded[,2]) ) ) #this is an easier way, using with and tapply mat2 <- with(data_original, tapply( get(names(data_original)[3]), list( get(names(data_original)[1]), get(names(data_original)[2])), sum )) #check that mat1 and mat 2 are "nearly equal" all.equal(mat1, mat2) Gunter <bgunter.4567 at gmail.com> wrote:> > "As my end result, I want a matrix or data frame, with one row for each > year, and one column for each category." > > If I understand you correctly, no reshaping gymnastics are needed -- > just use ?tapply: > > set.seed(1) > do <- data.frame(year = rep(1990:1999, length = 50), > category = sample(1:5, size = 50, replace = TRUE), > sales = sample(0:99999, size = 50 , replace = TRUE) ) > > > with(do, tapply(sales, list(year, category),sum)) > ## which gives the matrix: > > 1 2 3 4 5 > 1990 13283 NA 55083 87522 64877 > 1991 NA 80963 NA 30100 28277 > 1992 9391 202916 NA 55090 NA > 1993 29696 167344 NA NA 17625 > 1994 98015 99521 NA 70536 52252 > 1995 157003 NA 26875 NA 11366 > 1996 32986 88683 6562 79475 95282 > 1997 13601 NA 134757 12398 NA > 1998 30537 51117 31333 20204 NA > 1999 39240 87845 62479 NA 98804 > > If this is not what you wanted, you may need to explain further or > await a response from someone more insightful than I. > > Cheers, > Bert > > > On Fri, Oct 21, 2022 at 3:34 PM Kelly Thompson <kt1572757 at gmail.com> wrote: > > > > As my end result, I want a matrix or data frame, with one row for each > > year, and one column for each category. > > > > On Fri, Oct 21, 2022 at 6:23 PM Kelly Thompson <kt1572757 at gmail.com> wrote: > > > > > > # I think this might be a better example. > > > > > > # I have data presented in a "vertical" dataframe as shown below in > > > data_original. > > > # I want this data in a matrix or "grid", as shown below. > > > # What I show below seems like one way this can be done. > > > > > > # My question: Are there easier or better ways to do this, especially > > > in Base R, and also in R packages? > > > > > > #create data > > > set.seed(1) > > > data_original <- data.frame(year = rep(1990:1999, length = 50), > > > category = sample(1:5, size = 50, replace = TRUE), sales > > > sample(0:99999, size = 50 , replace = TRUE) ) > > > dim(data_original) > > > > > > #remove rows where data_original$year == 1990 & data_original$category > > > == 5, to ensure there is at least one NA in the "grid" > > > data_original <- data_original[ (data_original$year == 1990 & > > > data_original$category == 5) == FALSE, ] > > > dim(data_original) > > > > > > #aggregate data > > > data_aggregate_sum_by_year_and_category <- aggregate(x > > > data_original$sales, by = list(year = data_original$year, category > > > data_original$category), FUN = sum) > > > colnames(data_aggregate_sum_by_year_and_category) <- c('year', > > > 'category', 'sum_of_sales') > > > dim(data_aggregate_sum_by_year_and_category) > > > > > > data_expanded <- expand.grid(year > > > unique(data_aggregate_sum_by_year_and_category$year), category > > > unique(data_aggregate_sum_by_year_and_category$category)) > > > dim(data_expanded) > > > data_expanded <- merge(data_expanded, > > > data_aggregate_sum_by_year_and_category, all = TRUE) > > > dim(data_expanded) > > > > > > mat <- matrix(data = data_expanded$sum_of_sales, nrow > > > length(unique(data_expanded$year)), ncol > > > length(unique(data_expanded$category)) , byrow = TRUE, dimnames > > > list( unique(data_expanded$year), unique(data_expanded$category) ) ) > > > > > > > > > data_original > > > data_expanded > > > mat > > > > > > On Fri, Oct 21, 2022 at 5:03 PM Kelly Thompson <kt1572757 at gmail.com> wrote: > > > > > > > > ### > > > > #I have data presented in a "vertical" data frame as shown below in > > > > data_original. > > > > #I want this data in a matrix or "grid", as shown below. > > > > #What I show below seems like one way this can be done. > > > > > > > > #My question: Are there easier or better ways to do this, especially > > > > in Base R, and also in R packages? > > > > > > > > #reproducible example > > > > > > > > data_original <- data.frame(year = c('1990', '1999', '1990', '1989'), > > > > size = c('s', 'l', 'xl', 'xs'), n = c(99, 33, 3, 4) ) > > > > > > > > data_expanded <- expand.grid(unique(data_original$year), > > > > unique(data_original$size), stringsAsFactors = FALSE ) > > > > colnames(data_expanded) <- c('year', 'size') > > > > data_expanded <- merge(data_expanded, data_original, all = TRUE) > > > > > > > > mat <- matrix(data = data_expanded $n, nrow > > > > length(unique(data_expanded $year)), ncol > > > > length(unique(data_expanded $size)) , byrow = TRUE, dimnames = list( > > > > unique(data_expanded$year), unique(data_expanded$size) ) ) > > > > > > > > data_original > > > > data_expanded > > > > mat > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.