Hi All, I need to spread a data.frame on 2 variables, e. g. "channel" and "unit". If I do it in two steps spreads keeps all cases that does not look like the one before although it contains the same values for a specific case. Here is what I have right now: -- cut -- test1$dummy <- 1 test2 <- spread(data = test1, key = 'channel', value = "dummy") test2 cat("First spread is OK!") test2$dummy <- 1 test3 <- spread(data = test2, key = 'unit', value = 'dummy') test1 # test2 test3 warning(paste0("Second spread is not OK cause spread does not merge cases\n", "with CustID 700 and 800 into one case,\n", "cause they have values on different variables,\n", "although the corresponding values of the cases with", "custID 700 and 800 are missing.")) cat("What I would like to have is:\n") target4 <- structure(list(custID = c(100, 200, 300, 500, 600, 700, 800, 900), `10` = c(1, NA, NA, NA, NA, NA, NA, NA), `20` = c(1, NA, NA, NA, NA, NA, NA, NA), `30` = c(NA, NA, NA, NA, NA, NA, 1, 1), `40` = c(NA, NA, NA, NA, 1, NA, 1, 1), `50` = c(NA, NA, 1, NA, NA, NA, 1, 1), `60` = c(NA, NA, NA, NA, NA, 1, NA, NA), `70` = c(NA, NA, NA, NA, NA, 1, NA, NA), `99` = c(NA, 1, NA, 1, NA, NA, NA, NA), `1000` = c(1, NA, NA, NA, NA, NA, 1, 1), `2000` = c(NA, NA, NA, NA, 1, 1, 1, NA), `3000` = c(NA, NA, 1, NA, NA, 1, NA, NA), `4000` = c(NA, NA, 1, NA, NA, NA, NA, NA), `6000` = c(NA, NA, NA, NA, 1, NA, NA, NA), `9999` = c(NA, 1, NA, 1, NA, NA, NA, NA)), .Names = c("custID", "10", "20", "30", "40", "50", "60", "70", "99", "1000", "2000", "3000", "4000", "6000", "9999"), row.names = c(NA, 8L), class = "data.frame") target4 cat("What would be a proper way to create target4 from test1?") -- cut -- What would be the proper way to create target4 from test1? Kind regards Georg
Hi Georg, it's hard to tell without a reproducible example. Should spread really merge elements? Does spread know anything about CustID? Maybe you need to make a useful key of the CustIDs first and spread on that? Maybe I'm all off, because I'm really just guessing. Best, Ulrik On Thu, 28 Jul 2016 at 12:36 <G.Maubach at weinwolf.de> wrote:> Hi All, > > I need to spread a data.frame on 2 variables, e. g. "channel" and "unit". > > If I do it in two steps spreads keeps all cases that does not look like > the one before although it contains the same values for a specific case. > > Here is what I have right now: > > -- cut -- > > test1$dummy <- 1 > test2 <- spread(data = test1, key = 'channel', value = "dummy") > test2 > cat("First spread is OK!") > > test2$dummy <- 1 > test3 <- spread(data = test2, key = 'unit', value = 'dummy') > > test1 > # test2 > test3 > warning(paste0("Second spread is not OK cause spread does not merge > cases\n", > "with CustID 700 and 800 into one case,\n", > "cause they have values on different variables,\n", > "although the corresponding values of the cases with", > "custID 700 and 800 are missing.")) > > cat("What I would like to have is:\n") > target4 <- structure(list(custID = c(100, 200, 300, 500, 600, 700, 800, > 900), > `10` = c(1, NA, NA, NA, NA, NA, NA, NA), > `20` = c(1, NA, NA, NA, NA, NA, NA, NA), > `30` = c(NA, NA, NA, NA, NA, NA, 1, 1), > `40` = c(NA, NA, NA, NA, 1, NA, 1, 1), > `50` = c(NA, NA, 1, NA, NA, NA, 1, 1), > `60` = c(NA, NA, NA, NA, NA, 1, NA, NA), > `70` = c(NA, NA, NA, NA, NA, 1, NA, NA), > `99` = c(NA, 1, NA, 1, NA, NA, NA, NA), > `1000` = c(1, NA, NA, NA, NA, NA, 1, 1), > `2000` = c(NA, NA, NA, NA, 1, 1, 1, NA), > `3000` = c(NA, NA, 1, NA, NA, 1, NA, NA), > `4000` = c(NA, NA, 1, NA, NA, NA, NA, NA), > `6000` = c(NA, NA, NA, NA, 1, NA, NA, NA), > `9999` = c(NA, 1, NA, 1, NA, NA, NA, NA)), > .Names = c("custID", > "10", "20", "30", "40", "50", "60", "70", "99", > "1000", "2000", "3000", "4000", "6000", "9999"), > row.names = c(NA, 8L), class = "data.frame") > > target4 > > cat("What would be a proper way to create target4 from test1?") > > -- cut -- > > What would be the proper way to create target4 from test1? > > Kind regards > > Georg > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Hi Georg, it is difficult to figure out what happens between your expectation and the outcome if we cannot see a minimal dataset. Based on your description I did this library(tidyr) library(dplyr) test_df <- data_frame(channel = LETTERS[1:5], unit = letters[1:5], custID c(1:5), dummy = 1) test_df %>% spread(channel, dummy) %>% mutate(dummy = 1) %>% spread(unit, dummy) which seems to be working fine as I get wide data. If a combination is missing in the long form it will also be missing in the wide form. Maybe you are looking for something like this: channel_wide <- test_df %>% select(channel, custID) %>% spread(channel, custID) unit_wide <- test_df %>% select(unit, custID) %>% spread(unit, custID) bind_cols(channel_wide, unit_wide) Apologies for the HTML - it's gmail Best wishes, Ulrik On Thu, 28 Jul 2016 at 13:54 <G.Maubach at weinwolf.de> wrote:> Hi Ulrik, > > I have included a reproducable example. I ran the code and it did exactly > what I wanted to show you. > > You are right: the solution shall merge cases in the end cause the values > on the variables are either missing or the same. > > Example 1: Values are the same > If you look at 6 and 7 and variable 70 the value is 1 in both cases. This > is in this context the same information and cases 6 and 7 with custID can > be merged to 1 for variable 70. > > Example 2: Values are missing and not missing > If you look at cases 8 and 9 the value for case 8 at variable 40, 50 and > 2000 is missing whereas the variables 40, 50 and 2000 have all 1 for case > 9. Case 8 and 9 could be merged together cause the missing values are > overwritten what is correct in this case. > > The solution I am looking for is to transform the data from long into wide > form and keep all but missing value information. > > Did I explain my problem in a comprehensible way? Are there any further > questions? > > Kind regards > > Georg > > > > > > Von: Ulrik Stervbo <ulrik.stervbo at gmail.com> > An: G.Maubach at weinwolf.de, r-help at r-project.org, > Datum: 28.07.2016 12:59 > Betreff: Re: [R] Spread data.frame on 2 variables > > > > Hi Georg, > > it's hard to tell without a reproducible example. > > Should spread really merge elements? Does spread know anything about > CustID? Maybe you need to make a useful key of the CustIDs first and > spread on that? > > Maybe I'm all off, because I'm really just guessing. > > Best, > Ulrik > > On Thu, 28 Jul 2016 at 12:36 <G.Maubach at weinwolf.de> wrote: > Hi All, > > I need to spread a data.frame on 2 variables, e. g. "channel" and "unit". > > If I do it in two steps spreads keeps all cases that does not look like > the one before although it contains the same values for a specific case. > > Here is what I have right now: > > -- cut -- > > test1$dummy <- 1 > test2 <- spread(data = test1, key = 'channel', value = "dummy") > test2 > cat("First spread is OK!") > > test2$dummy <- 1 > test3 <- spread(data = test2, key = 'unit', value = 'dummy') > > test1 > # test2 > test3 > warning(paste0("Second spread is not OK cause spread does not merge > cases\n", > "with CustID 700 and 800 into one case,\n", > "cause they have values on different variables,\n", > "although the corresponding values of the cases with", > "custID 700 and 800 are missing.")) > > cat("What I would like to have is:\n") > target4 <- structure(list(custID = c(100, 200, 300, 500, 600, 700, 800, > 900), > `10` = c(1, NA, NA, NA, NA, NA, NA, NA), > `20` = c(1, NA, NA, NA, NA, NA, NA, NA), > `30` = c(NA, NA, NA, NA, NA, NA, 1, 1), > `40` = c(NA, NA, NA, NA, 1, NA, 1, 1), > `50` = c(NA, NA, 1, NA, NA, NA, 1, 1), > `60` = c(NA, NA, NA, NA, NA, 1, NA, NA), > `70` = c(NA, NA, NA, NA, NA, 1, NA, NA), > `99` = c(NA, 1, NA, 1, NA, NA, NA, NA), > `1000` = c(1, NA, NA, NA, NA, NA, 1, 1), > `2000` = c(NA, NA, NA, NA, 1, 1, 1, NA), > `3000` = c(NA, NA, 1, NA, NA, 1, NA, NA), > `4000` = c(NA, NA, 1, NA, NA, NA, NA, NA), > `6000` = c(NA, NA, NA, NA, 1, NA, NA, NA), > `9999` = c(NA, 1, NA, 1, NA, NA, NA, NA)), > .Names = c("custID", > "10", "20", "30", "40", "50", "60", "70", "99", > "1000", "2000", "3000", "4000", "6000", "9999"), > row.names = c(NA, 8L), class = "data.frame") > > target4 > > cat("What would be a proper way to create target4 from test1?") > > -- cut -- > > What would be the proper way to create target4 from test1? > > Kind regards > > Georg > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
G.Maubach at weinwolf.de
2016-Aug-02 13:13 UTC
[R] Antwort: Re: Re: Spread data.frame on 2 variables (SOLVED)
Hi Ulrik, many thanks for your help. The problem was that R regards a dataset with a combination like caseID custID channel unit 1 1000 10 10 2 1000 20 10 3 1000 20 30 as two diffrenet sets of cases: 1 set = case 1, 2 set = case 2 and 3 due to the different values of unit in case 3 value 30, althought all cases should be restructured based just on custID. To get a dataset like caseID custID channel -10 channel-20 unit-10 unit-30 1 1000 1 1 1 1 instead of caseID custID channel -10 channel-20 unit-10 unit-30 1 1000 1 1 1 NA 2 1000 NA 1 NA 1 I used the approach you suggested: 1. I created a subset of my data with the first variable to be restructured: d_temp1 <- dataset[ , c("custID", "channel")) 2. I deleted all the cases the were dupliates d_temp1 <- duplicated(d_temp1, c("custID", "channel") 3. I introduced a dummy variable delivering the values for the new variables created by dplyr:spread() d_temp1$dummy <- 1 4. Then I restructured the subset d_temp1 <- dplyr::spread(d_temp1, key_variable = "channel", value = d_temp1$dummy) 5. I repeaed steps 1 to 4 with the other variable "unit" (instead of "channel") creating a new dataset named d_temp2. 6. I deleted the variables used for restructuring in steps 1 to 5 "channel" and "unit" from the original dataset "dataset". dataset$channel <- NULL dataset$unit <- NULL 7. I checked if I still had duplicates duplicates <- duplicated(dataset, key_variable = c("Debitor")) sum(duplicates) # was 0 it this time 8. I merged the datasets back together dataset_2 <- merge(x = dataset, y = d_temp1, by.x = "Debitor", by.y = "Debitor", all.x = TRUE, all.y = TRUE) # leaving out all.y would be fine dataset_2 <- merge(x = dataset2, y = d_temp2, by.x = "Debitor", by.y = "Debitor", all.x = TRUE, all.y = TRUE) # leaving out all.y would be fine There might be a combination of commands and functions doing the same thing in one step but I find that this is clear, comprehensible and reproducable even at a later date or by other readers willing to use base R for their work. Many thanks again for your help. Kind regards Georg Von: Ulrik Stervbo <ulrik.stervbo at gmail.com> An: G.Maubach at weinwolf.de, R-help <r-help at r-project.org>, Datum: 28.07.2016 14:20 Betreff: Re: Re: [R] Spread data.frame on 2 variables Hi Georg, it is difficult to figure out what happens between your expectation and the outcome if we cannot see a minimal dataset. Based on your description I did this library(tidyr) library(dplyr) test_df <- data_frame(channel = LETTERS[1:5], unit = letters[1:5], custID = c(1:5), dummy = 1) test_df %>% spread(channel, dummy) %>% mutate(dummy = 1) %>% spread(unit, dummy) which seems to be working fine as I get wide data. If a combination is missing in the long form it will also be missing in the wide form. Maybe you are looking for something like this: channel_wide <- test_df %>% select(channel, custID) %>% spread(channel, custID) unit_wide <- test_df %>% select(unit, custID) %>% spread(unit, custID) bind_cols(channel_wide, unit_wide) Apologies for the HTML - it's gmail Best wishes, Ulrik On Thu, 28 Jul 2016 at 13:54 <G.Maubach at weinwolf.de> wrote: Hi Ulrik, I have included a reproducable example. I ran the code and it did exactly what I wanted to show you. You are right: the solution shall merge cases in the end cause the values on the variables are either missing or the same. Example 1: Values are the same If you look at 6 and 7 and variable 70 the value is 1 in both cases. This is in this context the same information and cases 6 and 7 with custID can be merged to 1 for variable 70. Example 2: Values are missing and not missing If you look at cases 8 and 9 the value for case 8 at variable 40, 50 and 2000 is missing whereas the variables 40, 50 and 2000 have all 1 for case 9. Case 8 and 9 could be merged together cause the missing values are overwritten what is correct in this case. The solution I am looking for is to transform the data from long into wide form and keep all but missing value information. Did I explain my problem in a comprehensible way? Are there any further questions? Kind regards Georg Von: Ulrik Stervbo <ulrik.stervbo at gmail.com> An: G.Maubach at weinwolf.de, r-help at r-project.org, Datum: 28.07.2016 12:59 Betreff: Re: [R] Spread data.frame on 2 variables Hi Georg, it's hard to tell without a reproducible example. Should spread really merge elements? Does spread know anything about CustID? Maybe you need to make a useful key of the CustIDs first and spread on that? Maybe I'm all off, because I'm really just guessing. Best, Ulrik On Thu, 28 Jul 2016 at 12:36 <G.Maubach at weinwolf.de> wrote: Hi All, I need to spread a data.frame on 2 variables, e. g. "channel" and "unit". If I do it in two steps spreads keeps all cases that does not look like the one before although it contains the same values for a specific case. Here is what I have right now: -- cut -- test1$dummy <- 1 test2 <- spread(data = test1, key = 'channel', value = "dummy") test2 cat("First spread is OK!") test2$dummy <- 1 test3 <- spread(data = test2, key = 'unit', value = 'dummy') test1 # test2 test3 warning(paste0("Second spread is not OK cause spread does not merge cases\n", "with CustID 700 and 800 into one case,\n", "cause they have values on different variables,\n", "although the corresponding values of the cases with", "custID 700 and 800 are missing.")) cat("What I would like to have is:\n") target4 <- structure(list(custID = c(100, 200, 300, 500, 600, 700, 800, 900), `10` = c(1, NA, NA, NA, NA, NA, NA, NA), `20` = c(1, NA, NA, NA, NA, NA, NA, NA), `30` = c(NA, NA, NA, NA, NA, NA, 1, 1), `40` = c(NA, NA, NA, NA, 1, NA, 1, 1), `50` = c(NA, NA, 1, NA, NA, NA, 1, 1), `60` = c(NA, NA, NA, NA, NA, 1, NA, NA), `70` = c(NA, NA, NA, NA, NA, 1, NA, NA), `99` = c(NA, 1, NA, 1, NA, NA, NA, NA), `1000` = c(1, NA, NA, NA, NA, NA, 1, 1), `2000` = c(NA, NA, NA, NA, 1, 1, 1, NA), `3000` = c(NA, NA, 1, NA, NA, 1, NA, NA), `4000` = c(NA, NA, 1, NA, NA, NA, NA, NA), `6000` = c(NA, NA, NA, NA, 1, NA, NA, NA), `9999` = c(NA, 1, NA, 1, NA, NA, NA, NA)), .Names = c("custID", "10", "20", "30", "40", "50", "60", "70", "99", "1000", "2000", "3000", "4000", "6000", "9999"), row.names = c(NA, 8L), class = "data.frame") target4 cat("What would be a proper way to create target4 from test1?") -- cut -- What would be the proper way to create target4 from test1? Kind regards Georg ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.