phii m@iii@g oii phiiipsmith@c@
2020-Mar-29 01:18 UTC
[R] Duplicate names in the pivot column
I have a problem involving inefficient coding. My code works, but in my actual application it takes a very long time to execute. I have included a reprex here that uses the same code, but with a much smaller-scale application. The data frame I am working with (df in my reprex) is in long form and I want to change it to wide form. My problem is that the pivot column, column 2 in my reprex, has some duplicate strings, so the pivot doesn't work well (df1 in my reprex). I want to find all the duplicates and tag them so they are no longer duplicates. My code succeeds (df3 in my reprex). But in the real application there can be over 100 "cases" and the for loops grind on far too long. I encounter this problem frequently in the datasets I use, so I am looking for a general solution that is as efficient as possible. Any help will be much appreciated. Philip ``` r library(tidyverse) df <- data.frame(time=c(1,1,1,1,1,1,2,2,2,2,2,2), y=c("A","B","C","B","D","C","A","B","C","B","D","C"), z=sample(1:100,12,replace=TRUE),stringsAsFactors=FALSE) df1 <- pivot_wider(df,id_cols=1,names_from=y,values_from=z) #> Warning: Values in `z` are not uniquely identified; output will contain list-cols. #> * Use `values_fn = list(z = list)` to suppress this warning. #> * Use `values_fn = list(z = length)` to identify where the duplicates arise #> * Use `values_fn = list(z = summary_fun)` to summarise duplicates fixcol <- function(dfm,cases,per,s,tag) { # dfm is the data frame # s is the target column number, containing character names # tag is a string to be added to a duplicate name # cases is the number of rows for a single time period # per is the number of time periods # all time periods must have the same number of rows for (k in 1:per) { for (i in (1+(k-1)*cases):(k*cases-1)) { for (j in (i+1):(k*cases)) { if (dfm[j,s]==dfm[i,s]) { # found a duplicate dfm[j,s] <- paste0(dfm[i,s],tag) # fix the duplicate dfm[j,s] } } } } return(dfm) } df2 <- fixcol(df,6,2,2,"_dup") df3 <- pivot_wider(df2,id_cols=1,names_from=y,values_from=z) ``` <sup>Created on 2020-03-28 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: pivot_col_duplicates.txt URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200328/38719199/attachment.txt>
Hi Phil, Sorry it's not in the environment you are using, but perhaps this will help: taby<-table(df$y) ynames<-names(taby) for(yval in 1:length(taby)) { if(taby[yval] > 1) { cat(paste(ynames[yval],1:taby[yval],sep=""),"\n") df$y[which(df$y == ynames[yval])]<-paste(ynames[yval],1:taby[yval],sep="") } } Jim On Sun, Mar 29, 2020 at 12:19 PM <phil at philipsmith.ca> wrote:> > I have a problem involving inefficient coding. My code works, but in my > actual application it takes a very long time to execute. I have included > a reprex here that uses the same code, but with a much smaller-scale > application. > > The data frame I am working with (df in my reprex) is in long form and I > want to change it to wide form. My problem is that the pivot column, > column 2 in my reprex, has some duplicate strings, so the pivot doesn't > work well (df1 in my reprex). I want to find all the duplicates and tag > them so they are no longer duplicates. My code succeeds (df3 in my > reprex). But in the real application there can be over 100 "cases" and > the for loops grind on far too long. > > I encounter this problem frequently in the datasets I use, so I am > looking for a general solution that is as efficient as possible. Any > help will be much appreciated. > > Philip > > ``` r > library(tidyverse) > df <- data.frame(time=c(1,1,1,1,1,1,2,2,2,2,2,2), > y=c("A","B","C","B","D","C","A","B","C","B","D","C"), > z=sample(1:100,12,replace=TRUE),stringsAsFactors=FALSE) > df1 <- pivot_wider(df,id_cols=1,names_from=y,values_from=z) > #> Warning: Values in `z` are not uniquely identified; output will > contain list-cols. > #> * Use `values_fn = list(z = list)` to suppress this warning. > #> * Use `values_fn = list(z = length)` to identify where the duplicates > arise > #> * Use `values_fn = list(z = summary_fun)` to summarise duplicates > fixcol <- function(dfm,cases,per,s,tag) { > # dfm is the data frame > # s is the target column number, containing character names > # tag is a string to be added to a duplicate name > # cases is the number of rows for a single time period > # per is the number of time periods > # all time periods must have the same number of rows > for (k in 1:per) { > for (i in (1+(k-1)*cases):(k*cases-1)) { > for (j in (i+1):(k*cases)) { > if (dfm[j,s]==dfm[i,s]) { # found a duplicate > dfm[j,s] <- paste0(dfm[i,s],tag) # fix the duplicate > dfm[j,s] > } > } > } > } > return(dfm) > } > df2 <- fixcol(df,6,2,2,"_dup") > df3 <- pivot_wider(df2,id_cols=1,names_from=y,values_from=z) > ``` > > <sup>Created on 2020-03-28 by the [reprex > package](https://reprex.tidyverse.org) (v0.3.0)</sup>______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Does this help? df4 <- ( df %>% group_by( time, y ) %>% mutate( lvl = seq.int( n() ) ) %>% ungroup() %>% mutate( y = ifelse( 1==lvl , y , paste( y, "dup" ) ) ) ) On March 28, 2020 6:18:51 PM PDT, phil at philipsmith.ca wrote:>I have a problem involving inefficient coding. My code works, but in my > >actual application it takes a very long time to execute. I have >included >a reprex here that uses the same code, but with a much smaller-scale >application. > >The data frame I am working with (df in my reprex) is in long form and >I >want to change it to wide form. My problem is that the pivot column, >column 2 in my reprex, has some duplicate strings, so the pivot doesn't > >work well (df1 in my reprex). I want to find all the duplicates and tag > >them so they are no longer duplicates. My code succeeds (df3 in my >reprex). But in the real application there can be over 100 "cases" and >the for loops grind on far too long. > >I encounter this problem frequently in the datasets I use, so I am >looking for a general solution that is as efficient as possible. Any >help will be much appreciated. > >Philip > >``` r >library(tidyverse) >df <- data.frame(time=c(1,1,1,1,1,1,2,2,2,2,2,2), > y=c("A","B","C","B","D","C","A","B","C","B","D","C"), > z=sample(1:100,12,replace=TRUE),stringsAsFactors=FALSE) >df1 <- pivot_wider(df,id_cols=1,names_from=y,values_from=z) >#> Warning: Values in `z` are not uniquely identified; output will >contain list-cols. >#> * Use `values_fn = list(z = list)` to suppress this warning. >#> * Use `values_fn = list(z = length)` to identify where the >duplicates >arise >#> * Use `values_fn = list(z = summary_fun)` to summarise duplicates >fixcol <- function(dfm,cases,per,s,tag) { > # dfm is the data frame > # s is the target column number, containing character names > # tag is a string to be added to a duplicate name > # cases is the number of rows for a single time period > # per is the number of time periods > # all time periods must have the same number of rows > for (k in 1:per) { > for (i in (1+(k-1)*cases):(k*cases-1)) { > for (j in (i+1):(k*cases)) { > if (dfm[j,s]==dfm[i,s]) { # found a duplicate > dfm[j,s] <- paste0(dfm[i,s],tag) # fix the duplicate > dfm[j,s] > } > } > } > } > return(dfm) >} >df2 <- fixcol(df,6,2,2,"_dup") >df3 <- pivot_wider(df2,id_cols=1,names_from=y,values_from=z) >``` > ><sup>Created on 2020-03-28 by the [reprex >package](https://reprex.tidyverse.org) (v0.3.0)</sup>-- Sent from my phone. Please excuse my brevity.
phii m@iii@g oii phiiipsmith@c@
2020-Mar-29 15:45 UTC
[R] Duplicate names in the pivot column
Thank you very much, Jim and Jeff. Both of your solutions work splendidly. Philip On 2020-03-29 02:25, Jim Lemon wrote:> Hi Phil, > Sorry it's not in the environment you are using, but perhaps this will > help: > > taby<-table(df$y) > ynames<-names(taby) > for(yval in 1:length(taby)) { > if(taby[yval] > 1) { > cat(paste(ynames[yval],1:taby[yval],sep=""),"\n") > df$y[which(df$y == > ynames[yval])]<-paste(ynames[yval],1:taby[yval],sep="") > } > } > > Jim > > On Sun, Mar 29, 2020 at 12:19 PM <phil at philipsmith.ca> wrote: >> >> I have a problem involving inefficient coding. My code works, but in >> my >> actual application it takes a very long time to execute. I have >> included >> a reprex here that uses the same code, but with a much smaller-scale >> application. >> >> The data frame I am working with (df in my reprex) is in long form and >> I >> want to change it to wide form. My problem is that the pivot column, >> column 2 in my reprex, has some duplicate strings, so the pivot >> doesn't >> work well (df1 in my reprex). I want to find all the duplicates and >> tag >> them so they are no longer duplicates. My code succeeds (df3 in my >> reprex). But in the real application there can be over 100 "cases" and >> the for loops grind on far too long. >> >> I encounter this problem frequently in the datasets I use, so I am >> looking for a general solution that is as efficient as possible. Any >> help will be much appreciated. >> >> Philip >> >> ``` r >> library(tidyverse) >> df <- data.frame(time=c(1,1,1,1,1,1,2,2,2,2,2,2), >> >> y=c("A","B","C","B","D","C","A","B","C","B","D","C"), >> >> z=sample(1:100,12,replace=TRUE),stringsAsFactors=FALSE) >> df1 <- pivot_wider(df,id_cols=1,names_from=y,values_from=z) >> #> Warning: Values in `z` are not uniquely identified; output will >> contain list-cols. >> #> * Use `values_fn = list(z = list)` to suppress this warning. >> #> * Use `values_fn = list(z = length)` to identify where the >> duplicates >> arise >> #> * Use `values_fn = list(z = summary_fun)` to summarise duplicates >> fixcol <- function(dfm,cases,per,s,tag) { >> # dfm is the data frame >> # s is the target column number, containing character names >> # tag is a string to be added to a duplicate name >> # cases is the number of rows for a single time period >> # per is the number of time periods >> # all time periods must have the same number of rows >> for (k in 1:per) { >> for (i in (1+(k-1)*cases):(k*cases-1)) { >> for (j in (i+1):(k*cases)) { >> if (dfm[j,s]==dfm[i,s]) { # found a duplicate >> dfm[j,s] <- paste0(dfm[i,s],tag) # fix the duplicate >> dfm[j,s] >> } >> } >> } >> } >> return(dfm) >> } >> df2 <- fixcol(df,6,2,2,"_dup") >> df3 <- pivot_wider(df2,id_cols=1,names_from=y,values_from=z) >> ``` >> >> <sup>Created on 2020-03-28 by the [reprex >> package](https://reprex.tidyverse.org) >> (v0.3.0)</sup>______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.