thr3ads.net - search: "make_clean

2024 Dec 12

1

Cores hang when calling mcapply

Hi Thomas, Glad to hear the suggestion helped, and that switching to a `data.table` approach reduced the processing time and memory overhead?15 minutes for one of the smaller datasets is certainly better! Sounds like the adjustments you devised, especially keeping the multicore approach for `make_clean_names()` and ensuring that `ID_Key` values remain intact, were the missing components you needed to fit it into your workflow. I believe the warning message regarding `dcast()` occurs because `keeptabs` is a `tbl_df` from the tidyverse rather than a base `data.frame` or `data.table`. The `data.table`...

Cores hang when calling mcapply

2024 Dec 12

1

Cores hang when calling mcapply

...1, out2), use.names = TRUE, fill = TRUE) final_result <- as_tibble(final_dt[, lapply(.SD, sum, na.rm = TRUE), by = ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]) Worth noting however: * I unfortunately had to keep the multicore parameters for the janitor package to use make_clean_names() because it just took to long to run it on the full dataframe, but deploying data.table CONSIDERABLY reduces the time and memory overhead to the point where it only takes about 15 minutes to run one of my smaller dataframes. * I keep getting the following warning message: * The dcast gen...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...D_Key" = string(), > "column1" = string(), > "column2" = string() > ) > ) |> > collect() > ) > > # Step B: Clean names once > # Assume `crewjanitormakeclean` essentially standardizes column names > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = > TRUE)] > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > TRUE)] > > # Step C: Create presence/absence indicators using data.table > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence. > # For large unique values, c...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...ID_Key" = string(), > "column1" = string(), > "column2" = string() > ) > ) |> > collect() > ) > > # Step B: Clean names once > # Assume `crewjanitormakeclean` essentially standardizes column names > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = > TRUE)] > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > TRUE)] > > # Step C: Create presence/absence indicators using data.table > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence. > # For large unique values, cons...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...ot; = string(), > >????? "column2" = string() > >??? ) > >? ) |> > > >??? collect() > > ) > > > > # Step B: Clean names once > > # Assume `crewjanitormakeclean` essentially standardizes column names > > dt[, column1 := janitor::make_clean_names(column1, allow_dupes =? > > > TRUE)] > > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > > >? TRUE)] > > > > # Step C: Create presence/absence indicators using data.table > > # Use dcast to pivot wide. Set n=1 for presence, 0 for ab...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...quot; = string() > > >??? ) > > >? ) |> > > > > >??? collect() > > > ) > > > > > > # Step B: Clean names once > > > # Assume `crewjanitormakeclean` essentially standardizes column names > > > dt[, column1 := janitor::make_clean_names(column1, allow_dupes =? > > > > > TRUE)] > > > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > > > > >? TRUE)] > > > > > > # Step C: Create presence/absence indicators using data.table > > > # Use dcast to pi...

Cores hang when calling mcapply

2024 Dec 11

2

Cores hang when calling mcapply

...lumns based on the strings contained within them, as an example, one set has ~29k unique values and the other with ~15k unique values (no overlap across the two). Using a combination of custom functions: crewjanitormakeclean <- function(df,columns) { df <- df |> mutate(across(columns, ~make_clean_names(., allow_dupes = TRUE))) return(df) } mass_pivot_wider <- function(df,column,prefix) { df <- df |> distinct() |> mutate(n = 1) |> pivot_wider(names_from = glue("{column}"), values_from = n, names_prefix = prefix, values_fill = list(n = 0)) return(df) } sum_group_...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...ID_Key" = string(), > "column1" = string(), > "column2" = string() > ) > ) |> > collect() > ) > > # Step B: Clean names once > # Assume `crewjanitormakeclean` essentially standardizes column names > dt[, column1 := janitor::make_clean_names(column1, allow_dupes = > TRUE)] > dt[, column2 := janitor::make_clean_names(column2, allow_dupes = > TRUE)] > > # Step C: Create presence/absence indicators using data.table > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence. > # For large unique values, cons...

search for: make_clean_nam