thr3ads.net - search: "id

2024 Dec 12

1

Cores hang when calling mcapply

...justments to get exactly what I wanted, but it works, and takes about 15 minutes on our server configuration: temp <- ??????open_dataset( ????????????sources = input_files, ????????????format = 'csv', ????????????unify_schema = TRUE, ????????????col_types = schema( ????????????"ID_Key" = string(), ????????????"column1" = string(), ????????????"column2" = string() ????????????) ??????) |> as_tibble() ?????? ??keeptabs <- split(temp, temp$ID_Key) if(isTRUE(multicore)){ keeptabs <- mclapply(1:length(keeptabs), function(i) crewjanitormake...

Cores hang when calling mcapply

2024 Dec 12

1

Cores hang when calling mcapply

...estion helped, and that switching to a `data.table` approach reduced the processing time and memory overhead?15 minutes for one of the smaller datasets is certainly better! Sounds like the adjustments you devised, especially keeping the multicore approach for `make_clean_names()` and ensuring that `ID_Key` values remain intact, were the missing components you needed to fit it into your workflow. I believe the warning message regarding `dcast()` occurs because `keeptabs` is a `tbl_df` from the tidyverse rather than a base `data.frame` or `data.table`. The `data.table` implementation of `dcast()` exp...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...ow-binding. > library(data.table) > library(arrow) > > # Step A: Load data efficiently as data.table > dt <- as.data.table( > open_dataset( > sources = input_files, > format = 'csv', > unify_schema = TRUE, > col_types = schema( > "ID_Key" = string(), > "column1" = string(), > "column2" = string() > ) > ) |> > collect() > ) > > # Step B: Clean names once > # Assume `crewjanitormakeclean` essentially standardizes column names > dt[, column1 := janitor::make_...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...ow-binding. > library(data.table) > library(arrow) > > # Step A: Load data efficiently as data.table > dt <- as.data.table( > open_dataset( > sources = input_files, > format = 'csv', > unify_schema = TRUE, > col_types = schema( > "ID_Key" = string(), > "column1" = string(), > "column2" = string() > ) > ) |> > collect() > ) > > # Step B: Clean names once > # Assume `crewjanitormakeclean` essentially standardizes column names > dt[, column1 := janitor::make_c...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...Step A: Load data efficiently as data.table > > > dt <- as.data.table( > > >?? open_dataset( > > >??? sources = input_files, > > >??? format = 'csv', > > >??? unify_schema = TRUE, > > >??? col_types = schema( > > >????? "ID_Key" = string(), > > >????? "column1" = string(), > > >????? "column2" = string() > > >??? ) > > >? ) |> > > > > >??? collect() > > > ) > > > > > > # Step B: Clean names once > > > # Ass...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...brary(arrow) > > > > # Step A: Load data efficiently as data.table > > dt <- as.data.table( > >?? open_dataset( > >??? sources = input_files, > >??? format = 'csv', > >??? unify_schema = TRUE, > >??? col_types = schema( > >????? "ID_Key" = string(), > >????? "column1" = string(), > >????? "column2" = string() > >??? ) > >? ) |> > > >??? collect() > > ) > > > > # Step B: Clean names once > > # Assume `crewjanitormakeclean` essentially standardize...

Cores hang when calling mcapply

2024 Dec 11

1

Cores hang when calling mcapply

...ow-binding. > library(data.table) > library(arrow) > > # Step A: Load data efficiently as data.table > dt <- as.data.table( > open_dataset( > sources = input_files, > format = 'csv', > unify_schema = TRUE, > col_types = schema( > "ID_Key" = string(), > "column1" = string(), > "column2" = string() > ) > ) |> > collect() > ) > > # Step B: Clean names once > # Assume `crewjanitormakeclean` essentially standardizes column names > dt[, column1 := janitor::make_c...

Cores hang when calling mcapply

2024 Dec 11

2

Cores hang when calling mcapply

...er <- function(df,column,prefix) { df <- df |> distinct() |> mutate(n = 1) |> pivot_wider(names_from = glue("{column}"), values_from = n, names_prefix = prefix, values_fill = list(n = 0)) return(df) } sum_group_function <- function(df) { df <- df |> group_by(ID_Key) |> summarise(across(c(starts_with("column1_name_"),starts_with("column2_name_"),), ~ sum(.x, na.rm = TRUE))) |> ungroup() return(df) } and splitting up the data into a list of 110k individual dataframes based on Key_ID temp <- open_dataset( sources = input_fi...

search for: id_key