Displaying 8 results from an estimated 8 matches for "id_key".
2024 Dec 12
1
Cores hang when calling mcapply
...justments to get exactly what I wanted, but it works, and takes about 15 minutes on our server configuration:
temp <-
??????open_dataset(
????????????sources = input_files,
????????????format = 'csv',
????????????unify_schema = TRUE,
????????????col_types = schema(
????????????"ID_Key" = string(),
????????????"column1" = string(),
????????????"column2" = string()
????????????)
??????) |> as_tibble()
??????
??keeptabs <- split(temp, temp$ID_Key)
if(isTRUE(multicore)){
keeptabs <- mclapply(1:length(keeptabs), function(i) crewjanitormake...
2024 Dec 12
1
Cores hang when calling mcapply
...estion helped, and that switching to a `data.table` approach reduced the processing time and memory overhead?15 minutes for one of the smaller datasets is certainly better! Sounds like the adjustments you devised, especially keeping the multicore approach for `make_clean_names()` and ensuring that `ID_Key` values remain intact, were the missing components you needed to fit it into your workflow.
I believe the warning message regarding `dcast()` occurs because `keeptabs` is a `tbl_df` from the tidyverse rather than a base `data.frame` or `data.table`. The `data.table` implementation of `dcast()` exp...
2024 Dec 11
1
Cores hang when calling mcapply
...ow-binding.
> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
> open_dataset(
> sources = input_files,
> format = 'csv',
> unify_schema = TRUE,
> col_types = schema(
> "ID_Key" = string(),
> "column1" = string(),
> "column2" = string()
> )
> ) |>
> collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_...
2024 Dec 11
1
Cores hang when calling mcapply
...ow-binding.
> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
> open_dataset(
> sources = input_files,
> format = 'csv',
> unify_schema = TRUE,
> col_types = schema(
> "ID_Key" = string(),
> "column1" = string(),
> "column2" = string()
> )
> ) |>
> collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_c...
2024 Dec 11
1
Cores hang when calling mcapply
...Step A: Load data efficiently as data.table
> > > dt <- as.data.table(
> > >?? open_dataset(
> > >??? sources = input_files,
> > >??? format = 'csv',
> > >??? unify_schema = TRUE,
> > >??? col_types = schema(
> > >????? "ID_Key" = string(),
> > >????? "column1" = string(),
> > >????? "column2" = string()
> > >??? )
> > >? ) |>
> >
> > >??? collect()
> > > )
> > >
> > > # Step B: Clean names once
> > > # Ass...
2024 Dec 11
1
Cores hang when calling mcapply
...brary(arrow)
> >
> > # Step A: Load data efficiently as data.table
> > dt <- as.data.table(
> >?? open_dataset(
> >??? sources = input_files,
> >??? format = 'csv',
> >??? unify_schema = TRUE,
> >??? col_types = schema(
> >????? "ID_Key" = string(),
> >????? "column1" = string(),
> >????? "column2" = string()
> >??? )
> >? ) |>
>
> >??? collect()
> > )
> >
> > # Step B: Clean names once
> > # Assume `crewjanitormakeclean` essentially standardize...
2024 Dec 11
1
Cores hang when calling mcapply
...ow-binding.
> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
> open_dataset(
> sources = input_files,
> format = 'csv',
> unify_schema = TRUE,
> col_types = schema(
> "ID_Key" = string(),
> "column1" = string(),
> "column2" = string()
> )
> ) |>
> collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_c...
2024 Dec 11
2
Cores hang when calling mcapply
...er <- function(df,column,prefix) {
df <- df |> distinct() |> mutate(n = 1) |> pivot_wider(names_from = glue("{column}"), values_from = n, names_prefix = prefix, values_fill = list(n = 0))
return(df)
}
sum_group_function <- function(df) {
df <- df |> group_by(ID_Key) |> summarise(across(c(starts_with("column1_name_"),starts_with("column2_name_"),), ~ sum(.x, na.rm = TRUE))) |> ungroup()
return(df)
}
and splitting up the data into a list of 110k individual dataframes based on Key_ID
temp <-
open_dataset(
sources = input_fi...