Displaying 8 results from an estimated 8 matches for "open_dataset".
2024 Dec 11
1
Cores hang when calling mcapply
...ient and faster for large-scale operations in R.
An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions.
Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -...
2024 Dec 11
1
Cores hang when calling mcapply
...cient and faster for large-scale operations in R.
An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions.
Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -...
2024 Dec 11
1
Cores hang when calling mcapply
...cale operations in R.
>
>
> An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions.
>
> Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -?...
2024 Dec 11
1
Cores hang when calling mcapply
...; >
> >
> > An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions.
> >
> > Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -?...
2024 Dec 12
1
Cores hang when calling mcapply
Hi Gregg.
Just wanted to follow up on the solution you proposed.
I had to make some adjustments to get exactly what I wanted, but it works, and takes about 15 minutes on our server configuration:
temp <-
??????open_dataset(
????????????sources = input_files,
????????????format = 'csv',
????????????unify_schema = TRUE,
????????????col_types = schema(
????????????"ID_Key" = string(),
????????????"column1" = string(),
????????????"column2" = string()
????????????)
??????) |> as_t...
2024 Dec 11
2
Cores hang when calling mcapply
...df <- df |> group_by(ID_Key) |> summarise(across(c(starts_with("column1_name_"),starts_with("column2_name_"),), ~ sum(.x, na.rm = TRUE))) |> ungroup()
return(df)
}
and splitting up the data into a list of 110k individual dataframes based on Key_ID
temp <-
open_dataset(
sources = input_files,
format = 'csv',
unify_schema = TRUE,
col_types = schema(
"ID_Key" = string(),
"column1" = string(),
"column1" = string()
)
) |> as_tibble()
keeptabs <- split(temp, temp$ID_Key)
I used a...
2024 Dec 11
1
Cores hang when calling mcapply
...cient and faster for large-scale operations in R.
An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions.
Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -...
2024 Dec 12
1
Cores hang when calling mcapply
...<tderamus at mgb.org> wrote:
> Hi Gregg.
>
> Just wanted to follow up on the solution you proposed.
>
> I had to make some adjustments to get exactly what I wanted, but it works, and takes about 15 minutes on our server configuration:
>
> ? ? temp <-
> ??????open_dataset(
> ????????????sources = input_files,
> ????????????format = 'csv',
> ????????????unify_schema = TRUE,
> ????????????col_types = schema(
> ????????????"ID_Key" = string(),
> ????????????"column1" = string(),
> ????????????"column2" = string...