search for: open_dataset

Displaying 8 results from an estimated 8 matches for "open_dataset".

2024 Dec 11
1
Cores hang when calling mcapply
...ient and faster for large-scale operations in R. An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions. Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -...
2024 Dec 11
1
Cores hang when calling mcapply
...cient and faster for large-scale operations in R. An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions. Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -...
2024 Dec 11
1
Cores hang when calling mcapply
...cale operations in R. > > > An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions. > > Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -?...
2024 Dec 11
1
Cores hang when calling mcapply
...; > > > > > An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions. > > > > Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -?...
2024 Dec 12
1
Cores hang when calling mcapply
Hi Gregg. Just wanted to follow up on the solution you proposed. I had to make some adjustments to get exactly what I wanted, but it works, and takes about 15 minutes on our server configuration: temp <- ??????open_dataset( ????????????sources = input_files, ????????????format = 'csv', ????????????unify_schema = TRUE, ????????????col_types = schema( ????????????"ID_Key" = string(), ????????????"column1" = string(), ????????????"column2" = string() ????????????) ??????) |> as_t...
2024 Dec 11
2
Cores hang when calling mcapply
...df <- df |> group_by(ID_Key) |> summarise(across(c(starts_with("column1_name_"),starts_with("column2_name_"),), ~ sum(.x, na.rm = TRUE))) |> ungroup() return(df) } and splitting up the data into a list of 110k individual dataframes based on Key_ID temp <- open_dataset( sources = input_files, format = 'csv', unify_schema = TRUE, col_types = schema( "ID_Key" = string(), "column1" = string(), "column1" = string() ) ) |> as_tibble() keeptabs <- split(temp, temp$ID_Key) I used a...
2024 Dec 11
1
Cores hang when calling mcapply
...cient and faster for large-scale operations in R. An alternate way would be data.table?s `dcast()` can handle large data more efficiently, and data.table?s in-memory operations often reduce overhead compared to tidyverse pivoting functions. Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly with `as.data.table()` to keep everything in a data.table format. For example, you can do a large `dcast()` operation to create presence/absence columns by group. If your categories are extremely large, consider an approach that processes categories in segments as I mentioned earlier -...
2024 Dec 12
1
Cores hang when calling mcapply
...<tderamus at mgb.org> wrote: > Hi Gregg. > > Just wanted to follow up on the solution you proposed. > > I had to make some adjustments to get exactly what I wanted, but it works, and takes about 15 minutes on our server configuration: > > ? ? temp <- > ??????open_dataset( > ????????????sources = input_files, > ????????????format = 'csv', > ????????????unify_schema = TRUE, > ????????????col_types = schema( > ????????????"ID_Key" = string(), > ????????????"column1" = string(), > ????????????"column2" = string...