thr3ads.net - R help - [R] Cores hang when calling mcapply [Dec 2024]

If this information is useful, please help other people find it:
Share via:

Gregg Powell

2024-Dec-11 00:52 UTC

[R] Cores hang when calling mcapply

Hello Thomas,

Consider that the primary bottleneck may be tied to memory usage and the
complexity of pivoting extremely large datasets into wide formats with tens of
thousands of unique values per column. Extremely large expansions of columns
inherently stress both memory and CPU, and splitting into 110k separate data
frames before pivoting and combining them again is likely causing resource
overhead and system instability.

Perhaps, evaluate if the presence/absence transformation can be done in a more
memory-efficient manner without pivoting all at once. Since you are dealing with
extremely large data, a more incremental or streaming approach may be necessary.
Instead of splitting into thousands of individual data frames and trying to
pivot each in parallel, consider instead  a method that processes segments of
data to incrementally build a large sparse matrix or a compressed
representation, then combine results at the end.

It's probbaly better to move away from `pivot_wider()` on a massive scale
and attempt a data.table-based approach, which is often more memory-efficient
and faster for large-scale operations in R.


An alternate way would be data.table?s `dcast()` can handle large data more
efficiently, and data.table?s in-memory operations often reduce overhead
compared to tidyverse pivoting functions.

Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly
with `as.data.table()` to keep everything in a data.table format. For example,
you can do a large `dcast()` operation to create presence/absence columns by
group. If your categories are extremely large, consider an approach that
processes categories in segments as I mentioned earlier -  and writes
intermediate results to disk, then combines/mergesresults at the end.

Limit parallelization when dealing with massive reshapes. Instead of trying to
parallelize the entire pivot across thousands of subsets, run a single
parallelized chunking approach that processes manageable subsets and writes out
intermediate results (for example... using `fwrite()` for each subset). After
processing, load and combine these intermediate results. This manual segmenting
approach can circumvent the "zombie" processes you mentioned - that I
think arise from overly complex parallel nesting and excessivememory
utilization.

If the presence/absence indicators are ultimately sparse (many zeros and few
ones), consider storing the result in a sparse matrix format (for exapmple-
`Matrix` package in R). Instead of creating thousands of columns as dense
integers, using a sparse matrix representation should dramatically reduce
memory. After processing the data into a sparse format, you can then save it in
a suitable file format and only convert to a dense format if absolutely
necessary.

Below is a reworked code segment using data.table for a more scalable approach.
Note that this is a conceptual template. In practice, adapt the chunk sizes and
filtering operations to your workflow. The idea is to avoid creating 110k
separate data frames and to handle the pivot in a data.table manner that?s more
robust and less memory intensve. Here, presence/absence encoding is done by
grouping and casting directly rather than repeatedly splitting and row-binding.
> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
>   open_dataset(
>    sources = input_files,
>    format = 'csv',
>    unify_schema = TRUE,
>    col_types = schema(
>      "ID_Key" = string(),
>      "column1" = string(),
>      "column2" = string()
>    )
>  ) |> 
>    collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_clean_names(column1, allow_dupes =  
> TRUE)]
> dt[, column2 := janitor::make_clean_names(column2, allow_dupes = 
>  TRUE)]
>
> # Step C: Create presence/absence indicators using data.table
> # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> # For large unique values, consider chunking if needed.
> out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate = 
> length, value.var = "column1")
> out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate = 
> length, value.var = "column2")
>
> # Step D: Merge the two wide tables by ID_Key
> # Fill missing columns with 0 using data.table on-the-fly operations
> all_cols <- unique(c(names(out1), names(out2)))
> out1_missing <- setdiff(all_cols, names(out1))
> out2_missing <- setdiff(all_cols, names(out2))
>
> # Add missing columns with 0
> for (col in out1_missing) out1[, (col) := 0]
> for (col in out2_missing) out2[, (col) := 0]
>
> # Ensure column order alignment if needed
> setcolorder(out1, all_cols)
> setcolorder(out2, all_cols)
>
> # Combine by ID_Key (since they share same columns now)
> final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
>
> # Step E: If needed, summarize across ID_Key to sum presence 
> indicators
> final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by = 
> ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
>
> # note that final_result should now contain summed presence/absence 
> (0/1) indicators.



Hope this helps!
gregg
somewhereinArizona
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 509 bytes
Desc: OpenPGP digital signature
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20241211/2385b049/attachment.sig>

Deramus, Thomas Patrick

2024-Dec-11 13:51 UTC

head link

[R] Cores hang when calling mcapply

About to try this implementation.

As a follow-up, this is the exact error:

Lost warning messages
Error: no more error handlers available (recursive errors?); invoking
'abort' restart
Execution halted
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)

________________________________
From: Gregg Powell <g.a.powell at protonmail.com>
Sent: Tuesday, December 10, 2024 7:52 PM
To: Deramus, Thomas Patrick <tderamus at mgb.org>
Cc: r-help at r-project.org <r-help at r-project.org>
Subject: Re: [R] Cores hang when calling mcapply

Hello Thomas,

Consider that the primary bottleneck may be tied to memory usage and the
complexity of pivoting extremely large datasets into wide formats with tens of
thousands of unique values per column. Extremely large expansions of columns
inherently stress both memory and CPU, and splitting into 110k separate data
frames before pivoting and combining them again is likely causing resource
overhead and system instability.

Perhaps, evaluate if the presence/absence transformation can be done in a more
memory-efficient manner without pivoting all at once. Since you are dealing with
extremely large data, a more incremental or streaming approach may be necessary.
Instead of splitting into thousands of individual data frames and trying to
pivot each in parallel, consider instead  a method that processes segments of
data to incrementally build a large sparse matrix or a compressed
representation, then combine results at the end.

It's probbaly better to move away from `pivot_wider()` on a massive scale
and attempt a data.table-based approach, which is often more memory-efficient
and faster for large-scale operations in R.


An alternate way would be data.table?s `dcast()` can handle large data more
efficiently, and data.table?s in-memory operations often reduce overhead
compared to tidyverse pivoting functions.

Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly
with `as.data.table()` to keep everything in a data.table format. For example,
you can do a large `dcast()` operation to create presence/absence columns by
group. If your categories are extremely large, consider an approach that
processes categories in segments as I mentioned earlier -  and writes
intermediate results to disk, then combines/mergesresults at the end.

Limit parallelization when dealing with massive reshapes. Instead of trying to
parallelize the entire pivot across thousands of subsets, run a single
parallelized chunking approach that processes manageable subsets and writes out
intermediate results (for example... using `fwrite()` for each subset). After
processing, load and combine these intermediate results. This manual segmenting
approach can circumvent the "zombie" processes you mentioned - that I
think arise from overly complex parallel nesting and excessivememory
utilization.

If the presence/absence indicators are ultimately sparse (many zeros and few
ones), consider storing the result in a sparse matrix format (for exapmple-
`Matrix` package in R). Instead of creating thousands of columns as dense
integers, using a sparse matrix representation should dramatically reduce
memory. After processing the data into a sparse format, you can then save it in
a suitable file format and only convert to a dense format if absolutely
necessary.

Below is a reworked code segment using data.table for a more scalable approach.
Note that this is a conceptual template. In practice, adapt the chunk sizes and
filtering operations to your workflow. The idea is to avoid creating 110k
separate data frames and to handle the pivot in a data.table manner that?s more
robust and less memory intensve. Here, presence/absence encoding is done by
grouping and casting directly rather than repeatedly splitting and row-binding.
> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
>   open_dataset(
>    sources = input_files,
>    format = 'csv',
>    unify_schema = TRUE,
>    col_types = schema(
>      "ID_Key" = string(),
>      "column1" = string(),
>      "column2" = string()
>    )
>  ) |>
>    collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_clean_names(column1, allow_dupes 
> TRUE)]
> dt[, column2 := janitor::make_clean_names(column2, allow_dupes 
>  TRUE)]
>
> # Step C: Create presence/absence indicators using data.table
> # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> # For large unique values, consider chunking if needed.
> out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate 
> length, value.var = "column1")
> out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate 
> length, value.var = "column2")
>
> # Step D: Merge the two wide tables by ID_Key
> # Fill missing columns with 0 using data.table on-the-fly operations
> all_cols <- unique(c(names(out1), names(out2)))
> out1_missing <- setdiff(all_cols, names(out1))
> out2_missing <- setdiff(all_cols, names(out2))
>
> # Add missing columns with 0
> for (col in out1_missing) out1[, (col) := 0]
> for (col in out2_missing) out2[, (col) := 0]
>
> # Ensure column order alignment if needed
> setcolorder(out1, all_cols)
> setcolorder(out2, all_cols)
>
> # Combine by ID_Key (since they share same columns now)
> final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
>
> # Step E: If needed, summarize across ID_Key to sum presence
> indicators
> final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by 
> ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
>
> # note that final_result should now contain summed presence/absence
> (0/1) indicators.



Hope this helps!
gregg
somewhereinArizona
The information in this e-mail is intended only for the ...{{dropped:14}}

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Dec 2024 - Cores hang when calling mcapply

[R] Cores hang when calling mcapply

[R] Cores hang when calling mcapply

Possibly Parallel Threads