thr3ads.net - R help - [R] Cores hang when calling mcapply [Dec 2024]

If this information is useful, please help other people find it:
Share via:

Deramus, Thomas Patrick

2024-Dec-12 15:54 UTC

[R] Cores hang when calling mcapply

Hi Gregg.

Just wanted to follow up on the solution you proposed.

I had to make some adjustments to get exactly what I wanted, but it works, and
takes about 15 minutes on our server configuration:

    temp <-
??????open_dataset(
????????????sources = input_files,
????????????format = 'csv',
????????????unify_schema = TRUE,
????????????col_types = schema(
????????????"ID_Key" = string(),
????????????"column1" = string(),
????????????"column2" = string()
????????????)
??????) |> as_tibble()
??????
??keeptabs <- split(temp, temp$ID_Key)

    if(isTRUE(multicore)){
      keeptabs <- mclapply(1:length(keeptabs), function(i)
crewjanitormakeclean(keeptabs[[i]],c("column1","column2")),
mc.cores = numcores)
    }else{
      keeptabs <- lapply(1:length(keeptabs), function(i)
crewjanitormakeclean(keeptabs[[i]],c("column1","column2")))
    }

    keeptabs <- bind_rows(keeptabs)

    out1 <- dcast(keeptabs, ID_Key ~ column1, fun.aggregate = length,
value.var = "column1")
    out2 <- dcast(keeptabs, ID_Key ~ column2, fun.aggregate = length,
value.var = "column2")
    out1 <- setDT(out1 |> rename_with(~ paste0("column1_name_",
.x, recycle0 = TRUE), -ID_Key))
    out2 <- setDT(out1 |> rename_with(~ paste0("column2_name_",
.x, recycle0 = TRUE), -ID_Key))
    all_cols <- unique(c(names(out1), names(out2)))
    out1_missing <- setdiff(all_cols, names(out1))
    out2_missing <- setdiff(all_cols, names(out2))
    for (col in out1_missing) out1[, (col) := 0]
    for (col in out2_missing) out2[, (col) := 0]
    setcolorder(out1, all_cols)
    setcolorder(out2, all_cols)
    final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
    final_result <- as_tibble(final_dt[, lapply(.SD, sum, na.rm = TRUE), by =
ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")])

Worth noting however:

  *
I unfortunately had to keep the multicore parameters for the janitor package to
use make_clean_names() because it just took to long  to run it on the full
dataframe, but deploying data.table CONSIDERABLY reduces the time and memory
overhead to the point where it only takes about 15 minutes to run one of my
smaller dataframes.
  *
I keep getting the following warning message:
     *
The dcast generic in data.table has been passed a tbl_df and will attempt to
redirect to the relevant reshape2 method; please note that reshape2 is
superseded and is no longer actively developed, and this redirection is now
deprecated. Please do this redirection yourself like reshape2::dcast(keeptabs).
In the next version, this warning will become an error.
     *
So a new call may be needed to keep this approach from failing if we update our
version of data.table in the future
  *
The initial approach was replacing all the KeyID variables with what was
basically row numbers and that would have made merging back to the main key
document an issue so I changed the rbind funciton to keep this from happening.

Thank you for all your help on this!
-Thomas DeRamus

________________________________
From: Gregg Powell <g.a.powell at protonmail.com>
Sent: Wednesday, December 11, 2024 2:11 PM
To: Deramus, Thomas Patrick <tderamus at mgb.org>
Cc: r-help at r-project.org <r-help at r-project.org>
Subject: Re: [R] Cores hang when calling mcapply

How is the server configured to handle memory distribution for individual users.
I see it has over 700GB of total system memory, but how much can be assigned it
each individual user?

AAgain - just curious, and wondering how much memory was assigned to your
instance when you were running R.

regards,
Gregg

On Wednesday, December 11th, 2024 at 9:49 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
It's Redhat Enterprise Linux 9

Specifically:
OS Information:
NAME="Red Hat Enterprise Linux"
VERSION="9.3 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.3"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
Operating System: Red Hat Enterprise Linux 9.3 (Plow)
     CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
          Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
    Architecture: x86-64
 Hardware Vendor: Dell Inc.
  Hardware Model: PowerEdge R840
Firmware Version: 2.15.1

Regarding RAM restrictions, here are the specs:
               total        used        free      shared  buff/cache   available
Mem:           753Gi        70Gi       600Gi       2.9Gi        89Gi       683Gi
Swap:          4.0Gi       2.5Gi       1.5Gi

It's a multi-user server so naturally things fluctuate.

Regarding possible CPU restrictions, here are the specs of our server:

    Thread(s) per core:  2
    Core(s) per socket:  20
    Socket(s):           4
    Stepping:            4
    CPU(s) scaling MHz:  50%
    CPU max MHz:         3700.0000
    CPU min MHz:         1000.0000

________________________________
From: Gregg Powell <g.a.powell at protonmail.com>
Sent: Wednesday, December 11, 2024 11:41 AM
To: Deramus, Thomas Patrick <tderamus at mgb.org>
Cc: r-help at r-project.org <r-help at r-project.org>
Subject: Re: [R] Cores hang when calling mcapply

Thomas,
I'm curious - what OS are you running this on, and how much memory does the
computer have?

Let me know if that code worked out as I hoped.

regards,
gregg

On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
About to try this implementation.

As a follow-up, this is the exact error:

Lost warning messages
Error: no more error handlers available (recursive errors?); invoking
'abort' restart
Execution halted
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)
Error: cons memory exhausted (limit reached?)

________________________________
From: Gregg Powell <g.a.powell at protonmail.com>
Sent: Tuesday, December 10, 2024 7:52 PM
To: Deramus, Thomas Patrick <tderamus at mgb.org>
Cc: r-help at r-project.org <r-help at r-project.org>
Subject: Re: [R] Cores hang when calling mcapply

Hello Thomas,

Consider that the primary bottleneck may be tied to memory usage and the
complexity of pivoting extremely large datasets into wide formats with tens of
thousands of unique values per column. Extremely large expansions of columns
inherently stress both memory and CPU, and splitting into 110k separate data
frames before pivoting and combining them again is likely causing resource
overhead and system instability.

Perhaps, evaluate if the presence/absence transformation can be done in a more
memory-efficient manner without pivoting all at once. Since you are dealing with
extremely large data, a more incremental or streaming approach may be necessary.
Instead of splitting into thousands of individual data frames and trying to
pivot each in parallel, consider instead  a method that processes segments of
data to incrementally build a large sparse matrix or a compressed
representation, then combine results at the end.

It's probbaly better to move away from `pivot_wider()` on a massive scale
and attempt a data.table-based approach, which is often more memory-efficient
and faster for large-scale operations in R.

An alternate way would be data.table?s `dcast()` can handle large data more
efficiently, and data.table?s in-memory operations often reduce overhead
compared to tidyverse pivoting functions.

Also - consider using data.table?s `fread()` or `arrow::open_dataset()` directly
with `as.data.table()` to keep everything in a data.table format. For example,
you can do a large `dcast()` operation to create presence/absence columns by
group. If your categories are extremely large, consider an approach that
processes categories in segments as I mentioned earlier -  and writes
intermediate results to disk, then combines/mergesresults at the end.

Limit parallelization when dealing with massive reshapes. Instead of trying to
parallelize the entire pivot across thousands of subsets, run a single
parallelized chunking approach that processes manageable subsets and writes out
intermediate results (for example... using `fwrite()` for each subset). After
processing, load and combine these intermediate results. This manual segmenting
approach can circumvent the "zombie" processes you mentioned - that I
think arise from overly complex parallel nesting and excessivememory
utilization.

If the presence/absence indicators are ultimately sparse (many zeros and few
ones), consider storing the result in a sparse matrix format (for exapmple-
`Matrix` package in R). Instead of creating thousands of columns as dense
integers, using a sparse matrix representation should dramatically reduce
memory. After processing the data into a sparse format, you can then save it in
a suitable file format and only convert to a dense format if absolutely
necessary.

Below is a reworked code segment using data.table for a more scalable approach.
Note that this is a conceptual template. In practice, adapt the chunk sizes and
filtering operations to your workflow. The idea is to avoid creating 110k
separate data frames and to handle the pivot in a data.table manner that?s more
robust and less memory intensve. Here, presence/absence encoding is done by
grouping and casting directly rather than repeatedly splitting and row-binding.
> library(data.table)
> library(arrow)
>
> # Step A: Load data efficiently as data.table
> dt <- as.data.table(
>   open_dataset(
>    sources = input_files,
>    format = 'csv',
>    unify_schema = TRUE,
>    col_types = schema(
>      "ID_Key" = string(),
>      "column1" = string(),
>      "column2" = string()
>    )
>  ) |>
>    collect()
> )
>
> # Step B: Clean names once
> # Assume `crewjanitormakeclean` essentially standardizes column names
> dt[, column1 := janitor::make_clean_names(column1, allow_dupes 
> TRUE)]
> dt[, column2 := janitor::make_clean_names(column2, allow_dupes 
>  TRUE)]
>
> # Step C: Create presence/absence indicators using data.table
> # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> # For large unique values, consider chunking if needed.
> out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1, fun.aggregate 
> length, value.var = "column1")
> out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2, fun.aggregate 
> length, value.var = "column2")
>
> # Step D: Merge the two wide tables by ID_Key
> # Fill missing columns with 0 using data.table on-the-fly operations
> all_cols <- unique(c(names(out1), names(out2)))
> out1_missing <- setdiff(all_cols, names(out1))
> out2_missing <- setdiff(all_cols, names(out2))
>
> # Add missing columns with 0
> for (col in out1_missing) out1[, (col) := 0]
> for (col in out2_missing) out2[, (col) := 0]
>
> # Ensure column order alignment if needed
> setcolorder(out1, all_cols)
> setcolorder(out2, all_cols)
>
> # Combine by ID_Key (since they share same columns now)
> final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill = TRUE)
>
> # Step E: If needed, summarize across ID_Key to sum presence
> indicators
> final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by 
> ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
>
> # note that final_result should now contain summed presence/absence
> (0/1) indicators.

Hope this helps!
gregg
somewhereinArizona

The information in this e-mail is intended only for the person to whom it is
addressed.  If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at
https://www.massgeneralbrigham.org/complianceline<https://secure-web.cisco.com/100Hi_g3yJTBAxxYPpHz7NQcwx9A06rN2U4Dh4wHDMTBLGJ5yiq8PszsdEMlRsD9C7ESGkM_88I-b3jy2QGu-x35cEfVZ6QUW9Uf8aQihVfQOceLc1DZr3AXcvDUKCpPFXgrBqSOYdHyh31yQD3--3ltk0pjgK_te1I0M6i_pIUEbdE334rYMuIxAhmwI48EHf8_k9wuSXxBrgiQc9D6Lsakb1w2RckPEmz4DmrbjWfLR4hx3ylUbDTf9UR_eWYxzvivxt6NpRfEN3T9WpUjWHHTNIdUphSKwmR8Sk8i4-KqcpJdPjHKC1185wd1Sr78X/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
.

Please note that this e-mail is not secure (encrypted).  If you do not wish to
continue communication over unencrypted e-mail, please notify the sender of this
message immediately.  Continuing to send or respond to e-mail after receiving
this message means you understand and accept this risk and wish to continue to
communicate over unencrypted e-mail.

The information in this e-mail is intended only for the person to whom it is
addressed.  If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at
https://www.massgeneralbrigham.org/complianceline<https://secure-web.cisco.com/1zMX7N7YTozoGwIAk3PWA50iqx1YK6HX1taKVKQz0HqImamktAnbE_Lhw9DITz4CNk4YtcTHiaVjRHBbsNJA6nMZFdvqvYOAa3K6EjvpWhTCeZuriog-AChP7n3TbQ2Bhfkgoqdr0SXJuHbtoWHgBArTDyMzA5e1muIeHZy1FHSNCrKM-Le9MboKJd65PrLi2bLIy_jGP8xQRMsS85zV5Pq6FGo9AR5qNHn0s8h9XiHbj0sIORoRALetWpNBqvavE37HV7v08WLPtq_BGRRIwYK15jIaMIiNtwPM06KI0VQh5nqvY8VNGV7Awx4Ou4Ns-/https%3A%2F%2Fwww.massgeneralbrigham.org%2Fcomplianceline>
.

Please note that this e-mail is not secure (encrypted).  If you do not wish to
continue communication over unencrypted e-mail, please notify the sender of this
message immediately.  Continuing to send or respond to e-mail after receiving
this message means you understand and accept this risk and wish to continue to
communicate over unencrypted e-mail.

The information in this e-mail is intended only for the person to whom it is
addressed.  If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted).  If you do not wish to
continue communication over unencrypted e-mail, please notify the sender of this
message immediately.  Continuing to send or respond to e-mail after receiving
this message means you understand and accept this risk and wish to continue to
communicate over unencrypted e-mail.

	[[alternative HTML version deleted]]

Gregg Powell

2024-Dec-12 17:30 UTC

head link

[R] Cores hang when calling mcapply

Hi Thomas,

Glad to hear the suggestion helped, and that switching to a `data.table`
approach reduced the processing time and memory overhead?15 minutes for one of
the smaller datasets is certainly better! Sounds like the adjustments you
devised, especially keeping the multicore approach for `make_clean_names()` and
ensuring that `ID_Key` values remain intact, were the missing components you
needed to fit it into your workflow.

I believe the warning message regarding `dcast()` occurs because `keeptabs` is a
`tbl_df` from the tidyverse rather than a base `data.frame` or `data.table`. The
`data.table` implementation of `dcast()` expects a `data.table` or `data.frame`.
When it detects a `tbl_df`, it tries to redirect to `reshape2::dcast()`, but
since that appears to be deprecated, it will fail in future versions at some
point.
To avoid this, consider converting?`keeptabs` into a `data.table` directly
before calling `dcast()`. For example:

    >?setDT(keeptabs)
    > out1 <- dcast(keeptabs, ID_Key ~ column1, fun.aggregate = length,
value.var = "column1")
    > out2 <- dcast(keeptabs, ID_Key ~ column2, fun.aggregate = length,
value.var = "column2")
    


If?`keeptabs` is a `data.table` at the time of calling `dcast()`, this ensures
the `data.table` method of `dcast()` is used and should eliminate the warning
message maintaining compatibility with? future updates of the `data.table`
package.

Also,?If you must retain?`keeptabs` as a tibble for other parts of the workflow,
you might consider converting it right before the `dcast()` call... do the wide
conversionand then proceed. This should keep the workflow stable and ensure that
you won?t run into issues down the road - when the redirect to
`reshape2::dcast()` becomes unsupported.

Regarding the note on multicore usage for `make_clean_names()`, your alternative
approach -splitting the data and applying `make_clean_names()` in parallel
before merging - was a good idea. I think relying on `data.table` for the main
pivot operations clearly made the process more memory and time-efficient.
Remaining parallelization overhead is likely a necessary step given the size of
the datasets you're working with and the complexity of the initial cleaning.

Anyway, happy the suggestion moved you closer to the results you were looking
for. I don't often respond on r-help often, mostly just a lurker.

All the best,

gregg
Sierra Vista, Arizona


On Thursday, December 12th, 2024 at 8:54 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
> Hi Gregg.
> 
> Just wanted to follow up on the solution you proposed.
> 
> I had to make some adjustments to get exactly what I wanted, but it works,
and takes about 15 minutes on our server configuration:
> 
> ? ? temp <-
> ??????open_dataset(
> ????????????sources = input_files,
> ????????????format = 'csv',
> ????????????unify_schema = TRUE,
> ????????????col_types = schema(
> ????????????"ID_Key" = string(),
> ????????????"column1" = string(),
> ????????????"column2" = string()
> ????????????)
> ??????) |> as_tibble()
> 
> ??keeptabs <- split(temp, temp$ID_Key)
> 
> ? ? if(isTRUE(multicore)){
> ? ? ? keeptabs <- mclapply(1:length(keeptabs), function(i)
crewjanitormakeclean(keeptabs[[i]],c("column1","column2")),
mc.cores = numcores)
> ? ? }else{
> ? ? ? keeptabs <- lapply(1:length(keeptabs), function(i)
crewjanitormakeclean(keeptabs[[i]],c("column1","column2")))
> ? ? }
> 
> ? ? keeptabs <- bind_rows(keeptabs)
> 
> ? ? out1 <- dcast(keeptabs, ID_Key ~ column1, fun.aggregate = length,
value.var = "column1")
> ? ? out2 <- dcast(keeptabs, ID_Key ~ column2, fun.aggregate = length,
value.var = "column2")
> ? ? out1 <- setDT(out1 |> rename_with(~
paste0("column1_name_", .x, recycle0 = TRUE), -ID_Key))
> ? ? out2 <- setDT(out1 |> rename_with(~
paste0("column2_name_", .x, recycle0 = TRUE), -ID_Key))
> ? ? all_cols <- unique(c(names(out1), names(out2)))
> ? ? out1_missing <- setdiff(all_cols, names(out1))
> ? ? out2_missing <- setdiff(all_cols, names(out2))
> ? ? for (col in out1_missing) out1[, (col) := 0]
> ? ? for (col in out2_missing) out2[, (col) := 0]
> ? ? setcolorder(out1, all_cols)
> ? ? setcolorder(out2, all_cols)
> ? ? final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill =
TRUE)
> ? ? final_result <- as_tibble(final_dt[, lapply(.SD, sum, na.rm = TRUE),
by = ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")])
> 
> 
> Worth noting however:
> 
> 
> -   I unfortunately had to keep the `multicore`?parameters for the
`janitor`?package to use `make_clean_names()`?`because it just took to long? to
run it on the full dataframe, but` deploying data.table CONSIDERABLY?reduces the
time and memory overhead to the point where it only takes about 15 minutes to
run one of my smaller dataframes.
>     
> -   I keep getting the following warning message:
>     
> 
> -   The dcast generic in data.table has been passed a tbl_df and will
attempt to redirect to the relevant reshape2 method; please note that reshape2
is superseded and is no longer actively developed, and this redirection is now
deprecated. Please do this redirection yourself like reshape2::dcast(keeptabs).
In the next version, this warning will become an error.
>     
> -   So a new call may be needed to keep this approach from failing if we
update our version of `data.table`?in the future
>     
> 
> -   The initial approach was replacing all the `KeyID`?variables with what
was basically row numbers and that would have made merging back to the main key
document an issue so I changed the rbind funciton to keep this from happening.
>     
> 
> 
> Thank you for all your help on this!
> -Thomas DeRamus
> 
> 
> 
> 
> From:?Gregg Powell <g.a.powell at protonmail.com>
> Sent:?Wednesday, December 11, 2024 2:11 PM
> To:?Deramus, Thomas Patrick <tderamus at mgb.org>
> Cc:?r-help at r-project.org <r-help at r-project.org>
> Subject:?Re: [R] Cores hang when calling mcapply
> 
> How is the server configured to handle memory distribution for individual
users. I see it has over 700GB of total system memory, but how much can be
assigned it each individual user?
> 
> AAgain - just curious, and wondering how much memory was assigned to your
instance when you were running R.
> 
> regards,
> Gregg
> 
> 
> On Wednesday, December 11th, 2024 at 9:49 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
> 
> > It's Redhat Enterprise Linux 9
> > 
> > Specifically:
> > OS Information:
> > NAME="Red Hat Enterprise Linux"
> > VERSION="9.3 (Plow)"
> > ID="rhel"
> > ID_LIKE="fedora"
> > VERSION_ID="9.3"
> > PLATFORM_ID="platform:el9"
> > PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
> > ANSI_COLOR="0;31"
> > LOGO="fedora-logo-icon"
> > CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
> > HOME_URL="https://www.redhat.com/"
> >
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
> > BUG_REPORT_URL="https://bugzilla.redhat.com/"
> > REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
> > REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
> > REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> > REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
> > Operating System: Red Hat Enterprise Linux 9.3 (Plow)
> > ? ? ?CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
> > ? ? ? ? ? Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
> > ? ? Architecture: x86-64
> > ?Hardware Vendor: Dell Inc.
> > ? Hardware Model: PowerEdge R840
> > Firmware Version: 2.15.1
> > 
> > 
> > Regarding RAM restrictions, here are the specs:
> > ? ? ? ? ? ? ? ?total ? ? ? ?used ? ? ? ?free ? ? ?shared ?buff/cache ?
available
> > Mem: ? ? ? ? ? 753Gi?? ? ? ?70Gi ? ? ? 600Gi ? ? ? 2.9Gi ? ? ? ?89Gi ?
? ? 683Gi
> > Swap: ? ? ? ? ?4.0Gi ? ? ? 2.5Gi ? ? ? 1.5Gi
> > 
> > It's a multi-user server so naturally things fluctuate.
> > 
> > Regarding possible CPU restrictions, here are the specs of our server:
> > ? ? Thread(s) per core: ?2
> > ? ? Core(s) per socket: ?20
> > ? ? Socket(s): ? ? ? ? ? 4
> > ? ? Stepping: ? ? ? ? ? ?4
> > ? ? CPU(s) scaling MHz: ?50%
> > ? ? CPU max MHz: ? ? ? ? 3700.0000
> > ? ? CPU min MHz: ? ? ? ? 1000.0000
> > 
> > 
> > 
> > 
> > 
> > From:?Gregg Powell <g.a.powell at protonmail.com>
> > Sent:?Wednesday, December 11, 2024 11:41 AM
> > To:?Deramus, Thomas Patrick <tderamus at mgb.org>
> > Cc:?r-help at r-project.org <r-help at r-project.org>
> > Subject:?Re: [R] Cores hang when calling mcapply
> > 
> > Thomas,
> > I'm curious - what OS are you running this on, and how much memory
does the computer have??
> > 
> > Let me know if that code worked out as I hoped.
> > 
> > regards,
> > gregg
> > 
> > On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
> > 
> > > About to try this implementation.
> > > 
> > > As a follow-up, this is the exact error:
> > > 
> > > Lost warning messages
> > > Error: no more error handlers available (recursive errors?);
invoking 'abort' restart
> > > Execution halted
> > > Error: cons memory exhausted (limit reached?)
> > > Error: cons memory exhausted (limit reached?)
> > > Error: cons memory exhausted (limit reached?)
> > > Error: cons memory exhausted (limit reached?)
> > > 
> > > 
> > > 
> > > From:?Gregg Powell <g.a.powell at protonmail.com>
> > > Sent:?Tuesday, December 10, 2024 7:52 PM
> > > To:?Deramus, Thomas Patrick <tderamus at mgb.org>
> > > Cc:?r-help at r-project.org <r-help at r-project.org>
> > > Subject:?Re: [R] Cores hang when calling mcapply
> > > 
> > > Hello Thomas,
> > > 
> > > Consider that the primary bottleneck may be tied to memory usage
and the complexity of pivoting extremely large datasets into wide formats with
tens of thousands of unique values per column. Extremely large expansions of
columns inherently stress both memory and CPU, and splitting into 110k separate
data frames before pivoting and combining them again is likely causing resource
overhead and system instability.
> > > 
> > > Perhaps, evaluate if the presence/absence transformation can be
done in a more memory-efficient manner without pivoting all at once. Since you
are dealing with extremely large data, a more incremental or streaming approach
may be necessary. Instead of splitting into thousands of individual data frames
and trying to pivot each in parallel, consider instead? a method that processes
segments of data to incrementally build a large sparse matrix or a compressed
representation, then combine results at the end.
> > > 
> > > It's probbaly better to move away from `pivot_wider()` on a
massive scale and attempt a data.table-based approach, which is often more
memory-efficient and faster for large-scale operations in R.
> > > 
> > > 
> > > An alternate way would be data.table?s `dcast()` can handle large
data more efficiently, and data.table?s in-memory operations often reduce
overhead compared to tidyverse pivoting functions.
> > > 
> > > Also - consider using data.table?s `fread()` or
`arrow::open_dataset()` directly with `as.data.table()` to keep everything in a
data.table format. For example, you can do a large `dcast()` operation to create
presence/absence columns by group. If your categories are extremely large,
consider an approach that processes categories in segments as I mentioned
earlier -? and writes intermediate results to disk, then combines/mergesresults
at the end.
> > > 
> > > Limit parallelization when dealing with massive reshapes. Instead
of trying to parallelize the entire pivot across thousands of subsets, run a
single parallelized chunking approach that processes manageable subsets and
writes out intermediate results (for example... using `fwrite()` for each
subset). After processing, load and combine these intermediate results. This
manual segmenting approach can circumvent the "zombie" processes you
mentioned - that I think arise from overly complex parallel nesting and
excessivememory utilization.
> > > 
> > > If the presence/absence indicators are ultimately sparse (many
zeros and few ones), consider storing the result in a sparse matrix format (for
exapmple- `Matrix` package in R). Instead of creating thousands of columns as
dense integers, using a sparse matrix representation should dramatically reduce
memory. After processing the data into a sparse format, you can then save it in
a suitable file format and only convert to a dense format if absolutely
necessary.
> > > 
> > > Below is a reworked code segment using data.table for a more
scalable approach. Note that this is a conceptual template. In practice, adapt
the chunk sizes and filtering operations to your workflow. The idea is to avoid
creating 110k separate data frames and to handle the pivot in a data.table
manner that?s more robust and less memory intensve. Here, presence/absence
encoding is done by grouping and casting directly rather than repeatedly
splitting and row-binding.
> > > 
> > > > library(data.table)
> > > > library(arrow)
> > > >
> > > > # Step A: Load data efficiently as data.table
> > > > dt <- as.data.table(
> > > >?? open_dataset(
> > > >??? sources = input_files,
> > > >??? format = 'csv',
> > > >??? unify_schema = TRUE,
> > > >??? col_types = schema(
> > > >????? "ID_Key" = string(),
> > > >????? "column1" = string(),
> > > >????? "column2" = string()
> > > >??? )
> > > >? ) |>
> > > 
> > > >??? collect()
> > > > )
> > > >
> > > > # Step B: Clean names once
> > > > # Assume `crewjanitormakeclean` essentially standardizes
column names
> > > > dt[, column1 := janitor::make_clean_names(column1,
allow_dupes =?
> > > 
> > > > TRUE)]
> > > > dt[, column2 := janitor::make_clean_names(column2,
allow_dupes > > >
> > > >? TRUE)]
> > > >
> > > > # Step C: Create presence/absence indicators using
data.table
> > > > # Use dcast to pivot wide. Set n=1 for presence, 0 for
absence.
> > > > # For large unique values, consider chunking if needed.
> > > > out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1,
fun.aggregate > > >
> > > > length, value.var = "column1")
> > > > out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2,
fun.aggregate > > >
> > > > length, value.var = "column2")
> > > >
> > > > # Step D: Merge the two wide tables by ID_Key
> > > > # Fill missing columns with 0 using data.table on-the-fly
operations
> > > > all_cols <- unique(c(names(out1), names(out2)))
> > > > out1_missing <- setdiff(all_cols, names(out1))
> > > > out2_missing <- setdiff(all_cols, names(out2))
> > > >
> > > > # Add missing columns with 0
> > > > for (col in out1_missing) out1[, (col) := 0]
> > > > for (col in out2_missing) out2[, (col) := 0]
> > > >
> > > > # Ensure column order alignment if needed
> > > > setcolorder(out1, all_cols)
> > > > setcolorder(out2, all_cols)
> > > >
> > > > # Combine by ID_Key (since they share same columns now)
> > > > final_dt <- rbindlist(list(out1, out2), use.names = TRUE,
fill = TRUE)
> > > >
> > > > # Step E: If needed, summarize across ID_Key to sum presence
> > > 
> > > > indicators
> > > > final_result <- final_dt[, lapply(.SD, sum, na.rm =
TRUE), by > > >
> > > > ID_Key, .SDcols = setdiff(names(final_dt),
"ID_Key")]
> > > >
> > > > # note that final_result should now contain summed
presence/absence
> > > 
> > > > (0/1) indicators.
> > > 
> > > 
> > > 
> > > 
> > > Hope this helps!
> > > gregg
> > > somewhereinArizona
> > > 
> > > The information in this e-mail is intended only for the person to
whom it is addressed.? If you believe this e-mail was sent to you in error and
the e-mail contains patient information, please contact the Mass General Brigham
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline?.
> > > 
> > > 
> > > 
> > > Please note that this e-mail is not secure (encrypted).? If you
do not wish to continue communication over unencrypted e-mail, please notify the
sender of this message immediately.? Continuing to send or respond to e-mail
after receiving this message means you understand and accept this risk and wish
to continue to communicate over unencrypted e-mail.?
> > 
> > 
> > 
> > The information in this e-mail is intended only for the person to whom
it is addressed.? If you believe this e-mail was sent to you in error and the
e-mail contains patient information, please contact the Mass General Brigham
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline?.
> > 
> > 
> > 
> > Please note that this e-mail is not secure (encrypted).? If you do not
wish to continue communication over unencrypted e-mail, please notify the sender
of this message immediately.? Continuing to send or respond to e-mail after
receiving this message means you understand and accept this risk and wish to
continue to communicate over unencrypted e-mail.?
> 
> 
> 
> The information in this e-mail is intended only for the person to whom it
is addressed.? If you believe this e-mail was sent to you in error and the
e-mail contains patient information, please contact the Mass General Brigham
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
> 
> 
> 
> Please note that this e-mail is not secure (encrypted).? If you do not wish
to continue communication over unencrypted e-mail, please notify the sender of
this message immediately.? Continuing to send or respond to e-mail after
receiving this message means you understand and accept this risk and wish to
continue to communicate over unencrypted e-mail.-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20241212/819b4174/attachment.sig>

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Dec 2024 - Cores hang when calling mcapply

[R] Cores hang when calling mcapply

[R] Cores hang when calling mcapply

Seemingly Similar Threads