How is the server configured to handle memory distribution for individual users.
I see it has over 700GB of total system memory, but how much can be assigned it
each individual user?
AAgain - just curious, and wondering how much memory was assigned to your
instance when you were running R.
regards,
Gregg
On Wednesday, December 11th, 2024 at 9:49 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
> It's Redhat Enterprise Linux 9
>
> Specifically:
> OS Information:
> NAME="Red Hat Enterprise Linux"
> VERSION="9.3 (Plow)"
> ID="rhel"
> ID_LIKE="fedora"
> VERSION_ID="9.3"
> PLATFORM_ID="platform:el9"
> PRETTY_NAME="Red Hat Enterprise Linux 9.3 (Plow)"
> ANSI_COLOR="0;31"
> LOGO="fedora-logo-icon"
> CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
> HOME_URL="https://www.redhat.com/"
>
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
> BUG_REPORT_URL="https://bugzilla.redhat.com/"
> REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
> REDHAT_BUGZILLA_PRODUCT_VERSION=9.3
> REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
> REDHAT_SUPPORT_PRODUCT_VERSION="9.3"
> Operating System: Red Hat Enterprise Linux 9.3 (Plow)
> ? ? ?CPE OS Name: cpe:/o:redhat:enterprise_linux:9::baseos
> ? ? ? ? ? Kernel: Linux 5.14.0-362.13.1.el9_3.x86_64
> ? ? Architecture: x86-64
> ?Hardware Vendor: Dell Inc.
> ? Hardware Model: PowerEdge R840
> Firmware Version: 2.15.1
>
>
> Regarding RAM restrictions, here are the specs:
> ? ? ? ? ? ? ? ?total ? ? ? ?used ? ? ? ?free ? ? ?shared ?buff/cache ?
available
> Mem: ? ? ? ? ? 753Gi?? ? ? ?70Gi ? ? ? 600Gi ? ? ? 2.9Gi ? ? ? ?89Gi ? ? ?
683Gi
> Swap: ? ? ? ? ?4.0Gi ? ? ? 2.5Gi ? ? ? 1.5Gi
>
> It's a multi-user server so naturally things fluctuate.
>
> Regarding possible CPU restrictions, here are the specs of our server:
> ? ? Thread(s) per core: ?2
> ? ? Core(s) per socket: ?20
> ? ? Socket(s): ? ? ? ? ? 4
> ? ? Stepping: ? ? ? ? ? ?4
> ? ? CPU(s) scaling MHz: ?50%
> ? ? CPU max MHz: ? ? ? ? 3700.0000
> ? ? CPU min MHz: ? ? ? ? 1000.0000
>
>
>
>
>
> From: Gregg Powell <g.a.powell at protonmail.com>
> Sent: Wednesday, December 11, 2024 11:41 AM
> To: Deramus, Thomas Patrick <tderamus at mgb.org>
> Cc: r-help at r-project.org <r-help at r-project.org>
> Subject: Re: [R] Cores hang when calling mcapply
>
> Thomas,
> I'm curious - what OS are you running this on, and how much memory does
the computer have??
>
> Let me know if that code worked out as I hoped.
>
> regards,
> gregg
>
>
> On Wednesday, December 11th, 2024 at 6:51 AM, Deramus, Thomas Patrick
<tderamus at mgb.org> wrote:
>
> > About to try this implementation.
> >
> > As a follow-up, this is the exact error:
> >
> > Lost warning messages
> > Error: no more error handlers available (recursive errors?); invoking
'abort' restart
> > Execution halted
> > Error: cons memory exhausted (limit reached?)
> > Error: cons memory exhausted (limit reached?)
> > Error: cons memory exhausted (limit reached?)
> > Error: cons memory exhausted (limit reached?)
> >
> >
> >
> > From: Gregg Powell <g.a.powell at protonmail.com>
> > Sent: Tuesday, December 10, 2024 7:52 PM
> > To: Deramus, Thomas Patrick <tderamus at mgb.org>
> > Cc: r-help at r-project.org <r-help at r-project.org>
> > Subject: Re: [R] Cores hang when calling mcapply
> >
> > Hello Thomas,
> >
> > Consider that the primary bottleneck may be tied to memory usage and
the complexity of pivoting extremely large datasets into wide formats with tens
of thousands of unique values per column. Extremely large expansions of columns
inherently stress both memory and CPU, and splitting into 110k separate data
frames before pivoting and combining them again is likely causing resource
overhead and system instability.
> >
> > Perhaps, evaluate if the presence/absence transformation can be done
in a more memory-efficient manner without pivoting all at once. Since you are
dealing with extremely large data, a more incremental or streaming approach may
be necessary. Instead of splitting into thousands of individual data frames and
trying to pivot each in parallel, consider instead? a method that processes
segments of data to incrementally build a large sparse matrix or a compressed
representation, then combine results at the end.
> >
> > It's probbaly better to move away from `pivot_wider()` on a
massive scale and attempt a data.table-based approach, which is often more
memory-efficient and faster for large-scale operations in R.
> >
> >
> > An alternate way would be data.table?s `dcast()` can handle large data
more efficiently, and data.table?s in-memory operations often reduce overhead
compared to tidyverse pivoting functions.
> >
> > Also - consider using data.table?s `fread()` or
`arrow::open_dataset()` directly with `as.data.table()` to keep everything in a
data.table format. For example, you can do a large `dcast()` operation to create
presence/absence columns by group. If your categories are extremely large,
consider an approach that processes categories in segments as I mentioned
earlier -? and writes intermediate results to disk, then combines/mergesresults
at the end.
> >
> > Limit parallelization when dealing with massive reshapes. Instead of
trying to parallelize the entire pivot across thousands of subsets, run a single
parallelized chunking approach that processes manageable subsets and writes out
intermediate results (for example... using `fwrite()` for each subset). After
processing, load and combine these intermediate results. This manual segmenting
approach can circumvent the "zombie" processes you mentioned - that I
think arise from overly complex parallel nesting and excessivememory
utilization.
> >
> > If the presence/absence indicators are ultimately sparse (many zeros
and few ones), consider storing the result in a sparse matrix format (for
exapmple- `Matrix` package in R). Instead of creating thousands of columns as
dense integers, using a sparse matrix representation should dramatically reduce
memory. After processing the data into a sparse format, you can then save it in
a suitable file format and only convert to a dense format if absolutely
necessary.
> >
> > Below is a reworked code segment using data.table for a more scalable
approach. Note that this is a conceptual template. In practice, adapt the chunk
sizes and filtering operations to your workflow. The idea is to avoid creating
110k separate data frames and to handle the pivot in a data.table manner that?s
more robust and less memory intensve. Here, presence/absence encoding is done by
grouping and casting directly rather than repeatedly splitting and row-binding.
> >
> > > library(data.table)
> > > library(arrow)
> > >
> > > # Step A: Load data efficiently as data.table
> > > dt <- as.data.table(
> > >?? open_dataset(
> > >??? sources = input_files,
> > >??? format = 'csv',
> > >??? unify_schema = TRUE,
> > >??? col_types = schema(
> > >????? "ID_Key" = string(),
> > >????? "column1" = string(),
> > >????? "column2" = string()
> > >??? )
> > >? ) |>
> >
> > >??? collect()
> > > )
> > >
> > > # Step B: Clean names once
> > > # Assume `crewjanitormakeclean` essentially standardizes column
names
> > > dt[, column1 := janitor::make_clean_names(column1, allow_dupes =?
> >
> > > TRUE)]
> > > dt[, column2 := janitor::make_clean_names(column2, allow_dupes
> >
> > >? TRUE)]
> > >
> > > # Step C: Create presence/absence indicators using data.table
> > > # Use dcast to pivot wide. Set n=1 for presence, 0 for absence.
> > > # For large unique values, consider chunking if needed.
> > > out1 <- dcast(dt[!is.na(column1)], ID_Key ~ column1,
fun.aggregate > >
> > > length, value.var = "column1")
> > > out2 <- dcast(dt[!is.na(column2)], ID_Key ~ column2,
fun.aggregate > >
> > > length, value.var = "column2")
> > >
> > > # Step D: Merge the two wide tables by ID_Key
> > > # Fill missing columns with 0 using data.table on-the-fly
operations
> > > all_cols <- unique(c(names(out1), names(out2)))
> > > out1_missing <- setdiff(all_cols, names(out1))
> > > out2_missing <- setdiff(all_cols, names(out2))
> > >
> > > # Add missing columns with 0
> > > for (col in out1_missing) out1[, (col) := 0]
> > > for (col in out2_missing) out2[, (col) := 0]
> > >
> > > # Ensure column order alignment if needed
> > > setcolorder(out1, all_cols)
> > > setcolorder(out2, all_cols)
> > >
> > > # Combine by ID_Key (since they share same columns now)
> > > final_dt <- rbindlist(list(out1, out2), use.names = TRUE, fill
= TRUE)
> > >
> > > # Step E: If needed, summarize across ID_Key to sum presence
> >
> > > indicators
> > > final_result <- final_dt[, lapply(.SD, sum, na.rm = TRUE), by
> >
> > > ID_Key, .SDcols = setdiff(names(final_dt), "ID_Key")]
> > >
> > > # note that final_result should now contain summed
presence/absence
> >
> > > (0/1) indicators.
> >
> >
> >
> >
> > Hope this helps!
> > gregg
> > somewhereinArizona
> >
> > The information in this e-mail is intended only for the person to whom
it is addressed.? If you believe this e-mail was sent to you in error and the
e-mail contains patient information, please contact the Mass General Brigham
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
> >
> >
> >
> > Please note that this e-mail is not secure (encrypted).? If you do not
wish to continue communication over unencrypted e-mail, please notify the sender
of this message immediately.? Continuing to send or respond to e-mail after
receiving this message means you understand and accept this risk and wish to
continue to communicate over unencrypted e-mail.?
>
> The information in this e-mail is intended only for the person to whom it
is addressed.? If you believe this e-mail was sent to you in error and the
e-mail contains patient information, please contact the Mass General Brigham
Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline .
>
>
>
> Please note that this e-mail is not secure (encrypted).? If you do not wish
to continue communication over unencrypted e-mail, please notify the sender of
this message immediately.? Continuing to send or respond to e-mail after
receiving this message means you understand and accept this risk and wish to
continue to communicate over unencrypted e-mail.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: OpenPGP digital signature
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20241211/98c681de/attachment.sig>