Rob Lines
2008-Aug-04  18:00 UTC
[CentOS] pam max locked memory issue after updating to 5.2 and rebooting
We were previously running 5.1 x86_64 and recently updated to 5.2
using yum.  Under 5.1 we were having problems when running jobs using
torque and the solution had been to add the following items to the
files noted
"*          soft    memlock         unlimited" in
/etc/security/limits.conf
"session    required     pam_limits.so" in /etc/pam.d/{rsh,sshd}
This changed the max locked memory setting in ulimit as follows:
Before the change
rsh nodeX ulimit -a
still gives us
max locked memory       (kbytes, -l) 32
After the change
rsh nodeX ulimit -a
max locked memory       (kbytes, -l) 16505400
The nodes have 16gb of memory.
Now after the 5.2 updates those files are all the same and on most of
the nodes we haven't yet rebooted them due to log running processes
but a few nodes have been restarted and now that jobs are starting to
be put on them we are back to max locked memory of 32k rather than
16gb.
The error we are receiving on those jobs is :
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
    This will severely limit memory registrations.
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(306).......: Initialization failed
MPID_Init(113)..............: channel initialization failed
MPIDI_CH3_Init(167).........:
MPIDI_CH3I_RDMA_init(138)...:
rdma_setup_startup_ring(333): cannot create cq
Fatal error in MPI_Init:
Other MPI error, error stack:
MPIR_Init_thread(306).......: Initialization failed
MPID_Init(113)..............: channel initialization failed
MPIDI_CH3_Init(167).........:
MPIDI_CH3I_RDMA_init(138)...:
rdma_setup_startup_ring(333): cannot create cq
rank 45 in job 1  nodeX_35175   caused collective abort of all ranks
  exit status of rank 45: return code 1
rank 44 in job 1  nodeX_35175   caused collective abort of all ranks
  exit status of rank 44: return code 1
The full output of :
rsh nodeX ulimit -a
connect to address x.x.x.x port 544: Connection refused
Trying krb4 rsh...
connect to address x.x.x.x port 544: Connection refused
trying normal rsh (/usr/bin/rsh)
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 135168
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 135168
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Any ideas, suggestions or items I could roll back would be
appreciated.  I looked through the list of packages that were updated
and the only one that I could see that was related was pam.  ssh and
rsh were not updated.
Thank you,
Rob
Rob Lines
2008-Aug-08  12:10 UTC
[CentOS] Re: pam max locked memory issue after updating to 5.2 and rebooting
It has been a few days so I am sending this again incase someone has seen this issue and might have a seen this problem or has a suggestion of where to look and why it might not be taking these settings with 5.2 when it did with 5.1 On Mon, Aug 4, 2008 at 2:00 PM, Rob Lines <rlinesseagate at gmail.com> wrote:> We were previously running 5.1 x86_64 and recently updated to 5.2 > using yum. Under 5.1 we were having problems when running jobs using > torque and the solution had been to add the following items to the > files noted > > "* soft memlock unlimited" in /etc/security/limits.conf > "session required pam_limits.so" in /etc/pam.d/{rsh,sshd} > > This changed the max locked memory setting in ulimit as follows: > > Before the change > rsh nodeX ulimit -a > still gives us > max locked memory (kbytes, -l) 32 > > After the change > rsh nodeX ulimit -a > max locked memory (kbytes, -l) 16505400 > > The nodes have 16gb of memory. > > Now after the 5.2 updates those files are all the same and on most of > the nodes we haven't yet rebooted them due to log running processes > but a few nodes have been restarted and now that jobs are starting to > be put on them we are back to max locked memory of 32k rather than > 16gb. > > The error we are receiving on those jobs is : > > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(306).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(167).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(333): cannot create cq > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(306).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(167).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(333): cannot create cq > rank 45 in job 1 nodeX_35175 caused collective abort of all ranks > exit status of rank 45: return code 1 > rank 44 in job 1 nodeX_35175 caused collective abort of all ranks > exit status of rank 44: return code 1 > > > The full output of : > > rsh nodeX ulimit -a > > connect to address x.x.x.x port 544: Connection refused > Trying krb4 rsh... > connect to address x.x.x.x port 544: Connection refused > trying normal rsh (/usr/bin/rsh) > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 135168 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 135168 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > Any ideas, suggestions or items I could roll back would be > appreciated. I looked through the list of packages that were updated > and the only one that I could see that was related was pam. ssh and > rsh were not updated. > > Thank you, > Rob >