Rob Lines
2008-Aug-04 18:00 UTC
[CentOS] pam max locked memory issue after updating to 5.2 and rebooting
We were previously running 5.1 x86_64 and recently updated to 5.2 using yum. Under 5.1 we were having problems when running jobs using torque and the solution had been to add the following items to the files noted "* soft memlock unlimited" in /etc/security/limits.conf "session required pam_limits.so" in /etc/pam.d/{rsh,sshd} This changed the max locked memory setting in ulimit as follows: Before the change rsh nodeX ulimit -a still gives us max locked memory (kbytes, -l) 32 After the change rsh nodeX ulimit -a max locked memory (kbytes, -l) 16505400 The nodes have 16gb of memory. Now after the 5.2 updates those files are all the same and on most of the nodes we haven't yet rebooted them due to log running processes but a few nodes have been restarted and now that jobs are starting to be put on them we are back to max locked memory of 32k rather than 16gb. The error we are receiving on those jobs is : libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(306).......: Initialization failed MPID_Init(113)..............: channel initialization failed MPIDI_CH3_Init(167).........: MPIDI_CH3I_RDMA_init(138)...: rdma_setup_startup_ring(333): cannot create cq Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(306).......: Initialization failed MPID_Init(113)..............: channel initialization failed MPIDI_CH3_Init(167).........: MPIDI_CH3I_RDMA_init(138)...: rdma_setup_startup_ring(333): cannot create cq rank 45 in job 1 nodeX_35175 caused collective abort of all ranks exit status of rank 45: return code 1 rank 44 in job 1 nodeX_35175 caused collective abort of all ranks exit status of rank 44: return code 1 The full output of : rsh nodeX ulimit -a connect to address x.x.x.x port 544: Connection refused Trying krb4 rsh... connect to address x.x.x.x port 544: Connection refused trying normal rsh (/usr/bin/rsh) core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 135168 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 135168 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Any ideas, suggestions or items I could roll back would be appreciated. I looked through the list of packages that were updated and the only one that I could see that was related was pam. ssh and rsh were not updated. Thank you, Rob
Rob Lines
2008-Aug-08 12:10 UTC
[CentOS] Re: pam max locked memory issue after updating to 5.2 and rebooting
It has been a few days so I am sending this again incase someone has seen this issue and might have a seen this problem or has a suggestion of where to look and why it might not be taking these settings with 5.2 when it did with 5.1 On Mon, Aug 4, 2008 at 2:00 PM, Rob Lines <rlinesseagate at gmail.com> wrote:> We were previously running 5.1 x86_64 and recently updated to 5.2 > using yum. Under 5.1 we were having problems when running jobs using > torque and the solution had been to add the following items to the > files noted > > "* soft memlock unlimited" in /etc/security/limits.conf > "session required pam_limits.so" in /etc/pam.d/{rsh,sshd} > > This changed the max locked memory setting in ulimit as follows: > > Before the change > rsh nodeX ulimit -a > still gives us > max locked memory (kbytes, -l) 32 > > After the change > rsh nodeX ulimit -a > max locked memory (kbytes, -l) 16505400 > > The nodes have 16gb of memory. > > Now after the 5.2 updates those files are all the same and on most of > the nodes we haven't yet rebooted them due to log running processes > but a few nodes have been restarted and now that jobs are starting to > be put on them we are back to max locked memory of 32k rather than > 16gb. > > The error we are receiving on those jobs is : > > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(306).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(167).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(333): cannot create cq > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(306).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(167).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(333): cannot create cq > rank 45 in job 1 nodeX_35175 caused collective abort of all ranks > exit status of rank 45: return code 1 > rank 44 in job 1 nodeX_35175 caused collective abort of all ranks > exit status of rank 44: return code 1 > > > The full output of : > > rsh nodeX ulimit -a > > connect to address x.x.x.x port 544: Connection refused > Trying krb4 rsh... > connect to address x.x.x.x port 544: Connection refused > trying normal rsh (/usr/bin/rsh) > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 135168 > max locked memory (kbytes, -l) 32 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 10240 > cpu time (seconds, -t) unlimited > max user processes (-u) 135168 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > > Any ideas, suggestions or items I could roll back would be > appreciated. I looked through the list of packages that were updated > and the only one that I could see that was related was pam. ssh and > rsh were not updated. > > Thank you, > Rob >