Hello, from time to time we see these messages on our MDS 1.6.5.1 during copying data onto lustre. Is this just informational or an indicator of a broken setup ? Network load problems ? We checked the group rights and they look ok to us. The lustre MDS system including clients runs with YP setup. It seems to us that the message comes from "lustre-1.6.5.1/lustre/utils/l_getgroups.c". But we cannot nail down the real reason. Thanks and Regards Heiko <snip> Aug 18 14:39:53 mds1 l_getgroups: LONG OP getgrent loop: 25 elapsed, 3 expected Aug 18 14:39:53 mds1 l_getgroups: LONG OP get_groups_local: 25 elapsed, 10 expected Aug 18 14:42:04 mds1 l_getgroups: LONG OP getgrent loop: 25 elapsed, 3 expected Aug 18 14:42:04 mds1 l_getgroups: LONG OP get_groups_local: 25 elapsed, 10 expected Aug 18 14:50:34 mds1 l_getgroups: LONG OP getgrent loop: 25 elapsed, 3 expected Aug 18 14:50:34 mds1 l_getgroups: LONG OP get_groups_local: 25 elapsed, 10 expected Aug 18 14:53:16 mds1 l_getgroups: LONG OP getgrent loop: 25 elapsed, 3 expected Aug 18 14:53:16 mds1 l_getgroups: LONG OP get_groups_local: 25 elapsed, 10 expected <snap> Compilation from source with patched 2.6.24.19 vanilla-kernel. MDS/MGT: ./configure --disable-liblustre --disable-client --with-linux=/usr/src/linux-2.6.22.19 Client: ./configure --disable-server --disable-liblustre --with-linux=/usr/src/linux-2.6.22.19
On Aug 18, 2008 15:46 +0200, Heiko Schroeter wrote:> from time to time we see these messages on our MDS 1.6.5.1 during > copying data onto lustre. > > Is this just informational or an indicator of a broken setup ? Network load > problems ? > > We checked the group rights and they look ok to us. The lustre MDS system > including clients runs with YP setup. It seems to us that the message comes > from "lustre-1.6.5.1/lustre/utils/l_getgroups.c". But we cannot nail down > the real reason.> Aug 18 14:39:53 mds1 l_getgroups: LONG OP getgrent loop: 25 elapsed, 3 expected > Aug 18 14:39:53 mds1 l_getgroups: LONG OP get_groups_local: 25 elapsed, 10 expectedThe reason is that it seems YP is taking longer than Lustre has expected it to. You should be able to remove these messages by increasing the timeout in /proc/fs/lustre/mds/{mds}/group_acquire_expire. You might also reduce the load on the YP server by increasing the group cache lifetime by increasing /proc/fs/lustre/mds/{mds}/group_expire_interval (default 600 seconds). The message itself isn''t harmful, just a warning at this point that your name services are taking longer than expected. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Am Dienstag, 19. August 2008 01:59:09 schrieb Andreas Dilger:> On Aug 18, 2008 15:46 +0200, Heiko Schroeter wrote: > > from time to time we see these messages on our MDS 1.6.5.1 during > > copying data onto lustre. > > > > Is this just informational or an indicator of a broken setup ? Network > > load problems ? > > > > We checked the group rights and they look ok to us. The lustre MDS system > > including clients runs with YP setup. It seems to us that the message > > comes from "lustre-1.6.5.1/lustre/utils/l_getgroups.c". But we cannot > > nail down the real reason. > > > > Aug 18 14:39:53 mds1 l_getgroups: LONG OP getgrent loop: 25 elapsed, 3 > > expected Aug 18 14:39:53 mds1 l_getgroups: LONG OP get_groups_local: 25 > > elapsed, 10 expected > > The reason is that it seems YP is taking longer than Lustre has expected it > to. You should be able to remove these messages by increasing the timeout > in /proc/fs/lustre/mds/{mds}/group_acquire_expire. You might also reduce > the load on the YP server by increasing the group cache lifetime by > increasing /proc/fs/lustre/mds/{mds}/group_expire_interval (default 600 > seconds). > > The message itself isn''t harmful, just a warning at this point that your > name services are taking longer than expected.Yep, that seems to be the case. We setup a new YPserver which solely serves the LUSTRE system and the messages disappears. Not only that. We had problems when copying data onto lustre and starting a ''du'' od ''ls -la'' on the lustre file system the client hung. https://bugzilla.lustre.org/show_bug.cgi?id=16384 That problem dissapeared as well with the new YPserver. Thanks for your help ! Heiko> > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.