-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey everyone, I had another instance of the client kernel panic which I first encountered a few months ago. This time I managed to get a shot of the console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the JPG is from the console of wsn1(client). The only thing out of the ordinary that was happening at the time was I was in the middle of updating several vhosts'' wordpress blog (lots of tiny files?) - just a for-loop copying contents from one dir (off the lustre filesystem) to a dir on the filesystem. It hung around the 200-300th time, though I can''t be sure that had anything to do with it, it''s the only thing that I was doing on the server. Normal operation on the filesystem is Apache & nothing else. Thanks for any help with this, hopefully I got enough info this time though let me know if you need anything else. Cheers, - -Nick - -- Nick Jennings Director of Technology Creative Motion Design www.creativemotiondesign.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAksvu0UACgkQ3WjKacHecdPvegCgugn6o6L3wipDQR2aSMK/6ozo VGAAniN7Luat2KhUjuqjO/wm3MM8Rkx+ =BCVx -----END PGP SIGNATURE----- -------------- next part -------------- A non-text attachment was scrubbed... Name: img00240.jpg Type: image/jpeg Size: 794746 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091221/b9c158e3/attachment-0001.jpg -------------- next part -------------- A non-text attachment was scrubbed... Name: dbn1_dmesg.txt.gz Type: application/gzip Size: 9097 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091221/b9c158e3/attachment-0002.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: ssn1_dmesg.txt.gz Type: application/gzip Size: 11322 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091221/b9c158e3/attachment-0003.bin
Brian J. Murrell
2009-Dec-21 18:36 UTC
[Lustre-discuss] lustre 1.6.7.2 client kernel panic
On Mon, 2009-12-21 at 19:15 +0100, Nick Jennings wrote:> > I had another instance of the client kernel panic which I first > encountered a few months ago. This time I managed to get a shot of the > console.Perhaps one of the more expert engineers will recognize the problem from that partial stack trace in the JPG, but given that the most import part, if one can only get a partial stack trace -- the top -- is cut off, I''m not hopeful. Photographs of 25 line console screens are not very often suitable substitutes for real console logging, unfortunately. Seriously, if you really want to pursue this issue, you are going to have to set up some form of console logging. I think netconsole is usually fairly successful at capturing kernel oops dumps. Maybe that''s an option. ISTR mentioning netconsole the last time though. Maybe that was another thread. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091221/4f073ddd/attachment.bin
On 2009-12-21, at 11:15, Nick Jennings wrote:> I had another instance of the client kernel panic which I first > encountered a few months ago. This time I managed to get a shot of the > console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the > JPG is from the console of wsn1(client).I see bug 19841, which has at least part of this stack (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614. The latter bug is marked landed for 1.8.0 and later releases.> The only thing out of the ordinary that was happening at the time > was I > was in the middle of updating several vhosts'' wordpress blog (lots of > tiny files?) - just a for-loop copying contents from one dir (off the > lustre filesystem) to a dir on the filesystem. It hung around the > 200-300th time, though I can''t be sure that had anything to do with > it, > it''s the only thing that I was doing on the server. Normal operation > on > the filesystem is Apache & nothing else.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Monday 21 December 2009, Andreas Dilger wrote:> On 2009-12-21, at 11:15, Nick Jennings wrote: > > I had another instance of the client kernel panic which I first > > encountered a few months ago. This time I managed to get a shot of the > > console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the > > JPG is from the console of wsn1(client). > > I see bug 19841, which has at least part of this stack > (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614. The > latter bug is marked landed for 1.8.0 and later releases.Nick, if you do not want to upgrade or patch your Lustre version, the workaround for this is to disable lockless truncates. # on all clients for i in /proc/fs/lustre/llite/*; do echo 0 > ${i}/lockless_truncate; done Cheers, Bernd -- Bernd Schubert DataDirect Networks
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Thanks for this tip Bernd. I''ll be unable to upgrade for a while, so this is a very useful workaround. Does it have any drawbacks I should be aware of? On 12/22/2009 12:35 AM, Bernd Schubert wrote:> On Monday 21 December 2009, Andreas Dilger wrote: >> On 2009-12-21, at 11:15, Nick Jennings wrote: >>> I had another instance of the client kernel panic which I first >>> encountered a few months ago. This time I managed to get a shot of the >>> console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the >>> JPG is from the console of wsn1(client). >> >> I see bug 19841, which has at least part of this stack >> (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614. The >> latter bug is marked landed for 1.8.0 and later releases. > > Nick, if you do not want to upgrade or patch your Lustre version, the > workaround for this is to disable lockless truncates. > > > # on all clients > for i in /proc/fs/lustre/llite/*; do > echo 0 > ${i}/lockless_truncate; > done > > > Cheers, > Bernd >- -- Nick Jennings Director of Technology Creative Motion Design www.creativemotiondesign.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkswEhAACgkQ3WjKacHecdOUSACfX8DY0XO8BpGq0wqCt/sq+Fyi aDUAnRwrSjp8oirBaalFdT9vlErkq0r5 =VyTn -----END PGP SIGNATURE-----
Hello Nick, at least I''m not aware on any drawbacks. Cheers, Bernd On Tuesday 22 December 2009, Nick Jennings wrote:> Thanks for this tip Bernd. I''ll be unable to upgrade for a while, so > this is a very useful workaround. Does it have any drawbacks I should be > aware of? > > On 12/22/2009 12:35 AM, Bernd Schubert wrote: > > On Monday 21 December 2009, Andreas Dilger wrote: > >> On 2009-12-21, at 11:15, Nick Jennings wrote: > >>> I had another instance of the client kernel panic which I first > >>> encountered a few months ago. This time I managed to get a shot of the > >>> console. Attached is the dmesg output from ssn1(OSS) dbn1(MDS) and the > >>> JPG is from the console of wsn1(client). > >> > >> I see bug 19841, which has at least part of this stack > >> (ldlm_cli_pool_shrink) and that is marked a duplicate of 17614. The > >> latter bug is marked landed for 1.8.0 and later releases. > > > > Nick, if you do not want to upgrade or patch your Lustre version, the > > workaround for this is to disable lockless truncates. > > > > > > # on all clients > > for i in /proc/fs/lustre/llite/*; do > > echo 0 > ${i}/lockless_truncate; > > done > > > > > > Cheers, > > Bernd >-- Bernd Schubert DataDirect Networks
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 12/21/2009 07:36 PM, Brian J. Murrell wrote:> Photographs of 25 line console screens are not very often suitable > substitutes for real console logging, unfortunately. Seriously, if you > really want to pursue this issue, you are going to have to set up some > form of console logging. I think netconsole is usually fairly > successful at capturing kernel oops dumps. Maybe that''s an option. > ISTR mentioning netconsole the last time though. Maybe that was another > thread.You''re right, I just hadn''t gotten around to getting netconsole set up like I planned. *blush* :) - -- Nick Jennings Director of Technology Creative Motion Design www.creativemotiondesign.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAksw+IEACgkQ3WjKacHecdMF3ACdEyGb6ptMvdS8LWcGc5ZbSOXM lp8Ani2JZSfhPkmU0itLh78a644Hm1++ =IZtI -----END PGP SIGNATURE-----
On Tuesday 22 December 2009, Nick Jennings wrote:> On 12/21/2009 07:36 PM, Brian J. Murrell wrote: > > Photographs of 25 line console screens are not very often suitable > > substitutes for real console logging, unfortunately. Seriously, if you > > really want to pursue this issue, you are going to have to set up some > > form of console logging. I think netconsole is usually fairly > > successful at capturing kernel oops dumps. Maybe that''s an option. > > ISTR mentioning netconsole the last time though. Maybe that was another > > thread. > > You''re right, I just hadn''t gotten around to getting netconsole set up > like I planned. *blush* :) >Most servers nowadays have IPMI and an IPMI SOL is much better. Cheers, Bernd
On Tue, 2009-12-22 at 18:09 +0100, Bernd Schubert wrote:> On Tuesday 22 December 2009, Nick Jennings wrote: > > On 12/21/2009 07:36 PM, Brian J. Murrell wrote: > > > Photographs of 25 line console screens are not very often suitable > > > substitutes for real console logging, unfortunately. Seriously, if you > > > really want to pursue this issue, you are going to have to set up some > > > form of console logging. I think netconsole is usually fairly > > > successful at capturing kernel oops dumps. Maybe that''s an option. > > > ISTR mentioning netconsole the last time though. Maybe that was another > > > thread. > > > > You''re right, I just hadn''t gotten around to getting netconsole set up > > like I planned. *blush* :) > > > > Most servers nowadays have IPMI and an IPMI SOL is much better.Heh, I''d like to know what servers you are running. Our experience with IPMI SOL on a variety of systems has been anything but reliable. It has a notorious habit of dropping out under any sort of load, such as during an oops where you need it the most. It''s still better than nothing, but it''s a crapshoot. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
On Tuesday 22 December 2009, David Dillow wrote:> On Tue, 2009-12-22 at 18:09 +0100, Bernd Schubert wrote: > > On Tuesday 22 December 2009, Nick Jennings wrote: > > > On 12/21/2009 07:36 PM, Brian J. Murrell wrote: > > > > Photographs of 25 line console screens are not very often suitable > > > > substitutes for real console logging, unfortunately. Seriously, if > > > > you really want to pursue this issue, you are going to have to set up > > > > some form of console logging. I think netconsole is usually fairly > > > > successful at capturing kernel oops dumps. Maybe that''s an option. > > > > ISTR mentioning netconsole the last time though. Maybe that was > > > > another thread. > > > > > > You''re right, I just hadn''t gotten around to getting netconsole set up > > > like I planned. *blush* :) > > > > Most servers nowadays have IPMI and an IPMI SOL is much better. > > Heh, I''d like to know what servers you are running. Our experience with > IPMI SOL on a variety of systems has been anything but reliable. It has > a notorious habit of dropping out under any sort of load, such as during > an oops where you need it the most. > > It''s still better than nothing, but it''s a crapshoot. >Yes, I know about IPMI issues, of course. In my experience, SuperMicro IPMI with an additional NIC port works perfectly. I don''t know about their most recent mainboards and BMCs, though. The first week I started for DDN I learned that Dell-DRAC5 has a bug and does not send a break (sysrq). According to Dell, this is fixed in their recent firmware released 7 days ago (I opened the ''priority'' call on March 6th), but I could not check yet. Also working rather well is HP ilO, although not with SOL, but their build in "vsp". The problem with vsp is that cursor-keys do not work and navigating through the grub-menu is a pain, unless you know emacs shortcuts in and out (I''m a vi user...). Cheers, Bernd -- Bernd Schubert DataDirect Networks