Hi; We have a lustre filesystem that has been pretty stable since June 2008 on a 200 node cluster until three weeks ago. The OSS kernel panic has escalated since then to now about every 2 hours. The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD procs The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs One OST raid 6 ~9TB (I know it is larger than currently tested) - at 58% Lustre 1.6.4.2 I decreased the threads to 256 then 128 thinking the storage was oversubscribed however the kernel panics continue. The storage has no errors in the logs. I have done a fsck with no filesystem issues detected. We do have an average of ~35 Gaussian programs running which is heavy I/O, however collectl does not show any system stress before the panic. Console shows a few messages about brw_writes and OST timeouts. I am attaching the messages from syslog prior to one of the kernel panics and the one lustre dump that has data. If anyone has any thoughts, I would appreciate it. Denise -------------- next part -------------- A non-text attachment was scrubbed... Name: messages Type: application/octet-stream Size: 161948 bytes Desc: messages Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081203/d1dbe851/attachment-0002.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre-log.1228334804.4054 Type: application/octet-stream Size: 904990 bytes Desc: lustre-log.1228334804.4054 Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081203/d1dbe851/attachment-0003.obj
On Wed, 2008-12-03 at 19:30 -0700, Hummel, Denise wrote:> Hi;I''ve only had a chance to take the quickest of peeks at this...> We have a lustre filesystem that has been pretty stable since June 2008 on a 200 node > cluster until three weeks ago. The OSS kernel panic has escalated since then to now about > every 2 hours.Those are not "panics". A kernel panic is a very particular thing and what you are seeing is not that. What you are seeing is watchdog timers firing. I notice that they mostly (all?) seem to be in ldiskfs code paths and at the end of the messages a bunch of these: Dec 3 13:07:36 oss1 kernel: Lustre: 3990:0:(lustre_fsfilt.h:240:fsfilt_brw_start_log()) lustre-OST0000: slow journal start 112s Dec 3 13:07:36 oss1 kernel: Lustre: 3942:0:(filter_io_26.c:711:filter_commitrw_write()) lustre-OST0000: slow brw_start 36s Dec 3 13:07:36 oss1 kernel: Lustre: 3947:0:(lustre_fsfilt.h:205:fsfilt_start_log()) lustre-OST0000: slow journal start 128s Dec 3 13:07:36 oss1 kernel: Lustre: 3947:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 3947 disabled after 128.2092s Dec 3 13:07:36 oss1 kernel: Lustre: 3988:0:(lustre_fsfilt.h:296:fsfilt_commit_wait()) lustre-OST0000: slow journal start 150s Dec 3 13:07:36 oss1 kernel: Lustre: 3988:0:(filter_io_26.c:776:filter_commitrw_write()) lustre-OST0000: slow commitrw commit 150s Dec 3 13:07:36 oss1 kernel: Lustre: 4035:0:(filter_io_26.c:763:filter_commitrw_write()) lustre-OST0000: slow direct_io 31s Dec 3 13:07:36 oss1 kernel: Lustre: 4053:0:(filter_io_26.c:698:filter_commitrw_write()) lustre-OST0000: slow i_mutex 150s Dec 3 13:07:36 oss1 kernel: Lustre: 4000:0:(filter_io_26.c:763:filter_commitrw_write()) lustre-OST0000: slow direct_io 132s Dec 3 13:07:36 oss1 kernel: Lustre: 4000:0:(filter_io_26.c:763:filter_commitrw_write()) Skipped 10 previous similar messages Dec 3 13:07:36 oss1 kernel: Lustre: 4054:0:(filter_io_26.c:776:filter_commitrw_write()) lustre-OST0000: slow commitrw commit 151s Which means your storage is too slow for the load that the OSS is putting on it.> I decreased the threads to 256 then 128 thinking the storage was oversubscribedYour instinct was right as this certainly is a symptom of that problem, as well as others however.> The storage has no errors in the logs.Hrm. That was going to be my next question. This symptom can also describe a back end storage system that has "slowed down". Or it could also describe a load that has gone up over time. Perhaps the storage has always been oversubscribed but just never taxed so the symptom was hiding. Did you ever do any iokit benchmarking of your storage before you put Lustre on it? I hope so, because doing that provides a good baseline for you do another, say, obdfilter run and compare your performance then and now to see how they measure up. Even if you didn''t do a baseline obdfilter-survey run before you started, doing one now will help you tune the number of OST threads you can use before you enter the realm of diminishing returns. The alternative is of course continuing to binary search for your "sweet spot". If you choose the latter, once you have found the number of OST threads you can run with before hitting too many "slow" messages and watchdog timeouts, you can do some benchmarking to see if your performance is as you would expect given your storage interconnect and hardware. If not, you will need to start trying to figure out why. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081203/a8ed2e88/attachment.bin
On Dec 03, 2008 19:30 -0700, Hummel, Denise wrote:> We have a lustre filesystem that has been pretty stable since June 2008 on > a 200 node cluster until three weeks ago. The OSS kernel panic has > escalated since then to now about every 2 hours. > The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD procs > The OSS is on a x86_64 server with 8G memory and 2 dual core AMD procs > One OST raid 6 ~9TB (I know it is larger than currently tested) - at 58%Running with OSTs > 8TB exposes you to filesystem corruption. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Thursday 04 December 2008, Andreas Dilger wrote:> On Dec 03, 2008 19:30 -0700, Hummel, Denise wrote: > > We have a lustre filesystem that has been pretty stable since June 2008 > > on a 200 node cluster until three weeks ago. The OSS kernel panic has > > escalated since then to now about every 2 hours. > > The MDT/MGS is on a x86_64 server with 8G memory and 2 dual core AMD > > procs The OSS is on a x86_64 server with 8G memory and 2 dual core AMD > > procs One OST raid 6 ~9TB (I know it is larger than currently tested) - > > at 58% > > Running with OSTs > 8TB exposes you to filesystem corruption. > > Cheers, AndreasWouldn''t it be an idea then to warn/refuse during mkfs.lustre? /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081208/de6dbbbb/attachment.bin
On Dec 08, 2008 20:40 +0100, Peter Kjellstrom wrote:> On Thursday 04 December 2008, Andreas Dilger wrote: > > Running with OSTs > 8TB exposes you to filesystem corruption. > > Wouldn''t it be an idea then to warn/refuse during mkfs.lustre?Yes, this should be added. In the past e2fsprogs would refuse to create > 8TB filesystems, but this was changed in the upstream e2fsprogs for ext4. However, we didn''t add an additional restriction back when this was removed from e2fsprogs. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.