Robin Humble
2007-Feb-10 18:07 UTC
[Lustre-discuss] LBUG ASSERTION(thread->t_id < OSS_THREADS_MAX) in 1.5.97
now onto the next problem :) setup is a single ~1TB software raid0 OST on ''xe'', an MDT/MGS on ''x19'' with SATA disk, and 16 other client nodes mounting the Lustre filesystem and doing bonnie++ over o2ib. all standard Lustre rpms, x86_64 CentOS4.4. IIRC the variable number of threads is a new feature in 1.5.97? looks like there''s a small bug in it... it ends up with 512 ll_ost_io_* threads, and the last is hung in a D state with no i/o on the filesystem. ie. % ps auxw ... root 7488 0.0 0.0 0 0 ? S 19:20 0:00 [ll_ost_io_511] root 7489 0.0 0.0 0 0 ? D 19:20 0:00 [ll_ost_io_512] ... after the i/o: num threads name ----------- ---- 5 ll_log_comt_* 8 ldlm_bl_* 8 ldlm_cb_* 15 ldlm_cn_* 304 ll_ost_* 512 ll_ost_io_* the /tmp/lustre-log.* files are kinda huge. they''re at http://www.cita.utoronto.ca/~rjh/lustre/ there''s no LBUG if I restrict the number of threads with eg: options ost oss_num_threads=300 let me know if you''d like more info or to try patches etc. cheers, robin Feb 10 19:20:41 xe kernel: LustreError: 7489:0:(ost_handler.c:1555:ost_thread_init()) ASSERTION(thread->t_id < OSS_THREADS_MAX) failed Feb 10 19:20:41 xe kernel: LustreError: 7489:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG Feb 10 19:20:41 xe kernel: Lustre: 7489:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 7489 Feb 10 19:20:41 xe kernel: ll_ost_io_512 R running task 0 7489 1 7488 (L-TLB) Feb 10 19:20:41 xe kernel: 000001013edf7ee8 000001011b21e070 000001018e064280 000001013edf7db8 Feb 10 19:20:41 xe kernel: 0000010037d01600 0000000000000200 0000000000000000 0000000000000000 Feb 10 19:20:41 xe kernel: 0000000000000459 00000000ffffffff Feb 10 19:20:41 xe kernel: Call Trace:<ffffffffa03ca547>{:ptlrpc:ptlrpc_main+0} <ffffffff80110e1b>{child_rip+0} Feb 10 19:20:41 xe kernel: Feb 10 19:20:41 xe kernel: LustreError: dumping log to /tmp/lustre-log.1171095641.7489 Feb 10 19:20:42 xe kernel: Lustre: 7489:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/scratch/lbuild-boulder/lbuild-v1_5_97_3-2.6-rhel4-x86_64/lbuild/BUILD/lustre-1.5.97/lnet/libcfs/tracefile.c,libcfs_assertion_failed,433 Feb 10 19:22:21 xe kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 6797: it was inactive for 100000ms Feb 10 19:22:21 xe kernel: Lustre: 0:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 6797 Feb 10 19:22:21 xe kernel: ll_ost_io_231 S 0000000000000202 0 6797 1 6798 6796 (L-TLB) Feb 10 19:22:21 xe kernel: 000001013edf7d78 0000000000000046 000001018e064280 0000010100000074 Feb 10 19:22:21 xe kernel: 000001013edf7ee8 0000000080110dbe 0000010001071aa0 000000011b21e070 Feb 10 19:22:21 xe kernel: 000001013d4ab800 000000000000124e Feb 10 19:22:21 xe kernel: Call Trace:<ffffffffa03ca3c5>{:ptlrpc:ptlrpc_start_thread+1522} Feb 10 19:22:21 xe kernel: <ffffffff801331a5>{default_wake_function+0} <ffffffffa03c858d>{:ptlrpc:ptlrpc_server_free_request+34} Feb 10 19:22:21 xe kernel: <ffffffff80115d7b>{do_gettimeofday+77} <ffffffffa03cacec>{:ptlrpc:ptlrpc_main+1957} Feb 10 19:22:21 xe kernel: <ffffffffa03c9a47>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03c9a47>{:ptlrpc:ptlrpc_retry_rqbds+0} Feb 10 19:22:21 xe kernel: <ffffffffa03c9a47>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80110e23>{child_rip+8} Feb 10 19:22:21 xe kernel: <ffffffffa03ca547>{:ptlrpc:ptlrpc_main+0} <ffffffff80110e1b>{child_rip+0} Feb 10 19:22:21 xe kernel: Feb 10 19:22:21 xe kernel: LustreError: dumping log to /tmp/lustre-log.1171095741.6797 Feb 10 19:29:26 x19 kernel: Lustre: 4761:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 30s Feb 10 19:29:26 x19 kernel: Lustre: 4761:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 1 previous similar message Feb 10 19:34:29 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 31s Feb 10 19:34:29 x19 kernel: Lustre: 4777:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 31s Feb 10 19:34:29 x19 kernel: Lustre: 4783:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 31s Feb 10 19:34:30 x19 kernel: Lustre: 4783:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 5 previous similar messages Feb 10 19:34:30 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 2 previous similar messages Feb 10 19:40:27 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 38s Feb 10 19:40:27 x19 kernel: Lustre: 4786:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 38s Feb 10 19:40:27 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 4 previous similar messages
Nathaniel Rutman
2007-Feb-12 10:01 UTC
[Lustre-discuss] LBUG ASSERTION(thread->t_id < OSS_THREADS_MAX) in 1.5.97
Robin Humble wrote:> now onto the next problem :) > > setup is a single ~1TB software raid0 OST on ''xe'', an MDT/MGS on ''x19'' > with SATA disk, and 16 other client nodes mounting the Lustre filesystem > and doing bonnie++ over o2ib. all standard Lustre rpms, x86_64 CentOS4.4. > > IIRC the variable number of threads is a new feature in 1.5.97? > looks like there''s a small bug in it... > >Yes. Thanks, I''ll take a look.> it ends up with 512 ll_ost_io_* threads, and the last is hung in a D state > with no i/o on the filesystem. ie. > % ps auxw > ... > root 7488 0.0 0.0 0 0 ? S 19:20 0:00 [ll_ost_io_511] > root 7489 0.0 0.0 0 0 ? D 19:20 0:00 [ll_ost_io_512] > ... > > after the i/o: > num threads name > ----------- ---- > 5 ll_log_comt_* > 8 ldlm_bl_* > 8 ldlm_cb_* > 15 ldlm_cn_* > 304 ll_ost_* > 512 ll_ost_io_* > > the /tmp/lustre-log.* files are kinda huge. they''re at > http://www.cita.utoronto.ca/~rjh/lustre/ > > there''s no LBUG if I restrict the number of threads with eg: > options ost oss_num_threads=300 >This effectively disables the dynamic thread creation, so it may not be just the number of threads that matters.> let me know if you''d like more info or to try patches etc. > > cheers, > robin > > Feb 10 19:20:41 xe kernel: LustreError: 7489:0:(ost_handler.c:1555:ost_thread_init()) ASSERTION(thread->t_id < OSS_THREADS_MAX) failed > Feb 10 19:20:41 xe kernel: LustreError: 7489:0:(tracefile.c:433:libcfs_assertion_failed()) LBUG > Feb 10 19:20:41 xe kernel: Lustre: 7489:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 7489 > Feb 10 19:20:41 xe kernel: ll_ost_io_512 R running task 0 7489 1 7488 (L-TLB) > Feb 10 19:20:41 xe kernel: 000001013edf7ee8 000001011b21e070 000001018e064280 000001013edf7db8 > Feb 10 19:20:41 xe kernel: 0000010037d01600 0000000000000200 0000000000000000 0000000000000000 > Feb 10 19:20:41 xe kernel: 0000000000000459 00000000ffffffff > Feb 10 19:20:41 xe kernel: Call Trace:<ffffffffa03ca547>{:ptlrpc:ptlrpc_main+0} <ffffffff80110e1b>{child_rip+0} > Feb 10 19:20:41 xe kernel: > Feb 10 19:20:41 xe kernel: LustreError: dumping log to /tmp/lustre-log.1171095641.7489 > Feb 10 19:20:42 xe kernel: Lustre: 7489:0:(linux-debug.c:98:libcfs_run_upcall()) Invoked LNET upcall /usr/lib/lustre/lnet_upcall LBUG,/scratch/lbuild-boulder/lbuild-v1_5_97_3-2.6-rhel4-x86_64/lbuild/BUILD/lustre-1.5.97/lnet/libcfs/tracefile.c,libcfs_assertion_failed,433 > Feb 10 19:22:21 xe kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 6797: it was inactive for 100000ms > Feb 10 19:22:21 xe kernel: Lustre: 0:0:(linux-debug.c:166:libcfs_debug_dumpstack()) showing stack for process 6797 > Feb 10 19:22:21 xe kernel: ll_ost_io_231 S 0000000000000202 0 6797 1 6798 6796 (L-TLB) > Feb 10 19:22:21 xe kernel: 000001013edf7d78 0000000000000046 000001018e064280 0000010100000074 > Feb 10 19:22:21 xe kernel: 000001013edf7ee8 0000000080110dbe 0000010001071aa0 000000011b21e070 > Feb 10 19:22:21 xe kernel: 000001013d4ab800 000000000000124e > Feb 10 19:22:21 xe kernel: Call Trace:<ffffffffa03ca3c5>{:ptlrpc:ptlrpc_start_thread+1522} > Feb 10 19:22:21 xe kernel: <ffffffff801331a5>{default_wake_function+0} <ffffffffa03c858d>{:ptlrpc:ptlrpc_server_free_request+34} > Feb 10 19:22:21 xe kernel: <ffffffff80115d7b>{do_gettimeofday+77} <ffffffffa03cacec>{:ptlrpc:ptlrpc_main+1957} > Feb 10 19:22:21 xe kernel: <ffffffffa03c9a47>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03c9a47>{:ptlrpc:ptlrpc_retry_rqbds+0} > Feb 10 19:22:21 xe kernel: <ffffffffa03c9a47>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80110e23>{child_rip+8} > Feb 10 19:22:21 xe kernel: <ffffffffa03ca547>{:ptlrpc:ptlrpc_main+0} <ffffffff80110e1b>{child_rip+0} > Feb 10 19:22:21 xe kernel: > Feb 10 19:22:21 xe kernel: LustreError: dumping log to /tmp/lustre-log.1171095741.6797 > Feb 10 19:29:26 x19 kernel: Lustre: 4761:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 30s > Feb 10 19:29:26 x19 kernel: Lustre: 4761:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 1 previous similar message > Feb 10 19:34:29 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 31s > Feb 10 19:34:29 x19 kernel: Lustre: 4777:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 31s > Feb 10 19:34:29 x19 kernel: Lustre: 4783:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 31s > Feb 10 19:34:30 x19 kernel: Lustre: 4783:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 5 previous similar messages > Feb 10 19:34:30 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 2 previous similar messages > Feb 10 19:40:27 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 38s > Feb 10 19:40:27 x19 kernel: Lustre: 4786:0:(lustre_fsfilt.h:284:fsfilt_setattr()) testfs-MDT0000: slow setattr 38s > Feb 10 19:40:27 x19 kernel: Lustre: 4756:0:(lustre_fsfilt.h:284:fsfilt_setattr()) Skipped 4 previous similar messages > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > >
Nathaniel Rutman
2007-Feb-13 13:55 UTC
[Lustre-discuss] LBUG ASSERTION(thread->t_id < OSS_THREADS_MAX) in 1.5.97
Robin Humble wrote:> now onto the next problem :) > > setup is a single ~1TB software raid0 OST on ''xe'', an MDT/MGS on ''x19'' > with SATA disk, and 16 other client nodes mounting the Lustre filesystem > and doing bonnie++ over o2ib. all standard Lustre rpms, x86_64 CentOS4.4. > > IIRC the variable number of threads is a new feature in 1.5.97? > looks like there''s a small bug in it... > > it ends up with 512 ll_ost_io_* threads, and the last is hung in a D state > with no i/o on the filesystem. ie. > % ps auxw > ... > root 7488 0.0 0.0 0 0 ? S 19:20 0:00 [ll_ost_io_511] > root 7489 0.0 0.0 0 0 ? D 19:20 0:00 [ll_ost_io_512] > ... > > after the i/o: > num threads name > ----------- ---- > 5 ll_log_comt_* > 8 ldlm_bl_* > 8 ldlm_cb_* > 15 ldlm_cn_* > 304 ll_ost_* > 512 ll_ost_io_* > > the /tmp/lustre-log.* files are kinda huge. they''re at > http://www.cita.utoronto.ca/~rjh/lustre/ > > there''s no LBUG if I restrict the number of threads with eg: > options ost oss_num_threads=300 > > let me know if you''d like more info or to try patches etc. > > cheers, > robin > > Feb 10 19:20:41 xe kernel: LustreError: 7489:0:(ost_handler.c:1555:ost_thread_init()) ASSERTION(thread->t_id < OSS_THREADS_MAX) failed >Looks like it''s just a slightly bad assert. Index: ost_handler.c ==================================================================RCS file: /cvsroot/cfs/lustre-core/ost/ost_handler.c,v retrieving revision 1.181 diff -u -p -r1.181 ost_handler.c --- ost_handler.c 10 Feb 2007 06:32:49 -0000 1.181 +++ ost_handler.c 13 Feb 2007 20:54:19 -0000 @@ -1586,7 +1586,7 @@ static int ost_thread_init(struct ptlrpc LASSERT(thread != NULL); LASSERT(thread->t_data == NULL); - LASSERT(thread->t_id < OSS_THREADS_MAX); + LASSERT(thread->t_id <= OSS_THREADS_MAX); OBD_ALLOC_PTR(tls); if (tls != NULL) {
Robin Humble
2007-Feb-13 17:10 UTC
[Lustre-discuss] LBUG ASSERTION(thread->t_id < OSS_THREADS_MAX) in 1.5.97
On Tue, Feb 13, 2007 at 12:55:18PM -0800, Nathaniel Rutman wrote:>Robin Humble wrote: >>now onto the next problem :) >>Feb 10 19:20:41 xe kernel: LustreError: >>7489:0:(ost_handler.c:1555:ost_thread_init()) ASSERTION(thread->t_id < >>OSS_THREADS_MAX) failed >> >Looks like it''s just a slightly bad assert. > >Index: ost_handler.c >==================================================================>RCS file: /cvsroot/cfs/lustre-core/ost/ost_handler.c,v >retrieving revision 1.181 >diff -u -p -r1.181 ost_handler.c >--- ost_handler.c 10 Feb 2007 06:32:49 -0000 1.181 >+++ ost_handler.c 13 Feb 2007 20:54:19 -0000 >@@ -1586,7 +1586,7 @@ static int ost_thread_init(struct ptlrpc > > LASSERT(thread != NULL); > LASSERT(thread->t_data == NULL); >- LASSERT(thread->t_id < OSS_THREADS_MAX); >+ LASSERT(thread->t_id <= OSS_THREADS_MAX); > > OBD_ALLOC_PTR(tls); > if (tls != NULL) {does that explain why the 512''th thread is in a D state afterwards? cheers, robin
Nathaniel Rutman
2007-Feb-13 17:16 UTC
[Lustre-discuss] LBUG ASSERTION(thread->t_id < OSS_THREADS_MAX) in 1.5.97
Robin Humble wrote:> On Tue, Feb 13, 2007 at 12:55:18PM -0800, Nathaniel Rutman wrote: > >> Robin Humble wrote: >> >>> now onto the next problem :) >>> Feb 10 19:20:41 xe kernel: LustreError: >>> 7489:0:(ost_handler.c:1555:ost_thread_init()) ASSERTION(thread->t_id < >>> OSS_THREADS_MAX) failed >>> >>> >> Looks like it''s just a slightly bad assert. >> >> Index: ost_handler.c >> ==================================================================>> RCS file: /cvsroot/cfs/lustre-core/ost/ost_handler.c,v >> retrieving revision 1.181 >> diff -u -p -r1.181 ost_handler.c >> --- ost_handler.c 10 Feb 2007 06:32:49 -0000 1.181 >> +++ ost_handler.c 13 Feb 2007 20:54:19 -0000 >> @@ -1586,7 +1586,7 @@ static int ost_thread_init(struct ptlrpc >> >> LASSERT(thread != NULL); >> LASSERT(thread->t_data == NULL); >> - LASSERT(thread->t_id < OSS_THREADS_MAX); >> + LASSERT(thread->t_id <= OSS_THREADS_MAX); >> >> OBD_ALLOC_PTR(tls); >> if (tls != NULL) { >> > > does that explain why the 512''th thread is in a D state afterwards? > > cheers, > robin > >Yup. the assert occurs when t_id=OSS_THREADS_MAX=512. t_id are numbered starting from 1, not 0.