Christopher J. Walker
2010-Apr-05 19:41 UTC
[Lustre-discuss] ras_stride_increase_window() ASSERTION failed
I see the following error in the logs on some of my lustre clients: Mar 29 20:58:43 cn507 kernel: LustreError: 18750:0:(rw.c:1948:ras_stride_increase_window()) ASSERTION(ras->ras_window_ start + ras->ras_window_len >= ras->ras_stride_offset) failed: window_start 1792, window_len 0 stride_offset 2017 Several processes seem to be blocking on this machine in state DN. Is this a known issue? I''ve looked in bugzilla and not found anything obvious (but this is the first time I''ve looked in your bugzilla). I''ve found http://www.nersc.gov/hypermail/nersc-io/att-0612/summary.pdf and had a quick flick through, but it refers to mpi-io, which we are not doing, and a 1.6 kernel, whereas we are running 1.8. I''m running 1.8.2 servers (downloaded from Sun/Oracle), and 1.8.2 clients compiled from source on a Scientific Linux 2.6.18-164.15.1.el5 kernel. /var/log/messages says:> Mar 29 20:58:43 cn507 kernel: LustreError: 18750:0:(rw.c:1948:ras_stride_increase_window()) ASSERTION(ras->ras_window_ > start + ras->ras_window_len >= ras->ras_stride_offset) failed: window_start 1792, window_len 0 stride_offset 2017 > Mar 29 20:58:43 cn507 kernel: LustreError: 18750:0:(rw.c:1948:ras_stride_increase_window()) LBUG > Mar 29 20:58:43 cn507 kernel: Pid: 18750, comm: athena.py > Mar 29 20:58:43 cn507 kernel: > Mar 29 20:58:43 cn507 kernel: Call Trace: > Mar 29 20:58:43 cn507 kernel: [<ffffffff8844d6a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs] > Mar 29 20:58:43 cn507 kernel: [<ffffffff8844dbda>] lbug_with_loc+0x7a/0xd0 [libcfs] > Mar 29 20:58:43 cn507 kernel: [<ffffffff8878d63f>] ll_readpage+0x129f/0x1e40 [lustre] > Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c707>] add_to_page_cache+0xaa/0xc1 > Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c2f5>] do_generic_mapping_read+0x208/0x354 > Mar 29 20:58:43 cn507 kernel: [<ffffffff8000d0e0>] file_read_actor+0x0/0x159 > Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c58d>] __generic_file_aio_read+0x14c/0x198 > Mar 29 20:58:43 cn507 kernel: [<ffffffff800c5d8f>] generic_file_readv+0x8f/0xa8 > Mar 29 20:58:43 cn507 kernel: [<ffffffff800a0307>] autoremove_wake_function+0x0/0x2e > Mar 29 20:58:43 cn507 kernel: [<ffffffff8879a427>] our_vma+0x117/0x1d0 [lustre] > Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b984>] touch_atime+0x67/0xaa > Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f65b>] ll_file_readv+0x1e4b/0x2130 [lustre] > Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f95a>] ll_file_read+0x1a/0x20 [lustre] > Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 > Mar 29 20:58:43 cn507 kernel: [<ffffffff80011b60>] sys_read+0x45/0x6e > Mar 29 20:58:43 cn507 kernel: [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76 > Mar 29 20:58:43 cn507 kernel: > Mar 29 20:58:43 cn507 kernel: LustreError: dumping log to /tmp/lustre-log.1269892723.18750Thanks, Chris -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lustre-error.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100405/28086260/attachment.txt
di.wang
2010-Apr-05 19:50 UTC
[Lustre-discuss] ras_stride_increase_window() ASSERTION failed
Hello, you need the patch in bug 17197, attachment https://bugzilla.lustre.org/attachment.cgi?id=28672 and probably also the patch in https://bugzilla.lustre.org/show_bug.cgi?id=22385 Thanks WangDi Christopher J. Walker wrote:> I see the following error in the logs on some of my lustre clients: > > Mar 29 20:58:43 cn507 kernel: LustreError: > 18750:0:(rw.c:1948:ras_stride_increase_window()) > ASSERTION(ras->ras_window_ > start + ras->ras_window_len >= ras->ras_stride_offset) failed: > window_start 1792, window_len 0 stride_offset 2017 > > Several processes seem to be blocking on this machine in state DN. > > Is this a known issue? I''ve looked in bugzilla and not found anything > obvious (but this is the first time I''ve looked in your bugzilla). > I''ve found > http://www.nersc.gov/hypermail/nersc-io/att-0612/summary.pdf and had a > quick flick through, but it refers to mpi-io, which we are not doing, > and a 1.6 kernel, whereas we are running 1.8. > > I''m running 1.8.2 servers (downloaded from Sun/Oracle), and 1.8.2 > clients compiled from source on a Scientific Linux 2.6.18-164.15.1.el5 > kernel. > > /var/log/messages says: > >> Mar 29 20:58:43 cn507 kernel: LustreError: >> 18750:0:(rw.c:1948:ras_stride_increase_window()) >> ASSERTION(ras->ras_window_ >> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >> window_start 1792, window_len 0 stride_offset 2017 >> Mar 29 20:58:43 cn507 kernel: LustreError: >> 18750:0:(rw.c:1948:ras_stride_increase_window()) LBUG >> Mar 29 20:58:43 cn507 kernel: Pid: 18750, comm: athena.py >> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: Call Trace: >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844d6a1>] >> libcfs_debug_dumpstack+0x51/0x60 [libcfs] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844dbda>] >> lbug_with_loc+0x7a/0xd0 [libcfs] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8878d63f>] >> ll_readpage+0x129f/0x1e40 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c707>] >> add_to_page_cache+0xaa/0xc1 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c2f5>] >> do_generic_mapping_read+0x208/0x354 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000d0e0>] >> file_read_actor+0x0/0x159 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c58d>] >> __generic_file_aio_read+0x14c/0x198 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff800c5d8f>] >> generic_file_readv+0x8f/0xa8 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff800a0307>] >> autoremove_wake_function+0x0/0x2e >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8879a427>] >> our_vma+0x117/0x1d0 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b984>] >> touch_atime+0x67/0xaa >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f65b>] >> ll_file_readv+0x1e4b/0x2130 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f95a>] >> ll_file_read+0x1a/0x20 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff80011b60>] sys_read+0x45/0x6e >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8006149d>] >> sysenter_do_call+0x1e/0x76 >> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: >> LustreError: dumping log to /tmp/lustre-log.1269892723.18750 > > Thanks, > > Chris > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Tom.Wang
2010-Apr-05 19:54 UTC
[Lustre-discuss] ras_stride_increase_window() ASSERTION failed
Hello, you need the patch in bug 17197, attachment https://bugzilla.lustre.org/attachment.cgi?id=28672 and probably also the patch in https://bugzilla.lustre.org/show_bug.cgi?id=22385 Thanks WangDi Christopher J. Walker wrote:> I see the following error in the logs on some of my lustre clients: > > Mar 29 20:58:43 cn507 kernel: LustreError: > 18750:0:(rw.c:1948:ras_stride_increase_window()) > ASSERTION(ras->ras_window_ > start + ras->ras_window_len >= ras->ras_stride_offset) failed: > window_start 1792, window_len 0 stride_offset 2017 > > Several processes seem to be blocking on this machine in state DN. > > Is this a known issue? I''ve looked in bugzilla and not found anything > obvious (but this is the first time I''ve looked in your bugzilla). > I''ve found > http://www.nersc.gov/hypermail/nersc-io/att-0612/summary.pdf and had a > quick flick through, but it refers to mpi-io, which we are not doing, > and a 1.6 kernel, whereas we are running 1.8. > > I''m running 1.8.2 servers (downloaded from Sun/Oracle), and 1.8.2 > clients compiled from source on a Scientific Linux 2.6.18-164.15.1.el5 > kernel. > > /var/log/messages says: > >> Mar 29 20:58:43 cn507 kernel: LustreError: >> 18750:0:(rw.c:1948:ras_stride_increase_window()) >> ASSERTION(ras->ras_window_ >> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >> window_start 1792, window_len 0 stride_offset 2017 >> Mar 29 20:58:43 cn507 kernel: LustreError: >> 18750:0:(rw.c:1948:ras_stride_increase_window()) LBUG >> Mar 29 20:58:43 cn507 kernel: Pid: 18750, comm: athena.py >> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: Call Trace: >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844d6a1>] >> libcfs_debug_dumpstack+0x51/0x60 [libcfs] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844dbda>] >> lbug_with_loc+0x7a/0xd0 [libcfs] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8878d63f>] >> ll_readpage+0x129f/0x1e40 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c707>] >> add_to_page_cache+0xaa/0xc1 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c2f5>] >> do_generic_mapping_read+0x208/0x354 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000d0e0>] >> file_read_actor+0x0/0x159 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c58d>] >> __generic_file_aio_read+0x14c/0x198 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff800c5d8f>] >> generic_file_readv+0x8f/0xa8 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff800a0307>] >> autoremove_wake_function+0x0/0x2e >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8879a427>] >> our_vma+0x117/0x1d0 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b984>] >> touch_atime+0x67/0xaa >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f65b>] >> ll_file_readv+0x1e4b/0x2130 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f95a>] >> ll_file_read+0x1a/0x20 [lustre] >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 >> Mar 29 20:58:43 cn507 kernel: [<ffffffff80011b60>] sys_read+0x45/0x6e >> Mar 29 20:58:43 cn507 kernel: [<ffffffff8006149d>] >> sysenter_do_call+0x1e/0x76 >> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: >> LustreError: dumping log to /tmp/lustre-log.1269892723.18750 > > Thanks, > > Chris > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher J.Walker
2010-Apr-07 18:01 UTC
[Lustre-discuss] ras_stride_increase_window() ASSERTION failed
Tom.Wang wrote:> Hello, > > you need the patch in bug 17197, attachment > > https://bugzilla.lustre.org/attachment.cgi?id=28672 > > and probably also the patch in > > https://bugzilla.lustre.org/show_bug.cgi?id=22385 >Thanks for the very quick reply. I''ve recompiled the patchless client with both these patches and have installed it on our machines. I''ve been running a test for the last 6 hours, and initial signs are very good - no repeat of the error message on any of the machines. Chris> Thanks > WangDi > > > Christopher J. Walker wrote: >> I see the following error in the logs on some of my lustre clients: >> >> Mar 29 20:58:43 cn507 kernel: LustreError: >> 18750:0:(rw.c:1948:ras_stride_increase_window()) >> ASSERTION(ras->ras_window_ >> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >> window_start 1792, window_len 0 stride_offset 2017 >> >> Several processes seem to be blocking on this machine in state DN. >> >> Is this a known issue? I''ve looked in bugzilla and not found anything >> obvious (but this is the first time I''ve looked in your bugzilla). >> I''ve found >> http://www.nersc.gov/hypermail/nersc-io/att-0612/summary.pdf and had a >> quick flick through, but it refers to mpi-io, which we are not doing, >> and a 1.6 kernel, whereas we are running 1.8. >> >> I''m running 1.8.2 servers (downloaded from Sun/Oracle), and 1.8.2 >> clients compiled from source on a Scientific Linux 2.6.18-164.15.1.el5 >> kernel. >> >> /var/log/messages says: >> >>> Mar 29 20:58:43 cn507 kernel: LustreError: >>> 18750:0:(rw.c:1948:ras_stride_increase_window()) >>> ASSERTION(ras->ras_window_ >>> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >>> window_start 1792, window_len 0 stride_offset 2017 >>> Mar 29 20:58:43 cn507 kernel: LustreError: >>> 18750:0:(rw.c:1948:ras_stride_increase_window()) LBUG >>> Mar 29 20:58:43 cn507 kernel: Pid: 18750, comm: athena.py >>> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: Call Trace: >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844d6a1>] >>> libcfs_debug_dumpstack+0x51/0x60 [libcfs] >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844dbda>] >>> lbug_with_loc+0x7a/0xd0 [libcfs] >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8878d63f>] >>> ll_readpage+0x129f/0x1e40 [lustre] >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c707>] >>> add_to_page_cache+0xaa/0xc1 >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c2f5>] >>> do_generic_mapping_read+0x208/0x354 >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000d0e0>] >>> file_read_actor+0x0/0x159 >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c58d>] >>> __generic_file_aio_read+0x14c/0x198 >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff800c5d8f>] >>> generic_file_readv+0x8f/0xa8 >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff800a0307>] >>> autoremove_wake_function+0x0/0x2e >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8879a427>] >>> our_vma+0x117/0x1d0 [lustre] >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b984>] >>> touch_atime+0x67/0xaa >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f65b>] >>> ll_file_readv+0x1e4b/0x2130 [lustre] >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f95a>] >>> ll_file_read+0x1a/0x20 [lustre] >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff80011b60>] sys_read+0x45/0x6e >>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8006149d>] >>> sysenter_do_call+0x1e/0x76 >>> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: >>> LustreError: dumping log to /tmp/lustre-log.1269892723.18750 >> >> Thanks, >> >> Chris >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Christopher J.Walker
2010-Jun-18 20:12 UTC
[Lustre-discuss] ras_stride_increase_window() ASSERTION failed
Christopher J.Walker wrote:> Tom.Wang wrote: >> Hello, >> >> you need the patch in bug 17197, attachment >> >> https://bugzilla.lustre.org/attachment.cgi?id=28672 >> >> and probably also the patch in >> >> https://bugzilla.lustre.org/show_bug.cgi?id=22385 >> > > Thanks for the very quick reply. > > I''ve recompiled the patchless client with both these patches and have > installed it on our machines. I''ve been running a test for the last 6 > hours, and initial signs are very good - no repeat of the error message > on any of the machines. >I subsequently upgraded to 1.8.3 on the clients. Whilst I didn''t see problems, one of the users is complaining about poor performance (it''s possible this has other causes, but the timing is suspicious). Both patches are labelled "johann: landed1.8.3+" I''m confused about whether this means the bugs are fixed in 1.8.3 or not. Bug 22385 is mentioned in the changelog as being fixed (and attempting to apply the patch causes a reject). Bug 17197 isn''t mentioned in the changelog, and applying the patch mentioned: https://bugzilla.lustre.org/attachment.cgi?id=28672 results in 2 hunks reversed and 3 applied. Should I downgrade to my 1.8.2 version? Apply the remaining 3 hunks for bug 17197 or something else? Thanks again, Chris> Chris > >> Thanks >> WangDi >> >> >> Christopher J. Walker wrote: >>> I see the following error in the logs on some of my lustre clients: >>> >>> Mar 29 20:58:43 cn507 kernel: LustreError: >>> 18750:0:(rw.c:1948:ras_stride_increase_window()) >>> ASSERTION(ras->ras_window_ >>> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >>> window_start 1792, window_len 0 stride_offset 2017 >>> >>> Several processes seem to be blocking on this machine in state DN. >>> >>> Is this a known issue? I''ve looked in bugzilla and not found anything >>> obvious (but this is the first time I''ve looked in your bugzilla). >>> I''ve found >>> http://www.nersc.gov/hypermail/nersc-io/att-0612/summary.pdf and had a >>> quick flick through, but it refers to mpi-io, which we are not doing, >>> and a 1.6 kernel, whereas we are running 1.8. >>> >>> I''m running 1.8.2 servers (downloaded from Sun/Oracle), and 1.8.2 >>> clients compiled from source on a Scientific Linux 2.6.18-164.15.1.el5 >>> kernel. >>> >>> /var/log/messages says: >>> >>>> Mar 29 20:58:43 cn507 kernel: LustreError: >>>> 18750:0:(rw.c:1948:ras_stride_increase_window()) >>>> ASSERTION(ras->ras_window_ >>>> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >>>> window_start 1792, window_len 0 stride_offset 2017 >>>> Mar 29 20:58:43 cn507 kernel: LustreError: >>>> 18750:0:(rw.c:1948:ras_stride_increase_window()) LBUG >>>> Mar 29 20:58:43 cn507 kernel: Pid: 18750, comm: athena.py >>>> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: Call Trace: >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844d6a1>] >>>> libcfs_debug_dumpstack+0x51/0x60 [libcfs] >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844dbda>] >>>> lbug_with_loc+0x7a/0xd0 [libcfs] >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8878d63f>] >>>> ll_readpage+0x129f/0x1e40 [lustre] >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c707>] >>>> add_to_page_cache+0xaa/0xc1 >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c2f5>] >>>> do_generic_mapping_read+0x208/0x354 >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000d0e0>] >>>> file_read_actor+0x0/0x159 >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c58d>] >>>> __generic_file_aio_read+0x14c/0x198 >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff800c5d8f>] >>>> generic_file_readv+0x8f/0xa8 >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff800a0307>] >>>> autoremove_wake_function+0x0/0x2e >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8879a427>] >>>> our_vma+0x117/0x1d0 [lustre] >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b984>] >>>> touch_atime+0x67/0xaa >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f65b>] >>>> ll_file_readv+0x1e4b/0x2130 [lustre] >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f95a>] >>>> ll_file_read+0x1a/0x20 [lustre] >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff80011b60>] sys_read+0x45/0x6e >>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8006149d>] >>>> sysenter_do_call+0x1e/0x76 >>>> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: >>>> LustreError: dumping log to /tmp/lustre-log.1269892723.18750 >>> Thanks, >>> >>> Chris >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
di wang
2010-Jun-21 05:40 UTC
[Lustre-discuss] ras_stride_increase_window() ASSERTION failed
Christopher J.Walker wrote:> Christopher J.Walker wrote: > >> Tom.Wang wrote: >> >>> Hello, >>> >>> you need the patch in bug 17197, attachment >>> >>> https://bugzilla.lustre.org/attachment.cgi?id=28672 >>> >>> and probably also the patch in >>> >>> https://bugzilla.lustre.org/show_bug.cgi?id=22385 >>> >>> >> Thanks for the very quick reply. >> >> I''ve recompiled the patchless client with both these patches and have >> installed it on our machines. I''ve been running a test for the last 6 >> hours, and initial signs are very good - no repeat of the error message >> on any of the machines. >> >> > > I subsequently upgraded to 1.8.3 on the clients. Whilst I didn''t see > problems, one of the users is complaining about poor performance (it''s > possible this has other causes, but the timing is suspicious). > > Both patches are labelled "johann: landed1.8.3+" > > I''m confused about whether this means the bugs are fixed in 1.8.3 or not. > > Bug 22385 is mentioned in the changelog as being fixed (and attempting > to apply the patch causes a reject). > > Bug 17197 isn''t mentioned in the changelog, and applying the patch > mentioned: > https://bugzilla.lustre.org/attachment.cgi?id=28672 > results in 2 hunks reversed and 3 applied. > > Should I downgrade to my 1.8.2 version? Apply the remaining 3 hunks for > bug 17197 or something else? >Yes, these fixes has been landed in 1.8.3. so you do not need downgrade to 1.8.2. Thanks WangDi> Thanks again, > > Chris > > >> Chris >> >> >>> Thanks >>> WangDi >>> >>> >>> Christopher J. Walker wrote: >>> >>>> I see the following error in the logs on some of my lustre clients: >>>> >>>> Mar 29 20:58:43 cn507 kernel: LustreError: >>>> 18750:0:(rw.c:1948:ras_stride_increase_window()) >>>> ASSERTION(ras->ras_window_ >>>> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >>>> window_start 1792, window_len 0 stride_offset 2017 >>>> >>>> Several processes seem to be blocking on this machine in state DN. >>>> >>>> Is this a known issue? I''ve looked in bugzilla and not found anything >>>> obvious (but this is the first time I''ve looked in your bugzilla). >>>> I''ve found >>>> http://www.nersc.gov/hypermail/nersc-io/att-0612/summary.pdf and had a >>>> quick flick through, but it refers to mpi-io, which we are not doing, >>>> and a 1.6 kernel, whereas we are running 1.8. >>>> >>>> I''m running 1.8.2 servers (downloaded from Sun/Oracle), and 1.8.2 >>>> clients compiled from source on a Scientific Linux 2.6.18-164.15.1.el5 >>>> kernel. >>>> >>>> /var/log/messages says: >>>> >>>> >>>>> Mar 29 20:58:43 cn507 kernel: LustreError: >>>>> 18750:0:(rw.c:1948:ras_stride_increase_window()) >>>>> ASSERTION(ras->ras_window_ >>>>> start + ras->ras_window_len >= ras->ras_stride_offset) failed: >>>>> window_start 1792, window_len 0 stride_offset 2017 >>>>> Mar 29 20:58:43 cn507 kernel: LustreError: >>>>> 18750:0:(rw.c:1948:ras_stride_increase_window()) LBUG >>>>> Mar 29 20:58:43 cn507 kernel: Pid: 18750, comm: athena.py >>>>> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: Call Trace: >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844d6a1>] >>>>> libcfs_debug_dumpstack+0x51/0x60 [libcfs] >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8844dbda>] >>>>> lbug_with_loc+0x7a/0xd0 [libcfs] >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8878d63f>] >>>>> ll_readpage+0x129f/0x1e40 [lustre] >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c707>] >>>>> add_to_page_cache+0xaa/0xc1 >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c2f5>] >>>>> do_generic_mapping_read+0x208/0x354 >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000d0e0>] >>>>> file_read_actor+0x0/0x159 >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000c58d>] >>>>> __generic_file_aio_read+0x14c/0x198 >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff800c5d8f>] >>>>> generic_file_readv+0x8f/0xa8 >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff800a0307>] >>>>> autoremove_wake_function+0x0/0x2e >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8879a427>] >>>>> our_vma+0x117/0x1d0 [lustre] >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b984>] >>>>> touch_atime+0x67/0xaa >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f65b>] >>>>> ll_file_readv+0x1e4b/0x2130 [lustre] >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8875f95a>] >>>>> ll_file_read+0x1a/0x20 [lustre] >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8000b695>] vfs_read+0xcb/0x171 >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff80011b60>] sys_read+0x45/0x6e >>>>> Mar 29 20:58:43 cn507 kernel: [<ffffffff8006149d>] >>>>> sysenter_do_call+0x1e/0x76 >>>>> Mar 29 20:58:43 cn507 kernel: Mar 29 20:58:43 cn507 kernel: >>>>> LustreError: dumping log to /tmp/lustre-log.1269892723.18750 >>>>> >>>> Thanks, >>>> >>>> Chris >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >