thr3ads.net - Xen users - [Xen-users] [XCP] ext3 crashes and slowdowns [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Christian Fischer

2011-Jan-04 14:37 UTC

[Xen-users] [XCP] ext3 crashes and slowdowns

Hi Folks.

I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA
HW-Raid,
BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 domU, clustered, 
active/passive. Data Storage is provided as SCSISR (without LVM layer, like a 
HBASR) to OpenFiler. Shared storage is provided as iSCSI target by OpenFiler 
via clusterIP (storage frontend network), replication is done by drbd (storage 
backend network), HA is done by haertbeat (hearbeat network). All networks are 
built on top of redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, 
each bonded and plugged into the same switch, both bonds multipathed 
(active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two 
switches, which are linked together with 2 ports each.

XCP pool works, ISCSI works, replication works, HA works.

If filer 1 (running on server1) is active i can install and run domUs on 
server 2 without problems, I can not install or run domUs on server 1.

If  I switch to filer 2 (on server 2) as the active one the running but 
stalled domUs on server 1 get back their life, and the running domUs on filer2 
loose their life.
# dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
shows a rate of  0.8 - 1.2 MB/sec.

The kernel shows traces like

INFO: task syslogd:1081 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
syslogd       D ffff880001003460     0  1081      1          1084  1073 
(NOTLB)
 ffff8800367edd88  0000000000000286  ffff8800367edd98  ffffffff80262dd3 
 0000000000000009  ffff88003fb007a0  ffffffff804f4b80  0000000000000d5b 
 ffff88003fb00988  0000000000006d06 
Call Trace:
 [<ffffffff80262dd3>] thread_return+0x6c/0x113
 [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
 [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
 [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
 [<ffffffff802e555b>] sync_inode+0x24/0x33
 [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
 [<ffffffff80252276>] do_fsync+0x52/0xa4
 [<ffffffff802d37f5>] __do_fsync+0x23/0x36
 [<ffffffff802602f9>] tracesys+0xab/0xb6


Iscsiadm shows no errors.

# iscsiadm -m session -r 1 -s
Stats for session [sid: 1, target: 
iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: 
172.16.0.2,3260]
iSCSI SNMP:
        txdata_octets: 486181549212
        rxdata_octets: 2622687792
        noptx_pdus: 0
        scsicmd_pdus: 15184105
        tmfcmd_pdus: 0
        login_pdus: 0
        text_pdus: 0
        dataout_pdus: 195910
        logout_pdus: 0
        snack_pdus: 0
        noprx_pdus: 0
        scsirsp_pdus: 15184088
        tmfrsp_pdus: 0
        textrsp_pdus: 0
        datain_pdus: 87898
        logoutrsp_pdus: 0
        r2t_pdus: 151200
        async_pdus: 0
        rjt_pdus: 0
        digest_err: 0
        timeout_err: 0
iSCSI Extended:
        tx_sendpage_failures: 0
        rx_discontiguous_hdr: 0
        eh_abort_cnt: 0

If I reboot the domU after giving back her life, in most cases, the ext3 
journal is corrupt, and the kernel panics after one reboot more.

If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish to 
initialize the disk xvda, but if the disk partitioning and layout questions 
appear the disk is missing in the list. There''s nothing more than a
question
mark.
Sometimes I have the disk in the list, if so I can install the OS, all seems 
fine, but after the second reboot the ext3 journal is missing and the kernel 
panics after the third reboot, rootfs is gone.


Are there any ideas? I''m out of.

Thanks
Christian

Some kernel logging from domU, nothing inside dom0 log.

EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743295
Aborting journal on device dm-0.
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743296
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743297
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743298
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743299
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743300
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743301
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743302
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743303
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743304
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
block 743305
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_truncate: Journal has aborted
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data



_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Pasi Kärkkäinen

2011-Jan-05 10:03 UTC

head link

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer
wrote:> Hi Folks.
> 
> I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram,
SATA HW-Raid,
> BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 domU, clustered,
> active/passive. Data Storage is provided as SCSISR (without LVM layer, like
a
> HBASR) to OpenFiler. Shared storage is provided as iSCSI target by
OpenFiler
> via clusterIP (storage frontend network), replication is done by drbd
(storage
> backend network), HA is done by haertbeat (hearbeat network). All networks
are
> built on top of redundant HP gigabit switches, 2 pairs of Intel gigabit
NICs,
> each bonded and plugged into the same switch, both bonds multipathed 
> (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two 
> switches, which are linked together with 2 ports each.
> 
Hello,

Did you try XCP 1.0 beta? 

-- Pasi
> XCP pool works, ISCSI works, replication works, HA works.
> 
> If filer 1 (running on server1) is active i can install and run domUs on 
> server 2 without problems, I can not install or run domUs on server 1.
> 
> If  I switch to filer 2 (on server 2) as the active one the running but 
> stalled domUs on server 1 get back their life, and the running domUs on
filer2
> loose their life.
> # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
> shows a rate of  0.8 - 1.2 MB/sec.
> 
> The kernel shows traces like
> 
> INFO: task syslogd:1081 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
> syslogd       D ffff880001003460     0  1081      1          1084  1073 
> (NOTLB)
>  ffff8800367edd88  0000000000000286  ffff8800367edd98  ffffffff80262dd3 
>  0000000000000009  ffff88003fb007a0  ffffffff804f4b80  0000000000000d5b 
>  ffff88003fb00988  0000000000006d06 
> Call Trace:
>  [<ffffffff80262dd3>] thread_return+0x6c/0x113
>  [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
>  [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
>  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
>  [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
>  [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
>  [<ffffffff802e555b>] sync_inode+0x24/0x33
>  [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
>  [<ffffffff80252276>] do_fsync+0x52/0xa4
>  [<ffffffff802d37f5>] __do_fsync+0x23/0x36
>  [<ffffffff802602f9>] tracesys+0xab/0xb6
> 
> 
> Iscsiadm shows no errors.
> 
> # iscsiadm -m session -r 1 -s
> Stats for session [sid: 1, target: 
> iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: 
> 172.16.0.2,3260]
> iSCSI SNMP:
>         txdata_octets: 486181549212
>         rxdata_octets: 2622687792
>         noptx_pdus: 0
>         scsicmd_pdus: 15184105
>         tmfcmd_pdus: 0
>         login_pdus: 0
>         text_pdus: 0
>         dataout_pdus: 195910
>         logout_pdus: 0
>         snack_pdus: 0
>         noprx_pdus: 0
>         scsirsp_pdus: 15184088
>         tmfrsp_pdus: 0
>         textrsp_pdus: 0
>         datain_pdus: 87898
>         logoutrsp_pdus: 0
>         r2t_pdus: 151200
>         async_pdus: 0
>         rjt_pdus: 0
>         digest_err: 0
>         timeout_err: 0
> iSCSI Extended:
>         tx_sendpage_failures: 0
>         rx_discontiguous_hdr: 0
>         eh_abort_cnt: 0
> 
> If I reboot the domU after giving back her life, in most cases, the ext3 
> journal is corrupt, and the kernel panics after one reboot more.
> 
> If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish
to
> initialize the disk xvda, but if the disk partitioning and layout questions
> appear the disk is missing in the list. There''s nothing more than
a question
> mark.
> Sometimes I have the disk in the list, if so I can install the OS, all
seems
> fine, but after the second reboot the ext3 journal is missing and the
kernel
> panics after the third reboot, rootfs is gone.
> 
> 
> Are there any ideas? I''m out of.
> 
> Thanks
> Christian
> 
> Some kernel logging from domU, nothing inside dom0 log.
> 
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743295
> Aborting journal on device dm-0.
> ext3_abort called.
> EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted
journal
> Remounting filesystem read-only
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743296
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743297
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743298
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743299
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743300
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743301
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743302
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743303
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743304
> EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for 
> block 743305
> EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has
aborted
> EXT3-fs error (device dm-0) in ext3_truncate: Journal has aborted
> EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has
aborted
> EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has aborted
> EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has
aborted
> __journal_remove_journal_head: freeing b_committed_data
> __journal_remove_journal_head: freeing b_committed_data
> __journal_remove_journal_head: freeing b_committed_data
> 
> 
> 
> _______________________________________________
> Xen-users mailing list
> Xen-users@lists.xensource.com
> http://lists.xensource.com/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Christian Fischer

2011-Jan-05 10:37 UTC

head link

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

On Wednesday 05 January 2011 11:03:37 Pasi Kärkkäinen
wrote:> On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote:
> > Hi Folks.
> > 
> > I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB
ram, SATA
> > HW-Raid, BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3
> > domU, clustered, active/passive. Data Storage is provided as SCSISR
> > (without LVM layer, like a HBASR) to OpenFiler. Shared storage is
> > provided as iSCSI target by OpenFiler via clusterIP (storage frontend
> > network), replication is done by drbd (storage backend network), HA is
> > done by haertbeat (hearbeat network). All networks are built on top of
> > redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, each
> > bonded and plugged into the same switch, both bonds multipathed
> > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two
> > switches, which are linked together with 2 ports each.
> 
> Hello,
> 
> Did you try XCP 1.0 beta?
Hi Pasi,

No, not yet. But I''ll try it. Is it more beta than 0.5, or less? Can it
be
used as production system? Is it upgradable if 1.0 final comes out?

There are two possible ways to solve this, trying 1.0 beta, or using dedicated 
storage server hardware. The storage works perfect if I run the guest systems 
on top of a third hardware.

What I don''t understand is what badness happens if the active filer and
the
guest running on top of the same hardware. I think the setup should work.
I''ve seen this fs crashes also on top of glusterfs, which I''ve
tried before,
with the difference that both servers was affected. That was an active/active 
filer setup.

Christian
> 
> -- Pasi
> 
> > XCP pool works, ISCSI works, replication works, HA works.
> > 
> > If filer 1 (running on server1) is active i can install and run domUs
on
> > server 2 without problems, I can not install or run domUs on server 1.
> > 
> > If  I switch to filer 2 (on server 2) as the active one the running
but
> > stalled domUs on server 1 get back their life, and the running domUs
on
> > filer2 loose their life.
> > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
> > shows a rate of  0.8 - 1.2 MB/sec.
> > 
> > The kernel shows traces like
> > 
> > INFO: task syslogd:1081 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
> > syslogd       D ffff880001003460     0  1081      1          1084 
1073
> > (NOTLB)
> > 
> >  ffff8800367edd88  0000000000000286  ffff8800367edd98 
ffffffff80262dd3
> >  0000000000000009  ffff88003fb007a0  ffffffff804f4b80 
0000000000000d5b
> >  ffff88003fb00988  0000000000006d06
> > 
> > Call Trace:
> >  [<ffffffff80262dd3>] thread_return+0x6c/0x113
> >  [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
> >  [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
> >  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
> >  [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
> >  [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
> >  [<ffffffff802e555b>] sync_inode+0x24/0x33
> >  [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
> >  [<ffffffff80252276>] do_fsync+0x52/0xa4
> >  [<ffffffff802d37f5>] __do_fsync+0x23/0x36
> >  [<ffffffff802602f9>] tracesys+0xab/0xb6
> > 
> > Iscsiadm shows no errors.
> > 
> > # iscsiadm -m session -r 1 -s
> > Stats for session [sid: 1, target:
> > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal:
> > 172.16.0.2,3260]
> > 
> > iSCSI SNMP:
> >         txdata_octets: 486181549212
> >         rxdata_octets: 2622687792
> >         noptx_pdus: 0
> >         scsicmd_pdus: 15184105
> >         tmfcmd_pdus: 0
> >         login_pdus: 0
> >         text_pdus: 0
> >         dataout_pdus: 195910
> >         logout_pdus: 0
> >         snack_pdus: 0
> >         noprx_pdus: 0
> >         scsirsp_pdus: 15184088
> >         tmfrsp_pdus: 0
> >         textrsp_pdus: 0
> >         datain_pdus: 87898
> >         logoutrsp_pdus: 0
> >         r2t_pdus: 151200
> >         async_pdus: 0
> >         rjt_pdus: 0
> >         digest_err: 0
> >         timeout_err: 0
> > 
> > iSCSI Extended:
> >         tx_sendpage_failures: 0
> >         rx_discontiguous_hdr: 0
> >         eh_abort_cnt: 0
> > 
> > If I reboot the domU after giving back her life, in most cases, the
ext3
> > journal is corrupt, and the kernel panics after one reboot more.
> > 
> > If I try to install a PV-Domain (CentOS-5.5) the installer asks if I
wish
> > to initialize the disk xvda, but if the disk partitioning and layout
> > questions appear the disk is missing in the list. There''s
nothing more
> > than a question mark.
> > Sometimes I have the disk in the list, if so I can install the OS, all
> > seems fine, but after the second reboot the ext3 journal is missing
and
> > the kernel panics after the third reboot, rootfs is gone.
> > 
> > 
> > Are there any ideas? I''m out of.
> > 
> > Thanks
> > Christian
> > 
> > Some kernel logging from domU, nothing inside dom0 log.
> > 
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743295
> > Aborting journal on device dm-0.
> > ext3_abort called.
> > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted
> > journal Remounting filesystem read-only
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743296
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743297
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743298
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743299
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743300
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743301
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743302
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743303
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743304
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743305
> > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has
> > aborted EXT3-fs error (device dm-0) in ext3_truncate: Journal has
> > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write:
Journal
> > has aborted EXT3-fs error (device dm-0) in ext3_orphan_del: Journal
has
> > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write:
Journal
> > has aborted __journal_remove_journal_head: freeing b_committed_data
> > __journal_remove_journal_head: freeing b_committed_data
> > __journal_remove_journal_head: freeing b_committed_data
> > 
> > 
> > 
> > _______________________________________________
> > Xen-users mailing list
> > Xen-users@lists.xensource.com
> > http://lists.xensource.com/xen-users

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Pasi Kärkkäinen

2011-Jan-05 13:22 UTC

head link

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

On Wed, Jan 05, 2011 at 11:37:03AM +0100, Christian Fischer
wrote:> On Wednesday 05 January 2011 11:03:37 Pasi Kärkkäinen wrote:
> > On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote:
> > > Hi Folks.
> > > 
> > > I''ve two Intel boxes (Intel server S5520UR, 2x E5520,
32GB ram, SATA
> > > HW-Raid, BBU) running as XCP-0.5 pool, both running a
OpenFiler-2.3
> > > domU, clustered, active/passive. Data Storage is provided as
SCSISR
> > > (without LVM layer, like a HBASR) to OpenFiler. Shared storage is
> > > provided as iSCSI target by OpenFiler via clusterIP (storage
frontend
> > > network), replication is done by drbd (storage backend network),
HA is
> > > done by haertbeat (hearbeat network). All networks are built on
top of
> > > redundant HP gigabit switches, 2 pairs of Intel gigabit NICs,
each
> > > bonded and plugged into the same switch, both bonds multipathed
> > > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via
the two
> > > switches, which are linked together with 2 ports each.
> > 
> > Hello,
> > 
> > Did you try XCP 1.0 beta?
> 
> Hi Pasi,
> 
> No, not yet. But I''ll try it. Is it more beta than 0.5, or less?
Can it be
> used as production system?
>
I *think* it should be better than 0.5 :) Also I *think* there''s XCP
1.0 beta2
coming up soon(ish).
> Is it upgradable if 1.0 final comes out?
Not sure.
> 
> There are two possible ways to solve this, trying 1.0 beta, or using
dedicated
> storage server hardware. The storage works perfect if I run the guest
systems
> on top of a third hardware.
> 
> What I don''t understand is what badness happens if the active
filer and the
> guest running on top of the same hardware. I think the setup should work.
> I''ve seen this fs crashes also on top of glusterfs, which
I''ve tried before,
> with the difference that both servers was affected. That was an
active/active
> filer setup.
> 
-- Pasi

> Christian
> 
> > 
> > -- Pasi
> > 
> > > XCP pool works, ISCSI works, replication works, HA works.
> > > 
> > > If filer 1 (running on server1) is active i can install and run
domUs on
> > > server 2 without problems, I can not install or run domUs on
server 1.
> > > 
> > > If  I switch to filer 2 (on server 2) as the active one the
running but
> > > stalled domUs on server 1 get back their life, and the running
domUs on
> > > filer2 loose their life.
> > > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
> > > shows a rate of  0.8 - 1.2 MB/sec.
> > > 
> > > The kernel shows traces like
> > > 
> > > INFO: task syslogd:1081 blocked for more than 120 seconds.
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
> > > syslogd       D ffff880001003460     0  1081      1          1084
1073
> > > (NOTLB)
> > > 
> > >  ffff8800367edd88  0000000000000286  ffff8800367edd98 
ffffffff80262dd3
> > >  0000000000000009  ffff88003fb007a0  ffffffff804f4b80 
0000000000000d5b
> > >  ffff88003fb00988  0000000000006d06
> > > 
> > > Call Trace:
> > >  [<ffffffff80262dd3>] thread_return+0x6c/0x113
> > >  [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
> > >  [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
> > >  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
> > >  [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
> > >  [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
> > >  [<ffffffff802e555b>] sync_inode+0x24/0x33
> > >  [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
> > >  [<ffffffff80252276>] do_fsync+0x52/0xa4
> > >  [<ffffffff802d37f5>] __do_fsync+0x23/0x36
> > >  [<ffffffff802602f9>] tracesys+0xab/0xb6
> > > 
> > > Iscsiadm shows no errors.
> > > 
> > > # iscsiadm -m session -r 1 -s
> > > Stats for session [sid: 1, target:
> > > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages,
portal:
> > > 172.16.0.2,3260]
> > > 
> > > iSCSI SNMP:
> > >         txdata_octets: 486181549212
> > >         rxdata_octets: 2622687792
> > >         noptx_pdus: 0
> > >         scsicmd_pdus: 15184105
> > >         tmfcmd_pdus: 0
> > >         login_pdus: 0
> > >         text_pdus: 0
> > >         dataout_pdus: 195910
> > >         logout_pdus: 0
> > >         snack_pdus: 0
> > >         noprx_pdus: 0
> > >         scsirsp_pdus: 15184088
> > >         tmfrsp_pdus: 0
> > >         textrsp_pdus: 0
> > >         datain_pdus: 87898
> > >         logoutrsp_pdus: 0
> > >         r2t_pdus: 151200
> > >         async_pdus: 0
> > >         rjt_pdus: 0
> > >         digest_err: 0
> > >         timeout_err: 0
> > > 
> > > iSCSI Extended:
> > >         tx_sendpage_failures: 0
> > >         rx_discontiguous_hdr: 0
> > >         eh_abort_cnt: 0
> > > 
> > > If I reboot the domU after giving back her life, in most cases,
the ext3
> > > journal is corrupt, and the kernel panics after one reboot more.
> > > 
> > > If I try to install a PV-Domain (CentOS-5.5) the installer asks
if I wish
> > > to initialize the disk xvda, but if the disk partitioning and
layout
> > > questions appear the disk is missing in the list.
There''s nothing more
> > > than a question mark.
> > > Sometimes I have the disk in the list, if so I can install the
OS, all
> > > seems fine, but after the second reboot the ext3 journal is
missing and
> > > the kernel panics after the third reboot, rootfs is gone.
> > > 
> > > 
> > > Are there any ideas? I''m out of.
> > > 
> > > Thanks
> > > Christian
> > > 
> > > Some kernel logging from domU, nothing inside dom0 log.
> > > 
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743295
> > > Aborting journal on device dm-0.
> > > ext3_abort called.
> > > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected
aborted
> > > journal Remounting filesystem read-only
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743296
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743297
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743298
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743299
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743300
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743301
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743302
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743303
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743304
> > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already
cleared for
> > > block 743305
> > > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal
has
> > > aborted EXT3-fs error (device dm-0) in ext3_truncate: Journal has
> > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write:
Journal
> > > has aborted EXT3-fs error (device dm-0) in ext3_orphan_del:
Journal has
> > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write:
Journal
> > > has aborted __journal_remove_journal_head: freeing
b_committed_data
> > > __journal_remove_journal_head: freeing b_committed_data
> > > __journal_remove_journal_head: freeing b_committed_data
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Xen-users mailing list
> > > Xen-users@lists.xensource.com
> > > http://lists.xensource.com/xen-users
> 
> 
> _______________________________________________
> Xen-users mailing list
> Xen-users@lists.xensource.com
> http://lists.xensource.com/xen-users
_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Christian Fischer

2011-Jan-19 12:02 UTC

head link

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

On Wednesday 05 January 2011 11:03:37 Pasi Kärkkäinen
wrote:> On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote:
> > Hi Folks.
> > 
> > I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB
ram, SATA
> > HW-Raid, BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3
> > domU, clustered, active/passive. Data Storage is provided as SCSISR
> > (without LVM layer, like a HBASR) to OpenFiler. Shared storage is
> > provided as iSCSI target by OpenFiler via clusterIP (storage frontend
> > network), replication is done by drbd (storage backend network), HA is
> > done by haertbeat (hearbeat network). All networks are built on top of
> > redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, each
> > bonded and plugged into the same switch, both bonds multipathed
> > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two
> > switches, which are linked together with 2 ports each.
> 
> Hello,
> 
> Did you try XCP 1.0 beta?
Yes, that works with XCP 1.0 beta, most of the time.

I had one final crash while swapping the active filer, with two corrupted 
filers and crashed file systems on all running domUs, as the result. Both 
dom0s where widely not responsive to ssh or local console requests, and 
freezed after invoked shutdown. No idea if there''s any relationship.


I''ve found two kernel messages inside kern.log of the second dom0:
BUG: soft lockup - CPU#3 stuck for 61s! [swapper:0]
followed by one 
INFO: task rc:24537 blocked for more than 120 seconds.
There where a lot of IO errors at the same time on the first dom0.


Christian
> 
> -- Pasi
> 
> > XCP pool works, ISCSI works, replication works, HA works.
> > 
> > If filer 1 (running on server1) is active i can install and run domUs
on
> > server 2 without problems, I can not install or run domUs on server 1.
> > 
> > If  I switch to filer 2 (on server 2) as the active one the running
but
> > stalled domUs on server 1 get back their life, and the running domUs
on
> > filer2 loose their life.
> > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
> > shows a rate of  0.8 - 1.2 MB/sec.
> > 
> > The kernel shows traces like
> > 
> > INFO: task syslogd:1081 blocked for more than 120 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
> > syslogd       D ffff880001003460     0  1081      1          1084 
1073
> > (NOTLB)
> > 
> >  ffff8800367edd88  0000000000000286  ffff8800367edd98 
ffffffff80262dd3
> >  0000000000000009  ffff88003fb007a0  ffffffff804f4b80 
0000000000000d5b
> >  ffff88003fb00988  0000000000006d06
> > 
> > Call Trace:
> >  [<ffffffff80262dd3>] thread_return+0x6c/0x113
> >  [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
> >  [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
> >  [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
> >  [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
> >  [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
> >  [<ffffffff802e555b>] sync_inode+0x24/0x33
> >  [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
> >  [<ffffffff80252276>] do_fsync+0x52/0xa4
> >  [<ffffffff802d37f5>] __do_fsync+0x23/0x36
> >  [<ffffffff802602f9>] tracesys+0xab/0xb6
> > 
> > Iscsiadm shows no errors.
> > 
> > # iscsiadm -m session -r 1 -s
> > Stats for session [sid: 1, target:
> > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal:
> > 172.16.0.2,3260]
> > 
> > iSCSI SNMP:
> >         txdata_octets: 486181549212
> >         rxdata_octets: 2622687792
> >         noptx_pdus: 0
> >         scsicmd_pdus: 15184105
> >         tmfcmd_pdus: 0
> >         login_pdus: 0
> >         text_pdus: 0
> >         dataout_pdus: 195910
> >         logout_pdus: 0
> >         snack_pdus: 0
> >         noprx_pdus: 0
> >         scsirsp_pdus: 15184088
> >         tmfrsp_pdus: 0
> >         textrsp_pdus: 0
> >         datain_pdus: 87898
> >         logoutrsp_pdus: 0
> >         r2t_pdus: 151200
> >         async_pdus: 0
> >         rjt_pdus: 0
> >         digest_err: 0
> >         timeout_err: 0
> > 
> > iSCSI Extended:
> >         tx_sendpage_failures: 0
> >         rx_discontiguous_hdr: 0
> >         eh_abort_cnt: 0
> > 
> > If I reboot the domU after giving back her life, in most cases, the
ext3
> > journal is corrupt, and the kernel panics after one reboot more.
> > 
> > If I try to install a PV-Domain (CentOS-5.5) the installer asks if I
wish
> > to initialize the disk xvda, but if the disk partitioning and layout
> > questions appear the disk is missing in the list. There''s
nothing more
> > than a question mark.
> > Sometimes I have the disk in the list, if so I can install the OS, all
> > seems fine, but after the second reboot the ext3 journal is missing
and
> > the kernel panics after the third reboot, rootfs is gone.
> > 
> > 
> > Are there any ideas? I''m out of.
> > 
> > Thanks
> > Christian
> > 
> > Some kernel logging from domU, nothing inside dom0 log.
> > 
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743295
> > Aborting journal on device dm-0.
> > ext3_abort called.
> > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted
> > journal Remounting filesystem read-only
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743296
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743297
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743298
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743299
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743300
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743301
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743302
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743303
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743304
> > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared
for
> > block 743305
> > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has
> > aborted EXT3-fs error (device dm-0) in ext3_truncate: Journal has
> > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write:
Journal
> > has aborted EXT3-fs error (device dm-0) in ext3_orphan_del: Journal
has
> > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write:
Journal
> > has aborted __journal_remove_journal_head: freeing b_committed_data
> > __journal_remove_journal_head: freeing b_committed_data
> > __journal_remove_journal_head: freeing b_committed_data
> > 
> > 
> > 
> > _______________________________________________
> > Xen-users mailing list
> > Xen-users@lists.xensource.com
> > http://lists.xensource.com/xen-users
> 
> _______________________________________________
> Xen-users mailing list
> Xen-users@lists.xensource.com
> http://lists.xensource.com/xen-users

_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users

Xen users - Jan 2011 - [XCP] ext3 crashes and slowdowns

[Xen-users] [XCP] ext3 crashes and slowdowns

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

Re: [Xen-users] [XCP] ext3 crashes and slowdowns

Re: [Xen-users] [XCP] ext3 crashes and slowdowns