Hi Folks.
I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA
HW-Raid,
BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 domU, clustered,
active/passive. Data Storage is provided as SCSISR (without LVM layer, like a
HBASR) to OpenFiler. Shared storage is provided as iSCSI target by OpenFiler
via clusterIP (storage frontend network), replication is done by drbd (storage
backend network), HA is done by haertbeat (hearbeat network). All networks are
built on top of redundant HP gigabit switches, 2 pairs of Intel gigabit NICs,
each bonded and plugged into the same switch, both bonds multipathed
(active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two
switches, which are linked together with 2 ports each.
XCP pool works, ISCSI works, replication works, HA works.
If filer 1 (running on server1) is active i can install and run domUs on
server 2 without problems, I can not install or run domUs on server 1.
If I switch to filer 2 (on server 2) as the active one the running but
stalled domUs on server 1 get back their life, and the running domUs on filer2
loose their life.
# dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct
shows a rate of 0.8 - 1.2 MB/sec.
The kernel shows traces like
INFO: task syslogd:1081 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
syslogd D ffff880001003460 0 1081 1 1084 1073
(NOTLB)
ffff8800367edd88 0000000000000286 ffff8800367edd98 ffffffff80262dd3
0000000000000009 ffff88003fb007a0 ffffffff804f4b80 0000000000000d5b
ffff88003fb00988 0000000000006d06
Call Trace:
[<ffffffff80262dd3>] thread_return+0x6c/0x113
[<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5
[<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e
[<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff
[<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328
[<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291
[<ffffffff802e555b>] sync_inode+0x24/0x33
[<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc
[<ffffffff80252276>] do_fsync+0x52/0xa4
[<ffffffff802d37f5>] __do_fsync+0x23/0x36
[<ffffffff802602f9>] tracesys+0xab/0xb6
Iscsiadm shows no errors.
# iscsiadm -m session -r 1 -s
Stats for session [sid: 1, target:
iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal:
172.16.0.2,3260]
iSCSI SNMP:
txdata_octets: 486181549212
rxdata_octets: 2622687792
noptx_pdus: 0
scsicmd_pdus: 15184105
tmfcmd_pdus: 0
login_pdus: 0
text_pdus: 0
dataout_pdus: 195910
logout_pdus: 0
snack_pdus: 0
noprx_pdus: 0
scsirsp_pdus: 15184088
tmfrsp_pdus: 0
textrsp_pdus: 0
datain_pdus: 87898
logoutrsp_pdus: 0
r2t_pdus: 151200
async_pdus: 0
rjt_pdus: 0
digest_err: 0
timeout_err: 0
iSCSI Extended:
tx_sendpage_failures: 0
rx_discontiguous_hdr: 0
eh_abort_cnt: 0
If I reboot the domU after giving back her life, in most cases, the ext3
journal is corrupt, and the kernel panics after one reboot more.
If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish to
initialize the disk xvda, but if the disk partitioning and layout questions
appear the disk is missing in the list. There''s nothing more than a
question
mark.
Sometimes I have the disk in the list, if so I can install the OS, all seems
fine, but after the second reboot the ext3 journal is missing and the kernel
panics after the third reboot, rootfs is gone.
Are there any ideas? I''m out of.
Thanks
Christian
Some kernel logging from domU, nothing inside dom0 log.
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743295
Aborting journal on device dm-0.
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743296
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743297
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743298
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743299
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743300
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743301
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743302
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743303
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743304
EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for
block 743305
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_truncate: Journal has aborted
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has aborted
EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
__journal_remove_journal_head: freeing b_committed_data
_______________________________________________
Xen-users mailing list
Xen-users@lists.xensource.com
http://lists.xensource.com/xen-users
On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote:> Hi Folks. > > I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA HW-Raid, > BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 domU, clustered, > active/passive. Data Storage is provided as SCSISR (without LVM layer, like a > HBASR) to OpenFiler. Shared storage is provided as iSCSI target by OpenFiler > via clusterIP (storage frontend network), replication is done by drbd (storage > backend network), HA is done by haertbeat (hearbeat network). All networks are > built on top of redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, > each bonded and plugged into the same switch, both bonds multipathed > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two > switches, which are linked together with 2 ports each. >Hello, Did you try XCP 1.0 beta? -- Pasi> XCP pool works, ISCSI works, replication works, HA works. > > If filer 1 (running on server1) is active i can install and run domUs on > server 2 without problems, I can not install or run domUs on server 1. > > If I switch to filer 2 (on server 2) as the active one the running but > stalled domUs on server 1 get back their life, and the running domUs on filer2 > loose their life. > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct > shows a rate of 0.8 - 1.2 MB/sec. > > The kernel shows traces like > > INFO: task syslogd:1081 blocked for more than 120 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > syslogd D ffff880001003460 0 1081 1 1084 1073 > (NOTLB) > ffff8800367edd88 0000000000000286 ffff8800367edd98 ffffffff80262dd3 > 0000000000000009 ffff88003fb007a0 ffffffff804f4b80 0000000000000d5b > ffff88003fb00988 0000000000006d06 > Call Trace: > [<ffffffff80262dd3>] thread_return+0x6c/0x113 > [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5 > [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e > [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff > [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328 > [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291 > [<ffffffff802e555b>] sync_inode+0x24/0x33 > [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc > [<ffffffff80252276>] do_fsync+0x52/0xa4 > [<ffffffff802d37f5>] __do_fsync+0x23/0x36 > [<ffffffff802602f9>] tracesys+0xab/0xb6 > > > Iscsiadm shows no errors. > > # iscsiadm -m session -r 1 -s > Stats for session [sid: 1, target: > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: > 172.16.0.2,3260] > iSCSI SNMP: > txdata_octets: 486181549212 > rxdata_octets: 2622687792 > noptx_pdus: 0 > scsicmd_pdus: 15184105 > tmfcmd_pdus: 0 > login_pdus: 0 > text_pdus: 0 > dataout_pdus: 195910 > logout_pdus: 0 > snack_pdus: 0 > noprx_pdus: 0 > scsirsp_pdus: 15184088 > tmfrsp_pdus: 0 > textrsp_pdus: 0 > datain_pdus: 87898 > logoutrsp_pdus: 0 > r2t_pdus: 151200 > async_pdus: 0 > rjt_pdus: 0 > digest_err: 0 > timeout_err: 0 > iSCSI Extended: > tx_sendpage_failures: 0 > rx_discontiguous_hdr: 0 > eh_abort_cnt: 0 > > If I reboot the domU after giving back her life, in most cases, the ext3 > journal is corrupt, and the kernel panics after one reboot more. > > If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish to > initialize the disk xvda, but if the disk partitioning and layout questions > appear the disk is missing in the list. There''s nothing more than a question > mark. > Sometimes I have the disk in the list, if so I can install the OS, all seems > fine, but after the second reboot the ext3 journal is missing and the kernel > panics after the third reboot, rootfs is gone. > > > Are there any ideas? I''m out of. > > Thanks > Christian > > Some kernel logging from domU, nothing inside dom0 log. > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743295 > Aborting journal on device dm-0. > ext3_abort called. > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal > Remounting filesystem read-only > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743296 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743297 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743298 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743299 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743300 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743301 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743302 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743303 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743304 > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > block 743305 > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted > EXT3-fs error (device dm-0) in ext3_truncate: Journal has aborted > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted > EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has aborted > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has aborted > __journal_remove_journal_head: freeing b_committed_data > __journal_remove_journal_head: freeing b_committed_data > __journal_remove_journal_head: freeing b_committed_data > > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Wednesday 05 January 2011 11:03:37 Pasi Kärkkäinen wrote:> On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote: > > Hi Folks. > > > > I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA > > HW-Raid, BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 > > domU, clustered, active/passive. Data Storage is provided as SCSISR > > (without LVM layer, like a HBASR) to OpenFiler. Shared storage is > > provided as iSCSI target by OpenFiler via clusterIP (storage frontend > > network), replication is done by drbd (storage backend network), HA is > > done by haertbeat (hearbeat network). All networks are built on top of > > redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, each > > bonded and plugged into the same switch, both bonds multipathed > > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two > > switches, which are linked together with 2 ports each. > > Hello, > > Did you try XCP 1.0 beta?Hi Pasi, No, not yet. But I''ll try it. Is it more beta than 0.5, or less? Can it be used as production system? Is it upgradable if 1.0 final comes out? There are two possible ways to solve this, trying 1.0 beta, or using dedicated storage server hardware. The storage works perfect if I run the guest systems on top of a third hardware. What I don''t understand is what badness happens if the active filer and the guest running on top of the same hardware. I think the setup should work. I''ve seen this fs crashes also on top of glusterfs, which I''ve tried before, with the difference that both servers was affected. That was an active/active filer setup. Christian> > -- Pasi > > > XCP pool works, ISCSI works, replication works, HA works. > > > > If filer 1 (running on server1) is active i can install and run domUs on > > server 2 without problems, I can not install or run domUs on server 1. > > > > If I switch to filer 2 (on server 2) as the active one the running but > > stalled domUs on server 1 get back their life, and the running domUs on > > filer2 loose their life. > > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct > > shows a rate of 0.8 - 1.2 MB/sec. > > > > The kernel shows traces like > > > > INFO: task syslogd:1081 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > syslogd D ffff880001003460 0 1081 1 1084 1073 > > (NOTLB) > > > > ffff8800367edd88 0000000000000286 ffff8800367edd98 ffffffff80262dd3 > > 0000000000000009 ffff88003fb007a0 ffffffff804f4b80 0000000000000d5b > > ffff88003fb00988 0000000000006d06 > > > > Call Trace: > > [<ffffffff80262dd3>] thread_return+0x6c/0x113 > > [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5 > > [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e > > [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff > > [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328 > > [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291 > > [<ffffffff802e555b>] sync_inode+0x24/0x33 > > [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc > > [<ffffffff80252276>] do_fsync+0x52/0xa4 > > [<ffffffff802d37f5>] __do_fsync+0x23/0x36 > > [<ffffffff802602f9>] tracesys+0xab/0xb6 > > > > Iscsiadm shows no errors. > > > > # iscsiadm -m session -r 1 -s > > Stats for session [sid: 1, target: > > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: > > 172.16.0.2,3260] > > > > iSCSI SNMP: > > txdata_octets: 486181549212 > > rxdata_octets: 2622687792 > > noptx_pdus: 0 > > scsicmd_pdus: 15184105 > > tmfcmd_pdus: 0 > > login_pdus: 0 > > text_pdus: 0 > > dataout_pdus: 195910 > > logout_pdus: 0 > > snack_pdus: 0 > > noprx_pdus: 0 > > scsirsp_pdus: 15184088 > > tmfrsp_pdus: 0 > > textrsp_pdus: 0 > > datain_pdus: 87898 > > logoutrsp_pdus: 0 > > r2t_pdus: 151200 > > async_pdus: 0 > > rjt_pdus: 0 > > digest_err: 0 > > timeout_err: 0 > > > > iSCSI Extended: > > tx_sendpage_failures: 0 > > rx_discontiguous_hdr: 0 > > eh_abort_cnt: 0 > > > > If I reboot the domU after giving back her life, in most cases, the ext3 > > journal is corrupt, and the kernel panics after one reboot more. > > > > If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish > > to initialize the disk xvda, but if the disk partitioning and layout > > questions appear the disk is missing in the list. There''s nothing more > > than a question mark. > > Sometimes I have the disk in the list, if so I can install the OS, all > > seems fine, but after the second reboot the ext3 journal is missing and > > the kernel panics after the third reboot, rootfs is gone. > > > > > > Are there any ideas? I''m out of. > > > > Thanks > > Christian > > > > Some kernel logging from domU, nothing inside dom0 log. > > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743295 > > Aborting journal on device dm-0. > > ext3_abort called. > > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted > > journal Remounting filesystem read-only > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743296 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743297 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743298 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743299 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743300 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743301 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743302 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743303 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743304 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743305 > > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has > > aborted EXT3-fs error (device dm-0) in ext3_truncate: Journal has > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal > > has aborted EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal > > has aborted __journal_remove_journal_head: freeing b_committed_data > > __journal_remove_journal_head: freeing b_committed_data > > __journal_remove_journal_head: freeing b_committed_data > > > > > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Wed, Jan 05, 2011 at 11:37:03AM +0100, Christian Fischer wrote:> On Wednesday 05 January 2011 11:03:37 Pasi Kärkkäinen wrote: > > On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote: > > > Hi Folks. > > > > > > I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA > > > HW-Raid, BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 > > > domU, clustered, active/passive. Data Storage is provided as SCSISR > > > (without LVM layer, like a HBASR) to OpenFiler. Shared storage is > > > provided as iSCSI target by OpenFiler via clusterIP (storage frontend > > > network), replication is done by drbd (storage backend network), HA is > > > done by haertbeat (hearbeat network). All networks are built on top of > > > redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, each > > > bonded and plugged into the same switch, both bonds multipathed > > > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two > > > switches, which are linked together with 2 ports each. > > > > Hello, > > > > Did you try XCP 1.0 beta? > > Hi Pasi, > > No, not yet. But I''ll try it. Is it more beta than 0.5, or less? Can it be > used as production system? >I *think* it should be better than 0.5 :) Also I *think* there''s XCP 1.0 beta2 coming up soon(ish).> Is it upgradable if 1.0 final comes out?Not sure.> > There are two possible ways to solve this, trying 1.0 beta, or using dedicated > storage server hardware. The storage works perfect if I run the guest systems > on top of a third hardware. > > What I don''t understand is what badness happens if the active filer and the > guest running on top of the same hardware. I think the setup should work. > I''ve seen this fs crashes also on top of glusterfs, which I''ve tried before, > with the difference that both servers was affected. That was an active/active > filer setup. >-- Pasi> Christian > > > > > -- Pasi > > > > > XCP pool works, ISCSI works, replication works, HA works. > > > > > > If filer 1 (running on server1) is active i can install and run domUs on > > > server 2 without problems, I can not install or run domUs on server 1. > > > > > > If I switch to filer 2 (on server 2) as the active one the running but > > > stalled domUs on server 1 get back their life, and the running domUs on > > > filer2 loose their life. > > > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct > > > shows a rate of 0.8 - 1.2 MB/sec. > > > > > > The kernel shows traces like > > > > > > INFO: task syslogd:1081 blocked for more than 120 seconds. > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > syslogd D ffff880001003460 0 1081 1 1084 1073 > > > (NOTLB) > > > > > > ffff8800367edd88 0000000000000286 ffff8800367edd98 ffffffff80262dd3 > > > 0000000000000009 ffff88003fb007a0 ffffffff804f4b80 0000000000000d5b > > > ffff88003fb00988 0000000000006d06 > > > > > > Call Trace: > > > [<ffffffff80262dd3>] thread_return+0x6c/0x113 > > > [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5 > > > [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e > > > [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff > > > [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328 > > > [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291 > > > [<ffffffff802e555b>] sync_inode+0x24/0x33 > > > [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc > > > [<ffffffff80252276>] do_fsync+0x52/0xa4 > > > [<ffffffff802d37f5>] __do_fsync+0x23/0x36 > > > [<ffffffff802602f9>] tracesys+0xab/0xb6 > > > > > > Iscsiadm shows no errors. > > > > > > # iscsiadm -m session -r 1 -s > > > Stats for session [sid: 1, target: > > > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: > > > 172.16.0.2,3260] > > > > > > iSCSI SNMP: > > > txdata_octets: 486181549212 > > > rxdata_octets: 2622687792 > > > noptx_pdus: 0 > > > scsicmd_pdus: 15184105 > > > tmfcmd_pdus: 0 > > > login_pdus: 0 > > > text_pdus: 0 > > > dataout_pdus: 195910 > > > logout_pdus: 0 > > > snack_pdus: 0 > > > noprx_pdus: 0 > > > scsirsp_pdus: 15184088 > > > tmfrsp_pdus: 0 > > > textrsp_pdus: 0 > > > datain_pdus: 87898 > > > logoutrsp_pdus: 0 > > > r2t_pdus: 151200 > > > async_pdus: 0 > > > rjt_pdus: 0 > > > digest_err: 0 > > > timeout_err: 0 > > > > > > iSCSI Extended: > > > tx_sendpage_failures: 0 > > > rx_discontiguous_hdr: 0 > > > eh_abort_cnt: 0 > > > > > > If I reboot the domU after giving back her life, in most cases, the ext3 > > > journal is corrupt, and the kernel panics after one reboot more. > > > > > > If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish > > > to initialize the disk xvda, but if the disk partitioning and layout > > > questions appear the disk is missing in the list. There''s nothing more > > > than a question mark. > > > Sometimes I have the disk in the list, if so I can install the OS, all > > > seems fine, but after the second reboot the ext3 journal is missing and > > > the kernel panics after the third reboot, rootfs is gone. > > > > > > > > > Are there any ideas? I''m out of. > > > > > > Thanks > > > Christian > > > > > > Some kernel logging from domU, nothing inside dom0 log. > > > > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743295 > > > Aborting journal on device dm-0. > > > ext3_abort called. > > > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted > > > journal Remounting filesystem read-only > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743296 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743297 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743298 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743299 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743300 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743301 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743302 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743303 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743304 > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > > block 743305 > > > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has > > > aborted EXT3-fs error (device dm-0) in ext3_truncate: Journal has > > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal > > > has aborted EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has > > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal > > > has aborted __journal_remove_journal_head: freeing b_committed_data > > > __journal_remove_journal_head: freeing b_committed_data > > > __journal_remove_journal_head: freeing b_committed_data > > > > > > > > > > > > _______________________________________________ > > > Xen-users mailing list > > > Xen-users@lists.xensource.com > > > http://lists.xensource.com/xen-users > > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users
On Wednesday 05 January 2011 11:03:37 Pasi Kärkkäinen wrote:> On Tue, Jan 04, 2011 at 03:37:36PM +0100, Christian Fischer wrote: > > Hi Folks. > > > > I''ve two Intel boxes (Intel server S5520UR, 2x E5520, 32GB ram, SATA > > HW-Raid, BBU) running as XCP-0.5 pool, both running a OpenFiler-2.3 > > domU, clustered, active/passive. Data Storage is provided as SCSISR > > (without LVM layer, like a HBASR) to OpenFiler. Shared storage is > > provided as iSCSI target by OpenFiler via clusterIP (storage frontend > > network), replication is done by drbd (storage backend network), HA is > > done by haertbeat (hearbeat network). All networks are built on top of > > redundant HP gigabit switches, 2 pairs of Intel gigabit NICs, each > > bonded and plugged into the same switch, both bonds multipathed > > (active/passive multipathing, patched OpenVSwitch-1.1.2p1) via the two > > switches, which are linked together with 2 ports each. > > Hello, > > Did you try XCP 1.0 beta?Yes, that works with XCP 1.0 beta, most of the time. I had one final crash while swapping the active filer, with two corrupted filers and crashed file systems on all running domUs, as the result. Both dom0s where widely not responsive to ssh or local console requests, and freezed after invoked shutdown. No idea if there''s any relationship. I''ve found two kernel messages inside kern.log of the second dom0: BUG: soft lockup - CPU#3 stuck for 61s! [swapper:0] followed by one INFO: task rc:24537 blocked for more than 120 seconds. There where a lot of IO errors at the same time on the first dom0. Christian> > -- Pasi > > > XCP pool works, ISCSI works, replication works, HA works. > > > > If filer 1 (running on server1) is active i can install and run domUs on > > server 2 without problems, I can not install or run domUs on server 1. > > > > If I switch to filer 2 (on server 2) as the active one the running but > > stalled domUs on server 1 get back their life, and the running domUs on > > filer2 loose their life. > > # dd if=/dev/zero of=/tmp/test bs=512M count=1 oflag=direct > > shows a rate of 0.8 - 1.2 MB/sec. > > > > The kernel shows traces like > > > > INFO: task syslogd:1081 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > syslogd D ffff880001003460 0 1081 1 1084 1073 > > (NOTLB) > > > > ffff8800367edd88 0000000000000286 ffff8800367edd98 ffffffff80262dd3 > > 0000000000000009 ffff88003fb007a0 ffffffff804f4b80 0000000000000d5b > > ffff88003fb00988 0000000000006d06 > > > > Call Trace: > > [<ffffffff80262dd3>] thread_return+0x6c/0x113 > > [<ffffffff88036d5a>] :jbd:log_wait_commit+0xa3/0xf5 > > [<ffffffff8029c60a>] autoremove_wake_function+0x0/0x2e > > [<ffffffff8803178a>] :jbd:journal_stop+0x1cf/0x1ff > > [<ffffffff8023138e>] __writeback_single_inode+0x1e9/0x328 > > [<ffffffff802d2ff1>] do_readv_writev+0x26e/0x291 > > [<ffffffff802e555b>] sync_inode+0x24/0x33 > > [<ffffffff8804c36d>] :ext3:ext3_sync_file+0xc9/0xdc > > [<ffffffff80252276>] do_fsync+0x52/0xa4 > > [<ffffffff802d37f5>] __do_fsync+0x23/0x36 > > [<ffffffff802602f9>] tracesys+0xab/0xb6 > > > > Iscsiadm shows no errors. > > > > # iscsiadm -m session -r 1 -s > > Stats for session [sid: 1, target: > > iqn.2006-01.com.openfiler:tsn.26336ef50fe0:storage1_osimages, portal: > > 172.16.0.2,3260] > > > > iSCSI SNMP: > > txdata_octets: 486181549212 > > rxdata_octets: 2622687792 > > noptx_pdus: 0 > > scsicmd_pdus: 15184105 > > tmfcmd_pdus: 0 > > login_pdus: 0 > > text_pdus: 0 > > dataout_pdus: 195910 > > logout_pdus: 0 > > snack_pdus: 0 > > noprx_pdus: 0 > > scsirsp_pdus: 15184088 > > tmfrsp_pdus: 0 > > textrsp_pdus: 0 > > datain_pdus: 87898 > > logoutrsp_pdus: 0 > > r2t_pdus: 151200 > > async_pdus: 0 > > rjt_pdus: 0 > > digest_err: 0 > > timeout_err: 0 > > > > iSCSI Extended: > > tx_sendpage_failures: 0 > > rx_discontiguous_hdr: 0 > > eh_abort_cnt: 0 > > > > If I reboot the domU after giving back her life, in most cases, the ext3 > > journal is corrupt, and the kernel panics after one reboot more. > > > > If I try to install a PV-Domain (CentOS-5.5) the installer asks if I wish > > to initialize the disk xvda, but if the disk partitioning and layout > > questions appear the disk is missing in the list. There''s nothing more > > than a question mark. > > Sometimes I have the disk in the list, if so I can install the OS, all > > seems fine, but after the second reboot the ext3 journal is missing and > > the kernel panics after the third reboot, rootfs is gone. > > > > > > Are there any ideas? I''m out of. > > > > Thanks > > Christian > > > > Some kernel logging from domU, nothing inside dom0 log. > > > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743295 > > Aborting journal on device dm-0. > > ext3_abort called. > > EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted > > journal Remounting filesystem read-only > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743296 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743297 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743298 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743299 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743300 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743301 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743302 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743303 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743304 > > EXT3-fs error (device dm-0): ext3_free_blocks_sb: bit already cleared for > > block 743305 > > EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal has > > aborted EXT3-fs error (device dm-0) in ext3_truncate: Journal has > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal > > has aborted EXT3-fs error (device dm-0) in ext3_orphan_del: Journal has > > aborted EXT3-fs error (device dm-0) in ext3_reserve_inode_write: Journal > > has aborted __journal_remove_journal_head: freeing b_committed_data > > __journal_remove_journal_head: freeing b_committed_data > > __journal_remove_journal_head: freeing b_committed_data > > > > > > > > _______________________________________________ > > Xen-users mailing list > > Xen-users@lists.xensource.com > > http://lists.xensource.com/xen-users > > _______________________________________________ > Xen-users mailing list > Xen-users@lists.xensource.com > http://lists.xensource.com/xen-users_______________________________________________ Xen-users mailing list Xen-users@lists.xensource.com http://lists.xensource.com/xen-users