I look after a remote server that has two iSCSI pools. The volumes for each pool are sparse volumes and a while back the target''s storage became full, causing weird and wonderful corruption issues until they manges to free some space. Since then, one pool has been reasonably OK, but the other has terrible performance receiving snapshots. Despite both iSCSI devices using the same IP connection, iostat shows one with reasonable service times while the other shows really high (up to 9 seconds) service times and 100% busy. This kills performance for snapshots with many random file removals and additions. I''m currently zero filling the bad pool to recover space on the target storage to see if that improves matters. Has anyone else seen similar behaviour with previously degraded iSCSI pools? -- Ian.
On 11/22/12 10:15, Ian Collins wrote:> I look after a remote server that has two iSCSI pools. The volumes for > each pool are sparse volumes and a while back the target''s storage > became full, causing weird and wonderful corruption issues until they > manges to free some space. > > Since then, one pool has been reasonably OK, but the other has terrible > performance receiving snapshots. Despite both iSCSI devices using the > same IP connection, iostat shows one with reasonable service times while > the other shows really high (up to 9 seconds) service times and 100% > busy. This kills performance for snapshots with many random file > removals and additions. > > I''m currently zero filling the bad pool to recover space on the target > storage to see if that improves matters. > > Has anyone else seen similar behaviour with previously degraded iSCSI > pools? >As a data point, both pools are being zero filled with dd. A 30 second iostat sample shows one device getting more than double the write throughput of the other: r/s w/s Mr/s Mw/s wait actv wsvc_t asvc_t %w %b device 0.2 64.0 0.0 50.1 0.0 5.6 0.7 87.9 4 64 c0t600144F096C94AC700004ECD96F20001d0 5.6 44.9 0.0 18.2 0.0 5.8 0.3 115.7 2 76 c0t600144F096C94AC700004FF354B00002d0 -- Ian.
Edward Ned Harvey (opensolarisisdeadlongliveopensolaris)
2012-Nov-22 13:06 UTC
[zfs-discuss] Woeful performance from an iSCSI pool
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Ian Collins > > I look after a remote server that has two iSCSI pools. The volumes for > each pool are sparse volumes and a while back the target''s storage > became full, causing weird and wonderful corruption issues until they > manges to free some space. > > Since then, one pool has been reasonably OK, but the other has terrible > performance receiving snapshots. Despite both iSCSI devices using the > same IP connection, iostat shows one with reasonable service times while > the other shows really high (up to 9 seconds) service times and 100% > busy. This kills performance for snapshots with many random file > removals and additions. > > I''m currently zero filling the bad pool to recover space on the target > storage to see if that improves matters. > > Has anyone else seen similar behaviour with previously degraded iSCSI > pools?This sounds exactly like the behavior I was seeing with my attempt at two machines zpool mirror''ing each other via iscsi. In my case, I had two machines that are both targets and initiators. I made the initiator service dependent on the target service, and I made the zpool mount dependent on the initiator service, and I made the virtualbox guest start dependent on the zpool mount. Everything seemed fine for a while, including some reboots. But then one reboot, one of my systems stayed down too long, and when it finally came back up, both machines started choking. So far I haven''t found any root cause, and so far the only solution I''ve found was to reinstall the OS. I tried everything I know in terms of removing, forgetting, recreating the targets, initiators, and pool, but somehow none of that was sufficient. I recently (yesterday) got budgetary approval to dig into this more, so hopefully maybe I''ll have some insight before too long, but don''t hold your breath. I could fail, and even if I don''t, it''s likely to be weeks or months. What I want to know from you is: Which machines are your solaris machines? Just the targets? Just the initiators? All of them? You say you''re having problems just with snapshots. Are you sure you''re not having trouble with all sorts of IO, and not just snapshots? What about import / export? In my case, I found I was able to zfs send, zfs receive, zfs status, all fine. But when I launched a guest VM, there would be a massive delay - you said up to 9 seconds - I was sometimes seeing over 30s - sometimes crashing the host system. And the guest OS was acting like it was getting IO error, without actually displaying error message indicating IO error. I would attempt, and sometimes fail, to power off the guest vm (kill -KILL VirtualBox). After the failure began, zpool status still works (and reports no errors), but if I try to do things like export/import, they fail indefinitely, and I need to power cycle the host. While in the failure mode, I can zpool iostat, and I sometimes see 0 transactions with nonzero bandwidth. Which defies my understanding. Did you ever see the iscsi targets "offline" or "degraded" in any way? Did you do anything like "online" or "clear?" My systems are openindiana - the latest, I forget if that''s 151a5 or a6
Ian Collins wrote:> I look after a remote server that has two iSCSI pools. The volumes for > each pool are sparse volumes and a while back the target''s storage > became full, causing weird and wonderful corruption issues until they > manges to free some space. > > Since then, one pool has been reasonably OK, but the other has terrible > performance receiving snapshots. Despite both iSCSI devices using the > same IP connection, iostat shows one with reasonable service times while > the other shows really high (up to 9 seconds) service times and 100% > busy. This kills performance for snapshots with many random file > removals and additions. > > I''m currently zero filling the bad pool to recover space on the target > storage to see if that improves matters.It did. Maybe the volume''s free space had become very fragmented. There are a couple of lessons here: 1) When using a thin provisioned volume for an iSCSI target, don''t let the volume''s pool become full! 2) if the pool using the iSCSI target has a lot of churn, consider zero filling the pool to flush out the free blocks. -- Ian.