thr3ads.net - zfs discuss - [zfs-discuss] zpool replace

If this information is useful, please help other people find it:
Share via:

Alan Rubin

2008-Dec-02 04:04 UTC

[zfs-discuss] zpool replace - choke point

I had posted at the Sun forums, but it was recommended to me to try here as
well.  For reference, please see
http://forums.sun.com/thread.jspa?threadID=5351916&tstart=0.

In the process of a large SAN migration project we are moving many large volumes
from the old SAN to the new. We are making use of the
''replace'' function to replace the old volumes with similar or
larger new volumes. This process is moving very slowly, sometimes as slow as
only moving one percentage of data every 10 minutes. Is there any way to
streamline this method? The system is Solaris 10 08/07. How much is dependent on
the activity of the box? How about on the architecture of the box? The primary
system in question at this point is a T2000 with 8GB of RAM and a 4-core CPU.
This server has 6 4Gb fibre channel connections to our SAN environment. At times
this server is quite busy because it is our backup server, but performance seems
no better when backup operations have ceased their daily activities.

Our pools are only stripes. Would we expect better performance from a mirror or
raidz pool? It is worrisome that if the environment were compromised by a failed
disk that it could take so long to replace and correct the usual redundancies
(if it was a mirror or raidz pool).

I have previously applied the kernel change described here:
http://blogs.digitar.com/jjww/?itemid=52

I just moved a 1TB volume which took approx. 27h.
-- 
This message posted from opensolaris.org

Blake

2008-Dec-02 04:26 UTC

head link

[zfs-discuss] zpool replace - choke point

Have you considered moving to 10/08 ?  ZFS resilver performance is
much improved in this release, and I suspect that code might help you.

You can easily test upgrading with Live Upgrade.  I did the transition
using LU and was very happy with the results.

For example, I added a disk to a mirror and resilvering the new disk
took about 6 min for almost 300GB, IIRC.

Blake



On Mon, Dec 1, 2008 at 11:04 PM, Alan Rubin <alan.rubin at nt.gov.au>
wrote:> I had posted at the Sun forums, but it was recommended to me to try here as
well.  For reference, please see
http://forums.sun.com/thread.jspa?threadID=5351916&tstart=0.
>
> In the process of a large SAN migration project we are moving many large
volumes from the old SAN to the new. We are making use of the
''replace'' function to replace the old volumes with similar or
larger new volumes. This process is moving very slowly, sometimes as slow as
only moving one percentage of data every 10 minutes. Is there any way to
streamline this method? The system is Solaris 10 08/07. How much is dependent on
the activity of the box? How about on the architecture of the box? The primary
system in question at this point is a T2000 with 8GB of RAM and a 4-core CPU.
This server has 6 4Gb fibre channel connections to our SAN environment. At times
this server is quite busy because it is our backup server, but performance seems
no better when backup operations have ceased their daily activities.
>
> Our pools are only stripes. Would we expect better performance from a
mirror or raidz pool? It is worrisome that if the environment were compromised
by a failed disk that it could take so long to replace and correct the usual
redundancies (if it was a mirror or raidz pool).
>
> I have previously applied the kernel change described here:
http://blogs.digitar.com/jjww/?itemid=52
>
> I just moved a 1TB volume which took approx. 27h.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Alan Rubin

2008-Dec-02 04:39 UTC

head link

[zfs-discuss] zpool replace - choke point

We will be considering it in the new year,  but that will not happen in time to
affect our current SAN migration.
-- 
This message posted from opensolaris.org

Matt Walburn

2008-Dec-03 00:17 UTC

head link

[zfs-discuss] zpool replace - choke point

Would any of this have to do with the system being a T2000? Would ZFS
resilvering be affected by single threadedness, slowish US-T1 clock
speed or lack of strong FPU performance?

On 12/1/08, Alan Rubin <alan.rubin at nt.gov.au>
wrote:> We will be considering it in the new year,  but that will not happen in
time
> to affect our current SAN migration.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
--
Matt Walburn
http://mattwalburn.com

Alan Rubin

2008-Dec-03 01:10 UTC

head link

[zfs-discuss] zpool replace - choke point

It''s something we''ve considered here as well.
-- 
This message posted from opensolaris.org

Alan Rubin

2008-Dec-03 23:36 UTC

head link

[zfs-discuss] zpool replace - choke point

I think we found the choke point.  The silver lining is that it isn''t
the T2000 or ZFS.  We think it is the new SAN, an Hitachi AMS1000, which has
7200RPM SATA disks with the cache turned off.  This system has a very small
cache, and when we did turn it on for one of the replacement LUNs we saw a 10x
improvement - until the cache filled up about 1 minute later (was using zpool
iostat).  Oh well.
-- 
This message posted from opensolaris.org

Marion Hakanson

2008-Dec-04 20:24 UTC

head link

[zfs-discuss] zpool replace - choke point

alan.rubin at nt.gov.au said:> I think we found the choke point.  The silver lining is that it
isn''t the
> T2000 or ZFS.  We think it is the new SAN, an Hitachi AMS1000, which has
> 7200RPM SATA disks with the cache turned off.  This system has a very small
> cache, and when we did turn it on for one of the replacement LUNs we saw a
> 10x improvement - until the cache filled up about 1 minute later (was using
> zpool iostat).  Oh well. 
We have experience with a T2000 connected to the HDS 9520V, predecessor
to the AMS arrays, with SATA drives, and it''s likely that your AMS1000
SATA
has similar characteristics.  I didn''t see if you''re using
Sun''s drivers to
talk to the SAN/array, but we are using Solaris-10 (and Sun drivers + MPXIO),
and since the Hitachi storage isn''t automatically recognized (sd/ssd,
scsi_vhci), it took a fair amount of tinkering to get parameters adjusted
to work well with the HDS storage.

The combination that has given us best results with ZFS is:
 (a) Tell the array to ignore SYNCHRONIZE_CACHE requests from the host.
 (b) Balance drives within each AMS disk shelf across both array controllers.
 (c) Set the host''s max queue depth to 4 for the SATA LUN''s
(sd/ssd driver).
 (d) Set the host''s disable_disksort flag (sd/ssd driver) for HDS
LUN''s.

Here''s the reference we used for setting the parameters in Solaris-10:
  http://wikis.sun.com/display/StorageDev/Parameter+Configuration

Note that the AMS uses read-after-write verification on SATA drives,
so you only have half the IOP''s for writes that the drives are capable
of handling.  We''ve found that small RAID volumes (e.g. a two-drive
mirror) are unbelievably slow, so you''d want to go toward having more
drives per RAID group, if possible.

Honestly, if I recall correctly what I saw in your "iostat" listings
earlier, your situation is not nearly as "bad" as with our older
array.
You don''t seem to be driving those HDS LUN''s to the extreme
busy states
that we have seen on our 9520V.  It was not unusual for us to see LUN''s
at 100% busy, 100% wait, with 35 ops total in the "actv" and
"wait" columns,
and I don''t recall seeing any 100%-busy devices in your logs.

But getting the FC queue-depth (max-throttle) setting to match what the
array''s back-end I/O can handle greatly reduced the long "zpool
status"
and other I/O-related hangs that we were experiencing.  And disabling
the host-side FC queue-sorting greatly improved the overall latency of
the system when busy.  Maybe it''ll help yours too.

Regards,

Marion

Alan Rubin

2008-Dec-04 23:37 UTC

head link

[zfs-discuss] zpool replace - choke point

Thanks for the tips.  I''m not sure if they will be relevant, though. 
We don''t talk directly with the AMS1000.  We are using a USP-VM to
virtualize all of our storage and we didn''t have to add anything to the
drv configuration files to see the new disk (mpxio was already turned on).  We
are using the Sun drivers and mpxio and we didn''t require any tinkering
to see the new LUNs.
-- 
This message posted from opensolaris.org

Marion Hakanson

2008-Dec-05 18:56 UTC

head link

[zfs-discuss] zpool replace - choke point

alan.rubin at nt.gov.au said:> Thanks for the tips.  I''m not sure if they will be relevant,
though.  We
> don''t talk directly with the AMS1000.  We are using a USP-VM to
virtualize
> all of our storage and we didn''t have to add anything to the drv
> configuration files to see the new disk (mpxio was already turned on).  We
> are using the Sun drivers and mpxio and we didn''t require any
tinkering to
> see the new LUNs.
Yes, the fact that the USP-VM was recognized automatically by Solaris drivers
is a good sign.  I suggest that you check to see what queue-depth and disksort
values you ended up with from the automatic settings:

  echo "*ssd_state::walk softstate |::print -t struct sd_lun
un_throttle" \
   | mdb -k

The "ssd_state" would be "sd_state" on an x86 machine
(Solaris-10).
The "un_throttle" above will show the current max_throttle (queue
depth);
Replace it with "un_min_throttle" to see the min, and
"un_f_disksort_disabled"
to see the current queue-sort setting.

The HDS docs for 9500 series suggested 32 as the max_throttle to use, and
the default setting (Solaris-10) was 256 (hopefully with the USP-VM you get
something more reasonable).  And while 32 did work for us, i.e. no operations
were ever lost as far as I could tell, the array back-end -- the drives
themselves, and the internal SATA shelf connections, have an actual queue
depth of four for each array controller.  The AMS1000 has the same limitation
for SATA shelves, according to our HDS engineer.

In short, Solaris, especially with ZFS, functions much better if it does
not try to send more FC operations to the array than the actual physical
devices can handle.  We were actually seeing NFS client operations hang
for minutes at a time when the SAN-hosted NFS server was making its ZFS
devices busy -- and this was true even if clients were using different
devices than the busy ones.  We do not see these hangs after making the
described changes, and I believe this is because the OS is no longer waiting
around for a response from devices that aren''t going to respond in a
reasonable amount of time.

Yes, having the USP between the host and the AMS1000 will affect things;
There''s probably some huge cache in there somewhere.  But unless
you''ve
got cache of hundreds of GB in size, at some point a resilver operation
is going to end up running at the speed of the actual back-end device.

Regards,

Marion

zfs discuss - Dec 2008 - zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point

[zfs-discuss] zpool replace - choke point