thr3ads.net - zfs discuss - [zfs-discuss] X4540 32GB SSD in x4500 as slog [May 2009]

If this information is useful, please help other people find it:
Share via:

Paul B. Henson

2009-May-08 19:32 UTC

[zfs-discuss] X4540 32GB SSD in x4500 as slog

I see Sun has recently released part number XRA-ST1CH-32G2SSD, a 32GB SATA
SSD for the x4540 server.

We have five x4500''s we purchased last year that we are deploying to
provide file and web services to our users. One issue that we have had is
horrible performance for the "single threaded process creating lots of
small files over NFS" scenario. The bottleneck in that case is fairly
clear, and to verify it we temporarily disabled the ZIL on one of the
servers. Extraction time for a large tarball into an NFSv4 mounted
filesystem dropped from 20 minutes to 2 minutes.

Obviously, it is strongly recommended not to run with the ZIL disabled, and
we don''t particularly want to do so in production. However, for some of
our
users, performance is simply unacceptable for various usage cases
(including not only tar extracts, but other common software development
processes such as svn checkouts).

As such, we have been investigating the possibility of improving
performance via a slog, preferably on some type of NVRAM or SSD. We
haven''t
really found anything appropriate, and now we see Sun has officially
released something very possibly like what we have been looking for.

My sales rep tells me the drive is only qualified for use in an x4540.
However, as a standard SATA interface SSD there is theoretically no reason
why it would not work in an x4500, they even share the exact same drive
sleds. I was told Sun just didn''t want to spend the time/effort to
qualify
it for the older hardware (kind of sucks that servers we bought less than a
year ago are being abandoned). We are considering using them anyway, in the
worst case if Sun support complains that they are installed and refuses to
continue any diagnostic efforts, presumably we can simply swap them out for
standard hard drives. slog devices can be replaced like any other zfs vdev,
correct? Or alternatively, what is the state of removing a slog device and
reverting back to a pool embedded log?

So, has anyone played with this new SSD in an x4500 and can comment on
whether or not they seemed to work okay? I can''t imagine no one inside
of
Sun, regardless of official support level, hasn''t tried it :). Feel
free to
post anonymously or reply off list if you don''t want anything on the
record
;).
>From reviewing the Sun hybrid storage documentation, it describes twodifferent flash devices, the "Logzilla", optimized for blindingly fast
writes and intended as a ZIL slog, and the "Cachezilla", optimized for
fast
reads and intended for use as L2ARC. Is this one of those, or some other
device? If the latter, what are its technical read/write performance
characteristics?

We currently have all 48 drives allocated, 23 mirror pairs and two hot
spares. Is there any timeline on the availability of removing an active
vdev from a pool, which would allow us to swap out a couple of devices
without destroying and having to rebuild our pool?

What is the current state of behavior in the face of slog failure?
Theoretically, if a dedicated slog device failed, the pool could simply
revert to logging embedded in the pool. However, the last I heard slog
device failure rendered a pool completely unusable and inaccessible. If
that is still the case and not expected to be resolved anytime soon, we
would presumably need two of the devices to mirror?

Thanks for any info you might be able to provide.

--
Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst | henson at csupomona.edu
California State Polytechnic University | Pomona CA 91768

Richard Elling

2009-May-13 20:42 UTC

head link

[zfs-discuss] X4540 32GB SSD in x4500 as slog

Paul B. Henson wrote:> I see Sun has recently released part number XRA-ST1CH-32G2SSD, a 32GB SATA
> SSD for the x4540 server.
>   
I didn''t find that exact part number, but I notice that manufacturing
part
   371-4196 32GB Solid State Drive, SATA Interface
is showing up in a number of systems.  IIRC, this would be an Intel X25-E.
(shock rated at 1,000 Gs @ 0.5ms, so it should still work if I fall off 
my horse ;-)
> We have five x4500''s we purchased last year that we are deploying
to
> provide file and web services to our users. One issue that we have had is
> horrible performance for the "single threaded process creating lots of
> small files over NFS" scenario. The bottleneck in that case is fairly
> clear, and to verify it we temporarily disabled the ZIL on one of the
> servers. Extraction time for a large tarball into an NFSv4 mounted
> filesystem dropped from 20 minutes to 2 minutes.
>
> Obviously, it is strongly recommended not to run with the ZIL disabled, and
> we don''t particularly want to do so in production. However, for
some of our
> users, performance is simply unacceptable for various usage cases
> (including not only tar extracts, but other common software development
> processes such as svn checkouts).
>   
Yep.  Same sort of workload.
> As such, we have been investigating the possibility of improving
> performance via a slog, preferably on some type of NVRAM or SSD. We
haven''t
> really found anything appropriate, and now we see Sun has officially
> released something very possibly like what we have been looking for.
>
> My sales rep tells me the drive is only qualified for use in an x4540.
> However, as a standard SATA interface SSD there is theoretically no reason
> why it would not work in an x4500, they even share the exact same drive
> sleds. I was told Sun just didn''t want to spend the time/effort to
qualify
> it for the older hardware (kind of sucks that servers we bought less than a
> year ago are being abandoned). We are considering using them anyway, in the
> worst case if Sun support complains that they are installed and refuses to
> continue any diagnostic efforts, presumably we can simply swap them out for
> standard hard drives. slog devices can be replaced like any other zfs vdev,
> correct? Or alternatively, what is the state of removing a slog device and
> reverting back to a pool embedded log?
>   
Generally, Sun doesn''t qualify new devices with EOLed systems.

Today, you can remove a cache device, but not a log device.  You can
replace a log device.

Before you start down this path, you should take a look at the workload
using zilstat, which will show you the kind of work the ZIL is doing.
If you don''t see any ZIL activity, no need to worry about a separate
log.
http://www.richardelling.com/Home/scripts-and-programs-1/zilstat

If you decide you need a log device... read on.
Usually, the log device does not need to be very big.  A good strategy
would be to create a small partition or slice, say 1 GByte, on an idle disk.
Add this as a log device to the pool.  If this device is a HDD, then you
might not see much of a performance boost. But now that you have a
log device setup, you can experiment with replacing the log device
with another.  You won''t be able to remove the log device, but you
can relocate or grow it on the fly.
> So, has anyone played with this new SSD in an x4500 and can comment on
> whether or not they seemed to work okay? I can''t imagine no one
inside of
> Sun, regardless of official support level, hasn''t tried it :).
Feel free to
> post anonymously or reply off list if you don''t want anything on
the record
> ;).
>
> >From reviewing the Sun hybrid storage documentation, it describes two
> different flash devices, the "Logzilla", optimized for blindingly
fast
> writes and intended as a ZIL slog, and the "Cachezilla",
optimized for fast
> reads and intended for use as L2ARC. Is this one of those, or some other
> device? If the latter, what are its technical read/write performance
> characteristics?
>   
Intel claims > 3,300 4kByte random write iops.  A really fast HDD
may reach 300 4kByte random write iops, but there are no really
fast SATA HDDs.
http://www.intel.com/design/flash/nand/extreme/index.htm
> We currently have all 48 drives allocated, 23 mirror pairs and two hot
> spares. Is there any timeline on the availability of removing an active
> vdev from a pool, which would allow us to swap out a couple of devices
> without destroying and having to rebuild our pool?
>   
My rule of thumb is to have a hot spare.  Having lots of hot
spares only makes a big difference for sites where you cannot
service the systems within a few days, such as remote locations.
But you can remove a hot spare, so that could be a source of
your experimental 1 GByte log.
> What is the current state of behavior in the face of slog failure?
>   
It depends on both the failure and event tree...
> Theoretically, if a dedicated slog device failed, the pool could simply
> revert to logging embedded in the pool. 
Yes, and this is what would happen in the case where the log
device completely failed while the pool was operational --
the ZIL will revert to using the main pool.
> However, the last I heard slog
> device failure rendered a pool completely unusable and inaccessible. If
> that is still the case and not expected to be resolved anytime soon, we
> would presumably need two of the devices to mirror?
>   
This is the case where the log device fails completely while
the pool is not operational.  Upon import, the pool will look
for an operational log device and will not find it.  This means
that any committed transactions that would have been in the
log device are not recoverable *and* the pool won''t know
the extent of this missing information.

We could build a model of such a system for an availability
or data retention analysis, but we would be hard pressed to
agree upon a probability that the events (system down and
log device fails) that would be interestingly large.  In large
part this is because the failure rate of SSDs is so much better
than the failure rate for HDDs.  In other words, the HDD
failure modes would dominate the analysis by a significant
margin and the SSD-failing-while-system-down case would
be way down in the noise.

OTOH, if you are paranoid and feel very strongly about CYA,
then by all means, mirror the log :-).
> Thanks for any info you might be able to provide.
>   
[editorial comment: it would be to Sun''s benefit if Sun people
would respond to Sun product questions.  Harrrummppff.]
 -- richard

Paul B. Henson

2009-May-14 00:27 UTC

head link

[zfs-discuss] X4540 32GB SSD in x4500 as slog

On Wed, 13 May 2009, Richard Elling wrote:
> I didn''t find that exact part number, but I notice that
manufacturing part
>    371-4196 32GB Solid State Drive, SATA Interface
> is showing up in a number of systems.  IIRC, this would be an Intel X25-E.
Hmm, the part number I provided was off an official quote from our
authorized reseller, googling it comes up with one sun.com link:

http://www.sun.com/executives/iforce/mysun/docs/Support2a_ReleaseContentInfo.html

and a bunch of Japanese sites. List price was $1500, if it is actually an
OEM''d Intel X25-E that''s quite a markup, street price on that
has dropped
below $500.  If it''s not, it sure would be nice to see some specs.
> Generally, Sun doesn''t qualify new devices with EOLed systems.
Understood, it just sucks to have bought a system on its deathbed without
prior knowledge thereof.
> Today, you can remove a cache device, but not a log device. You can
> replace a log device.
I guess if we ended up going this way replacing the log device with a
standard hard drive in case of support issues would be the only way to go.
Those log device replacement also require the replacement device be of
equal or greater size? If I wanted to swap between a 32GB SSD and a 1TB
SATA drive, I guess I would need to make a partition/slice on the TB drive
of exactly the size of the SSD?
> Before you start down this path, you should take a look at the workload
> using zilstat, which will show you the kind of work the ZIL is doing. If
> you don''t see any ZIL activity, no need to worry about a separate
log.
> http://www.richardelling.com/Home/scripts-and-programs-1/zilstat
Would a dramatic increase in performance when disabling the ZIL also be
sufficient evidence? Even with only me as the only person using our test
x4500 disabling the ZIL provides markedly better performance as originally
described for certain use cases.
> Usually, the log device does not need to be very big.  A good strategy
> would be to create a small partition or slice, say 1 GByte, on an idle
disk.
If the log device was too small, you potentially could end up bottlenecked
waiting for transactions to be committed to free up log device blocks?
> Intel claims > 3,300 4kByte random write iops.
Is that before after the device gets full and starts needing to erase whole
pages to write new blocks 8-/?
> My rule of thumb is to have a hot spare.  Having lots of hot
> spares only makes a big difference for sites where you cannot
> service the systems within a few days, such as remote locations.
Eh, they''re just downstairs, and we have 7x24 gold on them. Plus I have
5,
each with 2 hot spares. I wouldn''t have an issue trading a hot spare
for a
log device other than potential issues with the log device failing if not
mirrored.
> Yes, and this is what would happen in the case where the log device
> completely failed while the pool was operational -- the ZIL will revert
> to using the main pool.
But would then go belly up if the system ever rebooted? You said currently
you cannot remove a log device, if the pool reverts to an embedded log
upon slog failure, and continues to work after a reboot, you''ve
effectively
removed the slog, other than I guess it might keep complaining and showing
a dead slog device.
> This is the case where the log device fails completely while
> the pool is not operational.  Upon import, the pool will look
> for an operational log device and will not find it.  This means
> that any committed transactions that would have been in the
> log device are not recoverable *and* the pool won''t know
> the extent of this missing information.
So is there simply no recovery available for such a pool? Presumably the
majority of the data in the pool would probably be fine.
> OTOH, if you are paranoid and feel very strongly about CYA, then by all
> means, mirror the log :-).
That all depends on the outcome in that rare as it might case where the log
device fails and the pool is inaccessible. If it''s just a matter of
some
manual intervention to reset the pool to a happy state and the potential
loss of any uncommitted transactions (which, according to the evil zfs
tuning guide don''t result in a corrupted zfs filesystem, only in
potentially unhappy nfs clients), I could live with that. If all of the
data in the poll is trashed and must be restored from backup, that would be
problematic.
> [editorial comment: it would be to Sun''s benefit if Sun people
would
> respond to Sun product questions.  Harrrummppff.]
Maybe they''re too busy running in circles trying to figure out what
life
under Oracle dominion is going to be like :(.


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson at csupomona.edu
California State Polytechnic University  |  Pomona CA 91768

Eric D. Mudama

2009-May-14 00:45 UTC

head link

[zfs-discuss] X4540 32GB SSD in x4500 as slog

On Wed, May 13 at 17:27, Paul B. Henson wrote:>On Wed, 13 May 2009, Richard Elling wrote:
>>
>> Intel claims > 3,300 4kByte random write iops.
>
>Is that before after the device gets full and starts needing to erase whole
>pages to write new blocks 8-/?
The quoted numbers are minimums, not "up to" like on the X25-M
devices.

I believe that they''re measuring sustained 4k full-pack random writes,
long after the device has filled and needs to be doing garbage
collection, wear leveling, etc.

--eric


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Richard Elling

2009-May-14 04:25 UTC

head link

[zfs-discuss] X4540 32GB SSD in x4500 as slog

Paul B. Henson wrote:> On Wed, 13 May 2009, Richard Elling wrote:
>
>   
>> I didn''t find that exact part number, but I notice that
manufacturing part
>>    371-4196 32GB Solid State Drive, SATA Interface
>> is showing up in a number of systems.  IIRC, this would be an Intel
X25-E.
>>     
>
> Hmm, the part number I provided was off an official quote from our
> authorized reseller, googling it comes up with one sun.com link:
>
>
http://www.sun.com/executives/iforce/mysun/docs/Support2a_ReleaseContentInfo.html
>
> and a bunch of Japanese sites. List price was $1500, if it is actually an
> OEM''d Intel X25-E that''s quite a markup, street price on
that has dropped
> below $500.  If it''s not, it sure would be nice to see some specs.
>
>   
>> Generally, Sun doesn''t qualify new devices with EOLed systems.
>>     
>
> Understood, it just sucks to have bought a system on its deathbed without
> prior knowledge thereof.
>   
Since it costs real $$ to do such things, given the current state of
the economy, I don''t think you''ll find anyone in the computer
business
not trying to sell new product.
>> Today, you can remove a cache device, but not a log device. You can
>> replace a log device.
>>     
>
> I guess if we ended up going this way replacing the log device with a
> standard hard drive in case of support issues would be the only way to go.
> Those log device replacement also require the replacement device be of
> equal or greater size?
Yes, standard mirror rules apply.  This is why I try to make it known
that you don''t generally need much size for the log device. They are
solving a latency problem, not a space or bandwidth problem.
>  If I wanted to swap between a 32GB SSD and a 1TB
> SATA drive, I guess I would need to make a partition/slice on the TB drive
> of exactly the size of the SSD?
>   
Yes, but note that an SMI label hangs onto the outdated notion of
cylinders and you can''t make a slice except on cylinder boundaries.
>> Before you start down this path, you should take a look at the workload
>> using zilstat, which will show you the kind of work the ZIL is doing.
If
>> you don''t see any ZIL activity, no need to worry about a
separate log.
>> http://www.richardelling.com/Home/scripts-and-programs-1/zilstat
>>     
>
> Would a dramatic increase in performance when disabling the ZIL also be
> sufficient evidence? Even with only me as the only person using our test
> x4500 disabling the ZIL provides markedly better performance as originally
> described for certain use cases.
>   
Yes.  If the latency through the data path to write to the log was zero,
then it would perform the same as disabling the ZIL.
>> Usually, the log device does not need to be very big.  A good strategy
>> would be to create a small partition or slice, say 1 GByte, on an idle
disk.
>>     
>
> If the log device was too small, you potentially could end up bottlenecked
> waiting for transactions to be committed to free up log device blocks?
>   
zilstat can give you an idea of how much data is being written to
the log, so you can make that decision.  Of course you can always
grow the log, or add another.  But I think you will find that if a
txg commits in 30 seconds or less (less as it becomes more busy),
then the amount of data sent to the log will be substantially less
than 1 GByte per txg commit.  Once the txg commits, then the
log space is freed.
>> Intel claims > 3,300 4kByte random write iops.
>>     
>
> Is that before after the device gets full and starts needing to erase whole
> pages to write new blocks 8-/?
Buy two, if you add two log devices, then the data is striped
across them (add != attach)
>> My rule of thumb is to have a hot spare.  Having lots of hot
>> spares only makes a big difference for sites where you cannot
>> service the systems within a few days, such as remote locations.
>>     
>
> Eh, they''re just downstairs, and we have 7x24 gold on them. Plus I
have 5,
> each with 2 hot spares. I wouldn''t have an issue trading a hot
spare for a
> log device other than potential issues with the log device failing if not
> mirrored.
>
>   
>> Yes, and this is what would happen in the case where the log device
>> completely failed while the pool was operational -- the ZIL will revert
>> to using the main pool.
>>     
>
> But would then go belly up if the system ever rebooted? You said currently
> you cannot remove a log device, if the pool reverts to an embedded log
> upon slog failure, and continues to work after a reboot, you''ve
effectively
> removed the slog, other than I guess it might keep complaining and showing
> a dead slog device.
>
>   
In that case, the pool knows the log device is failed.
>> This is the case where the log device fails completely while
>> the pool is not operational.  Upon import, the pool will look
>> for an operational log device and will not find it.  This means
>> that any committed transactions that would have been in the
>> log device are not recoverable *and* the pool won''t know
>> the extent of this missing information.
>>     
>
> So is there simply no recovery available for such a pool? Presumably the
> majority of the data in the pool would probably be fine.
>   
Just as in the disabled ZIL case, the on-disk format is still correct.  
It is
client applications that may be inconsistent.  There may be a way to
recover the pool, Sun Service will have a more definitive stance.
>> OTOH, if you are paranoid and feel very strongly about CYA, then by all
>> means, mirror the log :-).
>>     
>
> That all depends on the outcome in that rare as it might case where the log
> device fails and the pool is inaccessible. If it''s just a matter
of some
> manual intervention to reset the pool to a happy state and the potential
> loss of any uncommitted transactions (which, according to the evil zfs
> tuning guide don''t result in a corrupted zfs filesystem, only in
> potentially unhappy nfs clients), I could live with that. If all of the
> data in the poll is trashed and must be restored from backup, that would be
> problematic.
>   
You are still much more likely to lose disks in the main pool.
Pedantically, ZFS does not limit the number of mirrors, so you
could do a 47-way mirror for the log device and use 1 disk for
the pool :-)
 -- richard

Paul B. Henson

2009-May-16 01:25 UTC

head link

[zfs-discuss] X4540 32GB SSD in x4500 as slog

On Wed, 13 May 2009, Richard Elling wrote:
> > If I wanted to swap between a 32GB SSD and a 1TB SATA drive, I guess I
> > would need to make a partition/slice on the TB drive of exactly the
> > size of the SSD?
>
> Yes, but note that an SMI label hangs onto the outdated notion of
> cylinders and you can''t make a slice except on cylinder
boundaries.
Hmm... So I probably wouldn''t be able to use the entire SSD, but
instead
create a partition on both the SSD and the SATA drive of the same size?
They wouldn''t necessarily have the same cylinder size, right? So
I''d have
to find the least common multiple of the cylinder sizes and create
partitions appropriately.

In general I know it is recommended to give ZFS the entire disk, in the
specific case of the ZIL will there be any performance degradation if it is
on a slice of the SSD rather than the entire disk?
> In that case, the pool knows the log device is failed.
So, if I understand correctly, if the log device fails while the pool is
active, the log device is marked faulty, logging returns to in-pool, and
everything works perfectly fine and happy like until the log device is
replaced? It would seem the only difference between a pool without a log
device and one with a failed log device is that the latter knows it used to
have a log device? If so, it would seem trivial to support removing a log
device from a pool, unless I''m misunderstanding why has that not been
implemented?
> Just as in the disabled ZIL case, the on-disk format is still correct.
> It is client applications that may be inconsistent.  There may be a way
> to recover the pool, Sun Service will have a more definitive stance.
Eh, Sun Service doesn''t necessarily like definitiveness :).

However, I do have multiple service contracts, and probably will open a
ticket requesting further details on upcoming log improvements and recovery
modes. It would be nicer to hear it straight from the source (hint hint
hint ;) ), but barring that hopefully I can get it escalated to someone who
can fill in the gaps.

Thanks much...

-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  henson at csupomona.edu
California State Polytechnic University  |  Pomona CA 91768

Richard Elling

2009-May-16 14:13 UTC

head link

[zfs-discuss] X4540 32GB SSD in x4500 as slog

Paul B. Henson wrote:> On Wed, 13 May 2009, Richard Elling wrote:
>
>   
>>> If I wanted to swap between a 32GB SSD and a 1TB SATA drive, I
guess I
>>> would need to make a partition/slice on the TB drive of exactly the
>>> size of the SSD?
>>>       
>> Yes, but note that an SMI label hangs onto the outdated notion of
>> cylinders and you can''t make a slice except on cylinder
boundaries.
>>     
>
> Hmm... So I probably wouldn''t be able to use the entire SSD, but
instead
> create a partition on both the SSD and the SATA drive of the same size?
> They wouldn''t necessarily have the same cylinder size, right? So
I''d have
> to find the least common multiple of the cylinder sizes and create
> partitions appropriately.
>   
You can always change the cylinder sizes to suit, or use EFI labels.
> In general I know it is recommended to give ZFS the entire disk, in the
> specific case of the ZIL will there be any performance degradation if it is
> on a slice of the SSD rather than the entire disk?
>   
The "use full disk" recommendation should make zero difference for an
SSD.  It only applies to HDDs with volatile write buffers (caches).
>> In that case, the pool knows the log device is failed.
>>     
>
> So, if I understand correctly, if the log device fails while the pool is
> active, the log device is marked faulty, logging returns to in-pool, and
> everything works perfectly fine and happy like until the log device is
> replaced? It would seem the only difference between a pool without a log
> device and one with a failed log device is that the latter knows it used to
> have a log device? If so, it would seem trivial to support removing a log
> device from a pool, unless I''m misunderstanding why has that not
been
> implemented?
>   
This is CR6574286
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6574286
 -- richard
>   
>> Just as in the disabled ZIL case, the on-disk format is still correct.
>> It is client applications that may be inconsistent.  There may be a way
>> to recover the pool, Sun Service will have a more definitive stance.
>>     
>
> Eh, Sun Service doesn''t necessarily like definitiveness :).
>
> However, I do have multiple service contracts, and probably will open a
> ticket requesting further details on upcoming log improvements and recovery
> modes. It would be nicer to hear it straight from the source (hint hint
> hint ;) ), but barring that hopefully I can get it escalated to someone who
> can fill in the gaps.
>
> Thanks much...
>
>

zfs discuss - May 2009 - X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog

[zfs-discuss] X4540 32GB SSD in x4500 as slog