Greetings,
We have a small Oracle project on ZFS (Solaris-10), using a SAN-connected
array which is need of replacement. I''m weighing whether to recommend
a Sun 2540 array or a Sun J4200 JBOD as the replacement. The old array
and the new ones all have 7200RPM SATA drives.
I''ve been watching the workload on the current storage using Richard
Elling''s
handy zilstat tool, and could use some more eyes/brains than just mine for
making sense of the results.
There are three pools; One is on a mirrored pair of internal 2.5" SAS
10kRPM drives, which holds some database logs; The 2nd is a RAID-5 LUN on
the old SAN array (6 drives), which holds database tables & indices; The
3rd is a mirrored pair of SAN drives, holding log replicas, archives,
and RMAN backup files.
I''ve included inline below an edited "zpool status" listing
to show
the ZFS pools, a listing of "zilstat -l 30 10" showing ZIL traffic
for each of the three pools, and a listing of "iostat -xn 10" for
the relevant devices, all during the same time period.
Note that the time these stats were taken was a bit atypical, in that an
RMAN backup was taking place, which was the source of the read (over)load
on the "san_sp2" pool devices.
So, here are my conclusions, and I''d like a sanity check since I
don''t
have a lot of experience with interpreting ZIL activity just yet.
(1) ZIL activity is not very heavy. Transaction logs on the internal
drives, which have no NVRAM cache, appear to generate low enough
levels of traffic that we could get by without an SSD ZIL if a
JBOD solution is chosen. We can keep using the internal drive pool
after the old SAN array is replaced.
(2) During RMAN backups, ZIL activity gets much heavier on the affected
SAN pool. We see a low-enough average rate (maybe 200 KBytes/sec),
but the occasional peak of as much as 1 to 2 MBytes/sec. The 100%-busy
figures here are for "regular" read traffic, not ZIL.
(3) Probably to be safe, we should go with the 2540 array, which does
have a small NVRAM cache, even though it is a fair bit more expensive
than the J4200 JBOD solution. Adding a Logzilla SSD to the J4200
is way more expensive than the 2540 with its NVRAM cache, and an 18GB
Logzilla is probably overkill for this workload.
I guess one question I''d add is: The "ops" numbers seem
pretty small.
Is it possible to give enough spindles to a pool to handle that many
IOP''s without needing an NVRAM cache? I know latency comes into play
at some point, but are we at that point?
Thanks and regards,
Marion
====================================== pool: int_mp1
config:
NAME STATE READ WRITE CKSUM
int_mp1 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s5 ONLINE 0 0 0
c0t1d0s5 ONLINE 0 0 0
pool: san_sp1
config:
NAME STATE READ WRITE
CKSUM
san_sp1 ONLINE 0 0
0
c3t4849544143484920443630303133323230303430d0 ONLINE 0 0
0
pool: san_sp2
config:
NAME STATE READ WRITE
CKSUM
san_sp2 ONLINE 0 0
0
c3t4849544143484920443630303133323230303033d0 ONLINE 0 0
0
======================================
# zilstat -p san_sp1 -l 30 10
N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops
<=4kB 4-32kB >=32kB
108992 10899 108992 143360 14336 143360 5 1
2 2
0 0 0 0 0 0 0 0
0 0
33536 3353 16768 40960 4096 20480 2 0
2 0
134144 13414 50304 163840 16384 61440 8 0
8 0
16768 1676 16768 20480 2048 20480 1 0
1 0
0 0 0 0 0 0 0 0
0 0
134144 13414 134144 221184 22118 221184 2 0
0 2
134848 13484 117376 233472 23347 143360 9 0
8 1
^C
# zilstat -p san_sp2 -l 30 10
N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops
<=4kB 4-32kB >=32kB
1126264 112626 318592 1658880 165888 466944 56 0
50 6
67072 6707 25152 114688 11468 53248 6 0
6 0
61120 6112 16768 86016 8601 20480 7 3
4 0
193216 19321 83840 258048 25804 114688 14 0
14 0
1563584 156358 1043776 1916928 191692 1282048 96 3
93 0
50304 5030 16768 61440 6144 20480 3 0
3 0
67072 6707 16768 81920 8192 20480 4 0
4 0
78912 7891 25152 110592 11059 40960 7 2
5 0
^C
# zilstat -p int_mp1 -l 30 10
N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops
<=4kB 4-32kB >=32kB
49728 4972 16576 61440 6144 20480 3 0
3 0
53888 5388 19520 110592 11059 49152 7 2
5 0
69760 6976 16576 126976 12697 40960 7 1
6 0
49728 4972 16576 61440 6144 20480 3 0
3 0
49728 4972 16576 61440 6144 20480 3 0
3 0
49728 4972 16576 61440 6144 20480 3 0
3 0
49728 4972 16576 61440 6144 20480 3 0
3 0
70464 7046 16576 131072 13107 45056 8 2
6 0
^C
#
======================================
# iostat -xn 10 | egrep ''r/s| c''
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
1.0 26.6 90.3 141.1 0.0 0.4 0.0 16.0 0 14 c0t0d0
1.0 26.6 90.4 141.1 0.0 0.4 0.0 15.9 0 14 c0t1d0
2.5 21.8 302.1 123.2 0.4 0.1 18.0 3.5 2 3
c3t4849544143484920443630303133323230303033d0
2.5 21.1 149.8 100.5 0.2 0.1 9.0 3.0 1 3
c3t4849544143484920443630303133323230303430d0
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 25.5 0.0 126.4 0.0 0.3 0.0 13.2 0 14 c0t0d0
0.0 25.5 0.0 126.4 0.0 0.3 0.0 11.9 0 13 c0t1d0
0.0 34.0 0.0 644.0 0.0 0.0 0.4 0.7 0 1
c3t4849544143484920443630303133323230303033d0
381.0 2.0 14505.6 9.2 31.0 4.0 80.9 10.4 100 100
c3t4849544143484920443630303133323230303430d0
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.2 35.5 13.6 141.4 0.0 0.6 0.0 15.9 0 18 c0t0d0
0.1 35.5 0.8 141.4 0.0 0.5 0.0 14.8 0 17 c0t1d0
0.0 29.5 0.0 610.6 0.0 0.0 0.4 0.8 0 1
c3t4849544143484920443630303133323230303033d0
391.2 4.6 16344.9 19.3 31.0 4.0 78.3 10.1 100 100
c3t4849544143484920443630303133323230303430d0
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 21.7 0.0 85.4 0.0 0.3 0.0 13.0 0 11 c0t0d0
0.0 21.7 0.0 85.4 0.0 0.3 0.0 14.7 0 12 c0t1d0
0.0 32.2 0.0 830.9 0.0 0.0 0.8 0.8 0 1
c3t4849544143484920443630303133323230303033d0
389.0 3.7 15683.8 16.3 31.0 4.0 78.9 10.2 100 100
c3t4849544143484920443630303133323230303430d0
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 37.0 0.0 241.7 0.0 0.5 0.0 13.2 0 17 c0t0d0
0.0 37.0 0.0 241.7 0.0 0.5 0.0 13.5 0 18 c0t1d0
0.0 33.0 0.0 933.3 0.0 0.1 1.0 1.9 0 2
c3t4849544143484920443630303133323230303033d0
362.5 1.4 14618.1 12.4 31.0 4.0 85.2 11.0 100 100
c3t4849544143484920443630303133323230303430d0
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.1 30.4 0.8 124.8 0.0 0.4 0.0 14.3 0 15 c0t0d0
0.1 30.4 0.8 124.8 0.0 0.4 0.0 14.2 0 15 c0t1d0
0.0 37.4 0.0 1153.3 0.1 0.1 2.2 1.6 1 2
c3t4849544143484920443630303133323230303033d0
373.7 2.9 15121.6 12.1 31.0 4.0 82.3 10.6 100 100
c3t4849544143484920443630303133323230303430d0
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 23.9 0.0 93.0 0.0 0.3 0.0 12.9 0 12 c0t0d0
0.0 23.9 0.0 93.0 0.0 0.3 0.0 13.3 0 12 c0t1d0
0.0 34.8 0.0 1240.2 0.1 0.1 2.0 1.5 1 2
c3t4849544143484920443630303133323230303033d0
371.5 1.9 15763.9 9.4 31.0 4.0 83.0 10.7 100 100
c3t4849544143484920443630303133323230303430d0
=======================================
On Mon, 27 Apr 2009, Marion Hakanson wrote:> > I guess one question I''d add is: The "ops" numbers seem pretty small. > Is it possible to give enough spindles to a pool to handle that many > IOP''s without needing an NVRAM cache? I know latency comes into play > at some point, but are we at that point?Your IOPS don''t seem high. You are currently using RAID-5, which is a poor choice for a database. If you use ZFS mirrors you are going to unleash a lot more IOPS from the available spindles. I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS drives arranged as six mirrors (2540 is configured like a JBOD). While I don''t run a database, I have run an IOPS benchmark with random writers (8K blocks) and see a peak of 3708 ops/sec. With a SATA model you are not likely to see half of that. I am not familiar with zilstat. Presumaby the ''93'' is actually 930 ops/second? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I have now downloaded zilstat.ksh and this is the sort of loading it
reports with my StorageTek 2540 while running the initial writer part
of the benchmark:
% ./zilstat.ksh -p Sun_2540 -l 30 10
N-Bytes N-Bytes/s N-Max-Rate B-Bytes B-Bytes/s B-Max-Rate ops
<=4kB 4-32kB >=32kB
413012608 41301260 54822976 531230720 53123072 71626752 13817 0
446 13371
412970688 41297068 54537920 535236608 53523660 72753152 13771 0
597 13174
532394040 53239404 55049760 689438720 68943872 71467008 17862 12
586 17264
416904872 41690487 54958456 543293440 54329344 71794688 13890 8
809 13073
420013248 42001324 55426624 546328576 54632857 75091968 14115 0
684 13431
and later during a ''rewriter'' stage:
339795136 33979513 38122048 442200064 44220006 50827264 11427 0
677 10750
268262848 26826284 39153280 357871616 35787161 53542912 8671 0
178 8493
332464672 33246467 38926912 442085376 44208537 50462720 10873 9
152 10712
332375296 33237529 38390336 443805696 44380569 52092928 10849 0
238 10611
273482248 27348224 37300416 363569152 36356915 51687424 8842 2
100 8740
320696384 32069638 35615232 420933632 42093363 47271936 10479 0
190 10289
Clearly these are stratospheric as compared with the numbers you are
encountering on your live system.
After scaling for time, I see that the ops/second is similar to what
iozone reports.
Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
bfriesen at simple.dallas.tx.us said:> Your IOPS don''t seem high. You are currently using RAID-5, which is a poor > choice for a database. If you use ZFS mirrors you are going to unleash a > lot more IOPS from the available spindles.RAID-5 may be poor for some database loads, but it''s perfectly adequate for this one (small data warehouse, sequential writes, and so far mostly sequential reads as well). So far the RAID-5 LUN has not been a problem, and it doesn''t look like the low IOPS are because of the hardware, rather the database/application just isn''t demanding more. Please correct me if I''ve come to the wrong conclusion here....> I am not familiar with zilstat. Presumaby the ''93'' is actually 930 ops/ > second?I think you answered your question in your second post. But for others, the "93" is the total ops over the reporting interval. In this case, the interval was 10 seconds, so 9.3 ops/sec.> I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS drives > arranged as six mirrors (2540 is configured like a JBOD). While I don''t run > a database, I have run an IOPS benchmark with random writers (8K blocks) and > see a peak of 3708 ops/sec. With a SATA model you are not likely to see > half of that.Thanks for the 2540 numbers you posted. There''s a SAS 2530 here with the same 300GB 15kRPM drives, and as you said, it''s fast. But it looks so far like the SATA model, even with less than half the IOPS, will be more than enough for our workload. I''m pretty convinced that the SATA 2540 will be sufficient. What I''m not sure of is if the cheaper J4200 without SSD would be sufficient. I.e., are we generating enough synchronous traffic that lack of NVRAM cache will cause problems? One thing zilstat doesn''t make obvious (to me) is the latency effects of a separate log/ZIL device. I guess I could force our old array''s cache into write-through mode and see what happens to the numbers. Judging by our experience with NFS servers using this same array, I''m reluctant to try. Thanks and regards, Marion