Greetings,
We have a small Oracle project on ZFS (Solaris-10), using a SAN-connected
array which is need of replacement.  I''m weighing whether to recommend
a Sun 2540 array or a Sun J4200 JBOD as the replacement.  The old array
and the new ones all have 7200RPM SATA drives.
I''ve been watching the workload on the current storage using Richard
Elling''s
handy zilstat tool, and could use some more eyes/brains than just mine for
making sense of the results.
There are three pools;  One is on a mirrored pair of internal 2.5" SAS
10kRPM drives, which holds some database logs;  The 2nd is a RAID-5 LUN on
the old SAN array (6 drives), which holds database tables & indices;  The
3rd is a mirrored pair of SAN drives, holding log replicas, archives,
and RMAN backup files.
I''ve included inline below an edited "zpool status" listing
to show
the ZFS pools, a listing of "zilstat -l 30 10" showing ZIL traffic
for each of the three pools, and a listing of "iostat -xn 10" for
the relevant devices, all during the same time period.
Note that the time these stats were taken was a bit atypical, in that an
RMAN backup was taking place, which was the source of the read (over)load
on the "san_sp2" pool devices.
So, here are my conclusions, and I''d like a sanity check since I
don''t
have a lot of experience with interpreting ZIL activity just yet.
(1) ZIL activity is not very heavy.  Transaction logs on the internal
    drives, which have no NVRAM cache, appear to generate low enough
    levels of traffic that we could get by without an SSD ZIL if a
    JBOD solution is chosen.  We can keep using the internal drive pool
    after the old SAN array is replaced.
(2) During RMAN backups, ZIL activity gets much heavier on the affected
    SAN pool.  We see a low-enough average rate (maybe 200 KBytes/sec),
    but the occasional peak of as much as 1 to 2 MBytes/sec.  The 100%-busy
    figures here are for "regular" read traffic, not ZIL.
(3) Probably to be safe, we should go with the 2540 array, which does
    have a small NVRAM cache, even though it is a fair bit more expensive
    than the J4200 JBOD solution.  Adding a Logzilla SSD to the J4200
    is way more expensive than the 2540 with its NVRAM cache, and an 18GB
    Logzilla is probably overkill for this workload.
I guess one question I''d add is:  The "ops" numbers seem
pretty small.
Is it possible to give enough spindles to a pool to handle that many
IOP''s without needing an NVRAM cache?  I know latency comes into play
at some point, but are we at that point?
Thanks and regards,
Marion
======================================  pool: int_mp1
config:
        NAME          STATE     READ WRITE CKSUM
        int_mp1       ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t0d0s5  ONLINE       0     0     0
            c0t1d0s5  ONLINE       0     0     0
  pool: san_sp1
config:
        NAME                                             STATE     READ WRITE 
CKSUM
        san_sp1                                          ONLINE       0     0  
   0
          c3t4849544143484920443630303133323230303430d0  ONLINE       0     0  
   0
  pool: san_sp2
config:
        NAME                                             STATE     READ WRITE 
CKSUM
        san_sp2                                          ONLINE       0     0  
   0
          c3t4849544143484920443630303133323230303033d0  ONLINE       0     0  
   0
======================================
# zilstat -p san_sp1 -l 30 10
   N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops 
<=4kB 4-32kB >=32kB
    108992      10899     108992     143360      14336     143360      5      1 
2      2
         0          0          0          0          0          0      0      0 
0      0
     33536       3353      16768      40960       4096      20480      2      0 
2      0
    134144      13414      50304     163840      16384      61440      8      0 
8      0
     16768       1676      16768      20480       2048      20480      1      0 
1      0
         0          0          0          0          0          0      0      0 
0      0
    134144      13414     134144     221184      22118     221184      2      0 
0      2
    134848      13484     117376     233472      23347     143360      9      0 
8      1
^C
# zilstat -p san_sp2 -l 30 10
   N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops 
<=4kB 4-32kB >=32kB
   1126264     112626     318592    1658880     165888     466944     56      0 
50      6
     67072       6707      25152     114688      11468      53248      6      0 
6      0
     61120       6112      16768      86016       8601      20480      7      3 
4      0
    193216      19321      83840     258048      25804     114688     14      0 
14      0
   1563584     156358    1043776    1916928     191692    1282048     96      3 
93      0
     50304       5030      16768      61440       6144      20480      3      0 
3      0
     67072       6707      16768      81920       8192      20480      4      0 
4      0
     78912       7891      25152     110592      11059      40960      7      2 
5      0
^C
# zilstat -p int_mp1 -l 30 10
   N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops 
<=4kB 4-32kB >=32kB
     49728       4972      16576      61440       6144      20480      3      0 
3      0
     53888       5388      19520     110592      11059      49152      7      2 
5      0
     69760       6976      16576     126976      12697      40960      7      1 
6      0
     49728       4972      16576      61440       6144      20480      3      0 
3      0
     49728       4972      16576      61440       6144      20480      3      0 
3      0
     49728       4972      16576      61440       6144      20480      3      0 
3      0
     49728       4972      16576      61440       6144      20480      3      0 
3      0
     70464       7046      16576     131072      13107      45056      8      2 
6      0
^C
#
======================================
# iostat -xn 10 | egrep ''r/s| c''
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.0   26.6   90.3  141.1  0.0  0.4    0.0   16.0   0  14 c0t0d0
    1.0   26.6   90.4  141.1  0.0  0.4    0.0   15.9   0  14 c0t1d0
    2.5   21.8  302.1  123.2  0.4  0.1   18.0    3.5   2   3
c3t4849544143484920443630303133323230303033d0
    2.5   21.1  149.8  100.5  0.2  0.1    9.0    3.0   1   3
c3t4849544143484920443630303133323230303430d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   25.5    0.0  126.4  0.0  0.3    0.0   13.2   0  14 c0t0d0
    0.0   25.5    0.0  126.4  0.0  0.3    0.0   11.9   0  13 c0t1d0
    0.0   34.0    0.0  644.0  0.0  0.0    0.4    0.7   0   1
c3t4849544143484920443630303133323230303033d0
  381.0    2.0 14505.6    9.2 31.0  4.0   80.9   10.4 100 100
c3t4849544143484920443630303133323230303430d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2   35.5   13.6  141.4  0.0  0.6    0.0   15.9   0  18 c0t0d0
    0.1   35.5    0.8  141.4  0.0  0.5    0.0   14.8   0  17 c0t1d0
    0.0   29.5    0.0  610.6  0.0  0.0    0.4    0.8   0   1
c3t4849544143484920443630303133323230303033d0
  391.2    4.6 16344.9   19.3 31.0  4.0   78.3   10.1 100 100
c3t4849544143484920443630303133323230303430d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   21.7    0.0   85.4  0.0  0.3    0.0   13.0   0  11 c0t0d0
    0.0   21.7    0.0   85.4  0.0  0.3    0.0   14.7   0  12 c0t1d0
    0.0   32.2    0.0  830.9  0.0  0.0    0.8    0.8   0   1
c3t4849544143484920443630303133323230303033d0
  389.0    3.7 15683.8   16.3 31.0  4.0   78.9   10.2 100 100
c3t4849544143484920443630303133323230303430d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   37.0    0.0  241.7  0.0  0.5    0.0   13.2   0  17 c0t0d0
    0.0   37.0    0.0  241.7  0.0  0.5    0.0   13.5   0  18 c0t1d0
    0.0   33.0    0.0  933.3  0.0  0.1    1.0    1.9   0   2
c3t4849544143484920443630303133323230303033d0
  362.5    1.4 14618.1   12.4 31.0  4.0   85.2   11.0 100 100
c3t4849544143484920443630303133323230303430d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.1   30.4    0.8  124.8  0.0  0.4    0.0   14.3   0  15 c0t0d0
    0.1   30.4    0.8  124.8  0.0  0.4    0.0   14.2   0  15 c0t1d0
    0.0   37.4    0.0 1153.3  0.1  0.1    2.2    1.6   1   2
c3t4849544143484920443630303133323230303033d0
  373.7    2.9 15121.6   12.1 31.0  4.0   82.3   10.6 100 100
c3t4849544143484920443630303133323230303430d0
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0   23.9    0.0   93.0  0.0  0.3    0.0   12.9   0  12 c0t0d0
    0.0   23.9    0.0   93.0  0.0  0.3    0.0   13.3   0  12 c0t1d0
    0.0   34.8    0.0 1240.2  0.1  0.1    2.0    1.5   1   2
c3t4849544143484920443630303133323230303033d0
  371.5    1.9 15763.9    9.4 31.0  4.0   83.0   10.7 100 100
c3t4849544143484920443630303133323230303430d0
=======================================
On Mon, 27 Apr 2009, Marion Hakanson wrote:> > I guess one question I''d add is: The "ops" numbers seem pretty small. > Is it possible to give enough spindles to a pool to handle that many > IOP''s without needing an NVRAM cache? I know latency comes into play > at some point, but are we at that point?Your IOPS don''t seem high. You are currently using RAID-5, which is a poor choice for a database. If you use ZFS mirrors you are going to unleash a lot more IOPS from the available spindles. I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS drives arranged as six mirrors (2540 is configured like a JBOD). While I don''t run a database, I have run an IOPS benchmark with random writers (8K blocks) and see a peak of 3708 ops/sec. With a SATA model you are not likely to see half of that. I am not familiar with zilstat. Presumaby the ''93'' is actually 930 ops/second? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I have now downloaded zilstat.ksh and this is the sort of loading it 
reports with my StorageTek 2540 while running the initial writer part 
of the benchmark:
% ./zilstat.ksh -p Sun_2540 -l 30 10
    N-Bytes  N-Bytes/s N-Max-Rate    B-Bytes  B-Bytes/s B-Max-Rate    ops 
<=4kB 4-32kB >=32kB
  413012608   41301260   54822976  531230720   53123072   71626752  13817      0
446  13371
  412970688   41297068   54537920  535236608   53523660   72753152  13771      0
597  13174
  532394040   53239404   55049760  689438720   68943872   71467008  17862     12
586  17264
  416904872   41690487   54958456  543293440   54329344   71794688  13890      8
809  13073
  420013248   42001324   55426624  546328576   54632857   75091968  14115      0
684  13431
and later during a ''rewriter'' stage:
  339795136   33979513   38122048  442200064   44220006   50827264  11427      0
677  10750
  268262848   26826284   39153280  357871616   35787161   53542912   8671      0
178   8493
  332464672   33246467   38926912  442085376   44208537   50462720  10873      9
152  10712
  332375296   33237529   38390336  443805696   44380569   52092928  10849      0
238  10611
  273482248   27348224   37300416  363569152   36356915   51687424   8842      2
100   8740
  320696384   32069638   35615232  420933632   42093363   47271936  10479      0
190  10289
Clearly these are stratospheric as compared with the numbers you are 
encountering on your live system.
After scaling for time, I see that the ops/second is similar to what 
iozone reports.
Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
bfriesen at simple.dallas.tx.us said:> Your IOPS don''t seem high. You are currently using RAID-5, which is a poor > choice for a database. If you use ZFS mirrors you are going to unleash a > lot more IOPS from the available spindles.RAID-5 may be poor for some database loads, but it''s perfectly adequate for this one (small data warehouse, sequential writes, and so far mostly sequential reads as well). So far the RAID-5 LUN has not been a problem, and it doesn''t look like the low IOPS are because of the hardware, rather the database/application just isn''t demanding more. Please correct me if I''ve come to the wrong conclusion here....> I am not familiar with zilstat. Presumaby the ''93'' is actually 930 ops/ > second?I think you answered your question in your second post. But for others, the "93" is the total ops over the reporting interval. In this case, the interval was 10 seconds, so 9.3 ops/sec.> I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS drives > arranged as six mirrors (2540 is configured like a JBOD). While I don''t run > a database, I have run an IOPS benchmark with random writers (8K blocks) and > see a peak of 3708 ops/sec. With a SATA model you are not likely to see > half of that.Thanks for the 2540 numbers you posted. There''s a SAS 2530 here with the same 300GB 15kRPM drives, and as you said, it''s fast. But it looks so far like the SATA model, even with less than half the IOPS, will be more than enough for our workload. I''m pretty convinced that the SATA 2540 will be sufficient. What I''m not sure of is if the cheaper J4200 without SSD would be sufficient. I.e., are we generating enough synchronous traffic that lack of NVRAM cache will cause problems? One thing zilstat doesn''t make obvious (to me) is the latency effects of a separate log/ZIL device. I guess I could force our old array''s cache into write-through mode and see what happens to the numbers. Judging by our experience with NFS servers using this same array, I''m reluctant to try. Thanks and regards, Marion