thr3ads.net - zfs discuss - [zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA [Apr 2010]

If this information is useful, please help other people find it:
Share via:

Ragnar Sundblad

2010-Apr-04 02:05 UTC

[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA

Hello,

Maybe this question should be put on another list, but since there
are a lot of people here using all kinds of HBAs, this could be right
anyway;

I have a X4150 running snv_134. It was shipped with a "STK RAID INT"
adaptec/intel/storagetek/sun SAS HBA.

When running the card in copyback write cache mode, I got horrible
performance (with zfs), much worse than with copyback disabled
(which I believe should mean it does write-through), when tested
with filebench.
This could actually be expected, depending on how good or bad the
the card is, but I am still not sure about what to expect.

It logs some errors, as shown with "fmdump -e(V).
It is most often a pci bridge error (I think), about five to ten
times an hour, and occasionally a problem with accessing a
mode page on the disks for enabling/disabling the write cache,
one error for each disk, about every three hours.
I don''t believe the two have to be related.

I am not sure if the PCI-PCI bridge is on the RAID board itself
or in the host.

I haven''t seen this problem on other more or less identical
machines running sol10.

Is this a known software problem, or do I have faulty hardware?

Thanks!

/ragge

--------------

% fmdump -e
...
Apr 04 01:21:53.2244 ereport.io.pci.fabric           
Apr 04 01:30:00.6999 ereport.io.pci.fabric           
Apr 04 01:30:23.4647 ereport.io.scsi.cmd.disk.dev.uderr
Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr
...
% fmdump -eV
Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric
nvlist version: 0
        class = ereport.io.pci.fabric
        ena = 0xd6a00a43be800c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,25f8 at 4
        (end detector)

        bdf = 0x20
        device_id = 0x25f8
        vendor_id = 0x8086
        rev_id = 0xb1
        dev_type = 0x40
        pcie_off = 0x6c
        pcix_off = 0x0
        aer_off = 0x100
        ecc_ver = 0x0
        pci_status = 0x10
        pci_command = 0x147
        pci_bdg_sec_status = 0x0
        pci_bdg_ctrl = 0x3
        pcie_status = 0x0
        pcie_command = 0x2027
        pcie_dev_cap = 0xfc1
        pcie_adv_ctl = 0x0
        pcie_ue_status = 0x0
        pcie_ue_mask = 0x100000
        pcie_ue_sev = 0x62031
        pcie_ue_hdr0 = 0x0
        pcie_ue_hdr1 = 0x0
        pcie_ue_hdr2 = 0x0
        pcie_ue_hdr3 = 0x0
        pcie_ce_status = 0x0
        pcie_ce_mask = 0x0
        pcie_rp_status = 0x0
        pcie_rp_control = 0x7
        pcie_adv_rp_status = 0x0
        pcie_adv_rp_command = 0x7
        pcie_adv_rp_ce_src_id = 0x0
        pcie_adv_rp_ue_src_id = 0x0
        remainder = 0x0
        severity = 0x1
        __ttl = 0x1
        __tod = 0x4bb7cd91 0xd617cdd
...
Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.dev.uderr
        ena = 0xde0cd54f84201c01
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /pci at 0,0/pci8086,25f8 at 4/pci108e,286 at
0/disk at 5,0
                devid = id1,sd at TSun_____STK_RAID_INT____EA4B6F24
        (end detector)

        driver-assessment = fail
        op-code = 0x1a
        cdb = 0x1a 0x0 0x8 0x0 0x18 0x0
        pkt-reason = 0x0
        pkt-state = 0x1f
        pkt-stats = 0x0
        stat-code = 0x0
        un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page
code mismatch 0

        un-decode-value         __ttl = 0x1
        __tod = 0x4bb7cf8f 0x1bb3cd13
...

Edward Ned Harvey

2010-Apr-05 02:35 UTC

head link

[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA

> When running the card in copyback write cache mode, I got horrible
> performance (with zfs), much worse than with copyback disabled
> (which I believe should mean it does write-through), when tested
> with filebench.
When I benchmark my disks, I also find that the system is slower with
WriteBack enabled.  I would not call it "much worse," I''d
estimate about 10%
worse.  This, naturally, is counterintuitive.  I do have an explanation,
however, which is partly conjecture:  With the WriteBack enabled, when the
OS tells the HBA to write something, it seems to complete instantly.  So the
OS will issue another, and another, and another.  The HBA has no knowledge
of the underlying pool data structure, so it cannot consolidate the smaller
writes into larger sequential ones.  It will brainlessly (or
less-brainfully) do as it was told, and write the blocks to precisely the
addresses that it was instructed to write.  Even if those are many small
writes, scattered throughout the platters.  ZFS is smarter than that. 
It''s
able to consolidate a zillion tiny writes, as well as some larger writes,
all into a larger sequential transaction.  ZFS has flexibility, in choosing
precisely how large a transaction it will create, before sending it to disk.
One of the variables used to decide how large the transaction should be is
... Is the disk busy writing, right now?  If the disks are still busy, I
might as well wait a little longer and continue building up my next
sequential block of data to write.  If it appears to have completed the
previous transaction already, no need to wait any longer.  Don''t let
the
disks sit idle.  Just send another small write to the disk.

Long story short, I think, ZFS simply does a better job of write buffering
than the HBA could possibly do.  So you benefit by disabling the WriteBack,
in order to allow ZFS handle that instead.

Ragnar Sundblad

2010-Apr-05 09:33 UTC

head link

[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA

On 5 apr 2010, at 04.35, Edward Ned Harvey wrote:
>> When running the card in copyback write cache mode, I got horrible
>> performance (with zfs), much worse than with copyback disabled
>> (which I believe should mean it does write-through), when tested
>> with filebench.
> 
> When I benchmark my disks, I also find that the system is slower with
> WriteBack enabled.  I would not call it "much worse,"
I''d estimate about 10%
> worse.
Yes, I oversimplified - I have been benchmarking with filebench,
just running the tests shipped with the OS trimmed a little
according to <http://www.solarisinternals.com/wiki/index.php/FileBench>.
For most tests, I typically get a little worse performance with
writeback enabled (or "copyback", as they called it on this card),
maybe about 10 % in average could be about right for these tests too.

The interesting part is that with these tests and writeback disabled,
on a 4 way stripe of sun stock 2.5" 146 GB 10000 RPM drives, the test
takes 2 hours and 18 minutes (138 minutes) to complete, but with
writeback enabled it takes 16 hours 57 minutes (1017 minutes), or 
over 7.3 times as long time!

I can''t (yet) explain the large difference in test time and the
small diff in test results.

Maybe a hardware - or driver - problem has its'' part in this.

I have made a few simple tests with these cards before and was
not really impressed, even with all the bells and whistles turned of
they merely seemed to be an IOPS and maybe BW bottleneck, but the above
seems just not right.
>  This, naturally, is counterintuitive.  I do have an explanation,
> however, which is partly conjecture:  With the WriteBack enabled, when the
> OS tells the HBA to write something, it seems to complete instantly.  So
the
> OS will issue another, and another, and another.  The HBA has no knowledge
> of the underlying pool data structure, so it cannot consolidate the smaller
> writes into larger sequential ones.  It will brainlessly (or
> less-brainfully) do as it was told, and write the blocks to precisely the
> addresses that it was instructed to write.  Even if those are many small
> writes, scattered throughout the platters.  ZFS is smarter than that. 
It''s
> able to consolidate a zillion tiny writes, as well as some larger writes,
> all into a larger sequential transaction.  ZFS has flexibility, in choosing
> precisely how large a transaction it will create, before sending it to
disk.
> One of the variables used to decide how large the transaction should be is
> ... Is the disk busy writing, right now?  If the disks are still busy, I
> might as well wait a little longer and continue building up my next
> sequential block of data to write.  If it appears to have completed the
> previous transaction already, no need to wait any longer.  Don''t
let the
> disks sit idle.  Just send another small write to the disk.
> 
> Long story short, I think, ZFS simply does a better job of write buffering
> than the HBA could possibly do.  So you benefit by disabling the WriteBack,
> in order to allow ZFS handle that instead.
You could think that ZIL transactions could get a speedup by the
writeback cache, meaning more sync actions per second, and in some
cases that seems to be true, and that the card should be designed to
be able to handle intermittent load as the txg completions bursts
(typically every 30 seconds), but something strange obviously happens,
at least on this setup.

(Actually I''d prefer if I could conclude that there is no use for
writeback caching HBAs - I''d like these machines to be as stable as
they possible can and therefore to be just as plain and simple as possible,
and for us to be able to just quickly move the disks if one machine should
brake - with some data stuck in some silly writeback cache inside a HBA
that may or may not cooperate depending on it''s state of mind, mood and
the
moon phase, that can''t be done and I''d need a much more
complicated
(= error- and mistake-prone) setup. But my tests so far seems just not
right and probably can''t be used to conclude anything.
I''d rather use slogs, and have a few Intel X25-Es to test with, but
then I just recently read on this list that X25-Es aren''t supported for
slog anymore! Maybe because they always have their writeback cache
turned on by default and ignore cache flush commands (and that is "not a
bug" - is the design from outer space?), I don''t know yet.
(Don''t know why I am stubbornly fooling around with this intel junk -
they
right now manage to annoy me with a crappy (or broken) PCI-PCI bridge,
a crappy HBA and a crappy SSD drives...))

/ragge

zfs discuss - Apr 2010 - Problems with zfs and a "STK RAID INT" SAS HBA

[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA

[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA

[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA