Ragnar Sundblad
2010-Apr-04 02:05 UTC
[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA
Hello, Maybe this question should be put on another list, but since there are a lot of people here using all kinds of HBAs, this could be right anyway; I have a X4150 running snv_134. It was shipped with a "STK RAID INT" adaptec/intel/storagetek/sun SAS HBA. When running the card in copyback write cache mode, I got horrible performance (with zfs), much worse than with copyback disabled (which I believe should mean it does write-through), when tested with filebench. This could actually be expected, depending on how good or bad the the card is, but I am still not sure about what to expect. It logs some errors, as shown with "fmdump -e(V). It is most often a pci bridge error (I think), about five to ten times an hour, and occasionally a problem with accessing a mode page on the disks for enabling/disabling the write cache, one error for each disk, about every three hours. I don''t believe the two have to be related. I am not sure if the PCI-PCI bridge is on the RAID board itself or in the host. I haven''t seen this problem on other more or less identical machines running sol10. Is this a known software problem, or do I have faulty hardware? Thanks! /ragge -------------- % fmdump -e ... Apr 04 01:21:53.2244 ereport.io.pci.fabric Apr 04 01:30:00.6999 ereport.io.pci.fabric Apr 04 01:30:23.4647 ereport.io.scsi.cmd.disk.dev.uderr Apr 04 01:30:23.4651 ereport.io.scsi.cmd.disk.dev.uderr ... % fmdump -eV Apr 04 2010 01:21:53.224492765 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0xd6a00a43be800c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci8086,25f8 at 4 (end detector) bdf = 0x20 device_id = 0x25f8 vendor_id = 0x8086 rev_id = 0xb1 dev_type = 0x40 pcie_off = 0x6c pcix_off = 0x0 aer_off = 0x100 ecc_ver = 0x0 pci_status = 0x10 pci_command = 0x147 pci_bdg_sec_status = 0x0 pci_bdg_ctrl = 0x3 pcie_status = 0x0 pcie_command = 0x2027 pcie_dev_cap = 0xfc1 pcie_adv_ctl = 0x0 pcie_ue_status = 0x0 pcie_ue_mask = 0x100000 pcie_ue_sev = 0x62031 pcie_ue_hdr0 = 0x0 pcie_ue_hdr1 = 0x0 pcie_ue_hdr2 = 0x0 pcie_ue_hdr3 = 0x0 pcie_ce_status = 0x0 pcie_ce_mask = 0x0 pcie_rp_status = 0x0 pcie_rp_control = 0x7 pcie_adv_rp_status = 0x0 pcie_adv_rp_command = 0x7 pcie_adv_rp_ce_src_id = 0x0 pcie_adv_rp_ue_src_id = 0x0 remainder = 0x0 severity = 0x1 __ttl = 0x1 __tod = 0x4bb7cd91 0xd617cdd ... Apr 04 2010 01:30:23.464768275 ereport.io.scsi.cmd.disk.dev.uderr nvlist version: 0 class = ereport.io.scsi.cmd.disk.dev.uderr ena = 0xde0cd54f84201c01 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci at 0,0/pci8086,25f8 at 4/pci108e,286 at 0/disk at 5,0 devid = id1,sd at TSun_____STK_RAID_INT____EA4B6F24 (end detector) driver-assessment = fail op-code = 0x1a cdb = 0x1a 0x0 0x8 0x0 0x18 0x0 pkt-reason = 0x0 pkt-state = 0x1f pkt-stats = 0x0 stat-code = 0x0 un-decode-info = sd_get_write_cache_enabled: Mode Sense caching page code mismatch 0 un-decode-value __ttl = 0x1 __tod = 0x4bb7cf8f 0x1bb3cd13 ...
Edward Ned Harvey
2010-Apr-05 02:35 UTC
[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA
> When running the card in copyback write cache mode, I got horrible > performance (with zfs), much worse than with copyback disabled > (which I believe should mean it does write-through), when tested > with filebench.When I benchmark my disks, I also find that the system is slower with WriteBack enabled. I would not call it "much worse," I''d estimate about 10% worse. This, naturally, is counterintuitive. I do have an explanation, however, which is partly conjecture: With the WriteBack enabled, when the OS tells the HBA to write something, it seems to complete instantly. So the OS will issue another, and another, and another. The HBA has no knowledge of the underlying pool data structure, so it cannot consolidate the smaller writes into larger sequential ones. It will brainlessly (or less-brainfully) do as it was told, and write the blocks to precisely the addresses that it was instructed to write. Even if those are many small writes, scattered throughout the platters. ZFS is smarter than that. It''s able to consolidate a zillion tiny writes, as well as some larger writes, all into a larger sequential transaction. ZFS has flexibility, in choosing precisely how large a transaction it will create, before sending it to disk. One of the variables used to decide how large the transaction should be is ... Is the disk busy writing, right now? If the disks are still busy, I might as well wait a little longer and continue building up my next sequential block of data to write. If it appears to have completed the previous transaction already, no need to wait any longer. Don''t let the disks sit idle. Just send another small write to the disk. Long story short, I think, ZFS simply does a better job of write buffering than the HBA could possibly do. So you benefit by disabling the WriteBack, in order to allow ZFS handle that instead.
Ragnar Sundblad
2010-Apr-05 09:33 UTC
[zfs-discuss] Problems with zfs and a "STK RAID INT" SAS HBA
On 5 apr 2010, at 04.35, Edward Ned Harvey wrote:>> When running the card in copyback write cache mode, I got horrible >> performance (with zfs), much worse than with copyback disabled >> (which I believe should mean it does write-through), when tested >> with filebench. > > When I benchmark my disks, I also find that the system is slower with > WriteBack enabled. I would not call it "much worse," I''d estimate about 10% > worse.Yes, I oversimplified - I have been benchmarking with filebench, just running the tests shipped with the OS trimmed a little according to <http://www.solarisinternals.com/wiki/index.php/FileBench>. For most tests, I typically get a little worse performance with writeback enabled (or "copyback", as they called it on this card), maybe about 10 % in average could be about right for these tests too. The interesting part is that with these tests and writeback disabled, on a 4 way stripe of sun stock 2.5" 146 GB 10000 RPM drives, the test takes 2 hours and 18 minutes (138 minutes) to complete, but with writeback enabled it takes 16 hours 57 minutes (1017 minutes), or over 7.3 times as long time! I can''t (yet) explain the large difference in test time and the small diff in test results. Maybe a hardware - or driver - problem has its'' part in this. I have made a few simple tests with these cards before and was not really impressed, even with all the bells and whistles turned of they merely seemed to be an IOPS and maybe BW bottleneck, but the above seems just not right.> This, naturally, is counterintuitive. I do have an explanation, > however, which is partly conjecture: With the WriteBack enabled, when the > OS tells the HBA to write something, it seems to complete instantly. So the > OS will issue another, and another, and another. The HBA has no knowledge > of the underlying pool data structure, so it cannot consolidate the smaller > writes into larger sequential ones. It will brainlessly (or > less-brainfully) do as it was told, and write the blocks to precisely the > addresses that it was instructed to write. Even if those are many small > writes, scattered throughout the platters. ZFS is smarter than that. It''s > able to consolidate a zillion tiny writes, as well as some larger writes, > all into a larger sequential transaction. ZFS has flexibility, in choosing > precisely how large a transaction it will create, before sending it to disk. > One of the variables used to decide how large the transaction should be is > ... Is the disk busy writing, right now? If the disks are still busy, I > might as well wait a little longer and continue building up my next > sequential block of data to write. If it appears to have completed the > previous transaction already, no need to wait any longer. Don''t let the > disks sit idle. Just send another small write to the disk. > > Long story short, I think, ZFS simply does a better job of write buffering > than the HBA could possibly do. So you benefit by disabling the WriteBack, > in order to allow ZFS handle that instead.You could think that ZIL transactions could get a speedup by the writeback cache, meaning more sync actions per second, and in some cases that seems to be true, and that the card should be designed to be able to handle intermittent load as the txg completions bursts (typically every 30 seconds), but something strange obviously happens, at least on this setup. (Actually I''d prefer if I could conclude that there is no use for writeback caching HBAs - I''d like these machines to be as stable as they possible can and therefore to be just as plain and simple as possible, and for us to be able to just quickly move the disks if one machine should brake - with some data stuck in some silly writeback cache inside a HBA that may or may not cooperate depending on it''s state of mind, mood and the moon phase, that can''t be done and I''d need a much more complicated (= error- and mistake-prone) setup. But my tests so far seems just not right and probably can''t be used to conclude anything. I''d rather use slogs, and have a few Intel X25-Es to test with, but then I just recently read on this list that X25-Es aren''t supported for slog anymore! Maybe because they always have their writeback cache turned on by default and ignore cache flush commands (and that is "not a bug" - is the design from outer space?), I don''t know yet. (Don''t know why I am stubbornly fooling around with this intel junk - they right now manage to annoy me with a crappy (or broken) PCI-PCI bridge, a crappy HBA and a crappy SSD drives...)) /ragge