Hi everyone, We''ve building a storage system that should have about 2TB of storage and good sequential write speed. The server side is a Sun X4200 running Solaris 10u4 (plus yesterday''s recommended patch cluster), the array we bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and it''s connected to the Sun through U320-scsi. Now the raidbox was sold to us as doing JBOD and various other raid levels, but JBOD turns out to mean ''create a single-disk stripe for every drive''. Which works, after a fashion: When using a 12-drive zfs with raidz and 1 hotspare, I get 132MB/s write performance, with raidz2 it''s still 112MB/s. If instead I configure the array as a Raid-50 through the hardware raid controller, I can only manage 72MB/s. So at a first glance, this seems a good case for zfs. Unfortunately, if I then pull a disk from the zfs array, it will keep trying to write to this disk, and will never activate the hot-spare. So a zpool status will then show the pool as ''degraded'', one drive marked as unavailable - and the hot-spare still marked as available. Write performance also drops to about 32MB/s. If I then try to activate the hot-spare by hand (zpool replace <broken disk> <hot spare>) the resilvering starts, but never makes it past 10% - it seems to restart all the time. As this box is not in production yet, and I''m the only user on it, I''m 100% sure that there is nothing happening on the zfs filesystem during the resilvering - no reads, writes and certainly no snapshots. In /var/adm/messages, I see this message repeated several times each minute: Nov 12 17:30:52 ddd scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7450 at 2/pci1000,5110 at 1/sd at 4,0 (sd47): Nov 12 17:30:52 ddd offline or reservation conflict Why isn''t this enough for zfs to switch over to the hotspare? I''ve tried disabling (setting to write-thru) the write-cache on the array box, but that didn''t make any difference to the behaviour either. I''d appreciate any insights or hints on how to proceed with this - should I even be trying to use zfs in this situation? Regards, Paul Boven. -- Paul Boven <boven at jive.nl> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It''s a fringe science
Paul Boven wrote:> Hi everyone, > > We''ve building a storage system that should have about 2TB of storage > and good sequential write speed. The server side is a Sun X4200 running > Solaris 10u4 (plus yesterday''s recommended patch cluster), the array we > bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and > it''s connected to the Sun through U320-scsi.A lot of improvements in this area are in the latest SXCE builds. Can you try this test on b77? I''m not sure what the schedule is for backporting these changes to S10. -- richard> Now the raidbox was sold to us as doing JBOD and various other raid > levels, but JBOD turns out to mean ''create a single-disk stripe for > every drive''. Which works, after a fashion: When using a 12-drive zfs > with raidz and 1 hotspare, I get 132MB/s write performance, with raidz2 > it''s still 112MB/s. If instead I configure the array as a Raid-50 > through the hardware raid controller, I can only manage 72MB/s. > So at a first glance, this seems a good case for zfs. > > Unfortunately, if I then pull a disk from the zfs array, it will keep > trying to write to this disk, and will never activate the hot-spare. So > a zpool status will then show the pool as ''degraded'', one drive marked > as unavailable - and the hot-spare still marked as available. Write > performance also drops to about 32MB/s. > > If I then try to activate the hot-spare by hand (zpool replace <broken > disk> <hot spare>) the resilvering starts, but never makes it past 10% - > it seems to restart all the time. As this box is not in production yet, > and I''m the only user on it, I''m 100% sure that there is nothing > happening on the zfs filesystem during the resilvering - no reads, > writes and certainly no snapshots. > > In /var/adm/messages, I see this message repeated several times each minute: > Nov 12 17:30:52 ddd scsi: [ID 107833 kern.warning] WARNING: > /pci at 1,0/pci1022,7450 at 2/pci1000,5110 at 1/sd at 4,0 (sd47): > Nov 12 17:30:52 ddd offline or reservation conflict > > Why isn''t this enough for zfs to switch over to the hotspare? > I''ve tried disabling (setting to write-thru) the write-cache on the > array box, but that didn''t make any difference to the behaviour either. > > I''d appreciate any insights or hints on how to proceed with this - > should I even be trying to use zfs in this situation? > > Regards, Paul Boven.
On Tue, Nov 13, 2007 at 12:25:24PM +0100, Paul Boven wrote:> Hi everyone, > > We''ve building a storage system that should have about 2TB of storage > and good sequential write speed. The server side is a Sun X4200 running > Solaris 10u4 (plus yesterday''s recommended patch cluster), the array we > bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and > it''s connected to the Sun through U320-scsi.We are doing basically the same thing with simliar Western Scientific (wsm.com) raids, based on infortrend controllers. ZFS notices when we pull a disk and goes on and does the right thing. I wonder if you''ve got a scsi card/driver problem. We tried using an Adaptec card with solaris with poor results; switched to LSI, it "just works". danno -- Dan Pritts, System Administrator Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224
Hi Dan, Dan Pritts wrote:> On Tue, Nov 13, 2007 at 12:25:24PM +0100, Paul Boven wrote:>> We''ve building a storage system that should have about 2TB of storage >> and good sequential write speed. The server side is a Sun X4200 running >> Solaris 10u4 (plus yesterday''s recommended patch cluster), the array we >> bought is a Transtec Provigo 510 12-disk array. The disks are SATA, and >> it''s connected to the Sun through U320-scsi. > > We are doing basically the same thing with simliar Western Scientific > (wsm.com) raids, based on infortrend controllers. ZFS notices when we > pull a disk and goes on and does the right thing. > > I wonder if you''ve got a scsi card/driver problem. We tried using > an Adaptec card with solaris with poor results; switched to LSI, > it "just works".Thanks for your reply. The SCSI-card in the X4200 is a Sun Single Channel U320 card that came with the system, but the PCB artwork does sport a nice ''LSI LOGIC'' imprint. So, just to make sure we''re talking about the same thing here - your drives are SATA, you''re exporting each drive through the Western Scientific raidbox as a seperate volume, and zfs actually brings in a hot spare when you pull a drive? Over here, I''ve still not been able to accomplish that - even after installing Nevada b76 on the machine, removing a disk will not cause a hot-spare to become active, nor does resilvering start. Our Transtec raidbox seems to be based on a chipset by Promise, by the way. Regards, Paul Boven. -- Paul Boven <boven at jive.nl> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It''s a fringe science
On Fri, Nov 16, 2007 at 11:31:00AM +0100, Paul Boven wrote:> Thanks for your reply. The SCSI-card in the X4200 is a Sun Single > Channel U320 card that came with the system, but the PCB artwork does > sport a nice ''LSI LOGIC'' imprint.That is probably the same card i''m using; it''s actually a "Sun" card but as you say is OEM by LSI.> So, just to make sure we''re talking about the same thing here - your > drives are SATA,yes> you''re exporting each drive through the Western > Scientific raidbox as a seperate volume,yes> and zfs actually brings in a > hot spare when you pull a drive?yes OS is Sol10U4, system is an X4200, original hardware rev.> Over here, I''ve still not been able to accomplish that - even after > installing Nevada b76 on the machine, removing a disk will not cause a > hot-spare to become active, nor does resilvering start. Our Transtec > raidbox seems to be based on a chipset by Promise, by the way.I have heard some bad things about the Promise RAID boxes but I haven''t had any direct experience. I do own one Promise box that accepts 4 PATA drives and exports them to a host as scsi disks. Shockingly, it uses a master/slave IDE configuration rather than 4 separate IDE controllers. It wasn''t super expensive but it wasn''t dirt cheap, either, and it seems it would have cost another $5 to manufacture the "right way." I''ve had fine luck with Promise $25 ATA PCI cards :) The infortrend units, on the other hand, I have had generally quite good luck with. When I worked at UUNet in the late ''90s we had hundreds of their SCSI RAIDs deployed. I do have an Infortrend FC-attached raid with SATA disks, which basically works fine. It has an external JBOD also SATA disks connecting to the main raid with FC. Unfortunately, The RAID unit boots faster than the JBOD. So, if you turn them on at the same time, it thinks the JBOD is gone and doesn''t notice it''s there until you reboot the controller. That caused a little pucker for my colleagues when it happened while i was on vacation. The support guy at the reseller we were working with (NOT Western Scientific) told them the raid was hosed and they should rebuild from scratch, hope you had a backup. danno -- Dan Pritts, System Administrator Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224
A little extra info: ZFS brings in a ZFS spare device the next time the pool is accessed, not a raidbox hot spare. Resilvering starts automatically and increases disk access times by about 30%. The first hour of estimated time left ( for 5-6 TB pools ) is wildly inaccurate, but it starts to settle down after that. Tom Mooney Dan Pritts wrote:> On Fri, Nov 16, 2007 at 11:31:00AM +0100, Paul Boven wrote: > >> Thanks for your reply. The SCSI-card in the X4200 is a Sun Single >> Channel U320 card that came with the system, but the PCB artwork does >> sport a nice ''LSI LOGIC'' imprint. >> > > That is probably the same card i''m using; it''s actually a "Sun" card > but as you say is OEM by LSI. > > >> So, just to make sure we''re talking about the same thing here - your >> drives are SATA, >> > > yes > > >> you''re exporting each drive through the Western >> Scientific raidbox as a seperate volume, >> > > yes > > >> and zfs actually brings in a >> hot spare when you pull a drive? >> > > yes > > OS is Sol10U4, system is an X4200, original hardware rev. > >> Over here, I''ve still not been able to accomplish that - even after >> installing Nevada b76 on the machine, removing a disk will not cause a >> hot-spare to become active, nor does resilvering start. Our Transtec >> raidbox seems to be based on a chipset by Promise, by the way. >> > > I have heard some bad things about the Promise RAID boxes but I haven''t > had any direct experience. > > I do own one Promise box that accepts 4 PATA drives and exports them to a > host as scsi disks. Shockingly, it uses a master/slave IDE configuration > rather than 4 separate IDE controllers. It wasn''t super expensive but > it wasn''t dirt cheap, either, and it seems it would have cost another > $5 to manufacture the "right way." > > I''ve had fine luck with Promise $25 ATA PCI cards :) > > The infortrend units, on the other hand, I have had generally quite good > luck with. When I worked at UUNet in the late ''90s we had hundreds of > their SCSI RAIDs deployed. > > I do have an Infortrend FC-attached raid with SATA disks, which basically > works fine. It has an external JBOD also SATA disks connecting to > the main raid with FC. Unfortunately, The RAID unit boots faster than > the JBOD. So, if you turn them on at the same time, it thinks the JBOD > is gone and doesn''t notice it''s there until you reboot the controller. > > That caused a little pucker for my colleagues when it happened while i > was on vacation. The support guy at the reseller we were working with > (NOT Western Scientific) told them the raid was hosed and they should > rebuild from scratch, hope you had a backup. > > danno > -- > Dan Pritts, System Administrator > Internet2 > office: +1-734-352-4953 | mobile: +1-734-834-7224 > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071116/3b4ebf68/attachment.html>
Hi Tom, everyone, Tom Mooney wrote:> A little extra info: > ZFS brings in a ZFS spare device the next time the pool is accessed, not > a raidbox hot spare. Resilvering starts automatically and increases disk > access times by about 30%. The first hour of estimated time left ( for > 5-6 TB pools ) is wildly inaccurate, but it starts to settle down after > that.Thanks for your reply. I''m talking about zfs hot spares, not the hot spare functionality of the raid box: # zpool create -f data raidz c4t0d0 c4t0d1 c4t0d2 c4t0d3 c4t0d4 c4t0d5 c4t0d6 c4t0d7 c4t0d8 c4t0d9 spare c4t0d10 c4t0d11 I did my initial tests by pulling a disk during a 100GB sequential write, so that should have kicked in a hot spare right away. But no hot spare was activated (as shown by ''zpool status'' and write performance fell to less than 25%. I have also tried to start resilvering manually, but that doesn''t seem to work either. I''ve heard from several people that currently, zfs has problems with reporting the ''estimated time left'' - but my issue is that not only the ''time left'', but also the progress indicator itself varies wildly, and keeps resetting itself to 0%, not giving any indication that the resilvering will ever finish. And with nv-b76, ''zpool status'' simply hangs when there is a drive missing, so I can''t even really keep track of the resilvering, if any. So, at least for me, hot spare functionality in zfs seems completely broken. Any suggestions on how to further investigate / fix this would be very much welcomed. I''m trying to determine whether this is a zfs bug or one with the Transtec raidbox, and whether to file a bug with either Transtec (Promise) or zfs. Regards, Paul Boven. -- Paul Boven <boven at jive.nl> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It''s a fringe science
On Mon, Nov 19, 2007 at 11:10:32AM +0100, Paul Boven wrote:> Any suggestions on how to further investigate / fix this would be very > much welcomed. I''m trying to determine whether this is a zfs bug or one > with the Transtec raidbox, and whether to file a bug with either > Transtec (Promise) or zfs.the way i''d try to do this would be to use the same box under solaris software RAID, or better yet linux or windows software RAID (to make sure it''s not a solaris device driver problem). does pulling the disk then get noticed? If so, it''s a zfs bug. danno -- Dan Pritts, System Administrator Internet2 office: +1-734-352-4953 | mobile: +1-734-834-7224
> but my issue is that > not only the ''time left'', but also the progress > indicator itself varies > wildly, and keeps resetting itself to 0%, not giving > any indication thatAre you sure you are not being hit by this bug: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667 i.e. scrub or resilver get''s reset to 0% on a snapshot creation or deletion. Cheers. This message posted from opensolaris.org
Hi MP, MP wrote:>> but my issue is that >> not only the ''time left'', but also the progress >> indicator itself varies >> wildly, and keeps resetting itself to 0%, not giving >> any indication that > > Are you sure you are not being hit by this bug: > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6343667 > > i.e. scrub or resilver get''s reset to 0% on a snapshot creation or deletion. > Cheers.I''m very sure of that: I''ve never done a snapshot on these, and I am the only user on the machine (it''s not in production yet). Regards, Paul Boven. -- Paul Boven <boven at jive.nl> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It''s a fringe science
Hi Dan, Dan Pritts wrote:> On Mon, Nov 19, 2007 at 11:10:32AM +0100, Paul Boven wrote: >> Any suggestions on how to further investigate / fix this would be very >> much welcomed. I''m trying to determine whether this is a zfs bug or one >> with the Transtec raidbox, and whether to file a bug with either >> Transtec (Promise) or zfs.> the way i''d try to do this would be to use the same box under solaris > software RAID, or better yet linux or windows software RAID (to make > sure it''s not a solaris device driver problem). > Does pulling the disk then get noticed? If so, it''s a zfs bug.Excellent suggestion, and today I had some time to give it a try. I created a 4 disk SVM volume (2x 2-disk stripe, mirrored, with 2 more disks as hot spare). d10 -m /dev/md/rdsk/d11 /dev/md/rdsk/d12 1 d11 1 2 /dev/rdsk/c4t0d0s0 /dev/rdsk/c4t1d0s0 -i 1024b -h hsp001 d12 1 2 /dev/rdsk/c4t2d0s0 /dev/rdsk/c4t3d0s0 -i 1024b -h hsp001 hsp001 c4t4d0s0 c4t5d0s0 I started a write and then pulled a disk. And without any further probing, SVM put a hotspare in place and started resyncing: d10 m 463GB d11 d12 (resync-0%) d11 s 463GB c4t0d0s0 c4t1d0s0 d12 s 463GB c4t2d0s0 (resyncing-c4t4d0s0) c4t3d0s0 hsp001 h - c4t4d0s0 (in-use) c4t5d0s0 This is all on b76. The issue seems to be with zfs indeed. I''m currently downloading b77, and once that is installed I''ll have to see whether the fault diagnostics and hot spare handling have indeed improved as several people here have pointed out. Regards, Paul Boven. -- Paul Boven <boven at jive.nl> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It''s a fringe science
Hi everyone, I''ve had some time to upgrade the machine in question to nv-b77 and run the same tests. And I''m happy to report that now, hotspares work a lot better. The only question remaining for us: how long for these changes to be integrated into a supported Solaris release? See below for some logs. # zpool history data History for ''data'': 2007-11-22.14:48:18 zpool create -f data raidz2 c4t0d0 c4t1d0 c4t2d0 c4t3d0 c4t4d0 c4t5d0 c4t6d0 c4t8d0 c4t9d0 c4t10d0 spare c4t11d0 c4t12d0>From /var/adm/messages:Nov 22 15:15:52 ddd scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7450 at 2/pci1000,5110 at 1/sd at 2,0 (sd16): Error for Command: write(10) Error Level: Fatal Requested Block: 103870006 Error Block: 103870006 Vendor: transtec Serial Number: Sense Key: Not_Ready ASC: 0x4 (LUN not ready intervention required), ASCQ: 0x3, FRU: 0x0 (and about 27 more of these, until 15:16:02) Nov 22 15:16:12 ddd scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7450 at 2/pci1000,5110 at 1/sd at 2,0 (sd16): offline or reservation conflict (95 of these, until 15:43:49, almost half an hour later) And then the console showed "The device has been offlined and marked as faulted. An attemt will be made to activate a hotspare if available" And my current zpool status shows: # zpool status pool: data state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use ''zpool clear'' to mark the device repaired. scrub: resilver completed with 0 errors on Thu Nov 22 16:09:49 2007 config: NAME STATE READ WRITE CKSUM data DEGRADED 0 0 0 raidz2 DEGRADED 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spare DEGRADED 0 0 0 c4t2d0 FAULTED 0 23.7K 0 too many errors c4t11d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c4t4d0 ONLINE 0 0 0 c4t5d0 ONLINE 0 0 0 c4t6d0 ONLINE 0 0 0 c4t8d0 ONLINE 0 0 0 c4t9d0 ONLINE 0 0 0 c4t10d0 ONLINE 0 0 0 spares c4t11d0 INUSE currently in use c4t12d0 AVAIL One remark: I find the overview above a bit confusing (''spare'' apparently is ''DEGRADED'' and consists of C4t2d0 and c4t11d0) but the hotspare was properly activated this time and my pool is otherwise in good health. Thanks everyone for the replies and suggestions, Regards, Paul Boven. -- Paul Boven <boven at jive.nl> +31 (0)521-596547 Unix/Linux/Networking specialist Joint Institute for VLBI in Europe - www.jive.nl VLBI - It''s a fringe science
On Mon, Nov 26, 2007 at 01:57:59PM +0100, Paul Boven wrote:> config: > > NAME STATE READ WRITE CKSUM > data DEGRADED 0 0 0 > raidz2 DEGRADED 0 0 0 > c4t0d0 ONLINE 0 0 0 > c4t1d0 ONLINE 0 0 0 > spare DEGRADED 0 0 0 > c4t2d0 FAULTED 0 23.7K 0 too many errors > c4t11d0 ONLINE 0 0 0 > c4t3d0 ONLINE 0 0 0 > c4t4d0 ONLINE 0 0 0 > c4t5d0 ONLINE 0 0 0 > c4t6d0 ONLINE 0 0 0 > c4t8d0 ONLINE 0 0 0 > c4t9d0 ONLINE 0 0 0 > c4t10d0 ONLINE 0 0 0 > spares > c4t11d0 INUSE currently in use > c4t12d0 AVAIL > > One remark: I find the overview above a bit confusing (''spare'' > apparently is ''DEGRADED'' and consists of C4t2d0 and c4t11d0) but the > hotspare was properly activated this time and my pool is otherwise in > good health. >Yeah, this bit is a little confusing, but just remove c4t2d0 and things will start looking good again. vh Mads Toftum -- http://soulfood.dk