Christian Rost
2007-May-05  09:41 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
Hello, i have an 8 port sata-controller and i don''t want to spend the money for 8 x 750 GB Sata Disks right now. I''m thinking about an optimal way of building a growing raidz-pool without loosing any data. As far as i know there are two ways to achieve this: - Adding 750 GB Disks from time to time. But this would lead to multiple groups with multiple redundancy/parity disks. I would not reach the maximum capacity of 7x750 GB at the end. - Buying "cheap" 8x250 GB SATA disks at first and replacing them from time to time by 750 GB or bigger disks. Disadvantage: At the end i''ve bought 8x250 GB + 8x750 GB Harddisks. My Question now: Is the second way reasonable or do i missing some things? Anything else to consider? This message posted from opensolaris.org
MC
2007-May-05  12:01 UTC
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
You could estimate how long it will take for ZFS to get the feature you need, and then buy enough space so that you don''t run out before then. Alternatively, Linux mdadm DOES support growing a RAID5 array with devices, so you could use that instead. This message posted from opensolaris.org
Harold Ancell
2007-May-05  13:49 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
At 04:41 AM 5/5/2007, Christian Rost wrote:>My Question now: >Is the second way reasonable or do i missing some things? Anything else to consider?Pardon me for jumping into a group I just joined, but I sense you are asking sort of a "philosophy of buying" question, and I have a different one that you may find useful, plus a question I''d like to confirm from my reading and searching of this list so far: For what and when to buy, I observe two things: at some point you HAVE to buy something; with disks exceeding Moores Law (aren''t they at about a doubling every 12 months instead of 18?), you''re going to feel some pain afterwards *whenever* you purchase as prices continue to plummet. From that, many say buy what you have to buy when you have to, although that isn''t so useful if growing a RAID-Z is difficult.... And now the useful (I hope) observation: try plotting the price performance of parts like this. When you do so, you''ll generally find a "knee" where it shoots up dramatically for the last increment(s) of performance. When I buy e.g. processors, I pick one that is just before the beginning of this knee, and for me (your mileage will vary :-), I suffer the least "buyers remorse" afterwards. The last time I checked and plotted this out, the knee is between 500 and 750GB for 7200.10 Seagate drives, and we can be pretty sure this won''t change until sometime after 1TB disks are widely adopted---and 500GB makes the math simple ^_^. BTW, is it really true there are no PCI or PCIe multiple SATA connection host adaptors (at ANY reasonable price, doesn''t have to be "budget") that are really solid under OpenSolaris right now? This would indeed seem to be a very big problem in the context of what ZFS and especially RAID-Z/Z2 have to offer.... I know I''ve selected OpenSolaris primarily based on the "pick the software you want to run, and then buy the platform that best supports it", that software being ZFS (plus I just plain like Solaris, and don''t particularly like Linux, even if I still curse the BSD -> AT&T change of 4.x to 5.x :-). For now, I''m going to buy a board with 4 SATA ports and use two 10K.7 SCSI drives for my system disks, and "play" with RAID-Z with 4 500GB drives on the former, and mirroring and ZFS in general with the latter, and hope by the time I NEED to build additional larger SATA arrays or mirrors that the above adaptor issue has been resolved.... - Harold
Brian Hechinger
2007-May-05  14:50 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
On Sat, May 05, 2007 at 02:41:28AM -0700, Christian Rost wrote:> > - Buying "cheap" 8x250 GB SATA disks at first and replacing them from time to time by 750 GB > or bigger disks. Disadvantage: At the end i''ve bought 8x250 GB + 8x750 GB Harddisks.Look at it this way. The amount you spend on 750G disks now will be equal to the amount yould would spend on 250G now and 750G later after the prices drop. For the same cost of what you would spend on 750G disks now, you would end up with a set of both 250G *AND* 750G disks. Maybe those 250G disks can be used elsewhere. Maybe buying more controllers and more disk enclosures and *ADDING* the 750G disks to the pool would be something of a reasonable cost in the future. then you would essenially have 1TB disks worth of space. THe beauty of ZFS (IMHO) is that you only need to keep disks the same size within a vdev, not the pool. so a raidz vdev of 250G disks and a raidz vdev of 750G disks will happily work in a single pool. To showcase ZFS at work i setup a zpool with three vdevs. Two 146G disks in a mirror, 5 36G disks in a raidz and 5 76G disks in a raidz. Everyone was completely impressed. ;) -brian ps: does mixing raidz and mirrors in a single pool have any performance degradation associated with it? Is ZFS smart enough to know what the read/write characteristics of a mirror vs. a raidz and try to take advantage of that? Just curious. -- "Perl can be fast and elegant as much as J2EE can be fast and elegant. In the hands of a skilled artisan, it can and does happen; it''s just that most of the shit out there is built by people who''d be better suited to making sure that my burger is cooked thoroughly." -- Jonathan Patschke
Richard Elling
2007-May-05  15:17 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
Harold Ancell wrote:> At 04:41 AM 5/5/2007, Christian Rost wrote: > >> My Question now: >> Is the second way reasonable or do i missing some things? Anything else to consider?Mirroring is the simplest way to expand in size and performance.> Pardon me for jumping into a group I just joined, but I sense you are > asking sort of a "philosophy of buying" question, and I have a different > one that you may find useful, plus a question I''d like to confirm from > my reading and searching of this list so far: > > For what and when to buy, I observe two things: at some point you > HAVE to buy something; with disks exceeding Moores Law (aren''t they > at about a doubling every 12 months instead of 18?), you''re going to > feel some pain afterwards *whenever* you purchase as prices continue to > plummet. From that, many say buy what you have to buy when you have to, > although that isn''t so useful if growing a RAID-Z is difficult....Disk prices remain constant. Disk densities change. With 500 GByte disks in the $120 range, they are on the way out, so are likely to be optimally priced. But they may not be available next year. If you mirror, then it is a no-brainer, just add two.> And now the useful (I hope) observation: try plotting the price > performance of parts like this. When you do so, you''ll generally > find a "knee" where it shoots up dramatically for the last increment(s) > of performance. When I buy e.g. processors, I pick one that is just > before the beginning of this knee, and for me (your mileage will > vary :-), I suffer the least "buyers remorse" afterwards."buyers remorse" for buying computer gear? We might have to revoke your geek license :-)> The last time I checked and plotted this out, the knee is between 500 > and 750GB for 7200.10 Seagate drives, and we can be pretty sure this > won''t change until sometime after 1TB disks are widely adopted---and > 500GB makes the math simple ^_^. > > BTW, is it really true there are no PCI or PCIe multiple SATA connection > host adaptors (at ANY reasonable price, doesn''t have to be "budget") that > are really solid under OpenSolaris right now? This would indeed seem > to be a very big problem in the context of what ZFS and especially > RAID-Z/Z2 have to offer.... I know I''ve selected OpenSolaris primarily > based on the "pick the software you want to run, and then buy the > platform that best supports it", that software being ZFS (plus I just > plain like Solaris, and don''t particularly like Linux, even if I still > curse the BSD -> AT&T change of 4.x to 5.x :-).Look for gear based on LSI 106x or Marvell (see marvell88sx). These are used in Sun products, such as thumper. -- richard
Orvar Korvar
2007-May-05  18:13 UTC
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
What brand is your 8 port satacontroller? I want one sata controller too, but heard that Solaris is picky about the model. All controllers doesnt work. Your does? This message posted from opensolaris.org
Christian Rost
2007-May-06  14:28 UTC
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
I''ve an EX-3403 Marvell 88SX6081 Controller. Unfortunately it is revision 07 which seems to be not supported until now. I don''t see any disks ... :-( This is already discussed here: http://www.opensolaris.org/jive/thread.jspa?threadID=13533 This message posted from opensolaris.org
Harold Ancell
2007-May-06  22:19 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
Date: Sat, 05 May 2007 08:17:41 -0700
  From: Richard Elling <Richard.Elling at Sun.COM>
  Harold Ancell wrote:
  At 04:41 AM 5/5/2007, Christian Rost wrote:
    For what and when to buy, I observe two things: at some point you
    HAVE to buy something; with disks exceeding Moores Law (aren''t
    they at about a doubling every 12 months instead of 18?), you''re
    going to feel some pain afterwards *whenever* you purchase as
    prices continue to plummet.  From that, many say buy what you have
    to buy when you have to, although that isn''t so useful if growing
    a RAID-Z is difficult....
  Disk prices remain constant.
After checking my spreadsheets for the last eight years, you''re right;
I hadn''t looked at it that way.  And it''s an important point.
  Disk densities change.  With 500 GByte disks in the $120 range, they
  are on the way out, so are likely to be optimally priced.  But they
  may not be available next year.  If you mirror, then it is a
  no-brainer, just add two.
A good point---but how realistic is the concern that 500GB drives will
go "poof"?  I guess the question is "who buys them"?  Lower
size disks
stay in production because people now only "need" X disk space today,
where X is *currently* below 500GB.
I checked Dell.com, and their "we want you to buy this higher end home
machine" offer has a 320GB stock drive, but highlights a 500GB just
below it with a bolded "Dell recommended for photos, music, and
games!", for an extra 120 US$, about 10% of the machine''s price.
I''ll bet a lot of people take them up on that.
    [ Query on solid host PCI/PCIe multiple SATA adaptors. ]
  Look for gear based on LSI 106x or Marvell (see marvell88sx).  These
  are used in Sun products, such as thumper.
The LSI 106[48] are PCI-X chips and the Marvell driver is for PCI-X as
well, according to http://www.sun.com/bigadmin/hcl/driverlist.html and
some searching of this list''s archives.  Also, as of last year the
latter (or most likely its driver) seemed to have some bugs---and is
closed source, so we can''t fix them :-(.
The LSI 106[48]e are a possibility: the 8 SAS/SATA port LSISAS1068E
would seem to be the interesting chip.  It uses the existing LSI Logic
Fusion-MPT software infrastructure.
LSI makes RAID 0, 1, 1E and 10E only cards; they have a product matrix
of their PCI-X and PCIe el cheapo cards at:
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/index.html
It has four PCI Express parts listed at the bottom: below find the
model name, Kit/Single Part number, and date for the "product brief"
glossy brochure, prices for a retail kit from scsi4me.com (no
endorsement there, they are just the first distributor I found that
carries these, be warned they list prices for 5 unit bulk packs, you
have to select the retail kit on the "Condition" drop box):
  LSISAS3041E-R, LSI00112, 4 internal, 12/06. 70 US$.
  LSISAS3442E-R, LSI00110, 4 external and 4 internal, 10/06, 109 US$.
  LSISAS3801E, LSI00138, 8 external 02/07 (Google returns only 13
    English hits on this model name, very new indeed).
  LSISAS3081E-R, LSI00151, 8 internal, none, does not seem to be shipping yet.
The LSI SAS3442E-R would seem to be their first released card:
http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/lsisas3442er/index.html
http://www.lsi.com/files/docs/marketing_docs/storage_stand_prod/sas/LSISAS3442E-R_PB.pdf
  LSI SAS3442E-R
  PCI Express, 3 Gb/s, SAS, 8-port Host Bus Adapter
  The LSI SAS3442E-R four-port internal/four-port external SAS PCI
  Express storage adapter provides 300 MB/s bandwidth (600 MB/s, full
  duplex) on each port for combined throughput of up to 2.4 GB/s.  The
  storage adapter supports multi-volume OS independent Integrated RAID
  0, 1, 1E and 10E without the need for special drivers. The
  SAS3442E-R features PCI Express connectivity, removing the host bus
  bottleneck from the parallel PCI busses.
    * LSI LSISAS3442E-R KIT - PCI Express, 3 Gb/s, 8-port, SAS
      - Individually packaged box with LSI LSISAS3442E-R HBA, CD
      containing drivers, (1) internal cable for connecting to (4) SAS
      hard disk drives, (1) internal cable for connecting to (4) SATA
      HDD or SATA-style backplane connectors, low-profile bracket, and
      quick installation guide
      - Part Number: LSI00110
One fly in the ointment for me at the moment: these have one PCIe lane
per supported port, x4 or x8; this is not insane, with SAS topologies,
the larger boards can be connected to up to 1023 devices (!!!, and see
below), which would allow only one card in the motherboard I''m
thinking of buying for my ZFS fileserver (the Tyan Tomcat K8E S2865).
If the single SATA port to multiple SATA drive enclosure problem can
be solved, this would not prove to be much if any of a limitation,
depending on how much traffic these first generation cards can handle,
and how many disks per enclosure.  The Seagate 7210.10 SATA Product
Manual says the "Sustained data transfer rate OD" for 500MB drives is
72MB/s (750GB is 78MB/s).  One PCIe lane will handle 250MB/s, about
3.5 time that, so the raw transfer numbers are there....
(I''m interested in these limits because I *desire* to have a RAID-Z
system feed a Quantum LTO-3 tape drive at it''s full native 68MB/s (it
will step down if that isn''t possible).)
Has anyone tried one of these LSI PCIe cards?  I checked this mailing
list and storage-discuss for each one, and only found these:
  Date: Tue, 17 Oct 2006 13:24:55 -0700
  From: Frank Cusack <fcusack at fcusack.com>
  Subject: Re: Re: ZFS Inexpensive SATA Whitebox
  Message-ID: <C78BB26E8A484B08EA4B232C at sucksless.local>
  [...] For example, I''ve had a devil of a time trying to get the LSI
  3442-E working on Solaris SPARC (works on x86 with LSI driver)....
  Date: Fri, 19 Jan 2007 23:23:49 -0800
  From: Frank Cusack <fcusack at fcusack.com>
  Subject: Re: External drive enclosures + Sun Server for mass storage
  Message-ID: <A8D436619E15260BD2117C77 at sucksless.local>
  On January 19, 2007 5:59:13 PM -0800 "David J. Orman" 
  <ormandj at corenode.com> wrote:
  > card that supports SAS would be *ideal*,
  Except that SAS support on Solaris is not very good.
  One major problem is they treat it like scsi when instead they should
  treat it like FC (or native SATA).
  > So, anybody have any good suggestions for these two things:
  >
  ># 1 - SAS/SATA PCI-E card that would work with the Sun X2200M2.
  I had the lsilogic 3442-E working on x86 but not reliably.  That is
  the only SAS controller Sun supports AFAIK.
  [...]
Which are ambiguous about the target being SATA or SAS, although
probably the former.
What I''ve found so far on this list is that if you want multi-port
non-PCI-X SATA, get a Silicon Image chip based card, e.g. the Si3114
(SATA 150 though :-( ), and flash its BIOS to IDE if it comes with the
RAID BIOS.  At 22 US$ from NewEgg, that "works for me!" until there is
a better solution.
                                        - Harold
Richard Elling
2007-May-06  23:52 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build a cheap and raidz?
Harold Ancell wrote:> I checked Dell.com, and their "we want you to buy this higher end home > machine" offer has a 320GB stock drive, but highlights a 500GB just > below it with a bolded "Dell recommended for photos, music, and > games!", for an extra 120 US$, about 10% of the machine''s price. > > I''ll bet a lot of people take them up on that.Interestingly, I was online recently comparing Dell, Frys.com, and Apple Store prices (for a research project). For a sampling of products exactly the same, Dell generally had the worst prices, Frys the best, and Apple more often matched Frys than Dell. Specifically, for a 500 GByte disk, Dell was asking $189 versus $129 at Frys. I couldn''t directly compare Apple''s price because they don''t sell raw disks, they sell "modules" which cost 2x the price of an external drive listed right next to the modules -- go figure. As the old saying goes, it pays to shop around. -- richard
Erblichs
2007-May-08  16:33 UTC
[zfs-discuss] Optimal strategy (add or replace disks) to build acheap and raidz?
Group, MOST people want a system to work without doing ANYTHING when they turn on the system. So yes, the thought of people buying another drive and installing it in a brand new system would be insane for this group of buyers. Mitchell Erblich ------------------ Richard Elling wrote:> > Harold Ancell wrote: > > I checked Dell.com, and their "we want you to buy this higher end home > > machine" offer has a 320GB stock drive, but highlights a 500GB just > > below it with a bolded "Dell recommended for photos, music, and > > games!", for an extra 120 US$, about 10% of the machine''s price. > > > > I''ll bet a lot of people take them up on that. > > Interestingly, I was online recently comparing Dell, Frys.com, and Apple > Store prices (for a research project). For a sampling of products exactly > the same, Dell generally had the worst prices, Frys the best, and Apple > more often matched Frys than Dell. Specifically, for a 500 GByte disk, > Dell was asking $189 versus $129 at Frys. I couldn''t directly compare > Apple''s price because they don''t sell raw disks, they sell "modules" which > cost 2x the price of an external drive listed right next to the modules -- > go figure. > > As the old saying goes, it pays to shop around. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Pål Baltzersen
2007-May-11  14:01 UTC
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
I use Supermicro AOC-SAT2-MV8 It is 8-port SATA2, JBOD only, and literally plug&play (sol10u3) and just ~100Euro It is PCI-X but mine is plugged into a plain PCI slot/mobo and works fine. (Don''t know how much better it would perform on a PCI-X slot/mobo). I bought mine here: http://www.mullet.se/sortiment/product.htm?product_id=133690&category_id=5907&search_pagegoogle for AOC-SAT2-MV8 should give you lots of webshops P?l This message posted from opensolaris.org
Pål Baltzersen
2007-May-11  16:41 UTC
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
Just my problem too ;) And ZFS disapointed me big time here! I know ZFS is new and every desired feature isn''t implemented yet. I hope and beleive more features are comming "soon", so I think I''ll stay with ZFS and wait.. My idea was to start out with just as many state-of-the-art size disks I really needed and could afford and add disks as price dropped and the zpool grew near full. So I bougth 4 Seagate 500GB. Now they are full and meanwhile price has dropped to ~1/3 and will continue to drop to ~1/5 I expect (I''ve just seen 1TB disks in stock at the same price the 500GB started when released ~ 2 years ago). I thought I could buy one at a time and expand the raidz. I have reallized that is (currently) *not* an option! -- You can''t (currently) add (attach) disks to a raidz vdev - period! What you can do; i.e. the only thing you can do (currently), is adding new raidz to an existing pool, somewhat like concatinating two or more raidz vdevs into a logical volume. So, ZFS does not help you here; You''ll have to buy an economically optimal bunch of disks each time you run out of space and group them into a new raidz vdev each time. The new raidz vdev may be part of an/the existing pool (volume) (most likely), or a new one. So with 8-port controller(s) you''d buy 4+4+4+4 or 4+4+8 or 8+8+8 or any number > 3 that fits your need and wallet at the time. For each set you lose one for redundancy Buying 5+1+1+1+1... is not an option (yet). Alternatively of course you coul buy 2+2+2+2... and add mirrored pairs to the poll, but then you loose 50% to redundancy which is no a budget approach.. Buying 4+4+4+4 gives you at best 75% usable space for your money (N-1/N for each set, i.e. 3/4); that is when your pool is 100% full. But if your usage grows slowly from nothing, then adding mirror-pairs could atually be more economic, and if it accelerates you could later add groups of raidz1 or raidz2. Note! You can''t even regret what you have added to a pool. Being able to evacuate a vdev and replace it by a bigger one would have helped. But this isn''t possible either (currently). I''d really like to see adding disks to a raidz vdev implemented. I''m no expert on the details but having read a bit about ZFS, I think it shouldn''t be that hard to implement (just extremely I/O intensive while in progress - like scrub where every stripe needed correction). It should be possible to read stipe-by-stripe and recalculate/reshape to a more narrow but longer stripe spanning the added disk(s), and as more space is added and the recalculated stripes would be narrower (at least not wider) everything shoud fit as a sequential process. One would need some percistent way to keep track on the progress in a way that would survive and resume after power loss, panic etc. A bitmap could do. Labeling the stripes with a version could be a way that would make it possible to having a mix of old short and new longer stripes coexisting for a while, say write new stripes (i.e. files) with the new size and recalculate and reshape everything as (an optional) part of next scrub. A constraint would probably be that you would each time have to add at least as much space as your biggest file (file|metadata=stripe as far as I have understood) -- at least true if the reshape-process could save the biggest/temporarily non fitting stripes to the end of the process, to make sure there is allways one good copy of every stipe on disk at any time, which is much of the point with ZFS. An implementation of something like this is *very* welcome! I would then also like to be able to convert my initial raidz1 to raidz2 so I could, ideally, start with a 2-disk raidz1 and end up with a giant raidz2, and split it in a reasonable number of disks per group and start a new raidz1 growing from 2 disks at every 10 disk or so, and probably at the same time step up to then new state-of-the-art disk size for each new vdev (and just before I run out of slots start replacing the by the time ridiculously small disks (and slow (controller)) in then first raidz and thus grow for ever not necessarily needing bigger chassis or rack units) Backing up the whole thing, destroying and recreating the pool, and restore everything every couple of months isn''t really an option for me.. Actually I have no clue how to back-up such a thing on a private budget. Tapedrives that could cope are way to expensive and tapes aren''t that cheap compared to mid-range SATA-disks.. Best thing I can come up with is rsync to a clone-system (build around your old PC/server but with similar disk-capasity, with less or no redundancy cold do (since this is budget HW there is no significantly cheaper way to build a downscaled clone except reducing/reusing old CPU and RAM and so) -- And by the way, yes, I think this applies to professional use too. It could give substantial savings on any scale. Buying things you don''t really need till next year have usually been a 50% waist of money in this business for at least the last 25 years. Next week or month you get more for less! And in business use I wouldn''t have one system; I have thousands!, so it applies even more there I think. The argument I''ve seen that a business would allwas afford buing say 6 disks at a time does not hold. It isn''t just 6 disks; It''s 6 disks for each system running aout of space and the sum of all waisted space in each separate system.. ZFS is really good, make it better! Thanks :) P?l This message posted from opensolaris.org
Robert Milkowski
2007-May-15  09:15 UTC
[zfs-discuss] Re: Optimal strategy (add or replace disks) to build a cheap and raidz?
Hello Pal,
Friday, May 11, 2007, 6:41:41 PM, you wrote:
PB> Note! You can''t even regret what you have added to a pool. Being
PB> able to evacuate a vdev and replace it by a bigger one would have
PB> helped. But this isn''t possible either (currently).
Actually you can. See ''zpool replace''.
So you can replace all 500GB disks with 1TB disks and once you done
you will get more space.
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com
Christian Rost
2007-May-15  10:17 UTC
[zfs-discuss] Re: Re: Optimal strategy (add or replace disks) tobuild a cheap and raidz?
Yes I have tested this virtually with vmware. Replacing disks by bigger ones works great. But the new space becomes usable only after replacing *all* disks. I hoped that new space will be usable after replacing 3 or 4 disks. I think the best strategy for me now is buying 2 x 750 GB Disks and using a mirrored setup. When the storage is full i''ll buy 2 new disks and concat the mirrored vdevs. At the end when 8 disks are full (4 x 750GB), i will switch the layout to raidz and have 7x750 GB available. I''ll have to copy the complete data to another disk-storage, but that is ok for me. Perhaps zfs will even permit changing storage-layout from mirror to raidz in the future ... This should the best working low-cost strategy for me now. This message posted from opensolaris.org