This isn''t about zfs as such, but about how to build a system for zfs. Zfs likes JBOD, right? So how do I best build a system with lots of raw disk? Lets assume that we''re talking Sun kit (as I''m generally familiar with most of the bits). And that we''re talking about a fibre interconnect - so that it''s basically a SAN, and I can just add more disk to the network any time I like. This gives me the 3510 and 3511 at the bottom end. I''ve been reading up on these (we already use direct attach 3510 boxes with hardware raid, and 3320/3310 scsi boxes). My understanding here is that the 3510 is only supported for direct attach to a host (so no SAN switches) and the 3511 isn''t supported for host attach at all - you''re supposed to hang it off a controller unit. Go further up the scale and it isn''t clear to me that JBOD exists. So, any suggestions for a good way to connect lots of JBOD disk to a machine? -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
I think it would be safer to say, "ZFS likes lots of LUNs". That way it can better place data within a pool, deal with failures, etc. JBOD makes that easier as usually you throw lots of drives in a JBOD and go. There is nothing inherent in a hardware raid array that disables features or interferes with ZFS. You may want to reconfigure or re-architect in order to use those features more efficiently but I don''t think anyone is going to say, "Throw away all your hardware raid controllers and convert to JBOD". 3510 and 3511 with raid controllers are supported on SAN or direct attach. The JBOD variant of the 3510 is only supported in direct connect mode as you state. As for lots of FC connected JBOD: Got any A5K lying around? ;) Peter Tribble wrote:> This isn''t about zfs as such, but about how to build a system for > zfs. > > Zfs likes JBOD, right? So how do I best build a system with lots > of raw disk? > > Lets assume that we''re talking Sun kit (as I''m generally familiar > with most of the bits). And that we''re talking about a fibre > interconnect - so that it''s basically a SAN, and I can just add > more disk to the network any time I like. > > This gives me the 3510 and 3511 at the bottom end. I''ve been > reading up on these (we already use direct attach 3510 boxes with > hardware raid, and 3320/3310 scsi boxes). My understanding here is > that the 3510 is only supported for direct attach to a host (so > no SAN switches) and the 3511 isn''t supported for host attach at > all - you''re supposed to hang it off a controller unit. > > Go further up the scale and it isn''t clear to me that JBOD exists. > > So, any suggestions for a good way to connect lots of JBOD disk > to a machine? > >
Torrey McMahon wrote:> I think it would be safer to say, "ZFS likes lots of LUNs". That way it > can better place data within a pool, deal with failures, etc. JBOD makes > that easier as usually you throw lots of drives in a JBOD and go. There > is nothing inherent in a hardware raid array that disables features or > interferes with ZFS. You may want to reconfigure or re-architect in > order to use those features more efficiently but I don''t think anyone is > going to say, "Throw away all your hardware raid controllers and convert > to JBOD".The existence of this question (and it''s quite appropriate, IMO) suggests the need for some good "Best Practices" documentation of how to handle ZFS in "big" SANs. I thought I''d seen a writeup from Bill Moore but it''s lost in a mountain of email. Since ZFS changes the dynamic so much, but so many customers have so much invested in their setups with these large SANs, are there plans for such a guide? IME, many customers spend more on their storage than on their systems and software. - Pete> 3510 and 3511 with raid controllers are supported on SAN or direct > attach. The JBOD variant of the 3510 is only supported in direct connect > mode as you state. > > As for lots of FC connected JBOD: Got any A5K lying around? ;) > > Peter Tribble wrote: >> This isn''t about zfs as such, but about how to build a system for >> zfs. >> >> Zfs likes JBOD, right? So how do I best build a system with lots >> of raw disk? >> >> Lets assume that we''re talking Sun kit (as I''m generally familiar >> with most of the bits). And that we''re talking about a fibre >> interconnect - so that it''s basically a SAN, and I can just add >> more disk to the network any time I like. >> >> This gives me the 3510 and 3511 at the bottom end. I''ve been >> reading up on these (we already use direct attach 3510 boxes with >> hardware raid, and 3320/3310 scsi boxes). My understanding here is >> that the 3510 is only supported for direct attach to a host (so >> no SAN switches) and the 3511 isn''t supported for host attach at >> all - you''re supposed to hang it off a controller unit. >> >> Go further up the scale and it isn''t clear to me that JBOD exists. >> >> So, any suggestions for a good way to connect lots of JBOD disk >> to a machine? >> >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I had a question much the same as yours before, but mainly because I wanted to avoid the rather pricey (sorry Sun) storage when it was unecessary. One of these: http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm Plus one of these: http://www.cooldrives.com/sataenclosures.html That should do you! :) At least, that''s my plan. I''ve been assured that SATA controller is supported, you might want to double check just to be sure, though. If you need more than 8 drives, I''m not really sure. :) Maybe two of the marvell controllers? :P This way you can buy SATA drives for the going ~30c/gig, instead of rather, uhm, painful pricing. 3511 for 1250G at 5 drives =~18000$: http://store.sun.com/CMTemplate/CEServlet?process=SunStore&cmdViewProduct_CP&catid=114140 That''s $14.7/gig. Solution I outlined for 2000G at 8 drives = ~1300$: http://www.newegg.com/Product/Product.asp?Item=N82E16822144701 http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm http://www.cooldrives.com/sata-eight-bay-enclosure-1.html That''s $0.65/gig. I don''t think the integrated raid controller in the sun storage array is worth that kind of pricing. That''s just me though. :) If somebody can point me to a nice 8+ drive rack mount enclosure for disks with SATA interface, I''d be super appreciative! :P Cheers, David> This isn''t about zfs as such, but about how to build a system for > zfs. > > Zfs likes JBOD, right? So how do I best build a system with lots > of raw disk? > > Lets assume that we''re talking Sun kit (as I''m generally familiar > with most of the bits). And that we''re talking about a fibre > interconnect - so that it''s basically a SAN, and I can just add > more disk to the network any time I like. > > This gives me the 3510 and 3511 at the bottom end. I''ve been > reading up on these (we already use direct attach 3510 boxes with > hardware raid, and 3320/3310 scsi boxes). My understanding here is > that the 3510 is only supported for direct attach to a host (so > no SAN switches) and the 3511 isn''t supported for host attach at > all - you''re supposed to hang it off a controller unit. > > Go further up the scale and it isn''t clear to me that JBOD exists. > > So, any suggestions for a good way to connect lots of JBOD disk > to a machine? > > -- > -Peter Tribble > L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ > http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Sorry, I misread your request, I didn''t see you needed it to be SAN. My apologies, I suppose the FC connections would be worth the cost to you then. Cheers, David> I had a question much the same as yours before, but mainly because I > wanted to avoid the rather pricey (sorry Sun) storage when it was > unecessary. > > One of these: > http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm > Plus one of these: > http://www.cooldrives.com/sataenclosures.html > > That should do you! :) At least, that''s my plan. I''ve been assured that > SATA controller is supported, you might want to double check just to be > sure, though. > > If you need more than 8 drives, I''m not really sure. :) Maybe two of the > marvell controllers? :P > > This way you can buy SATA drives for the going ~30c/gig, instead of > rather, uhm, painful pricing. > > 3511 for 1250G at 5 drives =~18000$: > http://store.sun.com/CMTemplate/CEServlet?process=SunStore&cmdViewProduct_CP&catid=114140 > > That''s $14.7/gig. > > Solution I outlined for 2000G at 8 drives = ~1300$: > http://www.newegg.com/Product/Product.asp?Item=N82E16822144701 > http://www.supermicro.com/products/accessories/addon/AoC-SAT2-MV8.cfm > http://www.cooldrives.com/sata-eight-bay-enclosure-1.html > > That''s $0.65/gig. > > I don''t think the integrated raid controller in the sun storage array is > worth that kind of pricing. That''s just me though. :) If somebody can > point me to a nice 8+ drive rack mount enclosure for disks with SATA > interface, I''d be super appreciative! :P > > Cheers, > David > >> This isn''t about zfs as such, but about how to build a system for >> zfs. >> >> Zfs likes JBOD, right? So how do I best build a system with lots >> of raw disk? >> >> Lets assume that we''re talking Sun kit (as I''m generally familiar >> with most of the bits). And that we''re talking about a fibre >> interconnect - so that it''s basically a SAN, and I can just add >> more disk to the network any time I like. >> >> This gives me the 3510 and 3511 at the bottom end. I''ve been >> reading up on these (we already use direct attach 3510 boxes with >> hardware raid, and 3320/3310 scsi boxes). My understanding here is >> that the 3510 is only supported for direct attach to a host (so >> no SAN switches) and the 3511 isn''t supported for host attach at >> all - you''re supposed to hang it off a controller unit. >> >> Go further up the scale and it isn''t clear to me that JBOD exists. >> >> So, any suggestions for a good way to connect lots of JBOD disk >> to a machine? >> >> -- >> -Peter Tribble >> L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ >> http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
I was hoping to provide (at least) some of this. I''m planning to do some Sun on Sun(tm) work with this to come up with best practices for ZFS in real terms. I believe in ZFS. I think it is the future. However, it is not a simple beast, and offers plenty of points for sub-optimal configurations. I''m waiting for the ZFS command set to stabilize before I start. Does anybody know the timeline for production quality ZFS? (e.g. when it will be practical to trust oracle on top of ZFS?) On Apr 3, 2006, at 2:22 PM, Peter Rival wrote:> Torrey McMahon wrote: >> I think it would be safer to say, "ZFS likes lots of LUNs". That >> way it can better place data within a pool, deal with failures, >> etc. JBOD makes that easier as usually you throw lots of drives in >> a JBOD and go. There is nothing inherent in a hardware raid array >> that disables features or interferes with ZFS. You may want to >> reconfigure or re-architect in order to use those features more >> efficiently but I don''t think anyone is going to say, "Throw away >> all your hardware raid controllers and convert to JBOD". > > The existence of this question (and it''s quite appropriate, IMO) > suggests the need for some good "Best Practices" documentation of > how to handle ZFS in "big" SANs. I thought I''d seen a writeup from > Bill Moore but it''s lost in a mountain of email. Since ZFS changes > the dynamic so much, but so many customers have so much invested in > their setups with these large SANs, are there plans for such a > guide? IME, many customers spend more on their storage than on > their systems and software. > > - Pete > >> 3510 and 3511 with raid controllers are supported on SAN or direct >> attach. The JBOD variant of the 3510 is only supported in direct >> connect mode as you state. >> As for lots of FC connected JBOD: Got any A5K lying around? ;) >> Peter Tribble wrote: >>> This isn''t about zfs as such, but about how to build a system for >>> zfs. >>> >>> Zfs likes JBOD, right? So how do I best build a system with lots >>> of raw disk? >>> >>> Lets assume that we''re talking Sun kit (as I''m generally familiar >>> with most of the bits). And that we''re talking about a fibre >>> interconnect - so that it''s basically a SAN, and I can just add >>> more disk to the network any time I like. >>> >>> This gives me the 3510 and 3511 at the bottom end. I''ve been >>> reading up on these (we already use direct attach 3510 boxes with >>> hardware raid, and 3320/3310 scsi boxes). My understanding here is >>> that the 3510 is only supported for direct attach to a host (so >>> no SAN switches) and the 3511 isn''t supported for host attach at >>> all - you''re supposed to hang it off a controller unit. >>> >>> Go further up the scale and it isn''t clear to me that JBOD exists. >>> >>> So, any suggestions for a good way to connect lots of JBOD disk >>> to a machine? >>> >>> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Mon, Apr 03, 2006 at 02:31:50PM -0600, Gregory Shaw wrote:> I was hoping to provide (at least) some of this. I''m planning to do > some Sun on Sun(tm) work with this to come up with best practices for > ZFS in real terms. > > I believe in ZFS. I think it is the future. However, it is not a > simple beast, and offers plenty of points for sub-optimal > configurations. > > I''m waiting for the ZFS command set to stabilize before I start.I''m not sure what you mean by this. It''s been stable since it''s integration. Or do you mean "not adding new features" ? If it''s the latter, we''ll always be adding new features so you''ll be waiting for a while.> Does anybody know the timeline for production quality ZFS? (e.g. > when it will be practical to trust oracle on top of ZFS?)It is production quality now. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Mon, 2006-04-03 at 21:31, David J. Orman wrote:> Sorry, I misread your request, I didn''t see you needed it to be SAN. My > apologies, I suppose the FC connections would be worth the cost to you > then.It doesn''t have to be SAN. Personally, I''m not that enamoured with SAN systems - too complex and unreliable for my tastes - but they do have the advantage of allowing you to scale better than direct attach, and make adding to and managing the storage easier. A rough estimate indicates that significantly more than a dozen fast drives would be required just for one system - make that 20 SATA, just to get enough spindles. Anybody from Sun care to comment on Thumper? Something like that could make a viable alternative. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Peter Tribble wrote:> On Mon, 2006-04-03 at 21:31, David J. Orman wrote: >> Sorry, I misread your request, I didn''t see you needed it to be SAN. My >> apologies, I suppose the FC connections would be worth the cost to you >> then. > It doesn''t have to be SAN. Personally, I''m not that enamoured with > SAN systems - too complex and unreliable for my tastes - but they > do have the advantage of allowing you to scale better than direct > attach, and make adding to and managing the storage easier. A rough > estimate indicates that significantly more than a dozen fast drives > would be required just for one system - make that 20 SATA, just to > get enough spindles.Hi Peter, In my experience of supporting customer SANs, and now writing the drivers to provide that connectivity, the complex SAN is the one which the customer has not planned before hand. And I include the "planning for expansion" part there too. As to unreliable - what experiences have you had which make SANs misbehave? best regards, James C. McPherson -- Solaris Datapath Engineering Data Management Group Sun Microsystems
On Mon, Apr 03, 2006 at 03:58:02PM -0400, Torrey McMahon wrote:> I think it would be safer to say, "ZFS likes lots of LUNs". That way it > can better place data within a pool, deal with failures, etc. JBOD makes > that easier as usually you throw lots of drives in a JBOD and go. There > is nothing inherent in a hardware raid array that disables features or > interferes with ZFS. You may want to reconfigure or re-architect in > order to use those features more efficiently but I don''t think anyone is > going to say, "Throw away all your hardware raid controllers and convert > to JBOD". > > 3510 and 3511 with raid controllers are supported on SAN or direct > attach. The JBOD variant of the 3510 is only supported in direct connect > mode as you state.what about a 3510 JBOD connected to a fabric? I have minimal expansion slots on the front end systems and I don''t want to burn them to add more storage. along these lines, my ideas for a storage project I''m working on were either: - 3510 JBOD connected to fabric, 2x 5 disk raidz and two hot spares. - 3510 RAID connected to fabric, 2x 5 disk HW raid5 sets, two hot spares, raid5 sets in a zfs pool. our IO pattern will be small, whole file reads and writes. I imagine the benefit of the HW cache would be worth the extra cost. I''ve started running some filebench scenarios, but I don''t have the hardware in my hands yet so I can''t do a direct comparison. any thoughts on the above? grant.
Peter Rival wrote: > Since ZFS changes the dynamic so much, but so many customers > have so much invested in their setups with these large SANs, > are there plans for such a guide? > IME, many customers spend more on their storage than on their> systems and software.That''s what ZFS is supposed to address. This will take some effort. Henk (who''s boss is at an EMC seminar this very moment)
On Mon, 2006-04-03 at 20:58, Torrey McMahon wrote:> I think it would be safer to say, "ZFS likes lots of LUNs". That way it > can better place data within a pool, deal with failures, etc. JBOD makes > that easier as usually you throw lots of drives in a JBOD and go. There > is nothing inherent in a hardware raid array that disables features or > interferes with ZFS. You may want to reconfigure or re-architect in > order to use those features more efficientlyHaving HW raid plus zfs costs you in two ways. The first is the obvious cost of the raid controllers (and they aren''t at all cheap). The second is that you need to put in two lots of redundancy - once inside the HW raid, and then again so that zfs has redundant data.> but I don''t think anyone is > going to say, "Throw away all your hardware raid controllers and convert > to JBOD".Oh I don''t know! Besides, what about people starting from scratch? Looking at something like a 3510/3511, a JBOD solution is typically half the price of the equivalent HW raid solution.> As for lots of FC connected JBOD: Got any A5K lying around? ;)Ugh! (As it happens, yes, and the sooner they are put out of reach the better. Just replaced some with 3510s, and just have one left that I want to replace soon - hopefully with something that would involve zfs.) -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Grant says,> what about a 3510 JBOD connected to a fabric? I have minimal expansion > slots on the front end systems and I don''t want to burn them to add > more storage.Just say "no." Actually, say "no way in hell" or "no (*%^@# way!" The experience of the A5000 was so painful, that I really hoped nobody would ever do it again. The fundamental problem is that the fault isolation is virtually nonexistent. Disks will never be sophisticated enough to offer both redundant paths and good fault isolation. In order to get there, you need disks which understand fabrics and switches with lots of ports and which are smart enough to understand broken disks. It is much, much easier to use a more modern technology which is designed for redundancy and fault isolation from the get-go: SAS. To use SAS, you are likely to be looking at a fancy controller which gets you back to a RAID array model. The circle is complete. As others have pointed out, SATA is similar to SAS, but less robust and less costly. From a fault isolation perspective, SAS and SATA are staggeringly similar as commonly implemented (a good thing). I see few people implementing the dual-port SAS capabilities, so far. -- richard This message posted from opensolaris.org
Hello Peter, Tuesday, April 4, 2006, 2:54:14 PM, you wrote:>> As for lots of FC connected JBOD: Got any A5K lying around? ;)PT> Ugh! (As it happens, yes, and the sooner they are put out of PT> reach the better. Just replaced some with 3510s, and just have PT> one left that I want to replace soon - hopefully with something PT> that would involve zfs.) Well, you can send me some A5k - I would gladly take them, especially with disks :) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Tue, 2006-04-04 at 16:24, Richard Elling wrote:> Grant says, > > what about a 3510 JBOD connected to a fabric? I have minimal expansion > > slots on the front end systems and I don''t want to burn them to add > > more storage. > > Just say "no." Actually, say "no way in hell" or "no (*%^@# way!" > > The experience of the A5000 was so painfulIndeed. This worried me as well. I was hoping that newer arrays would be better behaved, but I get the impression that you''re saying they aren''t. Are you saying that we should forget JBOD and just stick regular HW arrays on the SAN? What would you recommend as a scaleable solution, given a requirement for 20+ spindles, and an assumption that zfs is used? Given the cost advantages of SATA, what about simply adding more drives and making 3-way mirrors to get the resilience back? (But even there, how to connect everything together - back to a SAN.) -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Tue, 2006-04-04 at 11:24, Richard Elling wrote:> To > use SAS, you are likely to be looking at a fancy controller > which gets you back to a RAID array model. The circle is > complete.so, I''m confused by this. Seems to me (as a networking guy) that if all you want is a JBOD with fault isolation, you can leave out caching, RAID, cache batteries, periodic cache battery maintenance, an administrative CLI/GUI, management ethernet & serial ports, etc. and I think you''d end up with a much simpler array controller.. By analogy to networking, consider the difference in complexity (and cost) between an unmanaged ethernet switch and a caching proxy... - Bill
Given ZFS''s capabilities, I''d actually like to see if there were JBODs out there with NVRAM for write caching. Sort of like the old Sun StorEdge Array (you remember those, right? 10+ years ago?) Since HW raid isn''t really needed anymore, I''d prefer to see what an NVRAM write cache would do for speeding things up. Oh, and we (here in J2SE) just got one of the 6920 SAN things, and we''ll be hooking our existing 3510/3511s into it. Seems a nice compromise on managability, flexibility, storage, and cost. For our needs, I expect that we will solely be adding 3511s into it, since most of the requirements are long-term semi-archive storage. IF you''ve got the coin, it''s not bad at all. We got the base model with 4TB (essentially 2 6120 arrays, filled), and then have 3510s hooked into the FC switch head for management. Sun internal pricing is cheap (under $100k), but external is about 3x that. Does anyone know if Sun is planning on producing something that looks like a 6020 but uses SATA instead of FC drives? -Erik On Tue, 2006-04-04 at 16:55 +0100, Peter Tribble wrote:> On Tue, 2006-04-04 at 16:24, Richard Elling wrote: > > Grant says, > > > what about a 3510 JBOD connected to a fabric? I have minimal expansion > > > slots on the front end systems and I don''t want to burn them to add > > > more storage. > > > > Just say "no." Actually, say "no way in hell" or "no (*%^@# way!" > > > > The experience of the A5000 was so painful > > Indeed. This worried me as well. I was hoping that newer > arrays would be better behaved, but I get the impression > that you''re saying they aren''t. > > Are you saying that we should forget JBOD and just stick > regular HW arrays on the SAN? > > What would you recommend as a scaleable solution, given > a requirement for 20+ spindles, and an assumption that zfs > is used? > > Given the cost advantages of SATA, what about simply > adding more drives and making 3-way mirrors to get > the resilience back? (But even there, how to connect > everything together - back to a SAN.) >-- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Peter Tribble wrote:> On Mon, 2006-04-03 at 20:58, Torrey McMahon wrote: > >> I think it would be safer to say, "ZFS likes lots of LUNs". That way it >> can better place data within a pool, deal with failures, etc. JBOD makes >> that easier as usually you throw lots of drives in a JBOD and go. There >> is nothing inherent in a hardware raid array that disables features or >> interferes with ZFS. You may want to reconfigure or re-architect in >> order to use those features more efficiently >> > > Having HW raid plus zfs costs you in two ways. The first is the obvious > cost of the raid controllers (and they aren''t at all cheap). The second > is that you need to put in two lots of redundancy - once inside the HW > raid, and then again so that zfs has redundant data.You''re assuming that you have a bunch of Solaris boxes that are going to have dedicated storage arrays. Most datacenters are multiplatform and don''t like dedicated storage arrays on each and every box. HW raid arrays let you consolidate storage across many hosts and OS. I think it''s fair to say that customers don''t architect around specific OS or filesystems. They architect solutions to meet their overall data requirements. ZFS helps quite a bit but it''s not the solution to every problem....though I''m sure Jeff and company are working on it. :) Also, when people talk about HW raid storage arrays you usually get other benefits: Cache for coalescing writes, remote replication, ability to split mirrors, higher redudancy, etc. JBODs don''t play in that space.
Bill Sommerfeld wrote:> On Tue, 2006-04-04 at 11:24, Richard Elling wrote: > >> To >> use SAS, you are likely to be looking at a fancy controller >> which gets you back to a RAID array model. The circle is >> complete. >> > > so, I''m confused by this. Seems to me (as a networking guy) that if all > you want is a JBOD with fault isolation, you can leave out caching, > RAID, cache batteries, periodic cache battery maintenance, an > administrative CLI/GUI, management ethernet & serial ports, etc. and I > think you''d end up with a much simpler array controller.. > > By analogy to networking, consider the difference in complexity (and > cost) between an unmanaged ethernet switch and a caching proxy...To continue your analogy, you would have to buy every host it''s own unmanaged ethernet switch and convert all the hosts to speak IP when they don''t.
On Tue, 2006-04-04 at 12:07 -0400, Bill Sommerfeld wrote:> On Tue, 2006-04-04 at 11:24, Richard Elling wrote: > > To > > use SAS, you are likely to be looking at a fancy controller > > which gets you back to a RAID array model. The circle is > > complete. > > so, I''m confused by this. Seems to me (as a networking guy) that if all > you want is a JBOD with fault isolation, you can leave out caching, > RAID, cache batteries, periodic cache battery maintenance, an > administrative CLI/GUI, management ethernet & serial ports, etc. and I > think you''d end up with a much simpler array controller..Fundamentally, disks aren''t very sophisticated, and the price point for disks doesn''t allow much sophistication. One of the lessons learned with the A5000 is that even though each disk has 2 ports for connection to 2 different FC-AL loops, by definition that is a common cause fault opportunity. A faulty disk can bring down both loops. Since the disks aren''t very sophisticated, and firmware is difficult to manage once released to the field, all hell can break loose. The way to get around this is to go point-to-point, ala SATA and SAS. You will see this pattern repeated often in the high-availability space. For a recent example, see the Sun Netra CT900 announced this week. This design has excellent fault isolation and containment. http://www.sun.com/products-n-solutions/hw/networking/ct900/> By analogy to networking, consider the difference in complexity (and > cost) between an unmanaged ethernet switch and a caching proxy...The problem in your analogy is that you are assuming IP. There is no equivalent level to IP in the SAN world, as implemented. SANs are more like IPX/SPX with some features removed (ok, maybe I am a little biased... :-) -- richard
Regarding firmware, most intelligent disk controllers (arrays) manage disk microcode. I like that idea, as it allows the device closest to the disk (and not the host) to manage the disks. On Apr 7, 2006, at 4:42 PM, Richard Elling wrote:> On Tue, 2006-04-04 at 12:07 -0400, Bill Sommerfeld wrote: >> On Tue, 2006-04-04 at 11:24, Richard Elling wrote: >>> To >>> use SAS, you are likely to be looking at a fancy controller >>> which gets you back to a RAID array model. The circle is >>> complete. >> >> so, I''m confused by this. Seems to me (as a networking guy) that >> if all >> you want is a JBOD with fault isolation, you can leave out caching, >> RAID, cache batteries, periodic cache battery maintenance, an >> administrative CLI/GUI, management ethernet & serial ports, etc. >> and I >> think you''d end up with a much simpler array controller.. > > Fundamentally, disks aren''t very sophisticated, and the price point > for disks doesn''t allow much sophistication. One of the lessons > learned with the A5000 is that even though each disk has 2 ports > for connection to 2 different FC-AL loops, by definition that is a > common cause fault opportunity. A faulty disk can bring down both > loops. Since the disks aren''t very sophisticated, and firmware is > difficult to manage once released to the field, all hell can break > loose. The way to get around this is to go point-to-point, ala > SATA and SAS. You will see this pattern repeated often in the > high-availability space. For a recent example, see the Sun Netra > CT900 announced this week. This design has excellent fault > isolation and containment. > http://www.sun.com/products-n-solutions/hw/networking/ct900/ > >> By analogy to networking, consider the difference in complexity (and >> cost) between an unmanaged ethernet switch and a caching proxy... > > The problem in your analogy is that you are assuming IP. There is > no equivalent level to IP in the SAN world, as implemented. SANs > are more like IPX/SPX with some features removed (ok, maybe I am a > little biased... :-) > -- richard > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Fri, 2006-04-07 at 22:31, Torrey McMahon wrote:> Peter Tribble wrote: > > > > Having HW raid plus zfs costs you in two ways. The first is the obvious > > cost of the raid controllers (and they aren''t at all cheap). The second > > is that you need to put in two lots of redundancy - once inside the HW > > raid, and then again so that zfs has redundant data. > > You''re assuming that you have a bunch of Solaris boxes that are going to > have dedicated storage arrays.Indeed. That was essentially what the original question was. Let me rephrase it: "Assuming I have some Sun servers and will be using zfs, what''s the best way to buy the storage?" It''s not entirely a theoretical question - I have proposals to write involving just this scenario, hence the original question. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Mon, 2006-04-03 at 23:32, James C. McPherson wrote:> Hi Peter, > In my experience of supporting customer SANs, and now writing the > drivers to provide that connectivity, the complex SAN is the one > which the customer has not planned before hand. And I include the > "planning for expansion" part there too. As to unreliable - what > experiences have you had which make SANs misbehave?Buying a Sun 6900 was - with hindsight - a mistake. (At the time I''m not sure there was much else - it would have been more expensive to direct attach T3s to each box [need more T3s] and the current 3xxx systems didn''t exist.) But the switches would freeze, the VEs would randomly misbehave, GBICs would fail, cables go bad, T3 controllers would play up - and diagnosis was extremely difficult. Remember this was a canned solution - so we had very little access. (I think that if the place hadn''t been closed down I would have probably ripped out the T3s and gone for direct attach.) Were we on the bleeding edge of adoption? Have all the switch lockup problems (I know other colleagues at the time had problems getting switches stable) been fixed? The general point, though, is that a SAN has many more points of failure. You''ve got switches, the software on them, and the configuration of the SAN. (Networks and network switches are more mature, and I guess that VLANs are in some ways analagous to zones, which - given some experience - doesn''t encourage me.) My hope is that currently the components are more reliable than they were when I last tried this. Would that be a reasonable expectation? -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Fri, 2006-04-07 at 22:36, Torrey McMahon wrote:> Bill Sommerfeld wrote: > > > > By analogy to networking, consider the difference in complexity (and > > cost) between an unmanaged ethernet switch and a caching proxy... > > To continue your analogy, you would have to buy every host it''s own > unmanaged ethernet switch and convert all the hosts to speak IP when > they don''t.Or maybe the difference between a huge SMP system and a rack of cheap blade servers? In the HPC space, commodity clusters (JBOD) have largely superseded large single machines (HW raid boxes). The key is the glue, usually MPI (zfs). The issues are similar - large clusters have to deal with fault isolation and manageability as well. -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Boy, it sounds like you''ve had some bad experience with switches. I don''t know the timeframe for the below, but it sounds like ~1999. At that time, there were some bad lasers being produced that would burn out. Especially bad were the converters from copper FC to fiber FC (required by the T3). They failed regularly. I''ve been running (brocade) FC switches of varying size (16-120 ports), and have had very few problems. I''ve currently got over 60 switches in production (1g-4g), and we''ve had 3 switches actually fail. We designed everything with pairs of switches. One path goes to one switch, while the other goes to another switch. That has worked well when multi-pathing is available, and allows for an entire switch to fail, yet the system operation will continue. In other words, our SANs have been reliable for years in production. In recent history, it has helped that newer switches support upgrading microcode without downtime. On Apr 10, 2006, at 6:56 AM, Peter Tribble wrote:> On Mon, 2006-04-03 at 23:32, James C. McPherson wrote: >> Hi Peter, >> In my experience of supporting customer SANs, and now writing the >> drivers to provide that connectivity, the complex SAN is the one >> which the customer has not planned before hand. And I include the >> "planning for expansion" part there too. As to unreliable - what >> experiences have you had which make SANs misbehave? > > Buying a Sun 6900 was - with hindsight - a mistake. (At the time > I''m not sure there was much else - it would have been more expensive > to direct attach T3s to each box [need more T3s] and the current > 3xxx systems didn''t exist.) But the switches would freeze, the > VEs would randomly misbehave, GBICs would fail, cables go bad, > T3 controllers would play up - and diagnosis was extremely > difficult. Remember this was a canned solution - so we had very > little access. (I think that if the place hadn''t been closed down > I would have probably ripped out the T3s and gone for direct > attach.) Were we on the bleeding edge of adoption? Have all the > switch lockup problems (I know other colleagues at the time had > problems getting switches stable) been fixed? > > The general point, though, is that a SAN has many more points > of failure. You''ve got switches, the software on them, and the > configuration of the SAN. (Networks and network switches are > more mature, and I guess that VLANs are in some ways analagous > to zones, which - given some experience - doesn''t encourage me.) > > My hope is that currently the components are more reliable > than they were when I last tried this. Would that be a > reasonable expectation? > > -- > -Peter Tribble > L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ > http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/ > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
>The general point, though, is that a SAN has many more points >of failure. You''ve got switches, the software on them, and the >configuration of the SAN. (Networks and network switches are >more mature, and I guess that VLANs are in some ways analagous >to zones, which - given some experience - doesn''t encourage me.)"And if you''re on a SAN, you''re using a network designed by disk firmware writers. God help you." - Jeff Bonwick http://blogs.sun.com/roller/page/bonwick?entry=zfs_end_to_end_data Casper
On Mon, 10 Apr 2006, Peter Tribble wrote:> On Mon, 2006-04-03 at 23:32, James C. McPherson wrote: > > Hi Peter, > > In my experience of supporting customer SANs, and now writing the > > drivers to provide that connectivity, the complex SAN is the one > > which the customer has not planned before hand. And I include the > > "planning for expansion" part there too. As to unreliable - what > > experiences have you had which make SANs misbehave? > > Buying a Sun 6900 was - with hindsight - a mistake. (At the time > I''m not sure there was much else - it would have been more expensive > to direct attach T3s to each box [need more T3s] and the current > 3xxx systems didn''t exist.) But the switches would freeze, the > VEs would randomly misbehave, GBICs would fail, cables go bad, > T3 controllers would play up - and diagnosis was extremely > difficult. Remember this was a canned solution - so we had very > little access. (I think that if the place hadn''t been closed down > I would have probably ripped out the T3s and gone for direct > attach.) Were we on the bleeding edge of adoption? Have all the > switch lockup problems (I know other colleagues at the time had > problems getting switches stable) been fixed? > > The general point, though, is that a SAN has many more points > of failure. You''ve got switches, the software on them, and the > configuration of the SAN. (Networks and network switches are > more mature, and I guess that VLANs are in some ways analagous > to zones, which - given some experience - doesn''t encourage me.) > > My hope is that currently the components are more reliable > than they were when I last tried this. Would that be a > reasonable expectation?A couple of comments from someone with no first-hand experience of the particular equipment you refer to. One of the biggest gripes I have, and one that I constantly recommend that people (clients) consider, it that while some level of redundancy is good from a systems availability standpoint, *identicality* is not. Because if you use the same (FC) switches, with the same firmware, you experience the same bugs and the same failure modes. Whereas, if you use Switch A from vendor A and switch B from vendor B and you also use different GBICs[1] etc - you have redundancy without identical bugs and failure modes. The same applies to networking equipment. Going all Cisco is an easy sell in the boardroom - but when a Cisco exploit hits the street, you''ll wish you had not designed in identicality. Regarding GBICs: I insisted on sparing at the time we deployed a SAN solution for a client and also insisted on Finisar GBICs. We never did use any of the spares (in 5 years). One important design detail of FC, is that the low-level (Layer 1) transport spec maintains a constant duty-cycle, regardless of the data being transported over the FC links. The intent is to keep the temperature of opticial components[2] constant and prevent component temperature cycling, which often has a negative impact on electronic systems reliability. The FC wire protocol was contributed by IBM. One more general point. Redundancy does not increase system complexity and managability by x2 (times two); IMHO, it''s more like x4. In most areas of computing, the human operators are the weak link - and they have a poor track record of being able to master highly complex technology. If your tech "owners" are unable to grok x2 complexity - don''t burden them, and the down-stream user community, with x4 complexity by building fully redundant systems. It''s interesting to note that human failure modes, and how to minimize them, are well know in certain fields, but that knowledge has not been widely applied to high-reliability/high-availability computing systems. For example, its a well known fact, that a 2 person flight crew, flying a modern aircraft, will exhibit dramatically better accident statistics, than any single person crew flying equally equiped (modern) aircraft. [1] There were some GBICs that were widely known to be "bad". Bad, as in subject to excessively high failure rates. I think that they were made by IBM ... but I''m not sure. They were sold/re-sold under various names. [2] really all components that are transporting FC data. Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Peter Tribble wrote:> On Mon, 2006-04-03 at 23:32, James C. McPherson wrote: > >> Hi Peter, >> In my experience of supporting customer SANs, and now writing the >> drivers to provide that connectivity, the complex SAN is the one >> which the customer has not planned before hand. And I include the >> "planning for expansion" part there too. As to unreliable - what >> experiences have you had which make SANs misbehave? >> > > Buying a Sun 6900 was - with hindsight - a mistake. (At the time > I''m not sure there was much else - it would have been more expensive > to direct attach T3s to each box [need more T3s] and the current > 3xxx systems didn''t exist.) But the switches would freeze, the > VEs would randomly misbehave, GBICs would fail, cables go bad, > T3 controllers would play up - and diagnosis was extremely > difficult. Remember this was a canned solution - so we had very > little access. (I think that if the place hadn''t been closed down > I would have probably ripped out the T3s and gone for direct > attach.) Were we on the bleeding edge of adoption? Have all the > switch lockup problems (I know other colleagues at the time had > problems getting switches stable) been fixed? >All of them? Hard to say. 6900 was quite the beast but I''d say SANs are much more reliable then they were six years ago when the 6900 was being sold. (Or was it seven? I can''t remember that far back...)> The general point, though, is that a SAN has many more points > of failure. You''ve got switches, the software on them, and the > configuration of the SAN. (Networks and network switches are > more mature, and I guess that VLANs are in some ways analagous > to zones, which - given some experience - doesn''t encourage me.) >I think it would be better to say there are as many points of failure but, in the past, SAN components were more likely to fail. I''d say the components are much less likely to fail now then they did 5+ years ago.> My hope is that currently the components are more reliable > than they were when I last tried this. Would that be a > reasonable expectation? >I''d argue that, yes, they are much more reliable but I''ve no hard data to back that up.
Peter Tribble wrote:> On Fri, 2006-04-07 at 22:31, Torrey McMahon wrote: > >> Peter Tribble wrote: >> >>> Having HW raid plus zfs costs you in two ways. The first is the obvious >>> cost of the raid controllers (and they aren''t at all cheap). The second >>> is that you need to put in two lots of redundancy - once inside the HW >>> raid, and then again so that zfs has redundant data. >>> >> You''re assuming that you have a bunch of Solaris boxes that are going to >> have dedicated storage arrays. >> > > Indeed. That was essentially what the original question was. > Let me rephrase it: > > "Assuming I have some Sun servers and will be using zfs, > what''s the best way to buy the storage?"What are the i/o requirements for the apps running on the Sun boxes? Is this a purely Sun/Solaris environment? What''s the growth potential in the next three years? How much redundancy do you want on the transport level? How many nines? I''m not trying to be a pain in the arse - At least not in this forum :) - but a lot more data would be required before a good config could even be thought about.
> Well, you can send me some A5k - I would gladly take > them, especially with disks :)We''ve (we == the Lysator computer club at Link?ping University) got four A5000''s running just fine (equipped with 9GB and 18GB disks) on an Ultra2. But it was *HELL* getting the system to work in a stable way. Finally had to ditch the Sun FC Sbus controllers and use third party ones from JNI and we also had to forget about using the whole pack of 36GB Seagate Cheetah''s we had got for dirt cheap... Or else we''d get disk spinning up/down randomly, and FC errors would fill the log files endlessly. The Cheetah''s work just fine in standalone machines (like a Sun Blade 1000) but they just won''t work in the A5000 :-( That system have been running perfectly since then serving the HOME directories (a whooping 100GB of mirrored storage :-). Now, we just recently got two A3500FC systems (fully redundant), with 120 18GB disks and thus we figured it''d be a nice upgrade from (more space atleast :-) the old A5000 solution so I''ve started looking into how to configure and connect them... Installed Solaris Nevada on an Ultra 30 and hacked Raid Manager (I know, I know - it''s not supported after Solaris 9 - who cares? :-) to work so I could configure the controllers and the old configured LUNs showed up just fine (13 RAID5 groups on each A3500FC system). Now since I was planning on using ZFS with these systems I figured I''d try to reconfigure one of them to be a more JBOD-like system - erased all the old LUN/RAID5 groups and wrote a script to create an individual LUN for each drive - would be 60 LUNs per A3500. Started the script and it created LUNS 0-15 just fine, then created LUN 16 and then it started failing... Apparently Solaris refused to create the device node for LUN 16 and then Raid Manager got seriously confused and just gave up. Doh! So I figured I''d remove the last LUN (configured on the controller just fine, it was just that Solaris wouldn''t see it) and a truss on "raidutil" gave that it was silently giving up since it couldn''t see the /dev/osa/dev/rdsk/c1t4d15s0 device file - so I created a dummy one (a link to c1t4d14s0) and gave the command "raidutil -c c1t4d0 -D 16"... And then the A3500 controller crashed (it doesn''t show up on the FC bus anymore atleast and I''m currently 30km''s away and can''t check on it). *SIGH* (Yeah yeah, I know... Just venting some frustration... The A3500 is known to be crappy hardware but we got them for free :-) This message posted from opensolaris.org
On Fri, 14 Apr 2006, Peter Eriksson wrote:> > Well, you can send me some A5k - I would gladly take > > them, especially with disks :)... reformatted ...> We''ve (we == the Lysator computer club at Link?ping University) got four > A5000''s running just fine (equipped with 9GB and 18GB disks) on an > Ultra2. But it was *HELL* getting the system to work in a stable way. > Finally had to ditch the Sun FC Sbus controllers and use third party ones > from JNI and we also had to forget about using the whole pack of 36GB > Seagate Cheetah''s we had got for dirt cheap... Or else we''d get disk > spinning up/down randomly, and FC errors would fill the log files > endlessly. The Cheetah''s work just fine in standalone machines (like a > Sun Blade 1000) but they just won''t work in the A5000 :-( > > That system have been running perfectly since then serving the HOME > directories (a whooping 100GB of mirrored storage :-). > > Now, we just recently got two A3500FC systems (fully redundant), with 120 > 18GB disks and thus we figured it''d be a nice upgrade from (more space > atleast :-) the old A5000 solution so I''ve started looking into how to > configure and connect them... > > Installed Solaris Nevada on an Ultra 30 and hacked Raid Manager (I know, > I know - it''s not supported after Solaris 9 - who cares? :-) to work so I > could configure the controllers and the old configured LUNs showed up > just fine (13 RAID5 groups on each A3500FC system). Now since I was > planning on using ZFS with these systems I figured I''d try to reconfigure > one of them to be a more JBOD-like system - erased all the old LUN/RAID5 > groups and wrote a script to create an individual LUN for each drive - > would be 60 LUNs per A3500. Started the script and it created LUNS 0-15 > just fine, then created LUN 16 and then it started failing... Apparently > Solaris refused to create the device node for LUN 16 and then Raid > Manager got seriously confused and just gave up. Doh! So I figured I''d > remove the last LUN (configured on the controller just fine, it was just > that Solaris wouldn''t see it) and a truss on "raidutil" gave that it was > silently giving up since it couldn''t see the /dev/osa/dev/rdsk/c1t4d15s0 > device file - so I created a dummy one (a link to c1t4d14s0) and gave the > command "raidutil -c c1t4d0 -D 16"... And then the A3500 controller > crashed (it doesn''t show up on the FC bus anymore atleast and I''m > currently 30km''s away and can''t check on it). > > *SIGH* > > (Yeah yeah, I know... Just venting some frustration... The A3500 is known > to be crappy hardware but we got them for free :-)Ensure that the LUNs you need are defined in sd.conf. Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
> If somebody can > point me to a nice 8+ drive rack mount enclosure for > disks with SATA > interface, I''d be super appreciative! :P >promise vtrak j300s adaptec supposedly has one also, but you can''t get docs online, and i had one on order for 2 months before cancelling it. This message posted from opensolaris.org
> The same applies to networking equipment. Going all > Cisco is an easy sell > in the boardroom - but when a Cisco exploit hits the > street, you''ll wish > you had not designed in identicality.That''s not generally true. If you use two vendors, you''re not 50% isolated from one of their bugs, you''re 200% exposed. -frank This message posted from opensolaris.org
> You''re assuming that you have a bunch of Solaris > boxes that are going to > have dedicated storage arrays. Most datacenters are > multiplatform and > don''t like dedicated storage arrays on each and every > box. HW raid > arrays let you consolidate storage across many hosts > and OS.So does SAS JBOD.> Also, when people talk about HW raid storage arrays > you usually get > other benefits: Cache for coalescing writes, remote > replication, ability > to split mirrors, higher redudancy, etc. JBODs don''t > play in that space.I think zfs will put JBODs into that space. -frank This message posted from opensolaris.org
Frank Cusack wrote:>> You''re assuming that you have a bunch of Solaris >> boxes that are going to >> have dedicated storage arrays. Most datacenters are >> multiplatform and >> don''t like dedicated storage arrays on each and every >> box. HW raid >> arrays let you consolidate storage across many hosts >> and OS. >> > > So does SAS JBOD. >How can you share storage from a "bunch of disks" to multiple hosts? Outside of connecting it to all of the host at the same time and making sure one system doesn''t grab more then the disk you''ve allocated to it via mental process?
Frank Cusack wrote:> On April 18, 2006 10:24:01 PM -0400 Torrey McMahon > <Torrey.McMahon at Sun.COM> wrote: >> Frank Cusack wrote: >>>> You''re assuming that you have a bunch of Solaris >>>> boxes that are going to >>>> have dedicated storage arrays. Most datacenters are >>>> multiplatform and >>>> don''t like dedicated storage arrays on each and every >>>> box. HW raid >>>> arrays let you consolidate storage across many hosts >>>> and OS. >>>> >>> >>> So does SAS JBOD. >>> >> >> >> How can you share storage from a "bunch of disks" to multiple hosts? >> Outside of connecting it to >> all of the host at the same time and making sure one system doesn''t >> grab more then the disk >> you''ve allocated to it via mental process? >> > > Use a SAS switch (analogue of SAN switch). eg > <http://pmc-sierra.com/products/details/pm8398/> > > Granted, you can only divvy up storage by entire disk, as opposed to > arbitrarily-sized LUNs.First, that looks to be a chip solution for an array product and not a switch to let you take a jbod and allocate storage in front of it. Second, allocating an entire disk to a host - When they''ll be at 1TB in size in a short time - doesn''t help when a host might simply need 100GB. Allocate two for mirror and you''re over allocating by 20X the required space.
On Tue, 2006-04-18 at 18:25 -0700, Frank Cusack wrote:> > The same applies to networking equipment. Going all > > Cisco is an easy sell > > in the boardroom - but when a Cisco exploit hits the > > street, you''ll wish > > you had not designed in identicality. > > That''s not generally true. If you use two vendors, you''re not > 50% isolated from one of their bugs, you''re 200% exposed.I''m not sure I follow this, but if you''re saying what I think you''re saying, you are both 50% isolated and 200% exposed, and cannot be anything else. But I don''t see what percentages have to do with this. When we do a RAS analysis, we wouldn''t look at it that way. Each fault has a probability and an effect. If you have two identical things, then they have the same probability of being affected by each fault. If you have two different things, then they have different (perhaps completely different) probabilities of being affected by any given fault. So yes, you are more exposed because there are more fault opportunities. But it is also less likely that any single fault will bring both down. For safety-critical systems, I''ll go for the diversity. -- richard
[splitting hairs here...] On Tue, 2006-04-18 at 21:37 -0700, Frank Cusack wrote:> You have to do an analysis specific to the deployment at hand. You can''t > just outright say, diversity is good.Diversity is good. It also costs real money, which is your point. Fast, reliable, inexpensive: pick one. --richard
On Tue, 18 Apr 2006, Richard Elling wrote:> [splitting hairs here...] > > On Tue, 2006-04-18 at 21:37 -0700, Frank Cusack wrote: > > You have to do an analysis specific to the deployment at hand. You can''t > > just outright say, diversity is good. > > Diversity is good. It also costs real money, which is your point. > Fast, reliable, inexpensive: pick one.^^^ Correction: pick two. Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Frank Cusack wrote:> On April 18, 2006 11:11:09 PM -0400 Torrey McMahon > <Torrey.McMahon at Sun.COM> wrote: >> Frank Cusack wrote: >>> On April 18, 2006 10:24:01 PM -0400 Torrey McMahon >>> <Torrey.McMahon at Sun.COM> wrote: >>>> Frank Cusack wrote: >>>>>> You''re assuming that you have a bunch of Solaris >>>>>> boxes that are going to >>>>>> have dedicated storage arrays. Most datacenters are >>>>>> multiplatform and >>>>>> don''t like dedicated storage arrays on each and every >>>>>> box. HW raid >>>>>> arrays let you consolidate storage across many hosts >>>>>> and OS. >>>>>> >>>>> >>>>> So does SAS JBOD. >>>>> >>>> >>>> >>>> How can you share storage from a "bunch of disks" to multiple hosts? >>>> Outside of connecting it to >>>> all of the host at the same time and making sure one system doesn''t >>>> grab more then the disk >>>> you''ve allocated to it via mental process? >>>> >>> >>> Use a SAS switch (analogue of SAN switch). eg >>> <http://pmc-sierra.com/products/details/pm8398/> >>> >>> Granted, you can only divvy up storage by entire disk, as opposed to >>> arbitrarily-sized LUNs. >> >> >> First, that looks to be a chip solution for an array product and not >> a switch to let you take a >> jbod and allocate storage in front of it. > > Yes. But LSI just demo''d an actual switch, and probably in a year''s time > there will be multiple products on the market....which adds how much to the cost of the JBOD? (Hardware, support, training, etc.) How much for a hw raid array that lets you do lun masking, carve out space, do lun expansion, etc.?> >> Second, allocating an entire disk to a host - When >> they''ll be at 1TB in size in a short time - doesn''t help when a host >> might simply need 100GB. >> Allocate two for mirror and you''re over allocating by 20X the >> required space. > > It seems to me that hosts which require only 100GB don''t really > participate > in today''s SANs. (But just a guess, really.) I''d expect most disk > allocations > to be multi-disk, not sub-disk. But why buy 1TB disks to give out > 100GB chunks? > I wouldn''t pay the SAN/FC $$$ premium to do it. It''s going to be > cheaper to buy > smaller SAS disk than to part out parts of more expensive (per GB) > FC-attached > disk.In some cases we see customers taking a single HW raid array and carving up hundreds of LUNs of small size for multiple systems. Not as often as larger datasets but I got a call about one such setup the other week. My point here is that in the near future you will only be able to buy 1TB, or some other large size, drives. Drive density has been growing at an exponential rate the past 10 years. Anyone recall those 2GB drives that we used to fill SSAs with? SATA drives are up to 500GB today...in case you didn''t notice Do you think SAS drives are going to stay at 37GB for long? Not a chance. They''ll ramp just as fast, if not faster, then their FC cousins. (147 is out now. I''m pretty sure ~250 is in the works.) At some point you''ll need to split the disk drives in a logical manner to avoid gross over allocation, yet maintaining performance requirements, to the majority of your hosts. Of just place all of the storage behind a NAS box. Take your pick. ;)
On Wed, Apr 19, 2006 at 01:31:43AM -0400, Torrey McMahon wrote:> In some cases we see customers taking a single HW raid array and carving > up hundreds of LUNs of small size for multiple systems. Not as often as > larger datasets but I got a call about one such setup the other week.This will change, surely. Partly because this way lies madness, partly because ZFS rocks.> My point here is that in the near future you will only be able to buy > 1TB, or some other large size, drives. Drive density has been growing at > an exponential rate the past 10 years. Anyone recall those 2GB drives > that we used to fill SSAs with? SATA drives are up to 500GB today...inI remember 10MB hard drives, FWIW. I also remember that no matter how large the drives get there''s always stuff to fill them with. 100GB HW RAID seems pitiful now...> case you didn''t notice Do you think SAS drives are going to stay at 37GB > for long? Not a chance. They''ll ramp just as fast, if not faster, then > their FC cousins. (147 is out now. I''m pretty sure ~250 is in the > works.) At some point you''ll need to split the disk drives in a logical > manner to avoid gross over allocation, yet maintaining performance > requirements, to the majority of your hosts. Of just place all of the > storage behind a NAS box. Take your pick. ;)I expect the latter, as you seem to also, because ZFS rocks :) I expect some database applications will just use huge logical devices without volume management, even if it''d be better to use volume management. But small allocations have got to go. I can imagine, say, Solaris as iSCSI servers serving ZFS files as raw devices where small allocations are needed but NAS is, for whatever reason, not applicable. Anything but manage multitudes of LUNs. Nico --
>I don''t think the integrated raid controller in the sun storage array is >worth that kind of pricing. That''s just me though. :) If somebody can >point me to a nice 8+ drive rack mount enclosure for disks with SATA >interface, I''d be super appreciative! :PThis winter I built a nice rsync server running Solaris 10 (not ZFS yet though, but that is definitely coming) from the following parts:> Motherboard: SuperMicro X6DHE-XG2 > CPU: 2x Intel Xeon 2.8GHz > RAM: 2GB > Disks: 14x 400GB SATA 7200rpm > Rack mount case: SuperMicro SC933T-R760 > Disk controller: Adaptec S21610SA (16port SATA RAID)That rack mount case supports 15 1"SATA disks and has a triple-redundant power supply, lots of fans and stuff. Link: http://www.supermicro.com/products/chassis/3U/933/SC933T-R760.cfm I don''t use the RAID capability of that Adaptec card though so if I was going to build a similar server today I would probably go for your suggested SATA controller instead (probably cheaper :-). This message posted from opensolaris.org
On Apr 19, 2006, at 12:17 AM, Nicolas Williams wrote:> On Wed, Apr 19, 2006 at 01:31:43AM -0400, Torrey McMahon wrote: >> In some cases we see customers taking a single HW raid array and >> carving >> up hundreds of LUNs of small size for multiple systems. Not as >> often as >> larger datasets but I got a call about one such setup the other week. > > This will change, surely. Partly because this way lies madness, > partly > because ZFS rocks.Why would this change? When you buy disk resources, you want to use them as effectively as possible. It''s dangerous to have a large number of systems on a large HW array because of SPOF and performance concerns, but it''s entirely reasonable. It''s far better than local storage.> >> My point here is that in the near future you will only be able to buy >> 1TB, or some other large size, drives. Drive density has been >> growing at >> an exponential rate the past 10 years. Anyone recall those 2GB drives >> that we used to fill SSAs with? SATA drives are up to 500GB >> today...in > > I remember 10MB hard drives, FWIW. I also remember that no matter how > large the drives get there''s always stuff to fill them with. 100GB HW > RAID seems pitiful now... >True. Arrays today are measured in the 10''s of TBs.>> case you didn''t notice Do you think SAS drives are going to stay >> at 37GB >> for long? Not a chance. They''ll ramp just as fast, if not faster, >> then >> their FC cousins. (147 is out now. I''m pretty sure ~250 is in the >> works.) At some point you''ll need to split the disk drives in a >> logical >> manner to avoid gross over allocation, yet maintaining performance >> requirements, to the majority of your hosts. Of just place all of the >> storage behind a NAS box. Take your pick. ;) > > I expect the latter, as you seem to also, because ZFS rocks :) > > I expect some database applications will just use huge logical devices > without volume management, even if it''d be better to use volume > management. > > But small allocations have got to go. > > I can imagine, say, Solaris as iSCSI servers serving ZFS files as raw > devices where small allocations are needed but NAS is, for whatever > reason, not applicable. Anything but manage multitudes of LUNs. > > Nico > -- > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussOne thing bothers me about NAS and iSCSI. What''s the max performance? On modern arrays and tape drives, the arrays can drive at several thousand i/o''s per second, and 200+mb/sec. How can a nas box come even remotely close to that? Single threaded gig-e using NFS maxes out at around 45MB/second. That''s around 20% of what a local array can do. I don''t get the focus on NAS when local disk performance is far better. On an unrelated note, the problem I see with big drives is how to back up those drives in a reasonable amount of time. Drives spindles aren''t getting faster, which means that as drives get bigger, the amount of time to back them up is linear. ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
> [splitting hairs here...] > > On Tue, 2006-04-18 at 21:37 -0700, Frank Cusack > wrote: > > You have to do an analysis specific to the > deployment at hand. You can''t > > just outright say, diversity is good. > > Diversity is good. It also costs real money, which > is your point.Actually, my original point was that vendor diversity doesn''t necessarily insulate you from the problems of one of those vendors and may in fact increase your problems. I''ve certainly never been in a situation where having a second network vendor saved me from the problems of the first. There are, absolutely, reasons to use multiple vendors. Security exposure is probably not one of them. -frank This message posted from opensolaris.org
>There are, absolutely, reasons to use multiple vendors. Security exposure >is probably not one of them.The DOD supposedly uses firewall complexes consisting of three layers; each using a different vendor. (It works when they''re in series, not when they''re in parallel) Casper
Gregory Shaw wrote:> > On Apr 19, 2006, at 12:17 AM, Nicolas Williams wrote: > >> On Wed, Apr 19, 2006 at 01:31:43AM -0400, Torrey McMahon wrote: >>> In some cases we see customers taking a single HW raid array and >>> carving >>> up hundreds of LUNs of small size for multiple systems. Not as often as >>> larger datasets but I got a call about one such setup the other week. >> >> This will change, surely. Partly because this way lies madness, partly >> because ZFS rocks. > > Why would this change? When you buy disk resources, you want to use > them as effectively as possible. It''s dangerous to have a large > number of systems on a large HW array because of SPOF and performance > concerns, but it''s entirely reasonable. > > It''s far better than local storage.Right. Datasets are growing but in a lot of cases there are apps that still only need a 100GB to get their work done. (Think fast temp space for a grid.) It could be argued that certain ZFS features, like snapshots, could cause the overall space requirements to increase but even then you will find systems that don''t require TBs of storage. They might need high performing, reliable, storage.....but not lots and lots of TBs. Of course, the issue isn''t with one system, or even five, but hundreds. Datacenter math is always more interesting. :)
On Wed, 2006-04-19 at 01:17 -0500, Nicolas Williams wrote:> > case you didn''t notice Do you think SAS drives are going to stay at 37GB > > for long? Not a chance. They''ll ramp just as fast, if not faster, then > > their FC cousins. (147 is out now. I''m pretty sure ~250 is in the > > works.)This week Seagate announced 300 GByte, 15k rpm SAS (enterprise-style) drives.> But small allocations have got to go.Small blocks, too. Today 512 byte blocks are common. Going forward, this won''t work and the concensus seems to be falling to 4kByte blocks. We''ve made great strides in the past few years getting to large memory pages in the kernel, I expect the same exercise in the disk drives. -- richard
> On an unrelated note, the problem I see with big drives is how to back > up those drives in a reasonable amount of time. Drives spindles aren''t > getting faster, which means that as drives get bigger, the amount of > time to back them up is linear.So you back them up to more disk. And you don''t do all at once. This is where incremental snapshots and zfs send/receive can replicate a snapshot remotely. Henk
Gregory Shaw wrote:> > One thing bothers me about NAS and iSCSI. What''s the max > performance? On modern arrays and tape drives, the arrays can drive > at several thousand i/o''s per second, and 200+mb/sec. How can a nas > box come even remotely close to that? Single threaded gig-e using > NFS maxes out at around 45MB/second. That''s around 20% of what a > local array can do.What stack are you using to test this? NFSv4 on Solaris 10 gig/E should be much much much faster then 45 MB/s.> > > On an unrelated note, the problem I see with big drives is how to back > up those drives in a reasonable amount of time. Drives spindles > aren''t getting faster, which means that as drives get bigger, the > amount of time to back them up is linear.Traditional backup methods are running out of steam but software like SAM/FS helps ... but we''re probably veering off topic.
On Wed, Apr 19, 2006 at 08:29:15AM -0600, Gregory Shaw wrote:> > On Apr 19, 2006, at 12:17 AM, Nicolas Williams wrote: > > >On Wed, Apr 19, 2006 at 01:31:43AM -0400, Torrey McMahon wrote: > >>In some cases we see customers taking a single HW raid array and > >>carving > >>up hundreds of LUNs of small size for multiple systems. Not as > >>often as > >>larger datasets but I got a call about one such setup the other week. > > > >This will change, surely. Partly because this way lies madness, > >partly > >because ZFS rocks. > > Why would this change? When you buy disk resources, you want to use > them as effectively as possible. It''s dangerous to have a large > number of systems on a large HW array because of SPOF and performance > concerns, but it''s entirely reasonable.Not dangerous as much as difficult to manage.> It''s far better than local storage.Yes, but then NAS would probably be fine for any apps with small storage needs. Bring NAS into the picture and you bring ZFS into the picture, with volume management, quotas, snapshots and all that.> I don''t get the focus on NAS when local disk performance is far better.Managing local storage is a pain. NAS is much easier. I''ll let someone who has numbers respond to the NAS performance comment.> On an unrelated note, the problem I see with big drives is how to > back up those drives in a reasonable amount of time. Drives spindles > aren''t getting faster, which means that as drives get bigger, the > amount of time to back them up is linear.Redundancy (RAID-Z, mirroring, replication) + snapshots and/or backup to disk + infrequent backup to tape is one answer. Backup/restore has long been a problem, and local storage makes the problem worse, not better. Nico --
On Wed, Torrey McMahon wrote:> Gregory Shaw wrote: > > > >One thing bothers me about NAS and iSCSI. What''s the max > >performance? On modern arrays and tape drives, the arrays can drive > >at several thousand i/o''s per second, and 200+mb/sec. How can a nas > >box come even remotely close to that? Single threaded gig-e using > >NFS maxes out at around 45MB/second. That''s around 20% of what a > >local array can do.I can regularly get 100-110MB/second on gig-e. It has to be fast enough processors on the client and server but it can be done out of the box. Spencer
On Wed, 19 Apr 2006, Casper.Dik at Sun.COM wrote:> The DOD supposedly uses firewall complexes consisting of three layers; > each using a different vendor. > > (It works when they''re in series, not when they''re in parallel)Avionics systems also use triple redundancy, with different implementations of the same design spec for each instance* to avoid the common failure mode problem described by Al earlier. * Or at least they did when I last worked on military stuff, a few years ago. -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich
On Apr 19, 2006, at 3:16 PM, Nicolas Williams wrote:> On Wed, Apr 19, 2006 at 08:29:15AM -0600, Gregory Shaw wrote: >> >> On Apr 19, 2006, at 12:17 AM, Nicolas Williams wrote: >> >>> On Wed, Apr 19, 2006 at 01:31:43AM -0400, Torrey McMahon wrote: >>>> In some cases we see customers taking a single HW raid array and >>>> carving >>>> up hundreds of LUNs of small size for multiple systems. Not as >>>> often as >>>> larger datasets but I got a call about one such setup the other >>>> week. >>> >>> This will change, surely. Partly because this way lies madness, >>> partly >>> because ZFS rocks. >> >> Why would this change? When you buy disk resources, you want to use >> them as effectively as possible. It''s dangerous to have a large >> number of systems on a large HW array because of SPOF and performance >> concerns, but it''s entirely reasonable. > > Not dangerous as much as difficult to manage. >Agreed. By dangerous, I meant lots of systems going to a single array. It all goes away together...>> It''s far better than local storage. > > Yes, but then NAS would probably be fine for any apps with small > storage > needs. Bring NAS into the picture and you bring ZFS into the picture, > with volume management, quotas, snapshots and all that. >Perhaps. I think the local storage on most hosts today are sufficient for small storage needs.>> I don''t get the focus on NAS when local disk performance is far >> better. > > Managing local storage is a pain. NAS is much easier. I''ll let > someone > who has numbers respond to the NAS performance comment. >Managing NAS instead of SAN seems to be the same to me, if not worse.>> On an unrelated note, the problem I see with big drives is how to >> back up those drives in a reasonable amount of time. Drives spindles >> aren''t getting faster, which means that as drives get bigger, the >> amount of time to back them up is linear. > > Redundancy (RAID-Z, mirroring, replication) + snapshots and/or > backup to > disk + infrequent backup to tape is one answer. >Even with snapshots, you''ve got to back it up to tape. That involves the same disks, so it helps in an application sense (no downtime due to snapshot), but it doesn''t impact the need to back everything up.> Backup/restore has long been a problem, and local storage makes the > problem worse, not better. >When you say NAS, do you mean appliances, or servers? It changes the picture significantly between the two.> Nico > ------- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Is that NAS or iSCSI? On Apr 19, 2006, at 3:36 PM, Spencer Shepler wrote:> On Wed, Torrey McMahon wrote: >> Gregory Shaw wrote: >>> >>> One thing bothers me about NAS and iSCSI. What''s the max >>> performance? On modern arrays and tape drives, the arrays can drive >>> at several thousand i/o''s per second, and 200+mb/sec. How can a nas >>> box come even remotely close to that? Single threaded gig-e using >>> NFS maxes out at around 45MB/second. That''s around 20% of what a >>> local array can do. > > I can regularly get 100-110MB/second on gig-e. It has to be fast > enough processors on the client and server but it can be done > out of the box. > > Spencer----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
NAS - NFSv3 or NFSv4... On Wed, Gregory Shaw wrote:> Is that NAS or iSCSI? > > On Apr 19, 2006, at 3:36 PM, Spencer Shepler wrote: > > >On Wed, Torrey McMahon wrote: > >>Gregory Shaw wrote: > >>> > >>>One thing bothers me about NAS and iSCSI. What''s the max > >>>performance? On modern arrays and tape drives, the arrays can drive > >>>at several thousand i/o''s per second, and 200+mb/sec. How can a nas > >>>box come even remotely close to that? Single threaded gig-e using > >>>NFS maxes out at around 45MB/second. That''s around 20% of what a > >>>local array can do. > > > >I can regularly get 100-110MB/second on gig-e. It has to be fast > >enough processors on the client and server but it can be done > >out of the box. > > > >Spencer > > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds > >
In my testing, I''ve found that a single task can write no faster than 45mb/sec on nfsv3. I don''t know if v4 is faster -- I don''t have the infrastructure for that at this time. This was on a bluearc titan NAS fileserver, which is capable of far beyond gig-e throughput. When you start running multiple threads, it''s possible get better throughput, but I tend to think in single processes, like a batch job. If it can''t write faster than 45mb/sec, it doesn''t matter what else may be occurring on the same system -- it''s limited by the single process throughput. What were you using for testing? On Apr 20, 2006, at 9:08 AM, Spencer Shepler wrote:> > NAS - NFSv3 or NFSv4... > > On Wed, Gregory Shaw wrote: >> Is that NAS or iSCSI? >> >> On Apr 19, 2006, at 3:36 PM, Spencer Shepler wrote: >> >>> On Wed, Torrey McMahon wrote: >>>> Gregory Shaw wrote: >>>>> >>>>> One thing bothers me about NAS and iSCSI. What''s the max >>>>> performance? On modern arrays and tape drives, the arrays can >>>>> drive >>>>> at several thousand i/o''s per second, and 200+mb/sec. How can >>>>> a nas >>>>> box come even remotely close to that? Single threaded gig-e >>>>> using >>>>> NFS maxes out at around 45MB/second. That''s around 20% of what a >>>>> local array can do. >>> >>> I can regularly get 100-110MB/second on gig-e. It has to be fast >>> enough processors on the client and server but it can be done >>> out of the box. >>> >>> Spencer >> >> ----- >> Gregory Shaw, IT Architect >> Phone: (303) 673-8273 Fax: (303) 673-8273 >> ITCTO Group, Sun Microsystems Inc. >> 1 StorageTek Drive MS 4382 greg.shaw at sun.com (work) >> Louisville, CO 80028-4382 shaw at fmsoft.com (home) >> "When Microsoft writes an application for Linux, I''ve Won." - Linus >> Torvalds >> >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
Roch Bourbonnais - Performance Engineering
2006-Apr-20 16:40 UTC
[zfs-discuss] Re: Sun JBOD setup
Gregory Shaw writes: > In my testing, I''ve found that a single task can write no faster than > 45mb/sec on nfsv3. I don''t know if v4 is faster -- I don''t have the > infrastructure for that at this time. This was on a bluearc titan > NAS fileserver, which is capable of far beyond gig-e throughput. > > When you start running multiple threads, it''s possible get better > throughput, but I tend to think in single processes, like a batch > job. If it can''t write faster than 45mb/sec, it doesn''t matter what > else may be occurring on the same system -- it''s limited by the > single process throughput. Strange, was the problem investigated ? Something certainly went wrong. -r
On Thu, Gregory Shaw wrote:> In my testing, I''ve found that a single task can write no faster than > 45mb/sec on nfsv3. I don''t know if v4 is faster -- I don''t have the > infrastructure for that at this time. This was on a bluearc titan > NAS fileserver, which is capable of far beyond gig-e throughput. > > When you start running multiple threads, it''s possible get better > throughput, but I tend to think in single processes, like a batch > job. If it can''t write faster than 45mb/sec, it doesn''t matter what > else may be occurring on the same system -- it''s limited by the > single process throughput. > > What were you using for testing?Ah, I have to describe my treachery. :-) So, the clients and servers were 2-way opteron boxes and using tmpfs on the server for the filesystem. I used dd on the client to generate the i/o. The tmpfs usage was just to make it convenient to remove the issues of filesystem tuning. But I have seen a written report describing a Solaris server with local/FC attached nvram cached storage that demonstrated the same type of throughput. Anyway. As you have noted, the issue is queueing. If the client can effectively queue i/o to the server at an appropriate queue depth, then most servers at this date can generate a very good overall effective throughput. The main problem with NFS implementations is that they are not effective as they could be at queueing requests. To overcome the single threaded nature of an application, most NFS clients will use async or helper threads in the kernel to drive up the queue depth of i/o at the server. So, my numbers were an attempt to demonstrate that NFS is capable of reasonable throughput. Spencer
I wasn''t able to find a better throughput solution. I''ll have to recreate the test, as if I encountered it, I have to think that others encounter it regularly. I did some serious research on this about a year and a half ago. You can get better with some tuning (water marks, jumbo frames, etc.), but it generally drops to 45MB/sec on a single threaded process. Of course, the hardware in question has a huge impact. Faster servers and more intelligent gig-e cards (that don''t drown the CPU in interrupts) make a big difference. On Apr 20, 2006, at 10:40 AM, Roch Bourbonnais - Performance Engineering wrote:> > Gregory Shaw writes: >> In my testing, I''ve found that a single task can write no faster than >> 45mb/sec on nfsv3. I don''t know if v4 is faster -- I don''t have the >> infrastructure for that at this time. This was on a bluearc titan >> NAS fileserver, which is capable of far beyond gig-e throughput. >> >> When you start running multiple threads, it''s possible get better >> throughput, but I tend to think in single processes, like a batch >> job. If it can''t write faster than 45mb/sec, it doesn''t matter what >> else may be occurring on the same system -- it''s limited by the >> single process throughput. > > > Strange, was the problem investigated ? > Something certainly went wrong. > > -r >----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Thu, 2006-04-20 at 10:51 -0600, Gregory Shaw wrote:> I did some serious research on this about a year and a half ago. > You can get better with some tuning (water marks, jumbo frames, > etc.), but it generally drops to 45MB/sec on a single threaded process.45 MBytes/s is suspiciously close to the typical media speed for a single disk. Disks are slower than GbE. -- richard
On Thu, 2006-04-20 at 17:40, Spencer Shepler wrote:> Ah, I have to describe my treachery. :-) > > So, the clients and servers were 2-way opteron boxes and using > tmpfs on the server for the filesystem. > > I used dd on the client to generate the i/o. The tmpfs usage > was just to make it convenient to remove the issues of filesystem > tuning. But I have seen a written report describing a Solaris server > with local/FC attached nvram cached storage that demonstrated the > same type of throughput. Anyway.Hm. 45 still seems low. I could get that for single-threaded reads (from disk, at that) off an E250 5 years ago. Writes then were limited to just over 30, because that''s all you can push into a mirrored A3500. A couple of years ago I was getting 70M/s single-threaded writes onto a V240 with an attached SE3310. That was CPU-bound - it would have been interesting to try with Solaris 10, as that''s clearly better. (Although I''m not sure the SE3310 could soak the data up much faster.) With Solaris 10 what we noticed was that CPU utilization for a given level of network traffic (NFS, specifically) dropped enormously. Essentially, we were down to 1GHz to saturate a gigE network. And in every case I remember the transfer was then limited by the disk system (or the filesystem). -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Nicolas Williams wrote:> On Wed, Apr 19, 2006 at 08:29:15AM -0600, Gregory Shaw wrote: > > > >> I don''t get the focus on NAS when local disk performance is far better. >> > > Managing local storage is a pain. NAS is much easier. I''ll let someone > who has numbers respond to the NAS performance comment.Managing storage when you''re a NAS client is much easier. Guess what you end up seeing behind a lot of the NAS heads these days? A SAN. :-)
On Thu, Apr 20, 2006 at 05:23:16PM -0400, Torrey McMahon wrote:> Nicolas Williams wrote: > >Managing local storage is a pain. NAS is much easier. I''ll let someone > >who has numbers respond to the NAS performance comment. > > Managing storage when you''re a NAS client is much easier. Guess what you > end up seeing behind a lot of the NAS heads these days? A SAN. :-)But the converse, that managing storage when you''re the NAS is harder is not true. With a NAS server you can have one big volume with whole disks -- no LUNs to maintain -- and use filesystem quotas to manage storage allocations. I.e., storage allocation and storage devices are decoupled.
Gregory Shaw wrote:> > On Apr 19, 2006, at 3:16 PM, Nicolas Williams wrote: > > >> >> Redundancy (RAID-Z, mirroring, replication) + snapshots and/or backup to >> disk + infrequent backup to tape is one answer. >> > > Even with snapshots, you''ve got to back it up to tape. That involves > the same disks, so it helps in an application sense (no downtime due > to snapshot), but it doesn''t impact the need to back everything up.Do you? Maybe you can just take the snapshots? Maybe you just keep them on disk someplace? Maybe you really don''t have to archive as much as you need to? One of my favorite past times is getting the CFO and legal types to answer the "What are the data retention requirements?" instead of the CIO types. Ask most IT departments and they''ll say, "Level 0 every month, clone tapes off site, incremental every day." The CFO/Legal folks always have better answers that meet the real business requirements. (Well...not always but I think you get my meaning.)> >> Backup/restore has long been a problem, and local storage makes the >> problem worse, not better. >> > > When you say NAS, do you mean appliances, or servers? It changes the > picture significantly between the two.Whats the difference?
Funny, that''s what ZFS does as well. On Apr 20, 2006, at 3:25 PM, Nicolas Williams wrote:> On Thu, Apr 20, 2006 at 05:23:16PM -0400, Torrey McMahon wrote: >> Nicolas Williams wrote: >>> Managing local storage is a pain. NAS is much easier. I''ll let >>> someone >>> who has numbers respond to the NAS performance comment. >> >> Managing storage when you''re a NAS client is much easier. Guess >> what you >> end up seeing behind a lot of the NAS heads these days? A SAN. :-) > > But the converse, that managing storage when you''re the NAS is > harder is > not true. With a NAS server you can have one big volume with whole > disks -- no LUNs to maintain -- and use filesystem quotas to manage > storage allocations. I.e., storage allocation and storage devices are > decoupled.----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Apr 20, 2006, at 3:28 PM, Torrey McMahon wrote:> Gregory Shaw wrote: >> >> On Apr 19, 2006, at 3:16 PM, Nicolas Williams wrote: >> >> >>> >>> Redundancy (RAID-Z, mirroring, replication) + snapshots and/or >>> backup to >>> disk + infrequent backup to tape is one answer. >>> >> >> Even with snapshots, you''ve got to back it up to tape. That >> involves the same disks, so it helps in an application sense (no >> downtime due to snapshot), but it doesn''t impact the need to back >> everything up. > > Do you? Maybe you can just take the snapshots? Maybe you just keep > them on disk someplace? Maybe you really don''t have to archive as > much as you need to?> One of my favorite past times is getting the CFO and legal types to > answer the "What are the data retention requirements?" instead of > the CIO types. Ask most IT departments and they''ll say, "Level 0 > every month, clone tapes off site, incremental every day." The CFO/ > Legal folks always have better answers that meet the real business > requirements. (Well...not always but I think you get my meaning.) >With Sarbanes-Oxley, most companies are going far the other direction -- keep everything for 7-27 years. In my experience, engineering wants their data offsite for 7 years, while core business (such as ERP systems) aim higher, such as 27 years. Keeping things on disk doesn''t address disaster recovery either.>> >>> Backup/restore has long been a problem, and local storage makes the >>> problem worse, not better. >>> >> >> When you say NAS, do you mean appliances, or servers? It changes >> the picture significantly between the two. > > > Whats the difference? >----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Thu, Apr 20, 2006 at 04:15:13PM -0600, Gregory Shaw wrote:> Funny, that''s what ZFS does as well.That''s the point. I may be speaking of NASes generally, but in mind I have Solaris, ZFS, NFSv4, CIFS.> On Apr 20, 2006, at 3:25 PM, Nicolas Williams wrote: > >But the converse, that managing storage when you''re the NAS is > >harder is > >not true. With a NAS server you can have one big volume with whole > >disks -- no LUNs to maintain -- and use filesystem quotas to manage > >storage allocations. I.e., storage allocation and storage devices are > >decoupled.
>In my experience, engineering wants their data offsite for 7 years, >while core business (such as ERP systems) aim higher, such as 27 years.27? Better use printers and ink then. There''s no backup media I know off that will live that long. (Spinning rust is the only one which will survive as long as the data is migrated to new technology every 3-5 years) Don''t expect any backup media to be readable after 5-10 years (if you can find the drives, the media will have perished) Casper
> With Sarbanes-Oxley, most companies are going far the other direction > -- keep everything for 7-27 years. > > In my experience, engineering wants their data offsite for 7 years, > while core business (such as ERP systems) aim higher, such as 27 years. > > Keeping things on disk doesn''t address disaster recovery either. > ----- > Gregory Shaw, IT Architect > Phone: (303) 673-8273 Fax: (303) 673-8273 > ITCTO Group, Sun Microsystems Inc. > 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) > Louisville, CO 80028-4382 shaw at fmsoft.com (home) > "When Microsoft writes an application for Linux, I''ve Won." - Linus > Torvalds >The problem here is that both Engineering and Business generally say "keep it all". That''s idiotic. Even with S-OX. Take us here in JavaSoft for example. A good chunk of the email should be kept according to SOX requirements. As should core business data, such as what goes into customer and ordering DBs. BUT, the vast majority of daily work has no need to be kept, even with SOX coming in now. Certainly, all the Dev and QA work has much lower retention requirements. In a company that has enough $$$ to contemplate a SAN, multi-site disk redundancy is well within the cost horizon, and can cover the needs of Engineering for both backup and redundancy quite well. ZFS (and similar) snapshot capability and WAN mirroring via a SAN work perfectly for backup and D-R. Periodic archive of important long-term data is still required, but that is a VERY small amount compared to the transient data volume. Back to us here in JavaSoft. We generate about 10TB/year in build and testing binaries. For example, there are weekly code snapshots of our Mustang (JDK6) work, which are then built on multiple architectures and run through QA. Now, we need to keep the code snapshot around to do regression analysis, and it might be good to keep the built binaries, but neither have long-term requirements. When 6.0 finally ships later this year, we can effectively dump virtually all the build/test binaries and related test data, as it can be regenerated at will. In reality, we probably need to keep about ~50% of our data less than 1 year, ~40% for 1-3 years, and less than 5% for longer than 3. I can''t see this a atypical for an engineering department. People continually confuse archival, backup, and redundancy (disaster recovery) as the same thing. Mgmt as a whole (speaking about the business world in general) really needs to have this drilled into their skulls - one system does NOT fulfill all three purposes. While it is possible to have a single system for all 3, it is severely sub-optimal in time, cost, and effort. Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Casper.Dik at sun.com wrote:>> In my experience, engineering wants their data offsite for 7 years, >> while core business (such as ERP systems) aim higher, such as 27 years. > > 27? Better use printers and ink then. > > There''s no backup media I know off that will live that long. > > (Spinning rust is the only one which will survive as long as the > data is migrated to new technology every 3-5 years) > > Don''t expect any backup media to be readable after 5-10 years > (if you can find the drives, the media will have perished)Well, I have plenty of 20+ year old CDs that aren''t showing any signs of degradation and are all still readable on new commodity hardware today, but I''m not going to debate about the longevity of a single piece of media. Dana
Archival quality Tape will reliably last at least 10 years if stored properly. Finding a tape drive to read it, however, is a severe problem. :-) Long-term data storage is a problem. The best solution I''ve seen is Magneto-Optical stuff (the media is resistant to all common problems), but capacities suck, and finding old readers is problematic. Outside that, if you are truly worried about archival, then mastering a DVD is the best option. Mastered (i.e. pressed) DVD/CDs will last 50 years or more with proper storage, and we''ll probably have a better chance finding an operational reader for the format in 2050 than any other media. -Erik On Fri, 2006-04-21 at 00:45 +0200, Casper.Dik at Sun.COM wrote:> >In my experience, engineering wants their data offsite for 7 years, > >while core business (such as ERP systems) aim higher, such as 27 years. > > 27? Better use printers and ink then. > > There''s no backup media I know off that will live that long. > > (Spinning rust is the only one which will survive as long as the > data is migrated to new technology every 3-5 years) > > Don''t expect any backup media to be readable after 5-10 years > (if you can find the drives, the media will have perished) > > Casper > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Wow, what an email. See my comments below: On Apr 20, 2006, at 4:57 PM, Erik Trimble wrote:> >> With Sarbanes-Oxley, most companies are going far the other direction >> -- keep everything for 7-27 years. >> >> In my experience, engineering wants their data offsite for 7 years, >> while core business (such as ERP systems) aim higher, such as 27 >> years. >> >> Keeping things on disk doesn''t address disaster recovery either. >> ----- >> Gregory Shaw, IT Architect >> Phone: (303) 673-8273 Fax: (303) 673-8273 >> ITCTO Group, Sun Microsystems Inc. >> 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) >> Louisville, CO 80028-4382 shaw at fmsoft.com (home) >> "When Microsoft writes an application for Linux, I''ve Won." - Linus >> Torvalds >> > > > The problem here is that both Engineering and Business generally say > "keep it all". That''s idiotic. Even with S-OX. >I totally agree. Being in IT, I''ve got to design the solutions that have to deal with it. Data never gets smaller.> Take us here in JavaSoft for example. A good chunk of the email should > be kept according to SOX requirements. As should core business data, > such as what goes into customer and ordering DBs. > > BUT, the vast majority of daily work has no need to be kept, even with > SOX coming in now. Certainly, all the Dev and QA work has much lower > retention requirements. > > > In a company that has enough $$$ to contemplate a SAN, multi-site disk > redundancy is well within the cost horizon, and can cover the needs of > Engineering for both backup and redundancy quite well. ZFS (and > similar) > snapshot capability and WAN mirroring via a SAN work perfectly for > backup and D-R. Periodic archive of important long-term data is still > required, but that is a VERY small amount compared to the transient > data > volume. > > > Back to us here in JavaSoft. We generate about 10TB/year in build and > testing binaries. For example, there are weekly code snapshots of our > Mustang (JDK6) work, which are then built on multiple architectures > and > run through QA. Now, we need to keep the code snapshot around to do > regression analysis, and it might be good to keep the built binaries, > but neither have long-term requirements. When 6.0 finally ships later > this year, we can effectively dump virtually all the build/test > binaries > and related test data, as it can be regenerated at will. > > In reality, we probably need to keep about ~50% of our data less > than 1 > year, ~40% for 1-3 years, and less than 5% for longer than 3. I can''t > see this a atypical for an engineering department. > > > People continually confuse archival, backup, and redundancy (disaster > recovery) as the same thing. Mgmt as a whole (speaking about the > business world in general) really needs to have this drilled into > their > skulls - one system does NOT fulfill all three purposes. While it is > possible to have a single system for all 3, it is severely sub-optimal > in time, cost, and effort. > > > > Erik Trimble > Java System Support > Mailstop: usca14-102 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) >I see two solutions here: 1. What you''re talking about at a basic level is Information Lifecycle Management (ILM). If we had defined policies around data retention, automation for the data policies can be implemented. However, without policies, you''re stuck with an all-or-nothing view by the business which translates into ''all''. Everything has to be backed up, everything has to be offsite in multiple copies forever. If we had policies, we could use ILM to migrate the data (via SAMFS or another solution) from tier to tier of storage. That would significantly reduce the ongoing cost. 2. In your particular code case, you might want to look at the Intellistore storage solution. It''s part of the STK purchase. It allows you to define storage policies for data retention. In other words, you configure a NFS share with a defined data lifetime. It will guarantee that the data will not be touched and will live only as long as you need it. I find in the unix admin space that everybody wants to have the data go away. However, we''re not very good at following through when the data has expired and should be deleted. ----- Gregory Shaw, IT Architect Phone: (303) 673-8273 Fax: (303) 673-8273 ITCTO Group, Sun Microsystems Inc. 1 StorageTek Drive ULVL4-382 greg.shaw at sun.com (work) Louisville, CO 80028-4382 shaw at fmsoft.com (home) "When Microsoft writes an application for Linux, I''ve Won." - Linus Torvalds
On Apr 20, 2006, at 4:08 PM, Dana H. Myers wrote:> Casper.Dik at sun.com wrote: >>> In my experience, engineering wants their data offsite for 7 years, >>> while core business (such as ERP systems) aim higher, such as 27 >>> years. >> >> 27? Better use printers and ink then. >> >> There''s no backup media I know off that will live that long. >> >> (Spinning rust is the only one which will survive as long as the >> data is migrated to new technology every 3-5 years) >> >> Don''t expect any backup media to be readable after 5-10 years >> (if you can find the drives, the media will have perished) > > Well, I have plenty of 20+ year old CDs that aren''t showing any > signs of degradation and are all still readable on new commodity > hardware > today, but I''m not going to debate about the longevity of a single > piece > of media. > > Dana >Yes, but I''m sure those were CDs where the data was physically stamped into the metal. Re/Writable CDs used laser-heated dyes and won''t last anywhere near as long. ckl
Chad Lewis wrote:> > On Apr 20, 2006, at 4:08 PM, Dana H. Myers wrote: > >> Casper.Dik at sun.com wrote: >>>> In my experience, engineering wants their data offsite for 7 years, >>>> while core business (such as ERP systems) aim higher, such as 27 years. >>> >>> 27? Better use printers and ink then. >>> >>> There''s no backup media I know off that will live that long.>> Well, I have plenty of 20+ year old CDs that aren''t showing any >> signs of degradation and are all still readable on new commodity hardware >> today, but I''m not going to debate about the longevity of a single piece >> of media. >> >> Dana >> > > Yes, but I''m sure those were CDs where the data was physically stamped > into the metal. > > Re/Writable CDs used laser-heated dyes and won''t last anywhere near as > long.Of course. If you''re looking for long-term retention, you wouldn''t use -R or -R/W. Dana
>Well, I have plenty of 20+ year old CDs that aren''t showing any >signs of degradation and are all still readable on new commodity hardware >today, but I''m not going to debate about the longevity of a single piece >of media.20 year old CD-Rs? Or 20 year old CDs? (I''m not sure that writable CDs even existed, 20 years ago) CDs will generally live that long; CD-Rs do not. Casper
Casper.Dik at Sun.COM wrote:>> Well, I have plenty of 20+ year old CDs that aren''t showing any >> signs of degradation and are all still readable on new commodity hardware >> today, but I''m not going to debate about the longevity of a single piece >> of media. > > 20 year old CD-Rs? Or 20 year old CDs? (I''m not sure that writable CDs > even existed, 20 years ago)You didn''t specify the difference; CDs are as much media as CD-Rs are. I''m aware that CDs take longer and cost more to write than CD-Rs; but they''re media all the same. However, none of this matters; maintaining a data archive is much more than making a copy of the data and storing it a long time. Dana
>You didn''t specify the difference; CDs are as much media as CD-Rs are. >I''m aware that CDs take longer and cost more to write than CD-Rs; but >they''re media all the same.Well, they''re not "backup media" by any standard. But the optical disks, even pressed, are useless. They don''t store enough data; not currently anyway. And the gap is only worsening. Casper
Dana H. Myers wrote:> Casper.Dik at Sun.COM wrote: >>> Well, I have plenty of 20+ year old CDs that aren''t showing any >>> signs of degradation and are all still readable on new commodity hardware >>> today, but I''m not going to debate about the longevity of a single piece >>> of media. >> 20 year old CD-Rs? Or 20 year old CDs? (I''m not sure that writable CDs >> even existed, 20 years ago) > > You didn''t specify the difference; CDs are as much media as CD-Rs are. > I''m aware that CDs take longer and cost more to write than CD-Rs; but > they''re media all the same.The difference matters a lot. There is a huge difference in the quality and the failure characteristics between pressed (ie in a proper manufacturing plant) media and those produced in consumer burners. This is basically the difference between CD and CD-R - same applies in the DVD world too. -- Darren J Moffat
Casper.Dik at sun.com wrote:>> You didn''t specify the difference; CDs are as much media as CD-Rs are. >> I''m aware that CDs take longer and cost more to write than CD-Rs; but >> they''re media all the same. > > Well, they''re not "backup media" by any standard.They may be archive media, though. Backups and archives aren''t necessarily the same thing. If the requirement is 27-year retention, it may in-fact become cost-effective to master DVDs and handle them less often, particularly if done in bulk.> But the optical disks, even pressed, are useless. They don''t store > enough data; not currently anyway. And the gap is only worsening.This is indeed a problem. Dana
Dana H. Myers
2006-Apr-21 17:48 UTC
Volume of DVDs vs. disk drives (was Re: [zfs-discuss] Re: Sun JBOD setup)
Casper.Dik at sun.com wrote:>> You didn''t specify the difference; CDs are as much media as CD-Rs are. >> I''m aware that CDs take longer and cost more to write than CD-Rs; but >> they''re media all the same. > > Well, they''re not "backup media" by any standard. > > But the optical disks, even pressed, are useless. They don''t store > enough data; not currently anyway. And the gap is only worsening.Actually, this got me to thinking. If one desires to maintain a long-term archive for SOX, I''m guessing that the archive will be rarely accessed. So it''s perfectly reasonable to master double-sided DVDs and stack them on spindles, and put them in a vault. Each DVD would require a volume of around 9cc and would store approximately 9.4GB. A common 400GB SATA drive today has a volume of around 394 cc; thus DVDs could contain 394/9 * 9.4GB = 411GB in approximately the same volume. This is just estimating, of course. Fast forward to Blu-Ray (for example); it seems that a single double-sided Blu-Ray disk could contain 50GB, or something in excess of 2TB in the same volume as a disk drive. I''m assuming that Blu-Ray disks would be pressed metal, could be double-sided, and have the same volume as current DVDs. Since we''re talking about archiving data for a long period of time, the latency in mastering DVDs is unimportant, and the cost savings over the lifetime of the archive probably more than makes up for the greater cost of mastering the DVDs, particularly if this turns into a common business. Note that I calculate the volume of a DVD based on the square of the diameter, since circles don''t pack as tightly as squares. Dana
Nicolas Williams
2006-Apr-21 18:30 UTC
Volume of DVDs vs. disk drives (was Re: [zfs-discuss] Re: Sun JBOD setup)
On Fri, Apr 21, 2006 at 10:48:49AM -0700, Dana H. Myers wrote:> Since we''re talking about archiving data for a long period of time, the > latency in mastering DVDs is unimportant, and the cost savings over the > lifetime of the archive probably more than makes up for the greater cost > of mastering the DVDs, particularly if this turns into a common business.Well, the latency matters because the archive system needs to be able to cache bandwidth * latency. A quick search leads me to think that latency would be somewhere around 10 seconds. For bandwidth = 1TB/day that works out to a reasonably small number (<1/2 TB). OK, yes, the latency in mastering DVDs is unimportant :) Nico --
Nicolas Williams
2006-Apr-21 18:35 UTC
Volume of DVDs vs. disk drives (was Re: [zfs-discuss] Re: Sun JBOD setup)
On Fri, Apr 21, 2006 at 01:30:42PM -0500, Nicolas Williams wrote:> On Fri, Apr 21, 2006 at 10:48:49AM -0700, Dana H. Myers wrote: > > Since we''re talking about archiving data for a long period of time, the > > latency in mastering DVDs is unimportant, and the cost savings over the > > lifetime of the archive probably more than makes up for the greater cost > > of mastering the DVDs, particularly if this turns into a common business. > > Well, the latency matters because the archive system needs to be able to > cache bandwidth * latency. A quick search leads me to think that > latency would be somewhere around 10 seconds. For bandwidth = 1TB/day > that works out to a reasonably small number (<1/2 TB). OK, yes, the > latency in mastering DVDs is unimportant :)Or not, 10 seconds is pressing time. I don''t know how long it takes to make a master. I''m guessing that mastering is not realistic.
Erik Trimble
2006-Apr-21 19:48 UTC
Volume of DVDs vs. disk drives (was Re: [zfs-discuss] Re: Sun JBOD setup)
I''ve done this before for CDs. Mastering a CD (i.e. getting it pressed) usually takes about 2 weeks, provided you''ve already got the account/relationship set up with the vendor. Normally, it''s 1 day to burn THREE CD-Rs of all the data (vendors want multiple verification before they commit to plastic) and ship it out to them. 2-3 days for the mail. 5 business days for them to insert your work into their work queue, and make a run of 100. 2-3 days to ship it back to you. Voila! Also, when you store them, you DON''T store them on a Spindle. It''s bad for the CD, and it makes finding the right one more difficult. Generally, I''ve found that they get stored in a standard jewel case, complete with full label (and barcode) in a (acid-free) paper insert (just like liner notes). In reality, it''s fine for archival. VERY few businesses truly need vast quantities of archival data. Even a company the size of Sun, I''d be surprised if we needed more than a 1TB of long-term archives per year. All of JavaSoft probably produces no more than 25GB per year of data that should be archived (source code snapshots of our public releases - nice compression ratios, too), and I''d be surprised if any other Engineering division was significantly different. HR and Sales generate the most data, and even there, it''s all DB files, and that stuff compresses _very_ nicely. :-) So, for Fortune 100 company, we''d use somewhere around 200 DVDs per year (assuming some wasted space). That''s about 75 linear feet of shelf space, 6" high, 6" deep, or less than 20 cubic feet. That''s trivial. And, certified archival-quality pressed CDs have GUARRANTIED livespans of 50+ years. I haven''t looked at the corresponding DVDs, but I can''t image them being different. Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA
Erik Trimble <Erik.Trimble at Sun.COM> wrote:> Outside that, if you are truly worried about archival, then mastering a > DVD is the best option. Mastered (i.e. pressed) DVD/CDs will last 50 > years or more with proper storage, and we''ll probably have a better > chance finding an operational reader for the format in 2050 than any > other media.This is also true for "burned" DVDs in case you use the right media. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Chad Lewis <Chad.Lewis at Sun.COM> wrote:> > Well, I have plenty of 20+ year old CDs that aren''t showing any > > signs of degradation and are all still readable on new commodity > > hardware > > today, but I''m not going to debate about the longevity of a single > > piece > > of media. > > > > Dana > > > > Yes, but I''m sure those were CDs where the data was physically > stamped into the metal. > > Re/Writable CDs used laser-heated dyes and won''t last anywhere near > as long.I still have a readable Kodak CD from 1992 (when the first CD-Rs came out). J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On Wed, Apr 26, 2006 at 06:29:57PM +0200, Joerg Schilling wrote:> Erik Trimble <Erik.Trimble at Sun.COM> wrote: > > Outside that, if you are truly worried about archival, then mastering a > > DVD is the best option. Mastered (i.e. pressed) DVD/CDs will last 50 > > years or more with proper storage, and we''ll probably have a better > > chance finding an operational reader for the format in 2050 than any > > other media. > > This is also true for "burned" DVDs in case you use the right media.Links? (I believe you, I just want to know what media to buy.)
On Wed, 2006-04-26 at 11:51 -0500, Nicolas Williams wrote:> On Wed, Apr 26, 2006 at 06:29:57PM +0200, Joerg Schilling wrote: > > Erik Trimble <Erik.Trimble at Sun.COM> wrote: > > > Outside that, if you are truly worried about archival, then mastering a > > > DVD is the best option. Mastered (i.e. pressed) DVD/CDs will last 50 > > > years or more with proper storage, and we''ll probably have a better > > > chance finding an operational reader for the format in 2050 than any > > > other media. > > > > This is also true for "burned" DVDs in case you use the right media. > > Links? (I believe you, I just want to know what media to buy.) > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discusshttp://www.delkin.com/delkin_products_archival_gold_dvd.html Kodack makes some to: http://www.dvd-recordable.org/Article2616.phtml Google for "DVD gold media" with additional terms "archive" or "archival". There''s a good selection there. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Joerg Schilling
2006-Apr-29 12:07 UTC
Volume of DVDs vs. disk drives (was Re: [zfs-discuss] Re: Sun JBOD setup)
"Dana H. Myers" <Dana.Myers at Sun.COM> wrote:> Fast forward to Blu-Ray (for example); it seems that a single double-sided > Blu-Ray disk could contain 50GB, or something in excess of 2TB in the same > volume as a disk drive. I''m assuming that Blu-Ray disks would be pressed > metal, could be double-sided, and have the same volume as current DVDs.HD-DVD and Blu ray are not using organic dye anymore, they are based on phase change technology afaik. For reallity: Note that support for cdrw did stop recently and that cdrecord is the software that is actively maintained. But HD-DVD and Blu ray drives are expensive (~300 Euro for the laser component only) and for this reason, drive manufacturers do (currently) not give away sample drives for free. So you would need to wait some time until I am able to support the drives with cdrecord. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Nicolas Williams <Nicolas.Williams at Sun.COM> wrote:> On Wed, Apr 26, 2006 at 06:29:57PM +0200, Joerg Schilling wrote: > > Erik Trimble <Erik.Trimble at Sun.COM> wrote: > > > Outside that, if you are truly worried about archival, then mastering a > > > DVD is the best option. Mastered (i.e. pressed) DVD/CDs will last 50 > > > years or more with proper storage, and we''ll probably have a better > > > chance finding an operational reader for the format in 2050 than any > > > other media. > > > > This is also true for "burned" DVDs in case you use the right media. > > Links? (I believe you, I just want to know what media to buy.)I am sorry, this was Piomneer information I read in 1998. So it is most likely referring Pioneer or TDK media. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Erik Trimble <Erik.Trimble at Sun.COM> wrote:> On Wed, 2006-04-26 at 11:51 -0500, Nicolas Williams wrote: > > On Wed, Apr 26, 2006 at 06:29:57PM +0200, Joerg Schilling wrote: > > > Erik Trimble <Erik.Trimble at Sun.COM> wrote: > > > > Outside that, if you are truly worried about archival, then mastering a > > > > DVD is the best option. Mastered (i.e. pressed) DVD/CDs will last 50 > > > > years or more with proper storage, and we''ll probably have a better > > > > chance finding an operational reader for the format in 2050 than any > > > > other media. > > > > > > This is also true for "burned" DVDs in case you use the right media. > > > > Links? (I believe you, I just want to know what media to buy.) > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > http://www.delkin.com/delkin_products_archival_gold_dvd.html > > Kodack makes some to: http://www.dvd-recordable.org/Article2616.phtmlfrom what I''ve heard, the "Kodak" media is from MAME italy - mid level quality. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily