Interesting blog: http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/b1c73b3a/attachment.html>
Yeah I wrote them about it. I said they should sell them and even better pair it with their offsite backup service kind of like a massive appliance and service option. They''re not selling them but did encourage me to just make a copy of it. It looks like the only questionable piece in it is the port multipliers. Sil3726 if I recall. Which I think just barely is becoming supported in the most recent snvs? That''s been something I''ve been wanting forever anyway. You could also just design your own case that is optimized for a bunch of disks, a mobo as long as it has ECC support and enough pci/pci-x/ pcie slots for the amount of cards to add. You might be able to build one without port multipliers and just use a bunch of 8, 12, or 16 port sata controllers. I want to design a case that has two layers - an internal layer with all the drives and guts and an external layer that pushes air around it to exhaust it quietly and has additional noise dampening... Sent from my iPhone On Sep 2, 2009, at 11:01 AM, Al Hopper <al at logical-approach.com> wrote:> Interesting blog: > > http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > > Regards, > > -- > Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com > Voice: 972.379.2133 Timezone: US CDT > OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 > http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/ce4e0919/attachment.html>
As some Sun folks pointed out 1) No redundancy at the power or networking side 2) Getting 2TB drives in a x4540 would make the numbers closer 3) Performance isn''t going to be that great with their design but...they might not need it. On 9/2/2009 2:13 PM, Michael Shadle wrote:> Yeah I wrote them about it. I said they should sell them and even > better pair it with their offsite backup service kind of like a > massive appliance and service option. > > They''re not selling them but did encourage me to just make a copy of > it. It looks like the only questionable piece in it is the port > multipliers. Sil3726 if I recall. Which I think just barely is > becoming supported in the most recent snvs? That''s been something I''ve > been wanting forever anyway. > > You could also just design your own case that is optimized for a bunch > of disks, a mobo as long as it has ECC support and enough > pci/pci-x/pcie slots for the amount of cards to add. You might be able > to build one without port multipliers and just use a bunch of 8, 12, > or 16 port sata controllers. > > I want to design a case that has two layers - an internal layer with > all the drives and guts and an external layer that pushes air around > it to exhaust it quietly and has additional noise dampening... > > Sent from my iPhone > > On Sep 2, 2009, at 11:01 AM, Al Hopper <al at logical-approach.com > <mailto:al at logical-approach.com>> wrote: > >> Interesting blog: >> >> http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ >> >> >> Regards, >> >> -- >> Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com >> <mailto:al at logical-approach.com> >> Voice: 972.379.2133 Timezone: US CDT >> OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 >> http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org <mailto:zfs-discuss at opensolaris.org> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> As some Sun folks pointed out > > 1) No redundancy at the power or networking side > 2) Getting 2TB drives in a x4540 would make the numbers closer > 3) Performance isn''t going to be that great with their design but...they > might not need it.4) Silicon Image chipsets. Their SATA controller chips used on a variety of mainboards are already well known for their unreliability and data corruption. I''d not want a whole bunch of SiI chips handle 67TB. -mg
Mario Goebbels wrote:>> As some Sun folks pointed out >> >> 1) No redundancy at the power or networking side >> 2) Getting 2TB drives in a x4540 would make the numbers closer >> 3) Performance isn''t going to be that great with their design but...they >> might not need it. > > 4) Silicon Image chipsets. Their SATA controller chips used on a > variety of mainboards are already well known for their unreliability > and data corruption. I''d not want a whole bunch of SiI chips handle 67TB.5) Where''s the ECC ram? 6) Management interface? lustre + zfs... I''m already bouncing around ideas with others about an open "Fishworks".. Maybe this is the boost we needed to justify sponsoring some of the development... Anyone interested? ./C ------ CTO PathScale // Open source developer Follow me - http://www.twitter.com/CTOPathScale blog: http://www.codestrom.com
Torrey McMahon wrote:> 3) Performance isn''t going to be that great with their design but...they > might not need it.Would you be able to qualify this assertion? Thinking through it a bit, even if the disks are better than average and can achieve 1000Mb/s each, each uplink from the multiplier to the controller will still have 1000Gb/s to spare in the slowest SATA mode out there. With (5) disks per multiplier * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at the PCI-e interface, which approximately coincides with a meager 4x PCI-e slot.
IMHO it depends on the usage model. Mine is for home storage. A couple HD streams at most. 40mB/sec over a gigabit network switch is pretty good with me. On Wed, Sep 2, 2009 at 11:54 AM, Jacob Ritorto<Jacob.Ritorto at gmail.com> wrote:> Torrey McMahon wrote: > >> 3) Performance isn''t going to be that great with their design but...they >> might not need it. > > > Would you be able to qualify this assertion? ?Thinking through it a bit, > even if the disks are better than average and can achieve 1000Mb/s each, > each uplink from the multiplier to the controller will still have 1000Gb/s > to spare in the slowest SATA mode out there. ?With (5) disks per multiplier > * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at the PCI-e interface, > which approximately coincides with a meager 4x PCI-e slot. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jacob, Jacob Ritorto schrieb:> Torrey McMahon wrote: > >> 3) Performance isn''t going to be that great with their design >> but...they might not need it. > > > Would you be able to qualify this assertion? Thinking through it a bit, > even if the disks are better than average and can achieve 1000Mb/s each, > each uplink from the multiplier to the controller will still have > 1000Gb/s to spare in the slowest SATA mode out there. With (5) disks > per multiplier * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at > the PCI-e interface, which approximately coincides with a meager 4x > PCI-e slot.they use a 85$ PC motherboard - that does not have "meager 4x PCI-e slots", it has one 16x and 3 *1x* PCIe slots, plus 3 PCI slots ( remember, long time ago: 32-bit wide 33 MHz, probably shared bus ). Also it seems that all external traffic uses the single GbE motherboard port. -- Roland -- ********************************************************** Roland Rambau Platform Technology Team Principal Field Technologist Global Systems Engineering Phone: +49-89-46008-2520 Mobile:+49-172-84 58 129 Fax: +49-89-46008-2222 mailto:Roland.Rambau at sun.com ********************************************************** Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten Amtsgericht M?nchen: HRB 161028; Gesch?ftsf?hrer: Thomas Schr?der, Wolfgang Engels, Wolf Frenkel Vorsitzender des Aufsichtsrates: Martin H?ring ******* UNIX ********* /bin/sh ******** FORTRAN **********
On Wed, Sep 02, 2009 at 02:54:42PM -0400, Jacob Ritorto wrote:> Torrey McMahon wrote: > >> 3) Performance isn''t going to be that great with their design >> but...they might not need it. > > > Would you be able to qualify this assertion? Thinking through it a bit, > even if the disks are better than average and can achieve 1000Mb/s each, > each uplink from the multiplier to the controller will still have > 1000Gb/s to spare in the slowest SATA mode out there. With (5) disks > per multiplier * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at > the PCI-e interface, which approximately coincides with a meager 4x > PCI-e slot.Let''s look at the math. First, I don''t know how 5 * 2 * 1000GB/s equals 10000Gb/s, or how a 4x PCIe-gen2 slot, which can''t really push a 10Gb/s Ethernet NIC can do 1000x that. Moving on, modern high-capacity SATA drives are in the 100-120MB/s range. Let''s call it 125MB/s for easier math. A 5-port port multiplier (PM) has 5 links to the drives, and 1 uplink. SATA-II speed is 3Gb/s, which after all the framing overhead, can get you 300MB/s on a good day. So 3 drives can more than saturate a PM. 45 disks (9 backplanes at 5 disks + PM each) in the box won''t get you more than about 21 drives worth of performance, tops. So you leave at least half the available drive bandwidth on the table, in the best of circumstances. That also assumes that the SiI controllers can push 100% of the bandwidth coming into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting close to a 4x PCIe-gen2 slot. Frankly, I''d be surprised. And the card that uses 3 of the 4 ports has to do more like 900MB/s, which is greater than 4x PCIe-gen2 can pull off in the real world. And I''d re-iterate what myself and others have observed about SiI and silent data corruption over the years. Most of your data, most of the time, it would seem. --Bill
On Wed, Sep 2, 2009 at 12:12 PM, Roland Rambau<Roland.Rambau at sun.com> wrote:> Jacob, > > Jacob Ritorto schrieb: >> >> Torrey McMahon wrote: >> >>> 3) Performance isn''t going to be that great with their design but...they >>> might not need it. >> >> >> Would you be able to qualify this assertion? ?Thinking through it a bit, >> even if the disks are better than average and can achieve 1000Mb/s each, >> each uplink from the multiplier to the controller will still have 1000Gb/s >> to spare in the slowest SATA mode out there. ?With (5) disks per multiplier >> * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at the PCI-e interface, >> which approximately coincides with a meager 4x PCI-e slot. > > they use a 85$ PC motherboard - that does not have "meager 4x PCI-e slots", > it has one 16x and 3 *1x* PCIe slots, plus 3 PCI slots ( remember, long time > ago: 32-bit wide 33 MHz, probably shared bus ). > > Also it seems that all external traffic uses the single GbE motherboard > port. > > ?-- Roland > > > -- > > ********************************************************** > Roland Rambau ? ? ? ? ? ? ? ? Platform Technology Team > Principal Field Technologist ?Global Systems Engineering > Phone: +49-89-46008-2520 ? ? ?Mobile:+49-172-84 58 129 > Fax: ? +49-89-46008-2222 ? ? ?mailto:Roland.Rambau at sun.com > ********************************************************** > ? ?Sitz der Gesellschaft: Sun Microsystems GmbH, > ? ?Sonnenallee 1, D-85551 Kirchheim-Heimstetten > ? ?Amtsgericht M?nchen: HRB 161028; ?Gesch?ftsf?hrer: > ? ?Thomas Schr?der, Wolfgang Engels, Wolf Frenkel > ? ?Vorsitzender des Aufsichtsrates: ? Martin H?ring > ******* UNIX ********* /bin/sh ******** FORTRAN ********** > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Probably for their usage patterns, these boxes make sense. But I concur that the reliability and performance would be very suspect to any organization which values their data in any fashion. Personally, I have some old Dual P3 systems still running fine at home, on what were cheap motherboards. But would I advocate such a system to protect business data? Not a chance. I''m sure at the price they offer storage, this was the only way they could be profitable, and it''s a pretty creative solution. For my personal data backups, I''m sure their service would meet all my needs, but thats about as far as I would trust these systems - MP3''s, backups of photos for which I already maintain a couple copies of. -- Brent Jones brent at servuhome.net
On Sep 2, 2009, at 11:54 AM, Jacob Ritorto wrote:> Torrey McMahon wrote: > >> 3) Performance isn''t going to be that great with their design >> but...they might not need it. > > > Would you be able to qualify this assertion? Thinking through it a > bit, even if the disks are better than average and can achieve > 1000Mb/s each, each uplink from the multiplier to the controller > will still have 1000Gb/s to spare in the slowest SATA mode out > there. With (5) disks per multiplier * (2) multipliers * 1000GB/s > each, that''s 10000Gb/s at the PCI-e interface, which approximately > coincides with a meager 4x PCI-e slot.That doesn''t matter. It does HTTP PUT/GET, so it is completely limited by the network interface. The advantage to their model is that they are not required to implement a POSIX file system. PUT/GET is very easy to implement and tends to be large transfers. In other words, they aren''t running an OLTP database, no user-level quotas, no directories with millions of files, etc. The simple life can be good :-) I''d be more interested in seeing their field failure rate data :-) FWIW, bringing such a product to a global market would raise the list price to be on par with the commercially available products. Testing, qualifying, service, documentation, warranty, marketing, distribution, taxes, sales, and all sorts of other costs add up quickly. -- richard
On Sep 2, 2009, at 14:48, C. Bergstr?m wrote:> o Goebbels wrote: >>> As some Sun folks pointed out >>> >>> 1) No redundancy at the power or networking side >>> 2) Getting 2TB drives in a x4540 would make the numbers closer >>> 3) Performance isn''t going to be that great with their design >>> but...they >>> might not need it. >> >> 4) Silicon Image chipsets. Their SATA controller chips used on a >> variety of mainboards are already well known for their >> unreliability and data corruption. I''d not want a whole bunch of >> SiI chips handle 67TB. > 5) Where''s the ECC ram? > 6) Management interface? lustre + zfs... I''m already bouncing > around ideas with others about an open "Fishworks".. Maybe this is > the boost we needed to justify sponsoring some of the development... > Anyone interested?Redundancy is handled on the software side (a la Google). From Backblaze''s Tim Nufire:> ... on redundant power, it?s easy to swap out the 2 PSUs in the > current design with a 3+1 redundant unit. This adds a couple hundred > dollars to the cost and since we built redundancy into our software > layer we don?t need it. Our goal was dumb hardware, smart software.http://storagemojo.com/2009/09/01/cloud-storage-for-100-a-terabyte/#comment-204892 The design goal was cheap space. The same comment also states that only only one of the six fans actually needs to be running to handle cooling. I think a lot of people seem to be critiquing the "Blazebox Pod" criteria that it wasn''t meant to handle. It solved their problem (oodles of storage) at about a magnitude less cost than the closest alternatives. If you want redundancy and integrity you do it higher in the stack.
On Sep 2, 2009, at 15:14, Bill Moore wrote:> And I''d re-iterate what myself and others have observed about SiI and > silent data corruption over the years. > > Most of your data, most of the time, it would seem.Unless you have two or three or nine of these things and you spread data around. For the $ 1M that they claim a petabyte from Sun costs, they''re able to make nine of their pods. Just because they don''t don''t have redundancy and checksumming on the box doesn''t mean it doesn''t exists higher up in their stack. :)
> Unless you have two or three or nine of these things and you spread data > around. For the $ 1M that they claim a petabyte from Sun costs, they''re able > to make nine of their pods.It is the claim of the cost from Sun that I am sceptical about. I admit that it will be more expensive, and I know that as someone from academia I end up with discounts, but by my reckoning it is about half that price for bulk purchases from Sun. One day (as far as I know not yet) Sun will release 1.5 or even 2TB drives and close the gap further. Also, I suspect that I could get something like the Satabeast (regardless of what people think about it) for significantly less than that per petabyte.> Just because they don''t don''t have redundancy and checksumming on the box > doesn''t mean it doesn''t exists higher up in their stack. :)As far as I can work out you end up needing more storage which has a cost associated with it, the higher up the stack you put the redundancy and checksumming. Overall, the product is what it is. There is nothing wrong with it in the right situation although they have trimmed some corners that I wouldn''t have trimmed in their place. However, comparing it to a NetAPP or an EMC is to grossly misrepresent the market. This is the equivalent of seeing how many USB drives you can plug in as a storage solution. I''ve seen this done. Julian -- Julian King Computer Officer, University of Cambridge, Unix Support
Overall, the product is what it is. There is nothing wrong with it in the right situation although they have trimmed some corners that I wouldn''t have trimmed in their place. However, comparing it to a NetAPP or an EMC is to grossly misrepresent the market. I don''t think that is what they where doing. I think they where trying to point out they had $X budget and wanted to buy YPB of storage and building their own was cheaper than buying it. No surprise there! However they don''t show their R&D costs. I''m sure the designers don''t work for nothing, although to their credit they do share the H/W design and have made is open source. They also mention www.protocase.com will make them for you so if you want to build your own then you have no R&D costs. I would love to know why they did not use ZFS. This is the equivalent of seeing how many USB drives you can plug in as a storage solution. I''ve seen this done. Julian -- Julian King Computer Officer, University of Cambridge, Unix Support _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Trevor Pretty | +64 9 639 0652 | +64 21 666 161 Eagle Technology Group Ltd. Gate D, Alexandra Park, Greenlane West, Epsom Private Bag 93211, Parnell, Auckland www.eagle.co.nz This email is confidential and may be legally privileged. If received in error please destroy and immediately notify us. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Probably due to the lack of port multiplier support. Or perhaps they run software for monitoring that only works on Linux. Sent from my iPhone On Sep 2, 2009, at 4:33 PM, Trevor Pretty <trevor_pretty at eagle.co.nz> wrote:> >> >> Overall, the product is what it is. There is nothing wrong with it >> in the >> right situation although they have trimmed some corners that I >> wouldn''t >> have trimmed in their place. However, comparing it to a NetAPP or >> an EMC >> is to grossly misrepresent the market. > I don''t think that is what they where doing. I think they where > trying to point out they had $X budget and wanted to buy YPB of > storage and building their own was cheaper than buying it. No > surprise there! However they don''t show their R&D costs. I''m sure > the designers don''t work for nothing, although to their credit they > do share the H/W design and have made is open source. They also > mention www.protocase.com will make them for you so if you want to > build your own then you have no R&D costs. > > I would love to know why they did not use ZFS. > >> This is the equivalent of seeing >> how many USB drives you can plug in as a storage solution. I''ve >> seen this >> done. >> >> >> Julian >> -- >> Julian King >> Computer Officer, University of Cambridge, Unix Support >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > Trevor Pretty | +64 9 639 0652 | +64 21 666 161 > Eagle Technology Group Ltd. > Gate D, Alexandra Park, Greenlane West, Epsom > Private Bag 93211, Parnell, Auckland > > > > > > www.eagle.co.nz > This email is confidential and may be legally privileged. If > received in error please destroy and immediately notify us. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/6e00fd09/attachment.html>
On Sep 2, 2009, at 19:45, Michael Shadle wrote:> Probably due to the lack of port multiplier support. Or perhaps they > run software for monitoring that only works on Linux.Said support was committed only two to three weeks ago:> PSARC/2009/394 SATA Framework Port Multiplier Support > 6422924 sata framework has to support port multipliers > 6691950 ahci driver needs to support SIL3726/4726 SATA port multiplierhttp://mail.opensolaris.org/pipermail/onnv-notify/2009-August/010084.html If the rest of their stack is also Linux, then it would natural for their storage nodes to also run it as well.
Bill Moore <Bill.Moore <at> sun.com> writes:> > Moving on, modern high-capacity SATA drives are in the 100-120MB/s > range. Let''s call it 125MB/s for easier math. A 5-port port multiplier > (PM) has 5 links to the drives, and 1 uplink. SATA-II speed is 3Gb/s, > which after all the framing overhead, can get you 300MB/s on a good day. > So 3 drives can more than saturate a PM. 45 disks (9 backplanes at 5 > disks + PM each) in the box won''t get you more than about 21 drives > worth of performance, tops. So you leave at least half the available > drive bandwidth on the table, in the best of circumstances. That also > assumes that the SiI controllers can push 100% of the bandwidth coming > into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting > close to a 4x PCIe-gen2 slot.Wrong. The theoretical bandwidth of an x4 PCI-E v2.0 slot is 2GB/s per direction (5Gbit/s before 8b-10b encoding per lane, times 0.8, times 4), amply sufficient to deal with 600MB/s. However they don''t have this kind of slot, they have x2 PCI-E v1.0 slots (500MB/s per direction). Moreover SiI3132 default to a MAX_PAYLOAD_SIZE of 128 bytes therefore my guess is that each 2-port SATA card is only able to provide 60% of the theoretical throughput[1], or about 300MB/s. Then they have 3 such cards: total throughput of 900MB/s. Finally the 4th SATA card (with 4 ports) is in a 32-bit 33MHz PCI slot (not PCI-E). In practice such a bus can only provide a usable throughput of about 100MB/s (out of 133MB/s theoretical). All the bottlenecks are obviously the PCI-E links and the PCI bus. So in conclusion, my SBNSWAG (scientific but not so wild-ass guess) is that the max I/O throughput when reading from all the disks on 1 of their storage pod is about 1000MB/s. This is poor compared to a Thumper for example, but the most important factor for them was GB/$, not GB/sec. And they did a terrific job at that!> And I''d re-iterate what myself and others have observed about SiI and > silent data corruption over the years.Irrelevant, because it seems they have built fault-tolerance higher in the stack, ? la Google. Commodity hardware + reliable software = great combo. [1] http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ -mrb
Marc Bevand <m.bevand <at> gmail.com> writes:> > So in conclusion, my SBNSWAG (scientific but not so wild-ass guess) > is that the max I/O throughput when reading from all the disks on > 1 of their storage pod is about 1000MB/s.Correction: the SiI3132 are on x1 (not x2) links, so my guess as to the aggregate throughput when reading from all the disks is: 3*150+100 = 550MB/s. (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link) And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards to exploit closer to the max theoretical bandwidth of an x1 PCI-E link, it would be: 3*250+100 = 850MB/s. -mrb
On Fri, Sep 4, 2009 at 5:36 AM, Marc Bevand <m.bevand at gmail.com> wrote:> Marc Bevand <m.bevand <at> gmail.com> writes: > > > > So in conclusion, my SBNSWAG (scientific but not so wild-ass guess) > > is that the max I/O throughput when reading from all the disks on > > 1 of their storage pod is about 1000MB/s. > > Correction: the SiI3132 are on x1 (not x2) links, so my guess as to > the aggregate throughput when reading from all the disks is: > 3*150+100 = 550MB/s. > (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link) > > And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards > to exploit closer to the max theoretical bandwidth of an x1 PCI-E > link, it would be: > 3*250+100 = 850MB/s. > > -mrb > >Whats the point of arguing what the back-end can do anyways? This is bulk data storage. Their MAX input is ~100MB/sec. The backend can more than satisfy that. Who cares at that point whether it can push 500MB/s or 5000MB/s? It''s not a database processing transactions. It only needs to be able to push as fast as the front-end can go. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090904/7fa49080/attachment.html>
Tim Cook <tim <at> cook.ms> writes:> > Whats the point of arguing what the back-end can do anyways?? This is bulkdata storage.? Their MAX input is ~100MB/sec.? The backend can more than satisfy that.? Who cares at that point whether it can push 500MB/s or 5000MB/s?? It''s not a database processing transactions.? It only needs to be able to push as fast as the front-end can go.? --Tim True, what they have is sufficient to match GbE speed. But internal I/O throughput matters for resilvering RAID arrays, scrubbing, local data analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays per pod. If their layout is optimal they put 5 drives on the PCI bus (to minimize this number) & 10 drives behind PCI-E links per array, so this means the PCI bus''s ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of their arrays. -mrb
On Sat, Sep 5, 2009 at 12:30 AM, Marc Bevand <m.bevand at gmail.com> wrote:> Tim Cook <tim <at> cook.ms> writes: > > > > Whats the point of arguing what the back-end can do anyways? This is > bulk > data storage. Their MAX input is ~100MB/sec. The backend can more than > satisfy that. Who cares at that point whether it can push 500MB/s or > 5000MB/s? It''s not a database processing transactions. It only needs to > be > able to push as fast as the front-end can go. --Tim > > True, what they have is sufficient to match GbE speed. But internal I/O > throughput matters for resilvering RAID arrays, scrubbing, local data > analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays > per > pod. If their layout is optimal they put 5 drives on the PCI bus (to > minimize > this number) & 10 drives behind PCI-E links per array, so this means the > PCI > bus''s ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per > (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of > their arrays. > > -mrb > >But none of that matters. The data is replicated at a higher layer, combined with raid-6. They''d have to see triple disk failure across multiple arrays at the same time... They aren''t concerned with performance, the home users they''re backing up aren''t ever going to get anything remotely close to gigE speeds. Absolute BEST case scenario *MIGHT* push 20mbit if the end-user is lucky enough to have FIOS or docsis 3.0 in their area, and has large files with a clean link. Even rebuilding two failed disks that setup will push 2MB/sec all day long. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090905/dd6c5ea2/attachment.html>