thr3ads.net - zfs discuss - [zfs-discuss] Petabytes on a budget

If this information is useful, please help other people find it:
Share via:

Al Hopper

2009-Sep-02 18:01 UTC

[zfs-discuss] Petabytes on a budget - blog

Interesting blog:
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX al at logical-approach.com
                  Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/b1c73b3a/attachment.html>

Michael Shadle

2009-Sep-02 18:13 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Yeah I wrote them about it. I said they should sell them and even  
better pair it with their offsite backup service kind of like a  
massive appliance and service option.

They''re not selling them but did encourage me to just make a copy of  
it. It looks like the only questionable piece in it is the port  
multipliers. Sil3726 if I recall. Which I think just barely is  
becoming supported in the most recent snvs? That''s been something
I''ve
been wanting forever anyway.

You could also just design your own case that is optimized for a bunch  
of disks, a mobo as long as it has ECC support and enough pci/pci-x/ 
pcie slots for the amount of cards to add. You might be able to build  
one without port multipliers and just use a bunch of 8, 12, or 16 port  
sata controllers.

I want to design a case that has two layers - an internal layer with  
all the drives and guts and an external layer that pushes air around  
it to exhaust it quietly and has additional noise dampening...

Sent from my iPhone

On Sep 2, 2009, at 11:01 AM, Al Hopper <al at logical-approach.com> wrote:
> Interesting blog:
>
>
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/
>
> Regards,
>
> -- 
> Al Hopper  Logical Approach Inc,Plano,TX al at logical-approach.com
>                   Voice: 972.379.2133 Timezone: US CDT
> OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
> http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/ce4e0919/attachment.html>

Torrey McMahon

2009-Sep-02 18:31 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

As some Sun folks pointed out

1) No redundancy at the power or networking side
2) Getting 2TB drives in a x4540 would make the numbers closer
3) Performance isn''t going to be that great with their design
but...they
might not need it.


On 9/2/2009 2:13 PM, Michael Shadle wrote:> Yeah I wrote them about it. I said they should sell them and even 
> better pair it with their offsite backup service kind of like a 
> massive appliance and service option.
>
> They''re not selling them but did encourage me to just make a copy
of
> it. It looks like the only questionable piece in it is the port 
> multipliers. Sil3726 if I recall. Which I think just barely is 
> becoming supported in the most recent snvs? That''s been something
I''ve
> been wanting forever anyway.
>
> You could also just design your own case that is optimized for a bunch 
> of disks, a mobo as long as it has ECC support and enough 
> pci/pci-x/pcie slots for the amount of cards to add. You might be able 
> to build one without port multipliers and just use a bunch of 8, 12, 
> or 16 port sata controllers.
>
> I want to design a case that has two layers - an internal layer with 
> all the drives and guts and an external layer that pushes air around 
> it to exhaust it quietly and has additional noise dampening...
>
> Sent from my iPhone
>
> On Sep 2, 2009, at 11:01 AM, Al Hopper <al at logical-approach.com 
> <mailto:al at logical-approach.com>> wrote:
>
>> Interesting blog:
>>
>>
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/
>>
>>
>> Regards,
>>
>> -- 
>> Al Hopper  Logical Approach Inc,Plano,TX al at logical-approach.com 
>> <mailto:al at logical-approach.com>
>>                   Voice: 972.379.2133 Timezone: US CDT
>> OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
>> http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org <mailto:zfs-discuss at
opensolaris.org>
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Mario Goebbels

2009-Sep-02 18:35 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

> As some Sun folks pointed out
>
> 1) No redundancy at the power or networking side
> 2) Getting 2TB drives in a x4540 would make the numbers closer
> 3) Performance isn''t going to be that great with their design
but...they
> might not need it.
4) Silicon Image chipsets. Their SATA controller chips used on a variety 
of mainboards are already well known for their unreliability and data 
corruption. I''d not want a whole bunch of SiI chips handle 67TB.

-mg

"C. Bergström"

2009-Sep-02 18:48 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Mario Goebbels wrote:>> As some Sun folks pointed out
>>
>> 1) No redundancy at the power or networking side
>> 2) Getting 2TB drives in a x4540 would make the numbers closer
>> 3) Performance isn''t going to be that great with their design
but...they
>> might not need it.
>
> 4) Silicon Image chipsets. Their SATA controller chips used on a 
> variety of mainboards are already well known for their unreliability 
> and data corruption. I''d not want a whole bunch of SiI chips
handle 67TB.5) Where''s the ECC ram?
6) Management interface? lustre + zfs...   I''m already bouncing around 
ideas with others about an open "Fishworks".. Maybe this is the boost
we
needed to justify sponsoring some of the development... Anyone interested?


./C

------
CTO PathScale // Open source developer
Follow me - http://www.twitter.com/CTOPathScale
blog: http://www.codestrom.com

Jacob Ritorto

2009-Sep-02 18:54 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Torrey McMahon wrote:
> 3) Performance isn''t going to be that great with their design
but...they
> might not need it.

Would you be able to qualify this assertion?  Thinking through it a bit, 
even if the disks are better than average and can achieve 1000Mb/s each, 
each uplink from the multiplier to the controller will still have 
1000Gb/s to spare in the slowest SATA mode out there.  With (5) disks 
per multiplier * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at 
the PCI-e interface, which approximately coincides with a meager 4x 
PCI-e slot.

Michael Shadle

2009-Sep-02 19:00 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

IMHO it depends on the usage model. Mine is for home storage. A couple
HD streams at most. 40mB/sec over a gigabit network switch is pretty
good with me.

On Wed, Sep 2, 2009 at 11:54 AM, Jacob Ritorto<Jacob.Ritorto at gmail.com>
wrote:> Torrey McMahon wrote:
>
>> 3) Performance isn''t going to be that great with their design
but...they
>> might not need it.
>
>
> Would you be able to qualify this assertion? ?Thinking through it a bit,
> even if the disks are better than average and can achieve 1000Mb/s each,
> each uplink from the multiplier to the controller will still have 1000Gb/s
> to spare in the slowest SATA mode out there. ?With (5) disks per multiplier
> * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at the PCI-e
interface,
> which approximately coincides with a meager 4x PCI-e slot.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Roland Rambau

2009-Sep-02 19:12 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Jacob,

Jacob Ritorto schrieb:> Torrey McMahon wrote:
> 
>> 3) Performance isn''t going to be that great with their design 
>> but...they might not need it.
> 
> 
> Would you be able to qualify this assertion?  Thinking through it a bit, 
> even if the disks are better than average and can achieve 1000Mb/s each, 
> each uplink from the multiplier to the controller will still have 
> 1000Gb/s to spare in the slowest SATA mode out there.  With (5) disks 
> per multiplier * (2) multipliers * 1000GB/s each, that''s 10000Gb/s
at
> the PCI-e interface, which approximately coincides with a meager 4x 
> PCI-e slot.
they use a 85$ PC motherboard - that does not have "meager 4x PCI-e
slots",
it has one 16x and 3 *1x* PCIe slots, plus 3 PCI slots ( remember, long time
ago: 32-bit wide 33 MHz, probably shared bus ).

Also it seems that all external traffic uses the single GbE motherboard port.

   -- Roland


-- 

**********************************************************
Roland Rambau                 Platform Technology Team
Principal Field Technologist  Global Systems Engineering
Phone: +49-89-46008-2520      Mobile:+49-172-84 58 129
Fax:   +49-89-46008-2222      mailto:Roland.Rambau at sun.com
**********************************************************
     Sitz der Gesellschaft: Sun Microsystems GmbH,
     Sonnenallee 1, D-85551 Kirchheim-Heimstetten
     Amtsgericht M?nchen: HRB 161028;  Gesch?ftsf?hrer:
     Thomas Schr?der, Wolfgang Engels, Wolf Frenkel
     Vorsitzender des Aufsichtsrates:   Martin H?ring
******* UNIX ********* /bin/sh ******** FORTRAN **********

Bill Moore

2009-Sep-02 19:14 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Wed, Sep 02, 2009 at 02:54:42PM -0400, Jacob Ritorto
wrote:> Torrey McMahon wrote:
>
>> 3) Performance isn''t going to be that great with their design 
>> but...they might not need it.
>
>
> Would you be able to qualify this assertion?  Thinking through it a bit,  
> even if the disks are better than average and can achieve 1000Mb/s each,  
> each uplink from the multiplier to the controller will still have  
> 1000Gb/s to spare in the slowest SATA mode out there.  With (5) disks  
> per multiplier * (2) multipliers * 1000GB/s each, that''s 10000Gb/s
at
> the PCI-e interface, which approximately coincides with a meager 4x  
> PCI-e slot.
Let''s look at the math.  First, I don''t know how 5 * 2 *
1000GB/s equals
10000Gb/s, or how a 4x PCIe-gen2 slot, which can''t really push a
10Gb/s Ethernet NIC can do 1000x that.

Moving on, modern high-capacity SATA drives are in the 100-120MB/s
range.  Let''s call it 125MB/s for easier math.  A 5-port port
multiplier
(PM) has 5 links to the drives, and 1 uplink.  SATA-II speed is 3Gb/s,
which after all the framing overhead, can get you 300MB/s on a good day.
So 3 drives can more than saturate a PM.  45 disks (9 backplanes at 5
disks + PM each) in the box won''t get you more than about 21 drives
worth of performance, tops.  So you leave at least half the available
drive bandwidth on the table, in the best of circumstances.  That also
assumes that the SiI controllers can push 100% of the bandwidth coming
into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting
close to a 4x PCIe-gen2 slot.  Frankly, I''d be surprised.  And the card
that uses 3 of the 4 ports has to do more like 900MB/s, which is greater
than 4x PCIe-gen2 can pull off in the real world.

And I''d re-iterate what myself and others have observed about SiI and
silent data corruption over the years.

Most of your data, most of the time, it would seem.

--Bill

Brent Jones

2009-Sep-02 19:18 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Wed, Sep 2, 2009 at 12:12 PM, Roland Rambau<Roland.Rambau at sun.com>
wrote:> Jacob,
>
> Jacob Ritorto schrieb:
>>
>> Torrey McMahon wrote:
>>
>>> 3) Performance isn''t going to be that great with their
design but...they
>>> might not need it.
>>
>>
>> Would you be able to qualify this assertion? ?Thinking through it a
bit,
>> even if the disks are better than average and can achieve 1000Mb/s
each,
>> each uplink from the multiplier to the controller will still have
1000Gb/s
>> to spare in the slowest SATA mode out there. ?With (5) disks per
multiplier
>> * (2) multipliers * 1000GB/s each, that''s 10000Gb/s at the
PCI-e interface,
>> which approximately coincides with a meager 4x PCI-e slot.
>
> they use a 85$ PC motherboard - that does not have "meager 4x PCI-e
slots",
> it has one 16x and 3 *1x* PCIe slots, plus 3 PCI slots ( remember, long
time
> ago: 32-bit wide 33 MHz, probably shared bus ).
>
> Also it seems that all external traffic uses the single GbE motherboard
> port.
>
> ?-- Roland
>
>
> --
>
> **********************************************************
> Roland Rambau ? ? ? ? ? ? ? ? Platform Technology Team
> Principal Field Technologist ?Global Systems Engineering
> Phone: +49-89-46008-2520 ? ? ?Mobile:+49-172-84 58 129
> Fax: ? +49-89-46008-2222 ? ? ?mailto:Roland.Rambau at sun.com
> **********************************************************
> ? ?Sitz der Gesellschaft: Sun Microsystems GmbH,
> ? ?Sonnenallee 1, D-85551 Kirchheim-Heimstetten
> ? ?Amtsgericht M?nchen: HRB 161028; ?Gesch?ftsf?hrer:
> ? ?Thomas Schr?der, Wolfgang Engels, Wolf Frenkel
> ? ?Vorsitzender des Aufsichtsrates: ? Martin H?ring
> ******* UNIX ********* /bin/sh ******** FORTRAN **********
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Probably for their usage patterns, these boxes make sense. But I
concur that the reliability and performance would be very suspect to
any organization which values their data in any fashion.
Personally, I have some old Dual P3 systems still running fine at
home, on what were cheap motherboards. But would I advocate such a
system to protect business data? Not a chance.

I''m sure at the price they offer storage, this was the only way they
could be profitable, and it''s a pretty creative solution.
For my personal data backups, I''m sure their service would meet all my
needs, but thats about as far as I would trust these systems - MP3''s,
backups of photos for which I already maintain a couple copies of.


-- 
Brent Jones
brent at servuhome.net

Richard Elling

2009-Sep-02 19:20 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Sep 2, 2009, at 11:54 AM, Jacob Ritorto wrote:
> Torrey McMahon wrote:
>
>> 3) Performance isn''t going to be that great with their design
>> but...they might not need it.
>
>
> Would you be able to qualify this assertion?  Thinking through it a  
> bit, even if the disks are better than average and can achieve  
> 1000Mb/s each, each uplink from the multiplier to the controller  
> will still have 1000Gb/s to spare in the slowest SATA mode out  
> there.  With (5) disks per multiplier * (2) multipliers * 1000GB/s  
> each, that''s 10000Gb/s at the PCI-e interface, which approximately
> coincides with a meager 4x PCI-e slot.
That doesn''t matter. It does HTTP PUT/GET, so it is completely
limited by the network interface.

The advantage to their model is that they are not required to implement
a POSIX file system. PUT/GET is very easy to implement and tends to
be large transfers. In other words, they aren''t running an OLTP  
database,
no user-level quotas, no directories with millions of files, etc. The  
simple
life can be good :-)

I''d be more interested in seeing their field failure rate data :-)

FWIW, bringing such a product to a global market would raise the
list price to be on par with the commercially available products.
Testing, qualifying, service, documentation, warranty, marketing,
distribution, taxes, sales, and all sorts of other costs add up quickly.
  -- richard

David Magda

2009-Sep-02 21:58 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Sep 2, 2009, at 14:48, C. Bergstr?m wrote:
> o Goebbels wrote:
>>> As some Sun folks pointed out
>>>
>>> 1) No redundancy at the power or networking side
>>> 2) Getting 2TB drives in a x4540 would make the numbers closer
>>> 3) Performance isn''t going to be that great with their
design
>>> but...they
>>> might not need it.
>>
>> 4) Silicon Image chipsets. Their SATA controller chips used on a  
>> variety of mainboards are already well known for their  
>> unreliability and data corruption. I''d not want a whole bunch
of
>> SiI chips handle 67TB.
> 5) Where''s the ECC ram?
> 6) Management interface? lustre + zfs...   I''m already bouncing  
> around ideas with others about an open "Fishworks".. Maybe this
is
> the boost we needed to justify sponsoring some of the development...  
> Anyone interested?
Redundancy is handled on the software side (a la Google). From  
Backblaze''s Tim Nufire:
> ... on redundant power, it?s easy to swap out the 2 PSUs in the  
> current design with a 3+1 redundant unit. This adds a couple hundred  
> dollars to the cost and since we built redundancy into our software  
> layer we don?t need it. Our goal was dumb hardware, smart software.
http://storagemojo.com/2009/09/01/cloud-storage-for-100-a-terabyte/#comment-204892

The design goal was cheap space. The same comment also states that  
only only one of the six fans actually needs to be running to handle  
cooling.

I think a lot of people seem to be critiquing the "Blazebox Pod"  
criteria that it wasn''t meant to handle.  It solved their problem  
(oodles of storage) at about a magnitude less cost than the closest  
alternatives. If you want redundancy and integrity you do it higher in  
the stack.

David Magda

2009-Sep-02 22:02 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Sep 2, 2009, at 15:14, Bill Moore wrote:
> And I''d re-iterate what myself and others have observed about SiI
and
> silent data corruption over the years.
>
> Most of your data, most of the time, it would seem.
Unless you have two or three or nine of these things and you spread  
data around. For the $ 1M that they claim a petabyte from Sun costs,  
they''re able to make nine of their pods.

Just because they don''t don''t have redundancy and checksumming
on the
box doesn''t mean it doesn''t exists higher up in their stack.
:)

J.P. King

2009-Sep-02 22:44 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

> Unless you have two or three or nine of these things and you spread data 
> around. For the $ 1M that they claim a petabyte from Sun costs,
they''re able
> to make nine of their pods.
It is the claim of the cost from Sun that I am sceptical about.  I admit 
that it will be more expensive, and I know that as someone from academia I 
end up with discounts, but by my reckoning it is about half that price for 
bulk purchases from Sun.

One day (as far as I know not yet) Sun will release 1.5 or even 2TB drives 
and close the gap further.

Also, I suspect that I could get something like the Satabeast (regardless 
of what people think about it) for significantly less than that per 
petabyte.
> Just because they don''t don''t have redundancy and
checksumming on the box
> doesn''t mean it doesn''t exists higher up in their stack.
:)
As far as I can work out you end up needing more storage which has a cost 
associated with it, the higher up the stack you put the redundancy and 
checksumming.

Overall, the product is what it is.  There is nothing wrong with it in the 
right situation although they have trimmed some corners that I wouldn''t
have trimmed in their place.  However, comparing it to a NetAPP or an EMC 
is to grossly misrepresent the market.  This is the equivalent of seeing 
how many USB drives you can plug in as a storage solution.  I''ve seen
this
done.


Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support

Trevor Pretty

2009-Sep-02 23:33 UTC

head link

Re: Petabytes on a budget - blog

Overall, the product is what it is.  There is nothing wrong with it in the 
right situation although they have trimmed some corners that I wouldn''t
have trimmed in their place.  However, comparing it to a NetAPP or an EMC 
is to grossly misrepresent the market.  

I don''t think that is what they where doing. I think they where trying
to point out they had $X budget and wanted to buy YPB of storage and
building their own was cheaper than buying it. No surprise there!
However they don''t show their R&amp;D costs. I''m sure the
designers
don''t work for nothing, although to their credit they do share the H/W
design and have made is open source. They also mention
www.protocase.com will make them for you so if you want to build your
own then you have no R&amp;D costs.

I would love to know why they did not use ZFS.

This is the equivalent of seeing 
how many USB drives you can plug in as a storage solution.  I''ve seen
this
done.

Julian
--
Julian King
Computer Officer, University of Cambridge, Unix Support
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

-- 

Trevor
Pretty | +64
9 639 0652 |
+64
21 666 161

Eagle
Technology Group Ltd. 

Gate
D, Alexandra Park, Greenlane West, Epsom

Private Bag 93211,
Parnell, Auckland

www.eagle.co.nz 

This email is confidential and may be legally 
privileged. If received in error please destroy and immediately notify 
us.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Michael Shadle

2009-Sep-02 23:45 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Probably due to the lack of port multiplier support. Or perhaps they  
run software for monitoring that only
works on Linux.

Sent from my iPhone

On Sep 2, 2009, at 4:33 PM, Trevor Pretty <trevor_pretty at eagle.co.nz>  
wrote:
>
>>
>> Overall, the product is what it is.  There is nothing wrong with it  
>> in the
>> right situation although they have trimmed some corners that I  
>> wouldn''t
>> have trimmed in their place.  However, comparing it to a NetAPP or  
>> an EMC
>> is to grossly misrepresent the market.
> I don''t think that is what they where doing. I think they where  
> trying to point out they had $X budget and wanted to buy YPB of  
> storage and building their own was cheaper than buying it. No  
> surprise there! However they don''t show their R&D costs.
I''m sure
> the designers don''t work for nothing, although to their credit
they
> do share the H/W design and have made is open source. They also  
> mention www.protocase.com will make them for you so if you want to  
> build your own then you have no R&D costs.
>
> I would love to know why they did not use ZFS.
>
>> This is the equivalent of seeing
>> how many USB drives you can plug in as a storage solution. 
I''ve
>> seen this
>> done.
>>
>>
>> Julian
>> --
>> Julian King
>> Computer Officer, University of Cambridge, Unix Support
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> -- 
> Trevor Pretty | +64 9 639 0652 | +64 21 666 161
> Eagle Technology Group Ltd.
> Gate D, Alexandra Park, Greenlane West, Epsom
> Private Bag 93211, Parnell, Auckland
>
>
>
>
>
> www.eagle.co.nz
> This email is confidential and may be legally privileged. If  
> received in error please destroy and immediately notify us.
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090902/6e00fd09/attachment.html>

David Magda

2009-Sep-02 23:56 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Sep 2, 2009, at 19:45, Michael Shadle wrote:
> Probably due to the lack of port multiplier support. Or perhaps they  
> run software for monitoring that only works on Linux.
Said support was committed only two to three weeks ago:
> PSARC/2009/394 SATA Framework Port Multiplier Support
> 6422924 sata framework has to support port multipliers
> 6691950 ahci driver needs to support SIL3726/4726 SATA port multiplier
http://mail.opensolaris.org/pipermail/onnv-notify/2009-August/010084.html

If the rest of their stack is also Linux, then it would natural for  
their storage nodes to also run it as well.

Marc Bevand

2009-Sep-04 10:25 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Bill Moore <Bill.Moore <at> sun.com> writes:> 
> Moving on, modern high-capacity SATA drives are in the 100-120MB/s
> range.  Let''s call it 125MB/s for easier math.  A 5-port port
multiplier
> (PM) has 5 links to the drives, and 1 uplink.  SATA-II speed is 3Gb/s,
> which after all the framing overhead, can get you 300MB/s on a good day.
> So 3 drives can more than saturate a PM.  45 disks (9 backplanes at 5
> disks + PM each) in the box won''t get you more than about 21
drives
> worth of performance, tops.  So you leave at least half the available
> drive bandwidth on the table, in the best of circumstances.  That also
> assumes that the SiI controllers can push 100% of the bandwidth coming
> into them, which would be 300MB/s * 2 ports = 600MB/s, which is getting
> close to a 4x PCIe-gen2 slot.
Wrong. The theoretical bandwidth of an x4 PCI-E v2.0 slot is 2GB/s per
direction (5Gbit/s before 8b-10b encoding per lane, times 0.8, times 4),
amply sufficient to deal with 600MB/s.

However they don''t have this kind of slot, they have x2 PCI-E v1.0
slots (500MB/s per direction). Moreover SiI3132 default to a
MAX_PAYLOAD_SIZE of 128 bytes therefore my guess is that each 2-port
SATA card is only able to provide 60% of the theoretical throughput[1],
or about 300MB/s.

Then they have 3 such cards: total throughput of 900MB/s.

Finally the 4th SATA card (with 4 ports) is in a 32-bit 33MHz PCI slot
(not PCI-E). In practice such a bus can only provide a usable throughput
of about 100MB/s (out of 133MB/s theoretical).

All the bottlenecks are obviously the PCI-E links and the PCI bus.
So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
is that the max I/O throughput when reading from all the disks on
1 of their storage pod is about 1000MB/s. This is poor compared to
a Thumper for example, but the most important factor for them was
GB/$, not GB/sec. And they did a terrific job at that!
> And I''d re-iterate what myself and others have observed about SiI
and
> silent data corruption over the years.
Irrelevant, because it seems they have built fault-tolerance higher in
the stack, ? la Google. Commodity hardware + reliable software = great
combo.

[1] 
http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/

-mrb

Marc Bevand

2009-Sep-04 10:36 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Marc Bevand <m.bevand <at> gmail.com>
writes:> 
> So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
> is that the max I/O throughput when reading from all the disks on
> 1 of their storage pod is about 1000MB/s.
Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
the aggregate throughput when reading from all the disks is:
3*150+100 = 550MB/s.
(150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)

And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
to exploit closer to the max theoretical bandwidth of an x1 PCI-E
link, it would be:
3*250+100 = 850MB/s.

-mrb

Tim Cook

2009-Sep-04 18:23 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Fri, Sep 4, 2009 at 5:36 AM, Marc Bevand <m.bevand at gmail.com> wrote:
> Marc Bevand <m.bevand <at> gmail.com> writes:
> >
> > So in conclusion, my SBNSWAG (scientific but not so wild-ass guess)
> > is that the max I/O throughput when reading from all the disks on
> > 1 of their storage pod is about 1000MB/s.
>
> Correction: the SiI3132 are on x1 (not x2) links, so my guess as to
> the aggregate throughput when reading from all the disks is:
> 3*150+100 = 550MB/s.
> (150MB/s is 60% of the max theoretical 250MB/s bandwidth of an x1 link)
>
> And if they tuned MAX_PAYLOAD_SIZE to allow the 3 PCI-E SATA cards
> to exploit closer to the max theoretical bandwidth of an x1 PCI-E
> link, it would be:
> 3*250+100 = 850MB/s.
>
> -mrb
>
>
Whats the point of arguing what the back-end can do anyways?  This is bulk
data storage.  Their MAX input is ~100MB/sec.  The backend can more than
satisfy that.  Who cares at that point whether it can push 500MB/s or
5000MB/s?  It''s not a database processing transactions.  It only needs
to be
able to push as fast as the front-end can go.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090904/7fa49080/attachment.html>

Marc Bevand

2009-Sep-05 05:30 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

Tim Cook <tim <at> cook.ms> writes:> 
> Whats the point of arguing what the back-end can do anyways?? This is bulk data storage.? Their MAX input is ~100MB/sec.? The backend can more than 
satisfy that.? Who cares at that point whether it can push 500MB/s or 
5000MB/s?? It''s not a database processing transactions.? It only needs
to be
able to push as fast as the front-end can go.? --Tim

True, what they have is sufficient to match GbE speed. But internal I/O 
throughput matters for resilvering RAID arrays, scrubbing, local data 
analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays per 
pod. If their layout is optimal they put 5 drives on the PCI bus (to minimize 
this number) & 10 drives behind PCI-E links per array, so this means the PCI
bus''s ~100MB/s practical bandwidth is shared by 5 drives, so 20MB/s per
(1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of 
their arrays.

-mrb

Tim Cook

2009-Sep-05 05:40 UTC

head link

[zfs-discuss] Petabytes on a budget - blog

On Sat, Sep 5, 2009 at 12:30 AM, Marc Bevand <m.bevand at gmail.com>
wrote:
> Tim Cook <tim <at> cook.ms> writes:
> >
> > Whats the point of arguing what the back-end can do anyways?  This is
> bulk
> data storage.  Their MAX input is ~100MB/sec.  The backend can more than
> satisfy that.  Who cares at that point whether it can push 500MB/s or
> 5000MB/s?  It''s not a database processing transactions.  It only
needs to
> be
> able to push as fast as the front-end can go.  --Tim
>
> True, what they have is sufficient to match GbE speed. But internal I/O
> throughput matters for resilvering RAID arrays, scrubbing, local data
> analysis/processing, etc. In their case they have 3 15-drive RAID6 arrays
> per
> pod. If their layout is optimal they put 5 drives on the PCI bus (to
> minimize
> this number) & 10 drives behind PCI-E links per array, so this means
the
> PCI
> bus''s ~100MB/s practical bandwidth is shared by 5 drives, so
20MB/s per
> (1.5TB-)drive, so it is going to take minimun 20.8 hours to resilver one of
> their arrays.
>
> -mrb
>
>But none of that matters.  The data is replicated at a higher layer,
combined with raid-6.  They''d have to see triple disk failure across
multiple arrays at the same time...  They aren''t concerned with
performance,
the home users they''re backing up aren''t ever going to get
anything remotely
close to gigE speeds.  Absolute BEST case scenario *MIGHT* push 20mbit if
the end-user is lucky enough to have FIOS or docsis 3.0 in their area, and
has large files with a clean link.

Even rebuilding two failed disks that setup will push 2MB/sec all day long.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090905/dd6c5ea2/attachment.html>

zfs discuss - Sep 2009 - Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

Re: Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog

[zfs-discuss] Petabytes on a budget - blog