thr3ads.net - zfs discuss - Petabyte pool? [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Marion Hakanson

2013-Mar-16 01:09 UTC

Petabyte pool?

Greetings,

Has anyone out there built a 1-petabyte pool?  I''ve been asked to look
into this, and was told "low performance" is fine, workload is likely
to be write-once, read-occasionally, archive storage of gene sequencing
data.  Probably a single 10Gbit NIC for connectivity is sufficient.

We''ve had decent success with the 45-slot, 4U SuperMicro SAS disk
chassis,
using 4TB "nearline SAS" drives, giving over 100TB usable space
(raidz3).
Back-of-the-envelope might suggest stacking up eight to ten of those,
depending if you want a "raw marketing petabyte", or a proper
"power-of-two
usable petabyte".

I get a little nervous at the thought of hooking all that up to a single
server, and am a little vague on how much RAM would be advisable, other
than "as much as will fit" (:-).  Then again, I''ve been
waiting for
something like pNFS/NFSv4.1 to be usable for gluing together multiple
NFS servers into a single global namespace, without any sign of that
happening anytime soon.

So, has anyone done this?  Or come close to it?  Thoughts, even if you
haven''t done it yourself?

Thanks and regards,

Marion

Ray Van Dolson

2013-Mar-16 01:17 UTC

head link

Re: [zfs-discuss] Petabyte pool?

On Fri, Mar 15, 2013 at 06:09:34PM -0700, Marion Hakanson
wrote:> Greetings,
> 
> Has anyone out there built a 1-petabyte pool?  I''ve been asked to
look
> into this, and was told "low performance" is fine, workload is
likely
> to be write-once, read-occasionally, archive storage of gene sequencing
> data.  Probably a single 10Gbit NIC for connectivity is sufficient.
> 
> We''ve had decent success with the 45-slot, 4U SuperMicro SAS disk
chassis,
> using 4TB "nearline SAS" drives, giving over 100TB usable space
(raidz3).
> Back-of-the-envelope might suggest stacking up eight to ten of those,
> depending if you want a "raw marketing petabyte", or a proper
"power-of-two
> usable petabyte".
> 
> I get a little nervous at the thought of hooking all that up to a single
> server, and am a little vague on how much RAM would be advisable, other
> than "as much as will fit" (:-).  Then again, I''ve been
waiting for
> something like pNFS/NFSv4.1 to be usable for gluing together multiple
> NFS servers into a single global namespace, without any sign of that
> happening anytime soon.
> 
> So, has anyone done this?  Or come close to it?  Thoughts, even if you
> haven''t done it yourself?
> 
> Thanks and regards,
> 
> Marion
We''ve come close:

admin@mes-str-imgnx-p1:~$ zpool list
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
datapool   978T   298T   680T    30%  1.00x  ONLINE  -
syspool    278G   104G   174G    37%  1.00x  ONLINE  -

Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual
pathed to a couple of LSI SAS switches.

Using Nexenta but no reason you couldn''t do this w/ $whatever.

We did triple parity and our vdev membership is set up such that we can
lose up to three JBODs and still be functional (one vdev member disk
per JBOD).

This is with 3TB NL-SAS drives.

Ray

Jan Owoc

2013-Mar-16 01:29 UTC

head link

Re: Petabyte pool?

On Fri, Mar 15, 2013 at 7:09 PM, Marion Hakanson <hakansom@ohsu.edu>
wrote:> Has anyone out there built a 1-petabyte pool?
I''m not advising against your building/configuring a system yourself,
but I suggest taking look at the "Petarack":
http://www.aberdeeninc.com/abcatg/petarack.htm

It shows it''s been done with ZFS :-).

Jan

Marion Hakanson

2013-Mar-16 01:31 UTC

head link

Re: Petabyte pool?

rvandolson@esri.com said:> We''ve come close:
> 
> admin@mes-str-imgnx-p1:~$ zpool list
> NAME       SIZE  ALLOC   FREE    CAP DEDUP  HEALTH  ALTROOT
> datapool   978T   298T   680T    30%  1.00x  ONLINE  -
> syspool    278G   104G   174G    37%  1.00x  ONLINE  -
> 
> Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual pathed
to
> a couple of LSI SAS switches. 
Thanks Ray,

We''ve been looking at those too (we''ve had good luck with our
MD1200''s).

How many HBA''s in the R720?

Thanks and regards,

Marion

Ray Van Dolson

2013-Mar-16 01:56 UTC

head link

Re: Petabyte pool?

On Fri, Mar 15, 2013 at 06:31:11PM -0700, Marion Hakanson
wrote:> rvandolson@esri.com said:
> > We''ve come close:
> > 
> > admin@mes-str-imgnx-p1:~$ zpool list
> > NAME       SIZE  ALLOC   FREE    CAP DEDUP  HEALTH  ALTROOT
> > datapool   978T   298T   680T    30%  1.00x  ONLINE  -
> > syspool    278G   104G   174G    37%  1.00x  ONLINE  -
> > 
> > Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual
pathed to
> > a couple of LSI SAS switches. 
> 
> Thanks Ray,
> 
> We''ve been looking at those too (we''ve had good luck with
our MD1200''s).
> 
> How many HBA''s in the R720?
> 
> Thanks and regards,
> 
> Marion
We have qty 2 LSI SAS 9201-16e HBA''s (Dell resold[1]).

Ray

[1]
http://accessories.us.dell.com/sna/productdetail.aspx?c=us&l=en&s=hied&cs=65&sku=a4614101

Marion Hakanson

2013-Mar-16 02:35 UTC

head link

Re: [zfs-discuss] Petabyte pool?

>>Ray said:
>>> Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs dual
pathed
>>> to a couple of LSI SAS switches. 
>> 
>Marion said:
>> How many HBA''s in the R720?
>
Ray said:> We have qty 2 LSI SAS 9201-16e HBA''s (Dell resold[1]).
Sounds similar in approach to the Aberdeen product another sender referred to,
with SAS switch layout:
  http://www.aberdeeninc.com/images/1-up-petarack2.jpg

One concern I had is that I compared our SuperMicro JBOD with 40x 4TB drives
in it, connected via a dual-port LSI SAS 9200-8e HBA, to the same pool layout
on a 40-slot server with 40x SATA drives in it.  But the server uses no SAS
expanders, instead using SAS-to-SATA octopus cables to connect the drives
directly to three internal SAS HBA''s (2x 9201-16i''s, 1x
9211-8i).

What I found was that the internal pool was significantly faster for both
sequential and random I/O than the pool on the external JBOD.

My conclusion was that I would not want to exceed ~48 drives on a single
8-port SAS HBA.  So I thought that running the I/O of all your hundreds
of drives through only two HBA''s would be a bottleneck.

LSI''s specs say 4800MBytes/sec for an 8-port SAS HBA, but
4000MBytes/sec
for that card in an x8 PCIe-2.0 slot.  Sure, the newer 9207-8e is rated
at 8000MBytes/sec in an x8 PCIe-3.0 slot, but it still has only the same
8 SAS ports going at 4800MBytes/sec.

Yes, I know the disks probably can''t go that fast.  But in my tests
above, the internal 40-disk pool measures 2000MBytes/sec sequential
reads and writes, while the external 40-disk JBOD measures at 1500
to 1700 MBytes/sec.  Not a lot slower, but significantly slower, so
I do think the number of HBA''s makes a difference.

At the moment, I''m leaning toward piling six, eight, or ten
HBA''s into
a server, preferably one with dual IOH''s (thus two PCIe busses), and
connecting dual-path JBOD''s in that manner.

I hadn''t looked into SAS switches much, but they do look more reliable
than daisy-chaining a bunch of JBOD''s together.  I just
haven''t seen
how to get more bandwidth through them to a single host.

Regards,

Marion

Richard Elling

2013-Mar-16 04:57 UTC

head link

Re: Petabyte pool?

On Mar 15, 2013, at 6:09 PM, Marion Hakanson <hakansom@ohsu.edu> wrote:
> Greetings,
> 
> Has anyone out there built a 1-petabyte pool?
Yes, I''ve done quite a few.
>  I''ve been asked to look
> into this, and was told "low performance" is fine, workload is
likely
> to be write-once, read-occasionally, archive storage of gene sequencing
> data.  Probably a single 10Gbit NIC for connectivity is sufficient.
> 
> We''ve had decent success with the 45-slot, 4U SuperMicro SAS disk
chassis,
> using 4TB "nearline SAS" drives, giving over 100TB usable space
(raidz3).
> Back-of-the-envelope might suggest stacking up eight to ten of those,
> depending if you want a "raw marketing petabyte", or a proper
"power-of-two
> usable petabyte".
Yes. NB, for the PHB, using N^2 is found 2B less effective than N^10.
> I get a little nervous at the thought of hooking all that up to a single
> server, and am a little vague on how much RAM would be advisable, other
> than "as much as will fit" (:-).  Then again, I''ve been
waiting for
> something like pNFS/NFSv4.1 to be usable for gluing together multiple
> NFS servers into a single global namespace, without any sign of that
> happening anytime soon.
NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace
without needing the complexity of NFSv4.1, lustre, glusterfs, etc.
> 
> So, has anyone done this?  Or come close to it?  Thoughts, even if you
> haven''t done it yourself?
Don''t forget about backups :-)
 -- richard


--

Richard.Elling@RichardElling.com
+1-760-896-4422











_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Trey Palmer

2013-Mar-16 05:30 UTC

head link

Re: Re: [zfs-discuss] Petabyte pool?

I know it''s heresy these days, but given the I/O throughput
you''re looking for and the amount you''re going to spend on
disks, a T5-2 could make sense when they''re released (I think) later
this month.

Crucial sells RAM they guarantee for use in SPARC T-series, and since
you''re at an edu the academic discount is 35%.   So A T4-2 with 512GB
RAM could be had for under $35K shortly after release, 4-5 months before the E5
Xeon was released.  It seemed a surprisingly good deal to me.

The T5-2 has 32x3.6GHz cores, 256 threads and ~150GB/s aggregate memory
bandwidth.   In my testing a T4-1 can compete with a 12-core E-5 box on I/O and
memory bandwidth, and this thing is about 5 times bigger than the T4-1.   It
should have at least 10 PCIe''s and will take 32 DIMMs minimum, maybe
64.  And is likely to cost you less than $50K with aftermarket RAM.

    -- Trey



On Mar 15, 2013, at 10:35 PM, Marion Hakanson <hakansom@ohsu.edu> wrote:
>>> Ray said:
>>>> Using a Dell R720 head unit, plus a bunch of Dell MD1200 JBODs
dual pathed
>>>> to a couple of LSI SAS switches.
>> Marion said:
>>> How many HBA''s in the R720?
> Ray said:
>> We have qty 2 LSI SAS 9201-16e HBA''s (Dell resold[1]).
> 
> Sounds similar in approach to the Aberdeen product another sender referred
to,
> with SAS switch layout:
>  http://www.aberdeeninc.com/images/1-up-petarack2.jpg
> 
> One concern I had is that I compared our SuperMicro JBOD with 40x 4TB
drives
> in it, connected via a dual-port LSI SAS 9200-8e HBA, to the same pool
layout
> on a 40-slot server with 40x SATA drives in it.  But the server uses n
> expanders, instead using SAS-to-SATA octopus cables to connect the drives
> directly to three internal SAS HBA''s (2x 9201-16i''s, 1x
9211-8i).
> 
> What I found was that the internal pool was significantly faster for both
> sequential and random I/O than the pool on the external JBOD.
> 
> My conclusion was that I would not want to exceed ~48 drives on a single
> 8-port SAS HBA.  So I thought that running the I/O of all your hundreds
> of drives through only two HBA''s would be a bottleneck.
> 
> LSI''s specs say 4800MBytes/sec for an 8-port SAS HBA, but
4000MBytes/sec
> for that card in an x8 PCIe-2.0 slot.  Sure, the newer 9207-8e is rated
> at 8000MBytes/sec in an x8 PCIe-3.0 slot, but it still has only the same
> 8 SAS ports going at 4800MBytes/sec.
> 
> Yes, I know the disks probably can''t go that fast.  But in my
tests
> above, the internal 40-disk pool measures 2000MBytes/sec sequential
> reads and writes, while the external 40-disk JBOD measures at 1500
> to 1700 MBytes/sec.  Not a lot slower, but significantly slower, so
> I do think the number of HBA''s makes a difference.
> 
> At the moment, I''m leaning toward piling six, eight, or ten
HBA''s into
> a server, preferably one with dual IOH''s (thus two PCIe busses),
and
> connecting dual-path JBOD''s in that manner.
> 
> I hadn''t looked into SAS switches much, but they do look more
reliable
> than daisy-chaining a bunch of JBOD''s together.  I just
haven''t seen
> how to get more bandwidth through them to a single host.
> 
> Regards,
> 
> Marion
> 
> 
> 
> 
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
https://www.listbox.com/member/archive/rss/182191/22500336-78e51065
> Modify Your Subscription: https://www.listbox.com/member/?&
> Powered by Listbox: http://www.listbox.com

Marion Hakanson

2013-Mar-16 07:50 UTC

head link

Re: [zfs-discuss] Petabyte pool?

>hakansom@ohsu.edu said:
>> I get a little nervous at the thought of hooking all that up to a
single
>> server, and am a little vague on how much RAM would be advisable, other
than
>> "as much as will fit" (:-).  Then again, I''ve been
waiting for something
like>> pNFS/NFSv4.1 to be usable for gluing together multiple NFS servers into
a
>> single global namespace, without any sign of that happening anytime
soon.
> 
richard.elling@gmail.com said:> NFS v4 or DFS (or even clever sysadmin + automount) offers single namespace
> without needing the complexity of NFSv4.1, lustre, glusterfs, etc. 
Been using NFSv4 since it showed up in Solaris-10 FCS, and it is true
that I''ve been clever enough (without automount -- I like my computers
to be as deterministic as possible, thank you very much :-) for our
NFS clients to see a single directory-tree namespace which abstracts
away the actual server/location of a particular piece of data.

However, we find it starts getting hard to manage when a single project
(think "directory node") needs more space than their current NFS
server
will hold.  Or perhaps what you''re getting at above is even more clever
than I have been to date, and is eluding me at the moment.  I did see
someone mention "NFSv4 referrals" recently, maybe that would help.

Plus, believe it or not, some of our customers still insist on having the
server name in their path hierarchy for some reason, like /home/mynfs1/,
/home/mynfs2/, and so on.  Perhaps I''ve just not been persuasive enough
yet (:-).


richard.elling@gmail.com said:> Don''t forget about backups :-)
I was hoping I could get by with telling them to buy two of everything.

Thanks and regards,

Marion

Richard Yao

2013-Mar-16 12:23 UTC

head link

Re: Re: [zfs-discuss] Petabyte pool?

On 03/16/2013 12:57 AM, Richard Elling wrote:> On Mar 15, 2013, at 6:09 PM, Marion Hakanson <hakansom@ohsu.edu>
wrote:
>> So, has anyone done this?  Or come close to it?  Thoughts, even if you
>> haven''t done it yourself?
> 
> Don''t forget about backups :-)
>  -- richard
Transferring 1 PB over a 10 gigabit link will take at least 10 days when
overhead is taken into account. The backup system should have a
dedicated 10 gigabit link at the minimum and using incremental send/recv
will be extremely important.




-------------------------------------------
illumos-zfs
Archives: https://www.listbox.com/member/archive/182191/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182191/23047029-187a0c8d
Modify Your Subscription:
https://www.listbox.com/member/?member_id=23047029&id_secret=23047029-2e85923f
Powered by Listbox: http://www.listbox.com

Schweiss, Chip

2013-Mar-17 01:05 UTC

head link

Re: Petabyte pool?

I just recently built an OpenIndiana 151a7 system that is currently 1/2 PB
that will be expanded to 1 PB as we collect imaging data for the Human
Connectome Project at Washington University in St. Louis.  It is very much
like your use case as this is an offsite backup system that will write once
and read rarely.

It has displaced a BlueArc DR system because their mechanisms for syncing
over distances could not keep up with our data generation rate.   The fact
it cost 5x per TB as homebrew helped the decision also.

It is currently 180 4TB SAS Seagate Constellations in 4 Supermicro JBODs.
  The JBODS currently are in two branches only cascading once.   When
expanded 4 JBODs will be on each branch.  The pool is configured as 9 zvols
of 19 drives in raidz3.   The remaining disks are configured as hot
spares.  Metedata only is cached in 128GB ram and 2 480GB Intel 520 SSDs
for L2ARC.  Sync (ZIL) is turned off since the worst that would happen is
that we would need to rerun an rsync job.

Two identical servers were built for a cold standby configuration.   Since
it is a DR system the need for a hot standby was ruled out since even
several hours downtime would not be an issue.  Each server is fitted with 2
LSI 9207-8e HBAs configured as redundant multipath to the JBODs.

Before putting in into service I ran several iozone tests to benchmark the
pool.   Even with really fat vdevs the performance is impressive.   If
you''re interested in that data let me know.    It has many hours of
idle
time each day so additional performance tests are not out of the question
either.

Actually I should say I designed and configured the system.  The system was
assembled by a colleague at UMINN.   If you would like more details on the
hardware I have a very detailed assembly doc I wrote and would be happy to
share.

The system receives daily rsyncs from our production BlueArc system.   The
rsyncs are split into 120 parallel rsync jobs.  This overcomes the latency
slow down TCP suffers from and we see total throughput between
500-700Mb/s.  The BlueArc has 120TB of 15k SAS tiered to NL-SAS.  All
metadata is on the SAS pool.   The ZFS system outpaces the BlueArc on
metadata when rsync does its tree walk.

Given all the safeguards built into ZFS, I would not hesitate to build a
production system at the multi-petabyte scale.   If a channel to disks are
no longer available it will simply stop writing and data will be safe.
Given the redundant paths, power supplies, etc, the odds of that happening
are very unlikely.  The single points of failure left when running a single
server remain at the motherboard, CPU and RAM level.   Build a hot standby
server and human error becomes the most likely failure.

-Chip

On Fri, Mar 15, 2013 at 8:09 PM, Marion Hakanson <hakansom@ohsu.edu>
wrote:
> Greetings,
>
> Has anyone out there built a 1-petabyte pool?  I''ve been asked to
look
> into this, and was told "low performance" is fine, workload is
likely
> to be write-once, read-occasionally, archive storage of gene sequencing
> data.  Probably a single 10Gbit NIC for connectivity is sufficient.
>
> We''ve had decent success with the 45-slot, 4U SuperMicro SAS disk
chassis,
> using 4TB "nearline SAS" drives, giving over 100TB usable space
(raidz3).
> Back-of-the-envelope might suggest stacking up eight to ten of those,
> depending if you want a "raw marketing petabyte", or a proper
"power-of-two
> usable petabyte".
>
> I get a little nervous at the thought of hooking all that up to a single
> server, and am a little vague on how much RAM would be advisable, other
> than "as much as will fit" (:-).  Then again, I''ve been
waiting for
> something like pNFS/NFSv4.1 to be usable for gluing together multiple
> NFS servers into a single global namespace, without any sign of that
> happening anytime soon.
>
> So, has anyone done this?  Or come close to it?  Thoughts, even if you
> haven''t done it yourself?
>
> Thanks and regards,
>
> Marion
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Mar 2013 - Petabyte pool?

Petabyte pool?

Re: [zfs-discuss] Petabyte pool?

Re: Petabyte pool?

Re: Petabyte pool?

Re: Petabyte pool?

Re: [zfs-discuss] Petabyte pool?

Re: Petabyte pool?

Re: Re: [zfs-discuss] Petabyte pool?

Re: [zfs-discuss] Petabyte pool?

Re: Re: [zfs-discuss] Petabyte pool?

Re: Petabyte pool?