thr3ads.net - zfs discuss - [zfs-discuss] Best Practices recommendation on x4200 [Nov 2006]

If this information is useful, please help other people find it:
Share via:

John Tracy

2006-Nov-07 18:45 UTC

[zfs-discuss] Best Practices recommendation on x4200

Greetings all-
     I have a new X4200 that I''m getting ready to deploy. It has four
146 GB SAS drives. I''d like to setup the box for maximum redundancy on
the data stored on these drives. Unfortunately, it looks like ZFS boot/root
aren''t really options at this time. The LSI Logic controller in this
box only supports either a RAID0 array with all four disks, or a RAID 1 array
with two disks--neither of which are very appealing to me.
     Ideally I''d like to have at least 300 gigs of storage available to
the users, or more if I can do it with something like a RAID 5 setup. My
concern, however, is that the boot partition and root partitions have data
redundancy.
     How would you setup this box?
     It''s primary used as a development server, running a myriad of
applications.

Thank you-
John
 
 
This message posted from opensolaris.org

Al Hopper

2006-Nov-07 20:09 UTC

head link

[zfs-discuss] Best Practices recommendation on x4200

On Tue, 7 Nov 2006, John Tracy wrote:
> Greetings all-
>      I have a new X4200 that I''m getting ready to deploy. It has
four 146 GB SAS drives. I''d like to setup the box for maximum
redundancy on the data stored on these drives. Unfortunately, it looks like ZFS
boot/root aren''t really options at this time. The LSI Logic controller
in this box only supports either a RAID0 array with all four disks, or a RAID 1
array with two disks--neither of which are very appealing to me.
>      Ideally I''d like to have at least 300 gigs of storage
available to the users, or more if I can do it with something like a RAID 5
setup. My concern, however, is that the boot partition and root partitions have
data redundancy.
>      How would you setup this box?
>      It''s primary used as a development server, running a myriad
of applications.
Since you''ve posted this to zfs-discuss, I''m assuming that
your goal is to
find a way that you can take advantage of zfs on this box - if at all
possible.  So I''m going to propose a radical setup that I''m
sure many will
have issues with and which falls outside conventional/normal best
practices, which in this case, would be to form 2 mirrors of 2 disks each
using the built-in H/W RAID ctrl and you''re done.  If this is too
radical
for you, it will at least provide food for thought.

First your assertion that you want redundancy for root and boot is
somewhat flawed.  Let me explain; in the "old days", loosing root on a
box
was one of the worst possible user experiences - but that is simply not
true today.  If you keep the root filesystem pristine (more later) and
just save off the config files you modify (/etc/hostname.*, /etc/passwd,
/etc/group, /etc/shadow, /etc/hosts blah, blah) periodically, the root
partition can be restored quickly and simply and then your config files
restored.  Consider the root partition disposable and replaceable.  If you
setup the system initially to net boot[1], then your root partition can be
restored very quickly from the same set of files you used to load it
initially!  Since its being used as a development box, if the root disk
dies, you push in a replacement, net boot it and restore your saved config
files.  Downtime will probably be around 30 minutes, assuming you keep a
spare disk handy (in a locked, rack-mount drawer immediately adajacent to
the x4200 machine).

Next I mentioned keeping root pristine.  I''ll also assume
you''ll use the
blastwave software reposition which installs software in /opt/csw by
default.  So first up, the disk layout config:

on the boot disk:
  - 16Gb / root partation
  - 4 to 16Gb swap partition
  - 16Gb live upgrade partition
  - small lightly used /export/home partition
  - the rest of this disk will be un-allocated at this time

with the other 3 disks, form a 3-way raidz pool, with the following broad
plan for the initial zfs filesystems you''ll place in this pool:
  - a filesystem for shared home directories that will be shared into zones
  - additional swap vdev
  - a filesystem for your master zone (see below)
  - a filesystem for each zone you''ll define on this box
  - a filesystem for (one or more) junk zone(s)

So now root is still pristine - no supplemental software has been loaded
or added.  First up, build a "master" zone.  Its master, in the sense
that
it''ll be used to clone real working zones from, in which you will do
*all*
the "real" work.  So create a fat zone (create -b), run
"netservices
limited" within it, add default user accounts, setup DNS etc.  The more
effort you put into building/configuration of this master zone, the easier
it''ll be to add work zones to the box.

Now that you have your master zone, use zfs clone to create "fat"
zones
for use as work areas.  Within these work zones, you''ll install all
your
blastwave tools, compilers, tools etc. etc.  You can arrange for the
shared home directories to be automatically mounted when a user with a
shared home logs into the zone (using the automounter).  You''ll
probably
have some users who only have logins in certains zones etc.

Next repeat the above and build/clone more work zones on a per project or
per department or per whatever-makes-sense basis.  You''ll apply zfs
quotas
on zones where you have concerns about the users gobbling up too much disk
space.  You''ll have one or more junkzones to allow experiments with the
system config to be safely isolated.  Use zfs send/recv to backup
individual zones or datasets from within zones to another zfs server.

If you elect to install Solaris, then (my recommendation) wait for Update
3.  And you''ll be able to zfs clone zones by copying them - but not
create
them from snapshots.  If you install Solaris Express or the
latest/greatest OpenSolaris you''ll be able to create zones very quickly
from a zfs snapshot of your master zone - saving you a good deal of time
and disk space.

Downsides: There are many.  First off, you know that zones on zfs are not
supported (yet).  And that applying patches may break the box and render
it entirely useless.  And that, currently, zfs does not handle all disk
errors gracefully.  But all of these downsides will disappear over time
and I believe that the tradeoff, in terms of usability etc. is worth the
increased risk of running this radical system config.

In any case, if you want to try this config, it''ll take you a couple of
hours to build a mockup on the x4200 and you can experiment with it and
decide if you can live with it.

Email me offlist if you have any questions that you feel are off-topic for
the zfs-discuss list.

[1] use JET

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
             OpenSolaris Governing Board (OGB) Member - Feb 2006

Robert Milkowski

2006-Nov-07 23:22 UTC

head link

[zfs-discuss] Best Practices recommendation on x4200

Hello John,

Tuesday, November 7, 2006, 7:45:46 PM, you wrote:

JT> Greetings all-
JT>      I have a new X4200 that I''m getting ready to deploy. It has
JT> four 146 GB SAS drives. I''d like to setup the box for maximum
JT> redundancy on the data stored on these drives. Unfortunately, it
JT> looks like ZFS boot/root aren''t really options at this time. The
JT> LSI Logic controller in this box only supports either a RAID0
JT> array with all four disks, or a RAID 1 array with two
JT> disks--neither of which are very appealing to me.
JT>      Ideally I''d like to have at least 300 gigs of storage
JT> available to the users, or more if I can do it with something like
JT> a RAID 5 setup. My concern, however, is that the boot partition
JT> and root partitions have data redundancy.
JT>      How would you setup this box?
JT>      It''s primary used as a development server, running a myriad
of applications.


Use SVM to mirror system, something like:

    d0 mirror of c0t0d0s0 and c0t1d0s0     /    2GB
    d5 mirror of c0t0d0s1 and c0t1d0s1     /var 2GB
    d10 mirror of c0t2d0s0 and c0t3d0s0    swap (2+2GB, to match above)

    an all 4 disks create s4 slice with the rest of the disk, should
    be equal on all disks. Then create raidz pool out of those slices.
    You should get above 400GB of usable storage.

    That way you''ve got mirrored root disks, mirrored swap on another
    two disks matching exactly the space used by / and /var and the
    rest of the disk for your data on zfs.


    ps. and of course you''ve got to create small slices for
metadb''s.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling - PAE

2006-Nov-07 23:54 UTC

head link

[zfs-discuss] Best Practices recommendation on x4200

The best thing about best practices is that there are so many of them :-)

Robert Milkowski wrote:> Hello John,
> 
> Tuesday, November 7, 2006, 7:45:46 PM, you wrote:
> 
> JT> Greetings all-
> JT>      I have a new X4200 that I''m getting ready to deploy.
It has
> JT> four 146 GB SAS drives. I''d like to setup the box for
maximum
> JT> redundancy on the data stored on these drives. Unfortunately, it
> JT> looks like ZFS boot/root aren''t really options at this
time. The
> JT> LSI Logic controller in this box only supports either a RAID0
> JT> array with all four disks, or a RAID 1 array with two
> JT> disks--neither of which are very appealing to me.
> JT>      Ideally I''d like to have at least 300 gigs of storage
> JT> available to the users, or more if I can do it with something like
> JT> a RAID 5 setup. My concern, however, is that the boot partition
> JT> and root partitions have data redundancy.
> JT>      How would you setup this box?
> JT>      It''s primary used as a development server, running a
myriad of applications.
> 
> 
> Use SVM to mirror system, something like:
> 
>     d0 mirror of c0t0d0s0 and c0t1d0s0     /    2GB
>     d5 mirror of c0t0d0s1 and c0t1d0s1     /var 2GB
IMNSHO, having a separate /var is a complete waste of effort.
Also, 2 GBytes is too small.
>     d10 mirror of c0t2d0s0 and c0t3d0s0    swap (2+2GB, to match above)
Also a waste, use a swap file.  Add a dumpdev if you care about
kernel dumps, no need to mirror a dumpdev.
>     an all 4 disks create s4 slice with the rest of the disk, should
>     be equal on all disks. Then create raidz pool out of those slices.
>     You should get above 400GB of usable storage.
> 
>     That way you''ve got mirrored root disks, mirrored swap on
another
>     two disks matching exactly the space used by / and /var and the
>     rest of the disk for your data on zfs.
> 
> 
>     ps. and of course you''ve got to create small slices for
metadb''s.

Simple /.  Make it big enough to be useful.  Keep its changes to a
minimum.  Make more than one, so that you can use LiveUpgrade.
For consistency, you could make each disk look the same.
	s0 / 10G
	s6 zpool free
	s7 metadb 100M

Use two disks for your BE, the other two for your ABE (assuming all are
bootable).

The astute observer will note that you could also use the onboard RAID
controller for the same, simple configuration, less metadb of course.
  -- richard

Mike Gerdts

2006-Nov-08 02:05 UTC

head link

[zfs-discuss] Best Practices recommendation on x4200

On 11/7/06, Richard Elling - PAE <Richard.Elling at sun.com>
wrote:> >     d10 mirror of c0t2d0s0 and c0t3d0s0    swap (2+2GB, to match
above)
>
> Also a waste, use a swap file.  Add a dumpdev if you care about
> kernel dumps, no need to mirror a dumpdev.
How do you figure that allocating space to a swap file is less of a
waste than adding space to a swap device?
> Simple /.  Make it big enough to be useful.  Keep its changes to a
> minimum.  Make more than one, so that you can use LiveUpgrade.
> For consistency, you could make each disk look the same.
>         s0 / 10G
>         s6 zpool free
>         s7 metadb 100M
Since ZFS can get performance boosts from enabling the disk write
cache if it has the whole disk, you may want to consider something
more like the following for two of the disks (assumes mirroring rather
than raidz in the zpool):

s0 / 10G
s1 swap <pick your size>
s3 alt / 10G
s6 zpool free
s7 metadb 100M

The other pair of disks are given entirely to the zpool.
> Use two disks for your BE, the other two for your ABE (assuming all are
> bootable).
In any case, be sure that your root slices do not start at cylinder 0
(hmmm... maybe this is SPARC-specific advice...).  One way to populate
an ABE is to mirror slices.  However, you cannot mirror between a
device that starts at cylinder 0 and one that does not.  Consider the
following mock-up (output may be a bit skewed):

Starting state...

# lustatus
slice0 - active mounted at d0
slice3 - may or may not exist, if it exists it is on d30

# metastat -p
d0 -m d1 d2 1
d1 1 1 c0t0d0s0
d2 1 1 c0t1d0s0
d30 -m d31 d32 1
d31 1 1 c0t0d0s3
d32 1 1 c0t1d0s3

Get rid of slice3 boot environment, make d31 available to recreate it.

# ludelete slice3
# metadetach d30 d31
# metaclear -r d30

Mirror d0 to d31.  Wait for it to complete.

# metattach d0 d31
# while metastat -p | grep % ; do sleep 30 ; done

Detach d31 from d0, recreate d30 mirror

# metadetach d0 d31
# metainit d30 -m d31 1
# metainit d32 1 1 c0t1d0s3
# metattach d30 d32

Create boot environment named slice3:

# lucreate -n slice3 /:d30:ufs,preserve

Now you can manipulate the slice3 boot environment as needed.

Why go through all of this?  My reasons have typically been:

1) Normally lucreate uses cpio, which doesn''t cope with sparse files
well.  /var/adm/lastlog is a sparse file that can be problematic if
you have users with large UID''s
2) Lots of file systems mounted and little interest in creating very
complex command lines with many -x options.

Mike

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Richard Elling - PAE

2006-Nov-08 23:21 UTC

head link

[zfs-discuss] Best Practices recommendation on x4200

Mike Gerdts wrote:> On 11/7/06, Richard Elling - PAE <Richard.Elling at sun.com> wrote:
>> >     d10 mirror of c0t2d0s0 and c0t3d0s0    swap (2+2GB, to match
above)
>>
>> Also a waste, use a swap file.  Add a dumpdev if you care about
>> kernel dumps, no need to mirror a dumpdev.
> 
> How do you figure that allocating space to a swap file is less of a
> waste than adding space to a swap device?
If you ever guess wrong (which you will), you can just make another
swap file or redo the existing swap file.  If you carve out a slice,
then reclaiming the space is much more difficult.  Creating more slices
tends to also be difficult, so when you quess wrong you may still end
up swapping to files.
>> Simple /.  Make it big enough to be useful.  Keep its changes to a
>> minimum.  Make more than one, so that you can use LiveUpgrade.
>> For consistency, you could make each disk look the same.
>>         s0 / 10G
>>         s6 zpool free
>>         s7 metadb 100M
> 
> Since ZFS can get performance boosts from enabling the disk write
> cache if it has the whole disk, you may want to consider something
> more like the following for two of the disks (assumes mirroring rather
> than raidz in the zpool):
> 
> s0 / 10G
> s1 swap <pick your size>
> s3 alt / 10G
> s6 zpool free
> s7 metadb 100M
> 
> The other pair of disks are given entirely to the zpool.
> 
>> Use two disks for your BE, the other two for your ABE (assuming all are
>> bootable).
> 
> In any case, be sure that your root slices do not start at cylinder 0
> (hmmm... maybe this is SPARC-specific advice...).
I think this is folklore.  Can you cite a reference?
NB. traditionally, block 0 contains the vtoc and for SPARC systems,
blocks 1-15 contain the bootblocks, see installboot(1M).
Cylinder 0 may contain thousands of blocks for modern disks.  It is
a waste not to use them.  AFAIK, all Sun software which deals with
raw devices is aware of this.
>                                                   One way to populate
> an ABE is to mirror slices.  However, you cannot mirror between a
> device that starts at cylinder 0 and one that does not.  
Where is this restriction documented?  It doesn''t make sense to me.
Maybe you have a scar from running Sybase in a previous life? ;-)
  -- richard

Nathan Kroenert

2006-Nov-09 00:00 UTC

head link

[zfs-discuss] Best Practices recommendation on x4200

On Thu, 2006-11-09 at 10:21, Richard Elling - PAE wrote:> > One way to populate
> > an ABE is to mirror slices.  However, you cannot mirror between a
> > device that starts at cylinder 0 and one that does not.  
> 
> Where is this restriction documented?  It doesn''t make sense to
me.
> Maybe you have a scar from running Sybase in a previous life? ;-)
IIRC, that''s a part of the history of disksuite / SVM. Moreover, it was
that you cannot mirror a slice that has a VTOC label on it to one that
does not... (hence the understanding of it being a cylinder 0 issue).

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/lvm/mirror/mirror_ioctl.c#887

Or, perhaps I need more coffee...

Cheers!

Nathan. ;)

Reasonably Related Threads

Search for more maybe matching threads

zfs discuss - Nov 2006 - Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

[zfs-discuss] Best Practices recommendation on x4200

Reasonably Related Threads