thr3ads.net - zfs discuss - [zfs-discuss] ZFS File / Folder Management [May 2011]

If this information is useful, please help other people find it:
Share via:

MasterCATZ

2011-May-16 23:54 UTC

[zfs-discuss] ZFS File / Folder Management

Is it possible to ZFS to keep data that belongs together on the same Pool

( or would these Questions be more related to Raid-Z ? ) 

that way if their is a failure only the Data on the Pool that Failed Needs to be
replaced
( or if one Pool failed Does that Mean All the other Pools still fail as well ?
with out a way to recover data ? )

I am Wanting to be able to Expand My Array over Time by adding either 4 or 8 HDD
pools

Most of the Data will probably never be Deleted 

but say I have 1 gig remaining on the First Pool and adding an 8 gig file 
does this mean the data will be then put onto pool 1 and pool 2 ? 
( 1 gig pool 1 7 gig pool 2 )

or would ZFS be able to put it onto the 2nd pool instead of Splitting it ?

the other scenario would be folder structure would ZFS be able to understand
data contained in a Folder Tree Belongs together and be able to store it on a
dedicated pool ?


if so it would be great or else you would be spending for ever replacing data
from backup if something does go wrong


sorry if this goes in the wrong spot i could no find 
? OpenSolaris Forums ? zfs ? discuss
in the drop down menu
-- 
This message posted from opensolaris.org

Giovanni Tirloni

2011-May-17 14:54 UTC

head link

[zfs-discuss] ZFS File / Folder Management

On Mon, May 16, 2011 at 8:54 PM, MasterCATZ <MasterCATZ at hotmail.com>
wrote:>
> Is it possible to ZFS to keep data that belongs together on the same Pool
>
> ( or would these Questions be more related to Raid-Z ? )
>
> that way if their is a failure only the Data on the Pool that Failed Needs
to be replaced
> ( or if one Pool failed Does that Mean All the other Pools still fail as
well ? with out a way to recover data ? )
>
> I am Wanting to be able to Expand My Array over Time by adding either 4 or
8 HDD pools
>
> Most of the Data will probably never be Deleted
>
> but say I have 1 gig remaining on the First Pool and adding an 8 gig file
> does this mean the data will be then put onto pool 1 and pool 2 ?
> ( 1 gig pool 1 7 gig pool 2 )
>
> or would ZFS be able to put it onto the 2nd pool instead of Splitting it ?
>
> the other scenario would be folder structure would ZFS be able to
understand data contained in a Folder Tree Belongs together and be able to store
it on a dedicated pool ?
>
>
> if so it would be great or else you would be spending for ever replacing
data from backup if something does go wrong
>
>
> sorry if this goes in the wrong spot i could no find
> ? OpenSolaris Forums ? zfs ? discuss
> in the drop down menu
You can create a single pool and grow it as needed. From that pool,
you create filesystems.

If you want to create multiple pools (due to redundancy/performance
requirements being different), ZFS will keep them separated. And
again, you will create filesystems/datasets from each one
independently.

http://download.oracle.com/docs/cd/E19963-01/html/821-1448/index.html
http://download.oracle.com/docs/cd/E18752_01/html/819-5461/index.html

--
Giovanni Tirloni

Jim Klimov

2011-May-17 16:17 UTC

head link

[zfs-discuss] ZFS File / Folder Management

Hello, c4ts,

There seems to be some mixup of bad English and wrong terminology, so I am not
sure I understood your question correctly. Still, I''ll try to respond
;)

I''ve recently posted about ZFS terminology here:
http://opensolaris.org/jive/click.jspa?searchID=4607806&messageID=515894

In ZFS incapsulation world, a "pool" is a collection of
"top-level vdev"''s, which in turn are a collection of
"lower-level vdev"''s.

Lower level are usually individual disk partitions or slices, top-level are
redundant groups of disks slices (raidzN, mirror, on non-redundant single
devices), and the pool is a sort of striping across such groups. This is a
simplified explanation, because it is not like usual RAID striping, and
different types of data may be stored with different allocation algorithms (i.e.
AFAIK in a raidzN vdev, metadata may still be mirrored for performance).

Generally when your data is written, it is striped into relatively small blocks
and those are distributed across all top-level vdevs to provide parallel
performance.
There incoming writes are stiped into yet smaller blocks (attempts about 512kb
according to one source) so as to use the hardware prefetches efficiently,
parities are calculated and the resulting data and parity blocks are written
across different lower-level vdevs to provide redundancy.
Each written block has a checksum so it can be easily tested for validity, and
in redundant top-level VDEVs an invalid block can be automatically repaired
(using other copies in mirror or parity data in raidzN) during a read operation.

If a lower-level vdev breaks, its top-level vdev becomes "degraded" -
it can still be used, but would provide less redundancy.
If any top-level vdev fails completely, the pool is likely to become corrupted
and may require lengthy repair or destruction and restore from backup, because
part of striped data would be inaccessible.

You can address your user-data as "datasets", namely filesystem
datasets in case of POSIX filesystem interface, or volume datasets for "raw
storage" like swap. Datasets are kept as a hierarchy inside the pool. All
unallocated space in the pool is available to all its datasets.

Now that we know how things are called, regarding your questions:
> Is it possible to ZFS to keep data that belongs
> together on the same Pool
> that way if THERE is a failure only the Data on the
> Pool that Failed Needs to be replaced 
> ( or if one Pool failed Does that Mean All the other
> Pools still fail as well ? with out a way to recover
> data ? ) 
If you have many storage disks, you can create several small pools (i.e. many
independent mirrors) which would have less aggregate performance and shared
available space, but would break independently. So you''ll have a
smaller chance of losing and repairing/restoring ALL of your data. But you may
have to spend some effort to balance performance and free space when using many
pools.
> I am Wanting to be able to Expand My Array over Time
> by adding either 4 or 8 HDD pools 
Yes you can do that - either by expanding an existing pool with an additional
top-level vdev to stripe across, or by creating a new independent pool.
You can also replace all drives in an existing top-level vdev one-by-one (and
waiting for redundant data to be migrated to a new disk -
"resilvering"), and when all drives have become larger, you can
(auto-)expand the VDEV in-place.
> Most of the Data will probably never be Deleted 
However keep in mind, that if you add a new top-level vdev or expand an existing
one (keeping other old vdevs as they were), you would get unbalanced free space.
Existing data is not re-balanced automatically (though you can to that manually
by copying it around several times), so most of the new writes would go to the
new top-level vdev. As recently discussed in many threads on this forum, this is
likely to
reduce your overall pool performance by a lot (because attempted writes to
stripe accross nearly-full devices would lag, and because "fast"
writes would effectively go to one top-level vdev and not across many top-level
vdevs in parallel).
> but say I have 1 gig remaining on the First Pool and
> adding an 8 gig file 
> does this mean the data will be then put onto pool 1
> and pool 2 ? 
> ( 1 gig pool 1 7 gig pool 2 )
> 
> or would ZFS be able to put it onto the 2nd pool
> instead of Splitting it ?
Neither. Pools are independent.
If your question actually meant "1 gig free" on a top-level vdev or on
a disk, while there are other "free" spots on different devices in the
same pool, then overall free space in this pool is aggregated and you may be
able to write a larger file than would fit on any single disk.
> 
> the other scenario would be folder structure would
> ZFS be able to understand data contained in a Folder
> Tree Belongs together and be able to store it on a
> dedicated pool ?
That would probably mean a tree of ZFS filesystem datasets.
These are kept in a hierarchy like a directory tree, and each FS dataset can
store a tree of sub-directories.
An FS dataset has a unique mountpoint, so like with other Unix filesystems, your
OS''es view of the filesystem tree of directories and files is
aggregated from different individual filesystems mounted on the branches of a
global tree.
These "different individual filesystems" may come from a hierarchy of
FS datasets in one pool (often with an inherited mountpoint from a parent
dataset, such as "pool/export/home/username" being mounted by default
as a subdirectory "user" within the mountpoint of
"pool/export/home"), and they may come from different sources - such
as other pools or network filesystems. For the different sources you may need to
specify mountpoints manually, though (or use tricks like inheriting from a
parent dataset with a specified common "mountpoint=/export/home" but
which is not mounted by itself).

Thus files in one dataset would be kept together; files in different datasets in
one pool would also be kept somewhat together (see below); and files in datasets
from different pools would be kept separately. But one way or another, if the
datasets are mounted - files are addressable in the common Unix filesystem tree
of your OS.

Like other filesystems, FS datasets have unique IDs and private inode numbers
within the FS; so for example you can''t hard-link or fast-move
(hardlink new name, unlink old name) files between different FS datasets even in
the same pool. But you can often use soft-links and/or mountpoints to achieve a
specific needed result.
Also you have shared free pool space between different FS datasets in the same
pool (which can be further tuned for specific datasets by using quotas and
revervations).
> if so it would be great or else you would be spending
> for ever replacing data from backup if something does
> go wrong 
If you view this forum in greater detail, for one reason or another, even (or
especially) with ZFS things still do go wrong, especially on cheaper
under-provisioned hardware, and repairs are often slow or complicated, so for
large pools they can indeed take "forever" or close to that.

As always, redundant storage is not a replacement for a backup system (although
many people suggest that a backup on another similar server box is superior to
using tape backups - although probably using more electricity in real-time).
> sorry if this goes in the wrong spot i could no findSeems to have come correctly ;)

HTH,
//Jim Klimov
-- 
This message posted from opensolaris.org

Richard Elling

2011-May-17 16:33 UTC

head link

[zfs-discuss] ZFS File / Folder Management

On May 17, 2011, at 9:17 AM, Jim Klimov wrote:
> Hello, c4ts,
> 
> There seems to be some mixup of bad English and wrong terminology, so I am
not sure I understood your question correctly. Still, I''ll try to
respond ;)
> 
> I''ve recently posted about ZFS terminology here:
http://opensolaris.org/jive/click.jspa?searchID=4607806&messageID=515894
> 
> In ZFS incapsulation world, a "pool" is a collection of
"top-level vdev"''s, which in turn are a collection of
"lower-level vdev"''s.
> 
> Lower level are usually individual disk partitions or slices, top-level are
redundant groups of disks slices (raidzN, mirror, on non-redundant single
devices), and the pool is a sort of striping across such groups. This is a
simplified explanation, because it is not like usual RAID striping, and
different types of data may be stored with different allocation algorithms (i.e.
AFAIK in a raidzN vdev, metadata may still be mirrored for performance).
> 
> Generally when your data is written, it is striped into relatively small
blocks and those are distributed across all top-level vdevs to provide parallel
performance.
Where "relatively small" is the block size or record size: 512 byte to
128KB. A block will not be spread across
top-level vdevs.
> There incoming writes are stiped into yet smaller blocks (attempts about
512kb according to one source) so as to use the hardware prefetches efficiently,
parities are calculated and the resulting data and parity blocks are written
across different lower-level vdevs to provide redundancy.
> Each written block has a checksum so it can be easily tested for validity,
and in redundant top-level VDEVs an invalid block can be automatically repaired
(using other copies in mirror or parity data in raidzN) during a read operation.
> 
> If a lower-level vdev breaks, its top-level vdev becomes
"degraded" - it can still be used, but would provide less redundancy.
> If any top-level vdev fails completely, the pool is likely to become
corrupted and may require lengthy repair or destruction and restore from backup,
because part of striped data would be inaccessible.
Not quite correct, if the top-level vdev fails, the pool stops writing. It does
not compound
the problem by trying to write to a faulted pool. The failmode parameter setting
dictates
what happens when the faulted pool is faulted because of a write operation.

There have been many cases where large chunks of storage "disappear"
for some reason
or another (eg someone pulls out the SAS cable to the JBOD). Once reconnected,
the
pool can pick right up where it left off.
> 
> You can address your user-data as "datasets", namely filesystem
datasets in case of POSIX filesystem interface, or volume datasets for "raw
storage" like swap. Datasets are kept as a hierarchy inside the pool. All
unallocated space in the pool is available to all its datasets.
> 
> Now that we know how things are called, regarding your questions:
> 
>> Is it possible to ZFS to keep data that belongs
>> together on the same Pool
>> that way if THERE is a failure only the Data on the
>> Pool that Failed Needs to be replaced 
>> ( or if one Pool failed Does that Mean All the other
>> Pools still fail as well ? with out a way to recover
>> data ? ) 
> 
> If you have many storage disks, you can create several small pools (i.e.
many independent mirrors) which would have less aggregate performance and shared
available space, but would break independently. So you''ll have a
smaller chance of losing and repairing/restoring ALL of your data. But you may
have to spend some effort to balance performance and free space when using many
pools.
> 
>> I am Wanting to be able to Expand My Array over Time
>> by adding either 4 or 8 HDD pools 
> 
> Yes you can do that - either by expanding an existing pool with an
additional top-level vdev to stripe across, or by creating a new independent
pool.
> You can also replace all drives in an existing top-level vdev one-by-one
(and waiting for redundant data to be migrated to a new disk -
"resilvering"), and when all drives have become larger, you can
(auto-)expand the VDEV in-place.
> 
>> Most of the Data will probably never be Deleted 
> 
> However keep in mind, that if you add a new top-level vdev or expand an
existing one (keeping other old vdevs as they were), you would get unbalanced
free space. Existing data is not re-balanced automatically (though you can to
that manually by copying it around several times), so most of the new writes
would go to the new top-level vdev. As recently discussed in many threads on
this forum, this is likely to
> reduce your overall pool performance by a lot (because attempted writes to
stripe accross nearly-full devices would lag, and because "fast"
writes would effectively go to one top-level vdev and not across many top-level
vdevs in parallel).
> 
>> but say I have 1 gig remaining on the First Pool and
>> adding an 8 gig file 
>> does this mean the data will be then put onto pool 1
>> and pool 2 ? 
>> ( 1 gig pool 1 7 gig pool 2 )
>> 
>> or would ZFS be able to put it onto the 2nd pool
>> instead of Splitting it ?
> 
> Neither. Pools are independent.
> If your question actually meant "1 gig free" on a top-level vdev
or on a disk, while there are other "free" spots on different devices
in the same pool, then overall free space in this pool is aggregated and you may
be able to write a larger file than would fit on any single disk.
> 
>> 
>> the other scenario would be folder structure would
>> ZFS be able to understand data contained in a Folder
>> Tree Belongs together and be able to store it on a
>> dedicated pool ?
> 
> That would probably mean a tree of ZFS filesystem datasets.
> These are kept in a hierarchy like a directory tree, and each FS dataset
can store a tree of sub-directories.
> An FS dataset has a unique mountpoint, so like with other Unix filesystems,
your OS''es view of the filesystem tree of directories and files is
aggregated from different individual filesystems mounted on the branches of a
global tree.
> These "different individual filesystems" may come from a
hierarchy of FS datasets in one pool (often with an inherited mountpoint from a
parent dataset, such as "pool/export/home/username" being mounted by
default as a subdirectory "user" within the mountpoint of
"pool/export/home"), and they may come from different sources - such
as other pools or network filesystems. For the different sources you may need to
specify mountpoints manually, though (or use tricks like inheriting from a
parent dataset with a specified common "mountpoint=/export/home" but
which is not mounted by itself).
> 
> Thus files in one dataset would be kept together; files in different
datasets in one pool would also be kept somewhat together (see below); and files
in datasets from different pools would be kept separately. But one way or
another, if the datasets are mounted - files are addressable in the common Unix
filesystem tree of your OS.
> 
> Like other filesystems, FS datasets have unique IDs and private inode
numbers within the FS; so for example you can''t hard-link or fast-move
(hardlink new name, unlink old name) files between different FS datasets even in
the same pool. But you can often use soft-links and/or mountpoints to achieve a
specific needed result.
> Also you have shared free pool space between different FS datasets in the
same pool (which can be further tuned for specific datasets by using quotas and
revervations).
> 
>> if so it would be great or else you would be spending
>> for ever replacing data from backup if something does
>> go wrong 
> 
> If you view this forum in greater detail, for one reason or another, even
(or especially) with ZFS things still do go wrong, especially on cheaper
under-provisioned hardware, and repairs are often slow or complicated, so for
large pools they can indeed take "forever" or close to that.
> 
> As always, redundant storage is not a replacement for a backup system
(although many people suggest that a backup on another similar server box is
superior to using tape backups - although probably using more electricity in
real-time).
> 
>> sorry if this goes in the wrong spot i could no find
> Seems to have come correctly ;)
Yep, good advice, Jim
 -- richard

zfs discuss - May 2011 - ZFS File / Folder Management

[zfs-discuss] ZFS File / Folder Management

[zfs-discuss] ZFS File / Folder Management

[zfs-discuss] ZFS File / Folder Management

[zfs-discuss] ZFS File / Folder Management