Is it possible to ZFS to keep data that belongs together on the same Pool ( or would these Questions be more related to Raid-Z ? ) that way if their is a failure only the Data on the Pool that Failed Needs to be replaced ( or if one Pool failed Does that Mean All the other Pools still fail as well ? with out a way to recover data ? ) I am Wanting to be able to Expand My Array over Time by adding either 4 or 8 HDD pools Most of the Data will probably never be Deleted but say I have 1 gig remaining on the First Pool and adding an 8 gig file does this mean the data will be then put onto pool 1 and pool 2 ? ( 1 gig pool 1 7 gig pool 2 ) or would ZFS be able to put it onto the 2nd pool instead of Splitting it ? the other scenario would be folder structure would ZFS be able to understand data contained in a Folder Tree Belongs together and be able to store it on a dedicated pool ? if so it would be great or else you would be spending for ever replacing data from backup if something does go wrong sorry if this goes in the wrong spot i could no find ? OpenSolaris Forums ? zfs ? discuss in the drop down menu -- This message posted from opensolaris.org
On Mon, May 16, 2011 at 8:54 PM, MasterCATZ <MasterCATZ at hotmail.com> wrote:> > Is it possible to ZFS to keep data that belongs together on the same Pool > > ( or would these Questions be more related to Raid-Z ? ) > > that way if their is a failure only the Data on the Pool that Failed Needs to be replaced > ( or if one Pool failed Does that Mean All the other Pools still fail as well ? with out a way to recover data ? ) > > I am Wanting to be able to Expand My Array over Time by adding either 4 or 8 HDD pools > > Most of the Data will probably never be Deleted > > but say I have 1 gig remaining on the First Pool and adding an 8 gig file > does this mean the data will be then put onto pool 1 and pool 2 ? > ( 1 gig pool 1 7 gig pool 2 ) > > or would ZFS be able to put it onto the 2nd pool instead of Splitting it ? > > the other scenario would be folder structure would ZFS be able to understand data contained in a Folder Tree Belongs together and be able to store it on a dedicated pool ? > > > if so it would be great or else you would be spending for ever replacing data from backup if something does go wrong > > > sorry if this goes in the wrong spot i could no find > ? OpenSolaris Forums ? zfs ? discuss > in the drop down menuYou can create a single pool and grow it as needed. From that pool, you create filesystems. If you want to create multiple pools (due to redundancy/performance requirements being different), ZFS will keep them separated. And again, you will create filesystems/datasets from each one independently. http://download.oracle.com/docs/cd/E19963-01/html/821-1448/index.html http://download.oracle.com/docs/cd/E18752_01/html/819-5461/index.html -- Giovanni Tirloni
Hello, c4ts, There seems to be some mixup of bad English and wrong terminology, so I am not sure I understood your question correctly. Still, I''ll try to respond ;) I''ve recently posted about ZFS terminology here: http://opensolaris.org/jive/click.jspa?searchID=4607806&messageID=515894 In ZFS incapsulation world, a "pool" is a collection of "top-level vdev"''s, which in turn are a collection of "lower-level vdev"''s. Lower level are usually individual disk partitions or slices, top-level are redundant groups of disks slices (raidzN, mirror, on non-redundant single devices), and the pool is a sort of striping across such groups. This is a simplified explanation, because it is not like usual RAID striping, and different types of data may be stored with different allocation algorithms (i.e. AFAIK in a raidzN vdev, metadata may still be mirrored for performance). Generally when your data is written, it is striped into relatively small blocks and those are distributed across all top-level vdevs to provide parallel performance. There incoming writes are stiped into yet smaller blocks (attempts about 512kb according to one source) so as to use the hardware prefetches efficiently, parities are calculated and the resulting data and parity blocks are written across different lower-level vdevs to provide redundancy. Each written block has a checksum so it can be easily tested for validity, and in redundant top-level VDEVs an invalid block can be automatically repaired (using other copies in mirror or parity data in raidzN) during a read operation. If a lower-level vdev breaks, its top-level vdev becomes "degraded" - it can still be used, but would provide less redundancy. If any top-level vdev fails completely, the pool is likely to become corrupted and may require lengthy repair or destruction and restore from backup, because part of striped data would be inaccessible. You can address your user-data as "datasets", namely filesystem datasets in case of POSIX filesystem interface, or volume datasets for "raw storage" like swap. Datasets are kept as a hierarchy inside the pool. All unallocated space in the pool is available to all its datasets. Now that we know how things are called, regarding your questions:> Is it possible to ZFS to keep data that belongs > together on the same Pool > that way if THERE is a failure only the Data on the > Pool that Failed Needs to be replaced > ( or if one Pool failed Does that Mean All the other > Pools still fail as well ? with out a way to recover > data ? )If you have many storage disks, you can create several small pools (i.e. many independent mirrors) which would have less aggregate performance and shared available space, but would break independently. So you''ll have a smaller chance of losing and repairing/restoring ALL of your data. But you may have to spend some effort to balance performance and free space when using many pools.> I am Wanting to be able to Expand My Array over Time > by adding either 4 or 8 HDD poolsYes you can do that - either by expanding an existing pool with an additional top-level vdev to stripe across, or by creating a new independent pool. You can also replace all drives in an existing top-level vdev one-by-one (and waiting for redundant data to be migrated to a new disk - "resilvering"), and when all drives have become larger, you can (auto-)expand the VDEV in-place.> Most of the Data will probably never be DeletedHowever keep in mind, that if you add a new top-level vdev or expand an existing one (keeping other old vdevs as they were), you would get unbalanced free space. Existing data is not re-balanced automatically (though you can to that manually by copying it around several times), so most of the new writes would go to the new top-level vdev. As recently discussed in many threads on this forum, this is likely to reduce your overall pool performance by a lot (because attempted writes to stripe accross nearly-full devices would lag, and because "fast" writes would effectively go to one top-level vdev and not across many top-level vdevs in parallel).> but say I have 1 gig remaining on the First Pool and > adding an 8 gig file > does this mean the data will be then put onto pool 1 > and pool 2 ? > ( 1 gig pool 1 7 gig pool 2 ) > > or would ZFS be able to put it onto the 2nd pool > instead of Splitting it ?Neither. Pools are independent. If your question actually meant "1 gig free" on a top-level vdev or on a disk, while there are other "free" spots on different devices in the same pool, then overall free space in this pool is aggregated and you may be able to write a larger file than would fit on any single disk.> > the other scenario would be folder structure would > ZFS be able to understand data contained in a Folder > Tree Belongs together and be able to store it on a > dedicated pool ?That would probably mean a tree of ZFS filesystem datasets. These are kept in a hierarchy like a directory tree, and each FS dataset can store a tree of sub-directories. An FS dataset has a unique mountpoint, so like with other Unix filesystems, your OS''es view of the filesystem tree of directories and files is aggregated from different individual filesystems mounted on the branches of a global tree. These "different individual filesystems" may come from a hierarchy of FS datasets in one pool (often with an inherited mountpoint from a parent dataset, such as "pool/export/home/username" being mounted by default as a subdirectory "user" within the mountpoint of "pool/export/home"), and they may come from different sources - such as other pools or network filesystems. For the different sources you may need to specify mountpoints manually, though (or use tricks like inheriting from a parent dataset with a specified common "mountpoint=/export/home" but which is not mounted by itself). Thus files in one dataset would be kept together; files in different datasets in one pool would also be kept somewhat together (see below); and files in datasets from different pools would be kept separately. But one way or another, if the datasets are mounted - files are addressable in the common Unix filesystem tree of your OS. Like other filesystems, FS datasets have unique IDs and private inode numbers within the FS; so for example you can''t hard-link or fast-move (hardlink new name, unlink old name) files between different FS datasets even in the same pool. But you can often use soft-links and/or mountpoints to achieve a specific needed result. Also you have shared free pool space between different FS datasets in the same pool (which can be further tuned for specific datasets by using quotas and revervations).> if so it would be great or else you would be spending > for ever replacing data from backup if something does > go wrongIf you view this forum in greater detail, for one reason or another, even (or especially) with ZFS things still do go wrong, especially on cheaper under-provisioned hardware, and repairs are often slow or complicated, so for large pools they can indeed take "forever" or close to that. As always, redundant storage is not a replacement for a backup system (although many people suggest that a backup on another similar server box is superior to using tape backups - although probably using more electricity in real-time).> sorry if this goes in the wrong spot i could no findSeems to have come correctly ;) HTH, //Jim Klimov -- This message posted from opensolaris.org
On May 17, 2011, at 9:17 AM, Jim Klimov wrote:> Hello, c4ts, > > There seems to be some mixup of bad English and wrong terminology, so I am not sure I understood your question correctly. Still, I''ll try to respond ;) > > I''ve recently posted about ZFS terminology here: http://opensolaris.org/jive/click.jspa?searchID=4607806&messageID=515894 > > In ZFS incapsulation world, a "pool" is a collection of "top-level vdev"''s, which in turn are a collection of "lower-level vdev"''s. > > Lower level are usually individual disk partitions or slices, top-level are redundant groups of disks slices (raidzN, mirror, on non-redundant single devices), and the pool is a sort of striping across such groups. This is a simplified explanation, because it is not like usual RAID striping, and different types of data may be stored with different allocation algorithms (i.e. AFAIK in a raidzN vdev, metadata may still be mirrored for performance). > > Generally when your data is written, it is striped into relatively small blocks and those are distributed across all top-level vdevs to provide parallel performance.Where "relatively small" is the block size or record size: 512 byte to 128KB. A block will not be spread across top-level vdevs.> There incoming writes are stiped into yet smaller blocks (attempts about 512kb according to one source) so as to use the hardware prefetches efficiently, parities are calculated and the resulting data and parity blocks are written across different lower-level vdevs to provide redundancy. > Each written block has a checksum so it can be easily tested for validity, and in redundant top-level VDEVs an invalid block can be automatically repaired (using other copies in mirror or parity data in raidzN) during a read operation. > > If a lower-level vdev breaks, its top-level vdev becomes "degraded" - it can still be used, but would provide less redundancy. > If any top-level vdev fails completely, the pool is likely to become corrupted and may require lengthy repair or destruction and restore from backup, because part of striped data would be inaccessible.Not quite correct, if the top-level vdev fails, the pool stops writing. It does not compound the problem by trying to write to a faulted pool. The failmode parameter setting dictates what happens when the faulted pool is faulted because of a write operation. There have been many cases where large chunks of storage "disappear" for some reason or another (eg someone pulls out the SAS cable to the JBOD). Once reconnected, the pool can pick right up where it left off.> > You can address your user-data as "datasets", namely filesystem datasets in case of POSIX filesystem interface, or volume datasets for "raw storage" like swap. Datasets are kept as a hierarchy inside the pool. All unallocated space in the pool is available to all its datasets. > > Now that we know how things are called, regarding your questions: > >> Is it possible to ZFS to keep data that belongs >> together on the same Pool >> that way if THERE is a failure only the Data on the >> Pool that Failed Needs to be replaced >> ( or if one Pool failed Does that Mean All the other >> Pools still fail as well ? with out a way to recover >> data ? ) > > If you have many storage disks, you can create several small pools (i.e. many independent mirrors) which would have less aggregate performance and shared available space, but would break independently. So you''ll have a smaller chance of losing and repairing/restoring ALL of your data. But you may have to spend some effort to balance performance and free space when using many pools. > >> I am Wanting to be able to Expand My Array over Time >> by adding either 4 or 8 HDD pools > > Yes you can do that - either by expanding an existing pool with an additional top-level vdev to stripe across, or by creating a new independent pool. > You can also replace all drives in an existing top-level vdev one-by-one (and waiting for redundant data to be migrated to a new disk - "resilvering"), and when all drives have become larger, you can (auto-)expand the VDEV in-place. > >> Most of the Data will probably never be Deleted > > However keep in mind, that if you add a new top-level vdev or expand an existing one (keeping other old vdevs as they were), you would get unbalanced free space. Existing data is not re-balanced automatically (though you can to that manually by copying it around several times), so most of the new writes would go to the new top-level vdev. As recently discussed in many threads on this forum, this is likely to > reduce your overall pool performance by a lot (because attempted writes to stripe accross nearly-full devices would lag, and because "fast" writes would effectively go to one top-level vdev and not across many top-level vdevs in parallel). > >> but say I have 1 gig remaining on the First Pool and >> adding an 8 gig file >> does this mean the data will be then put onto pool 1 >> and pool 2 ? >> ( 1 gig pool 1 7 gig pool 2 ) >> >> or would ZFS be able to put it onto the 2nd pool >> instead of Splitting it ? > > Neither. Pools are independent. > If your question actually meant "1 gig free" on a top-level vdev or on a disk, while there are other "free" spots on different devices in the same pool, then overall free space in this pool is aggregated and you may be able to write a larger file than would fit on any single disk. > >> >> the other scenario would be folder structure would >> ZFS be able to understand data contained in a Folder >> Tree Belongs together and be able to store it on a >> dedicated pool ? > > That would probably mean a tree of ZFS filesystem datasets. > These are kept in a hierarchy like a directory tree, and each FS dataset can store a tree of sub-directories. > An FS dataset has a unique mountpoint, so like with other Unix filesystems, your OS''es view of the filesystem tree of directories and files is aggregated from different individual filesystems mounted on the branches of a global tree. > These "different individual filesystems" may come from a hierarchy of FS datasets in one pool (often with an inherited mountpoint from a parent dataset, such as "pool/export/home/username" being mounted by default as a subdirectory "user" within the mountpoint of "pool/export/home"), and they may come from different sources - such as other pools or network filesystems. For the different sources you may need to specify mountpoints manually, though (or use tricks like inheriting from a parent dataset with a specified common "mountpoint=/export/home" but which is not mounted by itself). > > Thus files in one dataset would be kept together; files in different datasets in one pool would also be kept somewhat together (see below); and files in datasets from different pools would be kept separately. But one way or another, if the datasets are mounted - files are addressable in the common Unix filesystem tree of your OS. > > Like other filesystems, FS datasets have unique IDs and private inode numbers within the FS; so for example you can''t hard-link or fast-move (hardlink new name, unlink old name) files between different FS datasets even in the same pool. But you can often use soft-links and/or mountpoints to achieve a specific needed result. > Also you have shared free pool space between different FS datasets in the same pool (which can be further tuned for specific datasets by using quotas and revervations). > >> if so it would be great or else you would be spending >> for ever replacing data from backup if something does >> go wrong > > If you view this forum in greater detail, for one reason or another, even (or especially) with ZFS things still do go wrong, especially on cheaper under-provisioned hardware, and repairs are often slow or complicated, so for large pools they can indeed take "forever" or close to that. > > As always, redundant storage is not a replacement for a backup system (although many people suggest that a backup on another similar server box is superior to using tape backups - although probably using more electricity in real-time). > >> sorry if this goes in the wrong spot i could no find > Seems to have come correctly ;)Yep, good advice, Jim -- richard