Welcome to the ZFS Community! Today, build 27 of OpenSolaris was released to the community. Included in this release is ZFS, Sun''s next generation filesystem. We are proud to announce the creation of the ZFS community to discuss and develop ZFS for the future. You can find the community at: http://www.opensolaris.org/os/community/zfs Be sure to look for blogs relating to ZFS at: http://blogs.sun.com As well as an introductory screencast produced by Dan Price: http://www.opensolaris.org/os/community/zfs/demos/basics/ For the developers out there, you can find an overview of the source code at: http://www.opensolaris.org/os/community/zfs/source/ Many thanks to the ZFS and OpenSolaris teams for making this a reality. So what is ZFS? ZFS is a new kind of filesystem that provides simple administration, transactional semantics, end-to-end data integrity, and immense scalability. ZFS is not an incremental improvement to existing technology; it is a fundamentally new approach to data management. We''ve blown away 20 years of obsolete assumptions, eliminated complexity at the source, and created a storage system that''s actually a pleasure to use. ZFS presents a pooled storage model that completely eliminates the concept of volumes and the associated problems of partitions, provisioning, wasted bandwidth and stranded storage. Thousands of filesystems can draw from a common storage pool, each one consuming only as much space as it actually needs. All operations are copy-on-write transactions, so the on-disk state is always valid. There is no need to fsck(1M) a ZFS filesystem, ever. Every block is checksummed to prevent silent data corruption, and the data is self-healing in replicated (mirrored or RAID) configurations. ZFS provides unlimited constant-time snapshots and clones. A snapshot is a read-only point-in-time copy of a filesystem, while a clone is a writable copy of a snapshot. Clones provide an extremely space-efficient way to store many copies of mostly-shared data such as workspaces, software installations, and diskless clients. ZFS administration is both simple and powerful. The tools are designed from the ground up to eliminate all the traditional headaches relating to managing filesystems. Storage can be added, disks replaced, and data scrubbed with straightforward commands. Filesystems can be created instantaneously, snapshots and clones taken, native backups made, and a simplified property mechanism allows for setting of quotas, reservations, compression, and more. Give it a spin, and let us know what you think! - The ZFS Team
thank you for message. Is already b27 out ? Any ideas what are the differences between b27a and 27 ? Should we wait for b27 ? ZFS should be available in b27, am I right ? Thanks, stefan This message posted from opensolaris.org
On Wed 16 Nov 2005 at 11:28AM, Stefan Parvu wrote:> thank you for message. Is already b27 out ? Any ideas what are the > differences between b27a and 27 ? Should we wait for b27 ? > > ZFS should be available in b27, am I right ?Yes, it''s out. b27a is out, I mean. 27a means 27 + a bug fix. http://www.opensolaris.org/os/downloads -dp -- Daniel Price - Solaris Kernel Engineering - dp at eng.sun.com - blogs.sun.com/dp
Casper.Dik at Sun.COM
2005-Nov-16 19:46 UTC
[zfs-discuss] Re: Welcome to the ZFS community!
>thank you for message. Is already b27 out ? Any ideas what are the differences between b27a and 27? Should we wait for b27 ? b27a post-dates b27.>ZFS should be available in b27, am I right ?It''s in b27a; b27 will not be made available externally. Some additional fixes for ZFS were pulled into the first release so build 27 was respun, giving build 27a. (Most importantly, some to make performance on debug kernels better) Casper
Ok, thanks Casper, starting then to download 27a. stefan This message posted from opensolaris.org
b27a is just a respin of b27. ZFS was release in b27, so it is also in b27a. Gary Stefan Parvu wrote:>thank you for message. Is already b27 out ? Any ideas what are the differences between b27a and 27 ? Should we wait for b27 ? > >ZFS should be available in b27, am I right ? > >Thanks, >stefan >This message posted from opensolaris.org >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://opensolaris.org/mailman/listinfo/zfs-discuss > >-- <http://www.sun.com> * Gary Combs * Product Architect *Sun Microsystems, Inc.* 3295 NW 211th Terrace Hillsboro, OR 97124 US Phone x32604/+1 503 715 3517 Fax 503-715-3517 Email Gary.Combs at Sun.COM -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051116/8c6c8b19/attachment.html>
It''s nice that ZFS eliminates the concept of volumes, but how does that fit in with a SAN where volume management is performed by the external storage system and not by the solaris server? This message posted from opensolaris.org
ZFS will be perfectly happy in this setting. However, you may not be getting the best of the features which ZFS has to offer. For example, if you have your SAN box set up with internal mirroring, and export it as a single volume to ZFS, the SAN doesn''t understand ZFS checksums and self healing data. A corrupt block would still be detected by ZFS at a higher level, but it would be unable to retrieve the correct data and proactively repair it. Similar small features will also be unavailable in other subsystems of ZFS (RAID-Z, dynamic striping, I/O load balancing, etc). Overall, ZFS will continue to function as designed, but we encourage users to expose ZFS to as close to raw hardware as possible to fully leverage all the available features. - Eric On Wed, Nov 16, 2005 at 01:15:58PM -0800, Luke wrote:> It''s nice that ZFS eliminates the concept of volumes, but how does that fit in with a SAN where volume management is performed by the external storage system and not by the solaris server? > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Wed, Nov 16, 2005 at 01:15:58PM -0800, Luke wrote:> It''s nice that ZFS eliminates the concept of volumes, but how does > that fit in with a SAN where volume management is performed by the > external storage system and not by the solaris server?It will work fine since at the bottom-most layer of ZFS, we simply read/write blocks to a block storage device. The thing you lose, however, is the ability to do self-healing. Since ZFS checksums the data, we can detect silent data corruption. Furthermore, since ZFS also does redundancy (mirroring, RAID-Z, etc.), when an error is detected, we can look to redundant copies of the data in the hopes of finding a good copy. You can''t do this if the filesystem and volume manager are separate products (as they would be if the redundancy were implemented in an external RAID array). There is no interface to a block device to ask it to read the other side of the mirror, for example. And the volume manager itself can''t do this since it has no idea what the data was supposed to look like. "But wait", you say, "my storage array vendor says that they checksum all of the data, too." That is correct, some do. But there is a whole class of errors that you can''t fix with that architecture: - Accidental overwrites: The array has no idea if the block writes coming in from the filesystem, or some rouge agent overwriting filesystem blocks. - Phantom writes: This is where the block device says it did the requested write, but actually dropped the data on the floor. - Mis-directed reads/writes: The filesystem requests block X, but instead gets block Y. How often can these errors (along with phantom writes) actually happen? Well, these are industry standard terms for the phenomenon, so it''s often enough to be given a proper name. Also, most folks working on storage for any length of time have actually witnessed this sort of thing for real. - Data path errors: If the data gets corrupted on the way to the storage array, then the storage array nicely checksums corrupt data and ensures that it stays corrupt. - Software bugs in the array: A storage array is nothing but a big pile of software sitting on top of a bunch of disks. Any software product has bugs (as does ZFS, I''m sure). And two stacks of software will, by definition, have more failures than a single stack. But in ZFS, we mitigate this risk by having a very simple, extremely well-tested I/O path that verifies the integrity of the data before the rest of ZFS gets to even look at it. So to sum up my answer, ZFS will work correctly on HW RAID devices, but you lose the ability to survive silent data corruption. --Bill
Luke wrote:> It''s nice that ZFS eliminates the concept of volumes, but how does that fit > in with a SAN where volume management is performed by the external storage > system and not by the solaris server?Luke, as long as you have multiple paths you''ll be fine. Besides, not all san-resident storage is managed for you (a5200->switch->host etc). What concerns do you have about zfs and san-resident storage? James C. McPherson -- Solaris Datapath Engineering Data Management Group Sun Microsystems
Luke wrote:> It''s nice that ZFS eliminates the concept of volumes, but how does that fit in with a SAN where volume management is performed by the external storage system and not by the solaris server?Someone from the ZFS team will probably have a better/longer/more-quote-worthy answer :) but the general takeaway is ZFS operates on the LUNs offered to the host. It has no view into the SAN, storage arrays, etc. on the other side of the LUN. If you have a storage array that provides virtualization, for example a Sun StorEdge 6920, and it has exported a LUN to the host running ZFS that really is composed of .... * A storage pool with five LUNs that come from ... * Five different 6020 trays each of which each LUN is really a ... * 7+1+1 R5 stripe of 144 GB disk drives ... ... ZFS isn''t going to manage or dive down to any of those levels. It''s really up to the customer to determine if and where volume management should be done to best suit their environment. ZFS greatly simplifies many storage management tasks and it would be advantageous to look at the new features available to see where any simplification can be achieved.
On Nov 16, 2005, at 3:50 PM, Bill Moore wrote:> On Wed, Nov 16, 2005 at 01:15:58PM -0800, Luke wrote: >> It''s nice that ZFS eliminates the concept of volumes, but how does >> that fit in with a SAN where volume management is performed by the >> external storage system and not by the solaris server? > > It will work fine since at the bottom-most layer of ZFS, we simply > read/write blocks to a block storage device. > > The thing you lose, however, is the ability to do self-healing. Since > ZFS checksums the data, we can detect silent data corruption. > Furthermore, since ZFS also does redundancy (mirroring, RAID-Z, etc.), > when an error is detected, we can look to redundant copies of the data > in the hopes of finding a good copy.Just to make sure the point is not lost, you can (and should) still mirror or RAID-Z across multiple LUNs in your SAN so you can get all the features that Bill laid out. The fact that those LUNs may be anything from a dumb disk to a complex RAID is not known to ZFS, its just a block device. Doing mirroring in two places is a bit of a belt and suspenders duplication. A subject of interesting discussion on what if any value that may add. -David
Thanks everyone for the great responses. Eric wrote:> Overall, ZFS will continue to function as designed, but we encourage > users to expose ZFS to as close to raw hardware as possible to fully > leverage all the available features.Bill wrote:> So to sum up my answer, ZFS will work correctly on HW RAID devices, > but you lose the ability to survive silent data corruption.David wrote:> Just to make sure the point is not lost, you can (and should) > still mirror or RAID-Z across multiple LUNs in your SAN so > you can get all the features that Bill laid out.You guys have explained quite clearly the benefits of doing some volume management on the solaris server even when using SAN storage with its own volume management. Here are some downsides I see with this: (1) As David pointed out, two levels of mirroring/RAID involves dubious duplication (and is more wasteful of disk space than should be necessary); (2) RAID on the SAN target is a lot more flexible than at the solaris server, because a single RAID group can be shared between multiple servers; (3) Mirroring on the solaris server will double the SAN bandwidth requirements (is this what James was alluding to when he wrote "as long as you have multiple paths you''ll be fine"?) All this makes me wonder whether there is an opportunity here to make a SAN target able to collaborate somehow with a server running ZFS. This message posted from opensolaris.org
On Nov 16, 2005, at 6:06 PM, Luke wrote:> You guys have explained quite clearly the benefits of doing some > volume management on the solaris server even when using SAN storage > with its own volume management. Here are some downsides I see with > this: > (1) As David pointed out, two levels of mirroring/RAID involves > dubious duplication (and is more wasteful of disk space than should be > necessary); > (2) RAID on the SAN target is a lot more flexible than at the solaris > server, because a single RAID group can be shared between multiple > servers; > (3) Mirroring on the solaris server will double the SAN bandwidth > requirements (is this what James was alluding to when he wrote "as > long as you have multiple paths you''ll be fine"?) > > All this makes me wonder whether there is an opportunity here to make > a SAN target able to collaborate somehow with a server running ZFS.From an integrity perspective, there is no real choice. You have to mirror or RAID-Z in the software. Doing only in the array leaves open the entire data path from the array to the CPU to allow corruption to occur, even if you exposed the checksums and detected the error, you still lack a mechanism to recover without a software mirror. With ZFS it is "safe" through to the cpu/memory subsystem, one that is usually built with various hardware protections (ECC etc) to minimize internal corruption. But there is added bandwidth and links, integrity is not free. The contrary argument can be made that you should just buy cheap JBODs, ZFS is all that you need, everything else is expensive duplication. But there is a middle ground, as you say in a shared array it may be easier to have one RAID group that may be accessed by systems that are not blessed with ZFS. -David
I''m not sure if I should continue the thread or start working on a blueprint titled something like, "Best Practices for ZFS in a Heterogeneous SAN environment". More inline ... Luke wrote:> (1) As David pointed out, two levels of mirroring/RAID involves dubious duplication (and is more wasteful of disk space than should be necessary); >True but as you then state....> (2) RAID on the SAN target is a lot more flexible than at the solaris server, because a single RAID group can be shared between multiple servers; >Clearly, this depends on your environment. A data center working on heterogeneous storage consolidation between 100s of systems would take a different approach then a customer running ZFS on a system with a JBOD storage system.> (3) Mirroring on the solaris server will double the SAN bandwidth requirements (is this what James was alluding to when he wrote "as long as you have multiple paths you''ll be fine"?) >I believe James point was that as long as you have transport redundancy, for example Traffic Manager, you negate some of the other data availability concerns. For example, if your FC link goes down between your HBA and storage array or switch ZFS can''t do much about it. Having a redundant link would alleviate that concern to a large degree. Mirroring on the server would double the write requirements but the read bandwidth would be the same. ZFS won''t read both sides of a mirror unless it detects an invalid checksum. I can''t say for sure but I would assume - Perhaps incorrectly so someone jump in here if I''m wrong - that reads between the LUNs in a mirror are accessed round-robin to increase throughput.> All this makes me wonder whether there is an opportunity here to make a SAN target able to collaborate somehow with a server running ZFS.It''s an idea a few of us have had for some time but not an easy problem to fix for various reasons.
On Wed, 2005-11-16 at 19:50, Torrey McMahon wrote:> > All this makes me wonder whether there is an opportunity here to > make a > SAN target able to collaborate somehow with a server running > ZFS. > > It''s an idea a few of us have had for some time but not an easy problem > to fix for various reasons.Yup. I''ll plug my blog post from earlier today: http://blogs.sun.com/roller/page/sommerfeld?entry=the_end_to_end_argument And it''s worth rereading "End to End Arguments in System Design".... If you''re going to build a shared storage box to provide space on a SAN to systems which will be running ZFS, I''m willing to bet that the answer is likely to look very different from today''s SAN systems. They go to immense lengths to provide the illusion of a huge, extremely reliable disk. ZFS doesn''t need that, so maybe some of the complexity in there is not actually needed... It screams out for taking the same start-with-a-blank-sheet design approach that was taken with ZFS itself. Move the functions around between boxes to where they can best be performed, and tweak the interfaces between them to make this possible. - Bill