Hi ZFS Team, I have a couple of questions... Assume that the maximum slab size that ZFS supports is x. (I am assuming there is a maximum.) An application does a (single) write(2) for 2x bytes. Does ZFS/COW guarantee that either all the 2x bytes are persistent or none at all? Consider a case where there is a panic after x bytes has gone to disk and the change propagated to the uber block. Do the uber-block and metadata blocks get updated with the entire write(2) or nothing? In other words, does ZFS''s ''always consistent on disk'' guarantee extend to data as well as metadata. I am _not_ talking about cases that result in ENOSPC, EFBIG, EDQUOT etc. where I think a partial write is probably ok. But I would be interested in knowing ZFS''s behaviour in these cases as well. The other question is about how ZFS does direct IO. Does it do COW for direct IO as well? Referring me to a manual/doc is good enough. :) Thanks in advance!! Cheers Manoj
>Assume that the maximum slab size that ZFS supports is x. (I am assuming >there is a maximum.) An application does a (single) write(2) for 2x >bytes. Does ZFS/COW guarantee that either all the 2x bytes are >persistent or none at all? Consider a case where there is a panic after >x bytes has gone to disk and the change propagated to the uber block. Do >the uber-block and metadata blocks get updated with the entire write(2) >or nothing? > >In other words, does ZFS''s ''always consistent on disk'' guarantee extend >to data as well as metadata.An important guarantee ZFS makes is that the data is consitent; that is a guarantee that ufs doesn''t make: it makes somewhat sure that the meta data does not contain errors. Everybody knows that "meta data consistent" buys you next to nothing; well, fsck doesn''t fail but that is about it. How often haven''t you seen filed updated just prior to a crash with bogus content? Specifically bad if it happens to /etc/*_* files. ZFS consistency guarantee would not be worth much if it did not group the meta data and data in the same transaction so that is what it does. Casper
Hi Casper, Thanks for your quick reply. Some followup questions. :) Casper.Dik at Sun.COM wrote:>>Assume that the maximum slab size that ZFS supports is x. (I am assuming >>there is a maximum.) An application does a (single) write(2) for 2x >>bytes. Does ZFS/COW guarantee that either all the 2x bytes are >>persistent or none at all? Consider a case where there is a panic after >>x bytes has gone to disk and the change propagated to the uber block. Do >>the uber-block and metadata blocks get updated with the entire write(2) >>or nothing? >> >>In other words, does ZFS''s ''always consistent on disk'' guarantee extend >>to data as well as metadata. > > > An important guarantee ZFS makes is that the data is consitent; that is > a guarantee that ufs doesn''t make: it makes somewhat sure that the meta > data does not contain errors. > > Everybody knows that "meta data consistent" buys you next to nothing; well, > fsck doesn''t fail but that is about it. How often haven''t you seen > filed updated just prior to a crash with bogus content? Specifically > bad if it happens to /etc/*_* files. > > ZFS consistency guarantee would not be worth much if it did not > group the meta data and data in the same transaction so that is > what it does.I thought so too. ;) man write(2) says it can return with less than or euqal to ''nbyte''. Can ZFS do this too - write less than what you asked it to write(2)? Can this happen only when it runs out of space? Writing less that what you ask the FS to write (but >0) gives you inconsistent data - at least IMHO. Regards, Manoj
>man write(2) says it can return with less than or euqal to ''nbyte''. >Can ZFS do this too - write less than what you asked it to write(2)? >Can this happen only when it runs out of space?Generally, this only happens for devices and not files. But the manual write(2) is pretty clear: If a write() requests that more bytes be written than there is room for-for example, if the write would exceed the pro- cess file size limit (see getrlimit(2) and ulimit(2)), the system file size limit, or the free space on the device-only as many bytes as there is room for will be written. For example, suppose there is space for 20 bytes more in a file before reaching a limit. A write() of 512-bytes returns 20. The next write() of a non-zero number of bytes gives a failure return (except as noted for pipes and FIFO below).>Writing less that what you ask the FS to write (but >0) gives you >inconsistent data - at least IMHO.ZFS deals with filesystem inconsistencies only, not application level inconsistencies. The application knows that the condition has arisen and can take the necessary steps to rectify the problem. The POSIX syscall interface is a given and we cannot change its behaviour. Casper
> man write(2) says it can return with less than or euqal to ''nbyte''. > Can ZFS do this too - write less than what you asked it to write(2)? > Can this happen only when it runs out of space?Yes -- any filesystem will do that if it runs out of space. Regarding atomicity of writes: they''re only atomic up to a point. If some application issues a 1TB write, we can''t hold up the rest of the system waiting for it to complete. At present, ZFS writes are atomic up to the whole-block level, i.e. a max of 128k. If it were useful, we could add a dataset property that indicates how much stuff we should be willing to batch up in a single tx. However, there does have to be some limit -- otherwise any ordinary user could cork up the system by issuing a giant write, thereby forcing ZFS to accumulate change until the system ran out of memory. Jeff