Hello I''m looking into the forensic artefacts of ZFS and came across zpool.cache in /etc/zfs/. Can some wise person confirm my information is correct and help me with some questions? I''ve had a look through the code on src.opensolaris.org and the On Disk specs (the PDF document ZFS On-Disk Specification) but haven''t found the answers yet. Many thanks for your help Mark INFO * ZFS has a legacy mode where mounts are handled by /etc/vfstab or a default mode where pools are automounted by ZFS. If you use /etc/vfstab and switch off the automatic mount at boot, the ZFS partitions won''t be touched unless manually mounted. * zpool.cache appears to be a record of the creation, maintenance and destruction of ZFS pools (spa_config.h). * According to spa_config.h and libzfs_config.c, zpool.cache is part of the SPA layer, made up of nvlist objects. * spa_history.c mentions a history log ring buffer where the creation and subsequent actions to an SPA object are recorded. QUESTIONS * Can someone confirm whether the info above is correct? * How persistent is the information in zpool.cache after zpools have been destroyed? * I''m still looking for a C-struct type definition of the zpool.cache records. Can someone point me in the right direction? * Are there any *other* records left on the host regarding the creation/maintenance/destruction of a ZFS file system? * Where does the SPA history log land on-disk? Is it the zpool.cache of the host, somewhere else on the host or part of vdev label / ZFS file system? * "ZFS On-Disk Specification" is described as a draft, dated 22 August 2006. Is there anything more recent available?
On Wed, Jul 25, 2007 at 01:11:54PM +0200, Mark Furner wrote:> > * ZFS has a legacy mode where mounts are handled by /etc/vfstab or a default > mode where pools are automounted by ZFS. If you use /etc/vfstab and switch > off the automatic mount at boot, the ZFS partitions won''t be touched unless > manually mounted. > > * zpool.cache appears to be a record of the creation, maintenance and > destruction of ZFS pools (spa_config.h). > > * According to spa_config.h and libzfs_config.c, zpool.cache is part of the > SPA layer, made up of nvlist objects. > > * spa_history.c mentions a history log ring buffer where the creation and > subsequent actions to an SPA object are recorded. > > QUESTIONS > > * Can someone confirm whether the info above is correct?Sort of. As its name implies, it is a "cache" of pool configurations active on the system. ZFS is designed such that all information needed to access any pool is present on the disks themselves. However, without a cache file we would need to inspect (i.e. read the zpool labels from) every Solaris device on boot, which would be quite expensive. And we wouldn''t be able to have pools based on files. If this file is destroyed or lost, the pool can still be opened via ''zpool import'', but this may be a slower operation.> * How persistent is the information in zpool.cache after zpools have been > destroyed?zpool.cache only keeps track of active pools on the system.> * I''m still looking for a C-struct type definition of the zpool.cache records. > Can someone point me in the right direction?Sorry, there''s no good reference here. The contents have grown haphazardly over time, and there''s no definitive reference. You can easily examine the contents of the cache file with ''zdb -C''. In the source code, you''ll want to look at the ZPOOL_CONFIG_* definitions in zfs.h and spa_config_generate().> * Are there any *other* records left on the host regarding the > creation/maintenance/destruction of a ZFS file system?For *filesystems*, you can consult the per-pool history. There is no record for pool creation or destruction, though there is an open RFE to have this incorporated in the Solaris auditing system.> * Where does the SPA history log land on-disk? Is it the zpool.cache of the > host, somewhere else on the host or part of vdev label / ZFS file system?It is part of the MOS (meta object set), which is the root of all pool-wide metadata.> * "ZFS On-Disk Specification" is described as a draft, dated 22 August 2006. > Is there anything more recent available?Sadly, no. It could probably use some love but most of the engineers have been too busy to keep the document up to date with ongoing changes. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Darren J Moffat
2007-Jul-25 16:57 UTC
[zfs-code] ZFS file system history log and zpool.cache
Eric Schrock wrote:>> * "ZFS On-Disk Specification" is described as a draft, dated 22 August 2006. >> Is there anything more recent available? > > Sadly, no. It could probably use some love but most of the engineers > have been too busy to keep the document up to date with ongoing changes.PSARC really wants this document updated for the ZFS crypto project. If it is out of date now it is going to be really hard for me to get it correct for the commitment review of ZFS crypto. Can we get someone officially tasked with bringing it upto date please ? -- Darren J Moffat
Many thanks for the quick reply, Eric. ZFS is quite awesome, but also awesomely complicated. What did you mean with>For *filesystems*, you can consult the per-pool history.Is this the vdev label uberblock / MOS thing?> It is part of the MOS (meta object set), which is the root of all > pool-wide metadata.I''m still trying to work out where the MOS can be found on disk. The uberblock points to it, but is it outside the vdev label? The diagram on p. 32 of the "ZFS On-Disk Specification" is a bit ambiguous here. Where does the file-system metadata end and the files / file metadata / directories begin? Regards Mark On Wednesday 25 July 2007 18:08, Eric Schrock (Eric Schrock <eric.schrock at sun.com>) may have written:> On Wed, Jul 25, 2007 at 01:11:54PM +0200, Mark Furner wrote: > > * ZFS has a legacy mode where mounts are handled by /etc/vfstab or a > > default mode where pools are automounted by ZFS. If you use /etc/vfstab > > and switch off the automatic mount at boot, the ZFS partitions won''t be > > touched unless manually mounted. > > > > * zpool.cache appears to be a record of the creation, maintenance and > > destruction of ZFS pools (spa_config.h). > > > > * According to spa_config.h and libzfs_config.c, zpool.cache is part of > > the SPA layer, made up of nvlist objects. > > > > * spa_history.c mentions a history log ring buffer where the creation and > > subsequent actions to an SPA object are recorded. > > > > QUESTIONS > > > > * Can someone confirm whether the info above is correct? > > Sort of. As its name implies, it is a "cache" of pool configurations > active on the system. ZFS is designed such that all information needed > to access any pool is present on the disks themselves. However, without > a cache file we would need to inspect (i.e. read the zpool labels from) > every Solaris device on boot, which would be quite expensive. And we > wouldn''t be able to have pools based on files. If this file is > destroyed or lost, the pool can still be opened via ''zpool import'', but > this may be a slower operation. > > > * How persistent is the information in zpool.cache after zpools have been > > destroyed? > > zpool.cache only keeps track of active pools on the system. > > > * I''m still looking for a C-struct type definition of the zpool.cache > > records. Can someone point me in the right direction? > > Sorry, there''s no good reference here. The contents have grown > haphazardly over time, and there''s no definitive reference. You can > easily examine the contents of the cache file with ''zdb -C''. In the > source code, you''ll want to look at the ZPOOL_CONFIG_* definitions in > zfs.h and spa_config_generate(). > > > * Are there any *other* records left on the host regarding the > > creation/maintenance/destruction of a ZFS file system? > > For *filesystems*, you can consult the per-pool history. There is no > record for pool creation or destruction, though there is an open RFE to > have this incorporated in the Solaris auditing system. > > > * Where does the SPA history log land on-disk? Is it the zpool.cache of > > the host, somewhere else on the host or part of vdev label / ZFS file > > system? > > It is part of the MOS (meta object set), which is the root of all > pool-wide metadata. > > > * "ZFS On-Disk Specification" is described as a draft, dated 22 August > > 2006. Is there anything more recent available? > > Sadly, no. It could probably use some love but most of the engineers > have been too busy to keep the document up to date with ongoing changes. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development > http://blogs.sun.com/eschrock-- x-x-x-x-x-x-x-x-x-x-x-x-x Mark Furner, PhD L?rchenstr. 39 CH 8400 Winterthur Switzerland T. 0041 (0)78 641 15 92 E. mark.furner at gmx.net
On Wed, Jul 25, 2007 at 08:19:07PM +0200, Mark Furner wrote:> Many thanks for the quick reply, Eric. ZFS is quite awesome, but also > awesomely complicated. > > What did you mean with > > >For *filesystems*, you can consult the per-pool history. > > Is this the vdev label uberblock / MOS thing?Yes, the history is stored in the MOS.> > It is part of the MOS (meta object set), which is the root of all > > pool-wide metadata. > > I''m still trying to work out where the MOS can be found on disk. The uberblock > points to it, but is it outside the vdev label? The diagram on p. 32 of > the "ZFS On-Disk Specification" is a bit ambiguous here. Where does the > file-system metadata end and the files / file metadata / directories begin?The uberblock points the MOS: struct uberblock { uint64_t ub_magic; /* UBERBLOCK_MAGIC */ uint64_t ub_version; /* SPA_VERSION */ uint64_t ub_txg; /* txg of last sync */ uint64_t ub_guid_sum; /* sum of all vdev guids */ uint64_t ub_timestamp; /* UTC time of last sync */ blkptr_t ub_rootbp; /* MOS objset_phys_t */ }; The ''ub_rootbp'' points to the MOS, from which all pool data and metadata can be discovered. It is a normal block pointer, which means it can be stored on any of the vdevs (several of them, thanks to ditto blocks). Only the vdev label (including the uberblock) has a fixed location, everything is derived from that. For what''s contained in the MOS, check out the definition for DMU_POOL_DIRECTORY_OBJECT: /* * The names of zap entries in the DIRECTORY_OBJECT of the MOS. */ #define DMU_POOL_DIRECTORY_OBJECT 1 #define DMU_POOL_CONFIG "config" #define DMU_POOL_ROOT_DATASET "root_dataset" #define DMU_POOL_SYNC_BPLIST "sync_bplist" #define DMU_POOL_ERRLOG_SCRUB "errlog_scrub" #define DMU_POOL_ERRLOG_LAST "errlog_last" #define DMU_POOL_SPARES "spares" #define DMU_POOL_DEFLATE "deflate" #define DMU_POOL_HISTORY "history" #define DMU_POOL_PROPS "pool_props" Most of this is pool-wide metadata, but the ''root_dataset'' is the link into the DSL (dataset and snapshot layer) which has a list of all the filesystems (among other things). Each filesystem has its own data structure that is the root of the actual filesystem data. Hope that helps, - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Thanks again Eric. Lots of food for thought here, so much head scratching I''ll get dandruff... Mark On Wednesday 25 July 2007 20:30, Eric Schrock (Eric Schrock <eric.schrock at sun.com>) may have written:> On Wed, Jul 25, 2007 at 08:19:07PM +0200, Mark Furner wrote: > > Many thanks for the quick reply, Eric. ZFS is quite awesome, but also > > awesomely complicated. > > > > What did you mean with > > > > >For *filesystems*, you can consult the per-pool history. > > > > Is this the vdev label uberblock / MOS thing? > > Yes, the history is stored in the MOS. > > > > It is part of the MOS (meta object set), which is the root of all > > > pool-wide metadata. > > > > I''m still trying to work out where the MOS can be found on disk. The > > uberblock points to it, but is it outside the vdev label? The diagram on > > p. 32 of the "ZFS On-Disk Specification" is a bit ambiguous here. Where > > does the file-system metadata end and the files / file metadata / > > directories begin? > > The uberblock points the MOS: > > struct uberblock { > uint64_t ub_magic; /* UBERBLOCK_MAGIC */ > uint64_t ub_version; /* SPA_VERSION */ > uint64_t ub_txg; /* txg of last sync */ > uint64_t ub_guid_sum; /* sum of all vdev guids */ > uint64_t ub_timestamp; /* UTC time of last sync */ > blkptr_t ub_rootbp; /* MOS objset_phys_t */ > }; > > The ''ub_rootbp'' points to the MOS, from which all pool data and metadata > can be discovered. It is a normal block pointer, which means it can be > stored on any of the vdevs (several of them, thanks to ditto blocks). > Only the vdev label (including the uberblock) has a fixed location, > everything is derived from that. > > For what''s contained in the MOS, check out the definition for > DMU_POOL_DIRECTORY_OBJECT: > > /* > * The names of zap entries in the DIRECTORY_OBJECT of the MOS. > */ > #define DMU_POOL_DIRECTORY_OBJECT 1 > #define DMU_POOL_CONFIG "config" > #define DMU_POOL_ROOT_DATASET "root_dataset" > #define DMU_POOL_SYNC_BPLIST "sync_bplist" > #define DMU_POOL_ERRLOG_SCRUB "errlog_scrub" > #define DMU_POOL_ERRLOG_LAST "errlog_last" > #define DMU_POOL_SPARES "spares" > #define DMU_POOL_DEFLATE "deflate" > #define DMU_POOL_HISTORY "history" > #define DMU_POOL_PROPS "pool_props" > > Most of this is pool-wide metadata, but the ''root_dataset'' is the link > into the DSL (dataset and snapshot layer) which has a list of all the > filesystems (among other things). Each filesystem has its own data > structure that is the root of the actual filesystem data. > > Hope that helps, > > - Eric > > -- > Eric Schrock, Solaris Kernel Development > http://blogs.sun.com/eschrock
I''m trying to read the ZFS MOS but unable to do so. The compression field is set to 3 and does not change with zfs compression=off <my_pool_name>. Does this compression have any effect on MOS? I''m just reading the block pointer using the mos ptr as mentioned above. -Asim -- This messages posted from opensolaris.org
On Sun, Nov 18, 2007 at 01:19:16AM -0800, Asim Kadav wrote:> I''m trying to read the ZFS MOS but unable to do so. The compression > field is set to 3 and does not change with zfs compression=off > <my_pool_name>. Does this compression have any effect on MOS? >All metadata is compressed using LZJB because it tends to be quite compressable and the decreased I/O is usually a win for most workloads. Search for ''os_md_compress'' to see how this is used. - Eric -- Eric Schrock, FishWorks http://blogs.sun.com/eschrock