Attached is a basic write-up of the user-serviceable snapshot feature
design (Avati's). Please take a look and let us know if you have
questions of any sort...
We have a basic implementation up now; reviews and upstream commit
should follow very soon over the next week.
Cheers,
Anand
-------------- next part --------------
User-serviceable Snapshots
=========================
Credits
======This brilliant design is Anand Avati's brainchild. The meta xlator is
also to blame to some extent.
Terminology
=========
* gluster volume - a GlusterFS volume created by the gluster vol create cmd
* snapshot volume - a snapshot volume created by the gluster snapshot create cmd
; this is based on the LVM2 thin-lv backend and is itself a thin-lv; a snapshot
thin-lv is accessible as yet another GlusterFS volume in itself
1. Introduction
User serviceable snapshots are a quick and easy way to access data stored in
earlier snapshotted volumes. This feature is based on the core snapshot feature
introduced in GlusterFS earlier. The key point here is that USS allows the end
user to access his/her older data without any admin intervention. To that extent
this feature is about ease of use, ease of access to one's past data in
snapshot volumes (which, today in the gluster world are based on LVM2 Thin-LVs
as the backend.
This is not a replacement for bulk data access from an earlier snapshot volume,
in which case the recommendation is to use the mounted snapshot volume and
access it as a GlusterFS volume, mounted and accessed via the Native FUSE
client.
Rather this is targetted for use in typical home directory scenarios where
individual users can at random points of time, access their files/dirs in their
individual home directories without admin intervention of any sort. The home
directory usecase is only an example and there are several other use-cases
including other kinds of applications that could benefit from this feature.
2. Use-case
Consider a user John with his Unix id john and $HOME as /home/john. Let us
consider an example when John wants to access a file
/home/john/Important/file_john.txt which existed in his home directory in
November 2013 but was deleted in December 2013. In order to access the file
(prior to the introduction of the user-serviceable snapshot feature), John's
only option was to send a note to the admin to ensure the gluster-snapshotted
volume from Nov 2013 was made available (activated
and mounted). The admin would then notify John of the availability of the
snapshot volume when John could potentially traverse his older home directory to
copy over the file.
With USS, the need for admin intervention goes away. John is now free to execute
the following steps and access the desired file whenever he needs:
$pwd
/home/john
$ls
dir1/ dir2/ dir3/ file1 file2 Important/
$cd Important/
$ls
(No files present - this being his current view)
$cd .snaps
$pwd
/home/john/Important/.snaps
$ls
snapshot_jan2014/ snapshot_dec2013/ snapshot_nov2013/ snapshot_oct2013/
snapshot_sep2013/
$cd snapshot_nov2013/
$ls
file_john.txt file_john_1.txt
$cp -p file_john.txt $HOME
As the above steps indicate, it is fairly easy to recover lost files or even
older vesions of files or directories using USS.
3. Design
=========
A new server-side xlator (snapview-server) and a client-side xlator
(snapview-client) are introduced. On the client side, the xlator would be above
DHT xlator in the graph and would redirect fops to either the dht xlator or to
the protocol-client xlator (both of which are children of the snapview-client
xlator). On the server side, the protocol-server xlator and the snapview-server
xlator would form the graph which is hosted inside a separate daemon snapd
(glusterfsd process). One such daemon process is spawned for each gluster
volume.
We rely on the fact that gfids are unique and are the same across all
snapshotted volumes. Given a volume, we will access a file using its gfid
without knowing the filename. We accomplish this by taking the existing data
filesystem namespace and overlaying a virtual gfid namespace on top.
All files, directories will remain accessible as they are (in the current state
of the gluster volumes). But in every directory we will create a "virtual
directory" called ".snaps" on the fly. This ".snaps"
directory will provide a list of all the available snapshots for the given
volume and kind of act as a wormhole into all the available snapshots of that
volume ie. to the past.
When the .snaps dir is looked up, the client xlator with its instrumented
lookup() detects that its a reference to the virtual directory. It redirects the
request to the snapd daemon and to the snapview-server xlator in turn, which
generates a random gfid, fills up a pseudo stat sturcture with necessary info
and returns via STACK_UNWIND. Information about the directory is maintained in
the server xlator inode context, where inodes are classified as VIRTUAL, REAL or
the
special "DOT_SNAPS_INODE" so that this info can be used in subsequent
lookups. On the client xlator side too, such virtual type info is maintained in
the inode_ctx.
The user would typically do a "ls" which results in an opendir and a
readdirp() on the inode returned. Server xlator will query the list of snaps
that are present in the system and present each one as an entry in the
directory, in the form of dirent entries. We also need to encode enough info in
each of the respective inodes so that next time a call happens to that inode, we
can figure where that inode is in the big picture. Whether it is in the snap
vol, where we are etc. And once a user tries to do ls inside one of the specific
snap dirs, we will have to figure out what the gfid was of the original
directory and perform access on the appropriate graph based on the specific snap
directory (hourly.0 etc) and perform access on that graph based on that gfid.
The inode information in the server xlator side is mapped to the gfapi world via
the handle-based libgfapi which were introduced for the nfs-ganesha integration.
These handle based APIs allow a gfapi operation to be performed on a
"gfid" handle and a glfs-object that encodes the gfid and inode
returned from the gfapi world.
In this case, once the server xlator allocates an inode, we need to track and
map it to the corresponding glfs-object in the handle-based gfapi world, so that
any glfs_h_XXX operation can be performed on it.
For example, on the server xlator side, the _stat call would typically need to
check the type of inode stored in the inode_ctx. If its a ".snaps"
inode then the iatt structure is filled in. If its a subsequent lookup on a
Virtual inode, then we obtain the glfs_t and glfs_object info from the inode_ctx
(where this is already stored). The desired stat is then easily obtained using
the glfs_h_stat (fs, object, &stat) call.
Considerations
=============- A global option will be available to turn off USS globally. A
volume level option will also be made available to enable USS per volume. There
could be volumes for which uss access is not desirable at all.
- Disabling this feature would remove the client side graph generation while
snapds can continue to exist on the server side; they will never be accessed
without the client side enablement. And since every access to client gfapi
graphs etc. is dynamic and done on the fly and cleaned up, the expectation is
that such a snapd left behind would not hog resources at all.
- Today we are allowing the listing of all available snapshots in each
".snaps" directory. We plan to introduce a configurable option to
limit the number of snapshots visible under the USS feature.
- There is no impact to existing fops by this feature. If enabled, it is just an
extra check in the client side xlator when the fop is redirected to the server
side xlator
- With a large number of snapshot volumes being made available or visible, one
glfs_t * hangs off the snapd for each gfapi client call-graph. Along with that,
if a large number of users start simultaneously accessing files on each of
the snapshot volumes (max number of supported snapshots is 256 today) then the
RSS for snapd could go high. We are trying to get numbers for this before we can
say for sure if this is an issue at all (say, with OOM killer).
- The list of snapshots is refreshed each time a new snap is taken or added to
the system. The snapd would query glusterd for the new list of snaps and refresh
its in-memory list of snapshots, appropriately cleaning up the glfs_t graphs for
each of the deleted snapshots and clearing up any glfs_objects.
- Again, this is not a performance oriented feature. Rather, the goal is to
allow a seamless user-experience by allowing easy and useful access to
snapshotted volumes and individual data stored in those volumes.