During the last few months we have received a few questions about our HSM architecture. This email explains the relationship and differences between an Open Source HSM (ADM from Sun) and the Lustre HSM design (which was done between CEA and some folks from the Lustre team). Likely this discussion applies equally well when comparing Lustre with other HSMs. Many thanks to Rick Matthews for educating me patiently. The conclusions at the end of this email may be of particular interest. References: Lustre HSM architecture: http://arch.lustre.org/index.php?title=HSM_Migration High level design: https://bugzilla.lustre.org/attachment.cgi?id=16341 CEA slides: attached (please post in a findable place on wiki.lustre.org) ADM: http://opensolaris.org/os/project/adm/WhatisADM/ Each of the following sections describes how elements of the Lustre HSM architecture relate to the ADM architecture Event management HSM?s capture events in the file system for the following purposes: 1. archiving a file 2. restoring a file 3. policy management, such as purging less used files A general mechanism to manage such events is provided by DMAPI which effectively burdens a user space daemon to manage the events. Lustre chose not to use DMAPI. Lustre has multiple servers and events can be detected both on OSS and on MDS nodes. Detecting file IO on OSS nodes allows one to know exactly what region of a file will be read/written and adjust actions accordingly (e.g. If the first 4K of a purged file were left on the disk and that block is read, no HSM restore action is needed). Events are detected by initiators and logged transactionally, so that in case of a power outage no scans of the file system are required. Note that ADM is targeting ZFS DMU data stores, and in these stores searches in the file tree can dynamically generate event logs similar to Lustre logs (Lustre on ZFS will use these searches and abandon its own logging system). A precise mechanism is required to determine which search results are still relevant. Typically many events might be generated by a single system call, and handling this is called filtering. Some filters are always valuable, some filters are user policy (e.g. A decision not to archive mp3 files). The initiators will filter always un-necessary events, such as multiple clients triggering a restore on the same object, and send events to coordinators. The coordinators can be a (failover) collection of load balancing systems, organized in such a manner that events for one file all reach the same coordinator. The coordinator applies further filtering, for example if multiple OSS nodes request the restore of one file the coordinator will make sure that this leads to one action. The coordinator will also implement the optional filters arising from policy. The discussions on the lustre-devel mailing list have mentioned that we would simply extend the policy options by adding it to the coordinator(s). This is not sufficient. Coordinators dispatch events for HSM action to agents. Lustre allows multiple agents to collaborate on one event and the coordinator observes completion by each agents. Agents in turn invoke archiving tools, which might run in user space, to move files to and from the file system, and agents can also abort on-going actions at the request of a coordinator. Another way in which agents can be used is to deliver events to an event manager for a system like ADM. This was not previously considered in the Lustre architecture, but it seems to be a natural way to couple the two systems. If the coordinators handle separate subsets of the file system in a load balanced manner (e.g. By hashing the fids) this might be a good way to horizontally scale ADM. Some events need synchronous handling ? principally to restore files. Lustre has not addressed how the adaptive timeout system in Lustre works with HSM to enable client wait until file restoration is complete. (For another timeout issue see space management below.) Omissions in the Lustre event management architecture are that (A) no mechanism was introduced to deliver the policy to all the coordinators (through the Lustre management server probably, using the standard configuration lock callbacks), and that (B) no attempt was made to define a policy language with a user interface. Lustre?s logging of events seems very desirable, but the cancellation of log entries when no longer requires has not yet been architected (and there are other consumers of the logs). Strengths of this architecture are: transactional management of events to eliminate scanning, kernel based filtering close to the source and scalability through all elements of initiators, coordinators and agents. The ADM architecture has an DMAPI event handling mechanism, which is effectively single node, and did not address the multi server issues yet. ADM has a clear interface for managing policy, but the details for a policy language remain under discussion. ADM stores some events in a database in user space. Lustre retains them in the kernel generated logs. HSM metadata After discussion on this list we have decided to implement minimal metadata in the file system to locate archived copies and manage copy-in. This is not different from ADM in its design now, and allows for flexible management of storing multiple versions, and for searches among HSM objects to be performed in a database with suitable indexes (instead of through file system scans). The key bits of the attributes are: 1. an indicating that there is a copy in the archive 2. an indication that the copy in the archive is current 3. an indication that the file is being restored 4. an offset indicating what extent of the file has already been restored 5. file size and disk usage for use by stat(2) while the file is in the archive The HSM database will hold attributes to: 1. dates, owners etc associated with the file for HSM policy (see below) 2. map a lustre FID to a primary HSM object 3. a list of other HSM objects associated with the FID, which can be copied back into a new file in the file system. 4. striping metadata for use by restores like in (2) or bare metal restores This is similar among the two architectures, but Lustre will store this metadata transactionally on the node that manages the object to which the metadata is attached. Policy ? small files The ADM policy manager (in user space) can instruct the work list generator to build an archive for a collection of small files. The archive can be transferred to tape, instead of the individual small files. In the HSM database FIDS for many small files will point to one object in the archive. Each of the FIDS in the file system will get its own EA to indicate that the file is archived; this is an undesirable issue as it involves a secondary update of the small files. Lustre currently has no design to build a collection of small files. One key issue with this is that the list needs to be retained until sufficiently many small files are available to form a good size archival file ? doing this in-kernel may be problematic. Archiving many small files can be an Achilles heel for HSM systems. Lustre can distribute archival associated with rapid small file creation over many coordinating nodes. A single HSM database cannot scale in performance to keep up with multiple metadata servers creating new archival events, but multiple independent databases could easily be used and clustered as the coordinators do themselves. If Lustre is coupled to ADM by delivering events to the ADM event manager from a Lustre agent, then small files can be managed by ADM. Kernel based mechanisms to form filesets of small files to be archived might prove much more efficient. For Lustre?s file set architecture see http://arch.lustre.org/index.php?title=Fileset Policy ? space management Lustre plans ad-hoc policy for space management. Based on a scan or on least-recently-used kernel log files, files can be selected for purging. Note that such scans rarely need refreshing ? a single scan should yields candidates for purging for a long time. The log is efficient to maintain and ZFS can likely search through its object tree to produce such lists. A better implementation would allow more flexible policies for space management to be expressed, using the policy language into this. Both lustre and ADM will migrate files typically before they need to be purged ? this eliminates most performance issues during space management. Both systems have low and high water marks. However, when the file system is really full the space manager may have to invoke archiving. This should be done by asking the coordinator to archive certain files. Policy ? HSM side A system like ADM will have interfaces to request action on the archived objects ? e.g. Remove objects archived before 2002. Lustre did not consider such policies, as it intentionally does not want a tight coupling to any particular HSM. This decisions remains fine. An administrative interface is present in both systems to pre-stage files based on a work list. This is to populate the file system with restored copies of all files required by certain jobs. A language to express pre-stage lists is desirable (including user friendly syntax to state ?restore all files in this directory?). Conclusions 1. There are a few issues with both architectures. I will not speak for the ADM project here. 2. There is an excellent opportunity to couple Lustre?s HSM with ADM 3. There is a interim, much simpler HSM architecture for Lustre that can work with ADM ? see separate future email to this list 4. Lustre should address small file handling when not coupled to the ADM policy manager. 5. Lustre should define a policy language in relation with filtering and space management 6. Lustre should define a central way to dispatch policy (from the MGS) 7. Lustre should enable adaptive timeouts to assist with ?restore on demand? and ?space management requires archiving? events. 8. Lustre should define release events for log entries and also manage log entry cancellation in the presence of many consumers. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-devel/attachments/20080716/4c884b27/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: Lustre HSM v6.ppt Type: application/octet-stream Size: 965632 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-devel/attachments/20080716/4c884b27/attachment-0001.obj