(adding lustre-devel, dropping Bojanic from distro list; if anyone else wants off, let me know.) Hua Huang and Andreas wrote:> > Nathan, > > Thanks for the write-up. A few questions and comments. > > SAM-QFS only runs on Solaris, so it is always > remotely mounted on Lustre client via network connection, > right?QFS has a Linux native client (http://www.sun.com/download/products.xml?id=4429c1d1). So the copy nodes would be linux nodes acting as clients for both Lustre and QFS. This would generally result in two network hops for the data, but by placing the clients on OST nodes and having the coordinator choose wisely, we can probably save one of the network hops most of the time. This may or may not be a good idea, depending on the load imposed on the OST. The copytool would also require us to pump the data from kernel to userspace and back, potentially resulting in significant bus loading. We could memory map the Lustre side> > > Nathaniel Rutman wrote: >> Hi all - >> So we all have a common starting point, I''m going to jump right in >> and describe the current plan for integrating Lustre''s HSM feature >> (in development) with SAM-QFS and ADM. >> >> HSM for Lustre can be broken into two major components, both of which >> will live in userspace: the policy engine, which decides when files >> are archived (copy to (logical) tape), punched (removed from OSTs), >> or deleted; and the copytool, which moves file data to and from >> tape. A third component that we call the coordinator lives in kernel >> space and is responsible for relaying HSM requests to various client >> nodes. > s/tape/the archive/yes, I knew my "(logical) tape" statement needed to be clarified :)> >> >> The policy engine collects filesystem info, maintains a database of >> files it is interested in, and makes archive and punch decisions that >> are then communicated back to Lustre. Note that the database is only >> used to make policy decisions, and is specifically _not_ a database >> of file/storage location information. Periodically, the policy >> engine give a list of file identifiers and operations (via the >> coordinator) to any number of Lustre clients running copytools. > This work will be done by CEA as part of the HPSS HSM solution. > This work is generic in the sense that it could be SAM-QFS or any > other tape backend on the remote side for archival, right?Yes. The issue here is that the policy engine is a big part of "brains" of the HSM, and could be a key differentiator for customers. That''s why the ADM integration would likely replace the HPSS policy engine with ADM''s Event Manager -- presumably we''ll be able to get enhanced features by doing this. The actual benefits need to be investigated.> Is it expected that a given copytool would be given multiple files to > archive at one time? This would allow optimizing the archiving operations > to e.g. aggregate small files into a single archive object, but would > make identifying and extracting these files from the aggregate harder. >I do expect the coordinator to hand a list of files to each copytool. But SAM-QFS would actually handle small file aggregation "underneath" the copytool itself; we don''t have to worry about identification/extraction.>> The copytool will take the list of files and perform the requested >> operation: archive, delete, or restore. (It is potentially possible >> to have finer-grained archive commands passed from the policy engine, >> e.g. archive_level_3.) It will then copy the files off to >> tape/storage using whatever hardware/software specific commands are >> necessary. Note that the file identifiers are opaque 16-byte >> strings. Files are requested using the same identifiers; "paths may >> change, but the fids remain the same" is the basic philosophy. The >> copytool may hash the fids into dirs/subdirs to relieve problems with >> a flat namespace, but this is invisible to Lustre. Having said that, >> additional information such as the full path name, EAs, etc. may be >> added by the copytool (using a tar wrapper, for example), for >> disaster recovery or striping recovery. >> The initial version of the copytool and policy engine will be written >> targeted for HPSS, but it is likely that the SAM-QFS integration will >> use the same pieces. Perhaps calling it the "Lustre policy engine" >> would be more appropriate. > > So the initial version will be done by CEA as part of the HPSS.Part of the "HPSS-compatible Lustre HSM solution", which is our initial target, yes.> > You mentioned other details above, which can be SAM_QFS specific? > I am trying to figure out if the full-version of copy-tool used in > Lustre/SAM_QFS integration will be implemented specifically for SAM-QFS > from the Lustre side.There are two items that I can think of that may be archive-specific 1. hash the fids into dirs/subdirs to avoid a big flat namespace 2. inclusion of file extended attributes (EAs) But in fact, I don''t know enough about HPSS to say we don''t need these items anyhow. CEA, can you comment? I think current versions of HPSS are able to store EAs automatically, and QFS is not, so that may be one difference.> >> >> Integration with SAM-QFS >> The SAM policy engine is tightly tied directly to the QFS filesystem >> and for this reason it is not possible to replace the HPSS policy >> engine with SAM. However, SAM policies could be layered in at the >> copytool level. The split as we envision it is this: existing Lustre >> policy engine decides which and when files should be archived and >> punched, and SAM-QFS decides how and where to archive them. The >> copytool in this case > > SAM-QFS already does all these, i.e, "how and where".Yes. SAM policies would likely have to be written without reference to specific filenames/directories, since that info will not be readily available. If this proves to be performance-limiting (maybe certain file extensions (.mpg) should be stored in a different manner than another (.txt)), then we can probably find a way to pass the full pathname through to SAM, but this would require SAM code changes.> >> is simply the unix "cp" command (or perhaps tar as mentioned above), >> that copies the file from the Lustre mount point to the QFS mount >> point on one (of many) clients that has both filesystems mounted. >> SAM-QFS''s file staging and small-file aggregation (as well as >> parallel operation) would all be used "out of the box" to provide the >> best performance possible. > > The one thing that should be taken into account is that the files being > moved from Lustre to SAM are losing the "age" information. This might > cause SAM some heartburn because all of the files being added will be > considered "new" but there will be a large enough influx of files that > it will need to archive and purge files within hours. > > It may be that the SAM copytool will need to be modified to allow it > to pass on some "age" information (if that is something other than > atime and mtime) so the SAM policy engine can treat these files sensibly. > Alternately, it may be that the SAM copytool will need to be smart enough > to mark the new files as "archive & purge immediately" in some manner. >We will just use cp -a to preserve timestamps, ownership, perms etc; I don''t see what any additional age info could be. As to the heartburn problem, QFS has disk cache as the first level of archive; as that fills files are moved off to secondary automatically. We can adjust these watermarks to aggressively move files off to tape. If something backs up, the cp command will simply block. It would be nice to have some visibility when this situation occurs, but in fact it''s not at all clear what we should do besides change our archiving policy. This is a general issue, not QFS specific.> Again, SAM-QFS already does all of these. Correct? > So no code changes are expected at SAM-QFS side, right?Correct. As I see it today, no SAM-QFS code changes are necessary, and the QFS copytool will likely be identical or almost identical to the HPSS copytool.> > For Lustre/SAM-QFS integration, could you point out specifically > which area (in this write-up) can be done by U.Minn students?I don''t actually see any work to be done at this point. There''s the pathname pass-through potential, but I''m not convinced it''s at all necessary.> >> >> Integration with ADM >> ADM''s event manager would replace the HPSS policy engine. It would >> need some minor modifications to be integrated with the Lustre >> changelogs (instead of DMAPI) and ioctl interface to the >> coordinator. It also produces a similar list of files and actions. >> The ADM core would be the copytool, consuming the list and sending >> files to tape. We would also need a bit of work to pass >> communications between ADM''s Archive Information Manager and the >> policy engine and copytools. ADM integration is dependent upon >> having a Linux ADM implementation, or a Solaris Lustre implementation >> (potentially Lustre client only). >> >> Feel free to question, correct, criticize. >> Nathan >>
On Jan 22, 2009 12:46 -0800, Nathaniel Rutman wrote:> QFS has a Linux native client > So the copy nodes would be linux nodes acting as clients for both Lustre > and QFS. This would generally result in two network hops for the data, > but by placing the clients on OST nodes and having the coordinator > choose wisely, we can probably save one of the network hops most of the > time. This may or may not be a good idea, depending on the load imposed > on the OST. The copytool would also require us to pump the data from > kernel to userspace and back, potentially resulting in significant bus > loading. We could memory map the Lustre sideI was just wondering to myself if we couldn''t make an optimized "cp" command that would work in the kernel and be able to use newer APIs like "splice" or just a read-write loop that avoids kernel-user-kernel data copies. Unfortunately, I don''t think mmap IO is very fast with Lustre, or memcpy() from mmap Lustre to mmap QFS would give us a single memcpy() operation (which is the best I think we can do).> There are two items that I can think of that may be archive-specific > 1. hash the fids into dirs/subdirs to avoid a big flat namespace > 2. inclusion of file extended attributes (EAs) > But in fact, I don''t know enough about HPSS to say we don''t need these > items anyhow. CEA, can you comment? > I think current versions of HPSS are able to store EAs automatically, > and QFS is not, so that may be one difference.I got a paper from CEA that indicated HPSS was going to (or may have already) implemented EA support, but it isn''t at all clear if that version of software would be available at all sites, since AFAIK it is relatively new. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nathan, On Jan 22, 2009, at 2:46 PM, Nathaniel Rutman wrote:>>> >>> Integration with SAM-QFS >>> The SAM policy engine is tightly tied directly to the QFS >>> filesystem and for this reason it is not possible to replace the >>> HPSS policy engine with SAM. However, SAM policies could be >>> layered in at the copytool level. The split as we envision it is >>> this: existing Lustre policy engine decides which and when files >>> should be archived and punched, and SAM-QFS decides how and where >>> to archive them. The copytool in this case >> >> SAM-QFS already does all these, i.e, "how and where". > Yes. SAM policies would likely have to be written without reference > to specific filenames/directories, since that info will not be > readily available. If this proves to be performance-limiting (maybe > certain file extensions (.mpg) should be stored in a different > manner than another (.txt)), then we can probably find a way to pass > the full pathname through to SAM, but this would require SAM code > changes.SAM supports classification policy rules for files -- (1) number of copies, up to 4 (2) where to put the copies on which vsn pools - disk and/or tape, local and/or remote) (3) when to make the copies (time based archiving). You specify the policy in the archiver.cmd file. You can group files for a policy rule by pathname, owner, group, size, wildcard, and access time.> >> >>> is simply the unix "cp" command (or perhaps tar as mentioned >>> above), that copies the file from the Lustre mount point to the >>> QFS mount point on one (of many) clients that has both filesystems >>> mounted. SAM-QFS''s file staging and small-file aggregation (as >>> well as parallel operation) would all be used "out of the box" to >>> provide the best performance possible. >> >> The one thing that should be taken into account is that the files >> being >> moved from Lustre to SAM are losing the "age" information. This >> might >> cause SAM some heartburn because all of the files being added will be >> considered "new" but there will be a large enough influx of files >> that >> it will need to archive and purge files within hours.>> >> >> It may be that the SAM copytool will need to be modified to allow it >> to pass on some "age" information (if that is something other than >> atime and mtime) so the SAM policy engine can treat these files >> sensibly. >> Alternately, it may be that the SAM copytool will need to be smart >> enough >> to mark the new files as "archive & purge immediately" in some >> manner.There is a option to release files from the disk cache after all archive copies have been made. You may want to set this in the archiver.cmd file. The releasing is done automatically. It depends on how you are going to use SAM. If it is just for backup, then, yes, set this. However in your mail above, you also are managing your disk cache. In this case, it will be faster to retrieve files that are in our disk cache. This brings up the question of restore. In case of a Lustre disk failure, how are you going to restore your Lustre file system?>> >> > We will just use cp -a to preserve timestamps, ownership, perms etc; > I don''t see what any additional age info could be. As to the > heartburn problem, QFS has disk cache as the first level of archive; > as that fills files are moved off to secondary automatically. We > can adjust these watermarks to aggressively move files off to tape. > If something backs up, the cp command will simply block. It would > be nice to have some visibility when this situation occurs, but in > fact it''s not at all clear what we should do besides change our > archiving policy. This is a general issue, not QFS specific.You will want to set your disk cache thresholds based on the rate of influx of data into the disk cache. We default to high 80%, low 70% which means when the disk cache reaches 80%, we release the oldest archived files until the disk caches reaches 70%. Some of our oil customers set the theshold to 60% - 50% because of the heavy influx. Of course, if SAM does reach 100%, we block the writers until we have space so this is transparent to the application.> > >> Again, SAM-QFS already does all of these. Correct? >> So no code changes are expected at SAM-QFS side, right? > Correct. As I see it today, no SAM-QFS code changes are necessary, > and the QFS copytool will likely be identical or almost identical to > the HPSS copytool.Agree. I don''t see any SAM-QFS code changes required. The Lustre copytool will write to HPSS using the HPSS APIs and write to SAM-QFS with a ftp or pftp interface. This is minimum changes.> >> >> For Lustre/SAM-QFS integration, could you point out specifically >> which area (in this write-up) can be done by U.Minn students? > I don''t actually see any work to be done at this point. There''s the > pathname pass-through potential, but I''m not convinced it''s at all > necessary.I do see work to switch the HPSS APIs to ftp or pftp. If this is already supported by HPSS, then, yes, no changes are required. - Harriet Harriet G. Coverston Solaris, Storage Software | Email: harriet.coverston at sun.com Sun Microsystems, Inc. | AT&T: 651-554-1515 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 Eagan, MN 55121-1231
Looks like HPSS will support EA in 7.1.2.0, June 2009 I have asked Vicky here at ORNL to dig a bit into what the EA features will look like. Do we have a set of requirements for EAs for HSM integration? - Galen -----Original Message----- From: Andreas.Dilger at sun.com on behalf of Andreas Dilger Sent: Thu 1/22/2009 5:55 PM To: Nathaniel Rutman Cc: Hua Huang; lustre-hsm-core-ext at sun.com; lustre-devel at lists.lustre.org; Karen Jourdenais; Erica Dorenkamp; Harriet.Coverston at sun.com; Rick Matthews; karl at tacc.utexas.edu Subject: Re: SAM-QFS, ADM, and Lustre HSM On Jan 22, 2009 12:46 -0800, Nathaniel Rutman wrote:> QFS has a Linux native client > So the copy nodes would be linux nodes acting as clients for both Lustre > and QFS. This would generally result in two network hops for the data, > but by placing the clients on OST nodes and having the coordinator > choose wisely, we can probably save one of the network hops most of the > time. This may or may not be a good idea, depending on the load imposed > on the OST. The copytool would also require us to pump the data from > kernel to userspace and back, potentially resulting in significant bus > loading. We could memory map the Lustre sideI was just wondering to myself if we couldn''t make an optimized "cp" command that would work in the kernel and be able to use newer APIs like "splice" or just a read-write loop that avoids kernel-user-kernel data copies. Unfortunately, I don''t think mmap IO is very fast with Lustre, or memcpy() from mmap Lustre to mmap QFS would give us a single memcpy() operation (which is the best I think we can do).> There are two items that I can think of that may be archive-specific > 1. hash the fids into dirs/subdirs to avoid a big flat namespace > 2. inclusion of file extended attributes (EAs) > But in fact, I don''t know enough about HPSS to say we don''t need these > items anyhow. CEA, can you comment? > I think current versions of HPSS are able to store EAs automatically, > and QFS is not, so that may be one difference.I got a paper from CEA that indicated HPSS was going to (or may have already) implemented EA support, but it isn''t at all clear if that version of software would be available at all sites, since AFAIK it is relatively new. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nathaniel and all, Thanks for putting this together. Having a mover to put data into QFS is a great idea, and can easily use the QFS Linux client. I don''t think you would necessarily get QFS policy for native Lustre files unless the "moved" files retained the Lustre attributes, from which you want policy decisions made. There may be ways to do this. You would automatically gain the file gathering of QFS and its efficient tape handling. I also think there is a "archive then release (purge)" policy that can be established. The applicable Lustre namespace would be essentially duplicated in the QFS space, and (I think) QFS classification and policy occur on that name space. Doing so gives you access to rich QFS policy. This also allows QFS to migrate data to/from archive media without I/O or compute load on any Linux clients. Nathaniel Rutman wrote:> (adding lustre-devel, dropping Bojanic from distro list; if anyone > else wants off, let me know.) > > Hua Huang and Andreas wrote: >> >> Nathan, >> >> Thanks for the write-up. A few questions and comments. >> >> SAM-QFS only runs on Solaris, so it is always >> remotely mounted on Lustre client via network connection, >> right? > QFS has a Linux native client > (http://www.sun.com/download/products.xml?id=4429c1d1). > So the copy nodes would be linux nodes acting as clients for both > Lustre and QFS. This would generally result in two network hops for > the data, but by placing the clients on OST nodes and having the > coordinator choose wisely, we can probably save one of the network > hops most of the time. This may or may not be a good idea, depending > on the load imposed on the OST. The copytool would also require us to > pump the data from kernel to userspace and back, potentially resulting > in significant bus loading. We could memory map the Lustre side > >> >> >> Nathaniel Rutman wrote: >>> Hi all - >>> So we all have a common starting point, I''m going to jump right in >>> and describe the current plan for integrating Lustre''s HSM feature >>> (in development) with SAM-QFS and ADM. >>> >>> HSM for Lustre can be broken into two major components, both of >>> which will live in userspace: the policy engine, which decides when >>> files are archived (copy to (logical) tape), punched (removed from >>> OSTs), or deleted; and the copytool, which moves file data to and >>> from tape. A third component that we call the coordinator lives in >>> kernel space and is responsible for relaying HSM requests to various >>> client nodes. >> s/tape/the archive/ > yes, I knew my "(logical) tape" statement needed to be clarified :) >> >>> >>> The policy engine collects filesystem info, maintains a database of >>> files it is interested in, and makes archive and punch decisions >>> that are then communicated back to Lustre. Note that the database >>> is only used to make policy decisions, and is specifically _not_ a >>> database of file/storage location information. Periodically, the >>> policy engine give a list of file identifiers and operations (via >>> the coordinator) to any number of Lustre clients running copytools. >> This work will be done by CEA as part of the HPSS HSM solution. >> This work is generic in the sense that it could be SAM-QFS or any >> other tape backend on the remote side for archival, right? > Yes. The issue here is that the policy engine is a big part of > "brains" of the HSM, and could be a key differentiator for customers. > That''s why the ADM integration would likely replace the HPSS policy > engine with ADM''s Event Manager -- presumably we''ll be able to get > enhanced features by doing this. The actual benefits need to be > investigated. >> Is it expected that a given copytool would be given multiple files to >> archive at one time? This would allow optimizing the archiving >> operations >> to e.g. aggregate small files into a single archive object, but would >> make identifying and extracting these files from the aggregate harder. >> > I do expect the coordinator to hand a list of files to each > copytool. But SAM-QFS would actually handle small file aggregation > "underneath" the copytool itself; we don''t have to worry about > identification/extraction. > >>> The copytool will take the list of files and perform the requested >>> operation: archive, delete, or restore. (It is potentially possible >>> to have finer-grained archive commands passed from the policy >>> engine, e.g. archive_level_3.) It will then copy the files off to >>> tape/storage using whatever hardware/software specific commands are >>> necessary. Note that the file identifiers are opaque 16-byte >>> strings. Files are requested using the same identifiers; "paths may >>> change, but the fids remain the same" is the basic philosophy. The >>> copytool may hash the fids into dirs/subdirs to relieve problems >>> with a flat namespace, but this is invisible to Lustre. Having said >>> that, additional information such as the full path name, EAs, etc. >>> may be added by the copytool (using a tar wrapper, for example), for >>> disaster recovery or striping recovery. >>> The initial version of the copytool and policy engine will be >>> written targeted for HPSS, but it is likely that the SAM-QFS >>> integration will use the same pieces. Perhaps calling it the >>> "Lustre policy engine" would be more appropriate. >> >> So the initial version will be done by CEA as part of the HPSS. > Part of the "HPSS-compatible Lustre HSM solution", which is our > initial target, yes. >> >> You mentioned other details above, which can be SAM_QFS specific? >> I am trying to figure out if the full-version of copy-tool used in >> Lustre/SAM_QFS integration will be implemented specifically for SAM-QFS >> from the Lustre side. > There are two items that I can think of that may be archive-specific > 1. hash the fids into dirs/subdirs to avoid a big flat namespace > 2. inclusion of file extended attributes (EAs) > But in fact, I don''t know enough about HPSS to say we don''t need these > items anyhow. CEA, can you comment? > I think current versions of HPSS are able to store EAs automatically, > and QFS is not, so that may be one difference. >> >>> >>> Integration with SAM-QFS >>> The SAM policy engine is tightly tied directly to the QFS filesystem >>> and for this reason it is not possible to replace the HPSS policy >>> engine with SAM. However, SAM policies could be layered in at the >>> copytool level. The split as we envision it is this: existing >>> Lustre policy engine decides which and when files should be archived >>> and punched, and SAM-QFS decides how and where to archive them. The >>> copytool in this case >> >> SAM-QFS already does all these, i.e, "how and where". > Yes. SAM policies would likely have to be written without reference > to specific filenames/directories, since that info will not be readily > available. If this proves to be performance-limiting (maybe certain > file extensions (.mpg) should be stored in a different manner than > another (.txt)), then we can probably find a way to pass the full > pathname through to SAM, but this would require SAM code changes. >> >>> is simply the unix "cp" command (or perhaps tar as mentioned above), >>> that copies the file from the Lustre mount point to the QFS mount >>> point on one (of many) clients that has both filesystems mounted. >>> SAM-QFS''s file staging and small-file aggregation (as well as >>> parallel operation) would all be used "out of the box" to provide >>> the best performance possible. >> >> The one thing that should be taken into account is that the files being >> moved from Lustre to SAM are losing the "age" information. This might >> cause SAM some heartburn because all of the files being added will be >> considered "new" but there will be a large enough influx of files that >> it will need to archive and purge files within hours. >> >> It may be that the SAM copytool will need to be modified to allow it >> to pass on some "age" information (if that is something other than >> atime and mtime) so the SAM policy engine can treat these files >> sensibly. >> Alternately, it may be that the SAM copytool will need to be smart >> enough >> to mark the new files as "archive & purge immediately" in some manner. >> > We will just use cp -a to preserve timestamps, ownership, perms etc; I > don''t see what any additional age info could be. As to the heartburn > problem, QFS has disk cache as the first level of archive; as that > fills files are moved off to secondary automatically. We can adjust > these watermarks to aggressively move files off to tape. If something > backs up, the cp command will simply block. It would be nice to have > some visibility when this situation occurs, but in fact it''s not at > all clear what we should do besides change our archiving policy. This > is a general issue, not QFS specific. > >> Again, SAM-QFS already does all of these. Correct? >> So no code changes are expected at SAM-QFS side, right? > Correct. As I see it today, no SAM-QFS code changes are necessary, > and the QFS copytool will likely be identical or almost identical to > the HPSS copytool. >> >> For Lustre/SAM-QFS integration, could you point out specifically >> which area (in this write-up) can be done by U.Minn students? > I don''t actually see any work to be done at this point. There''s the > pathname pass-through potential, but I''m not convinced it''s at all > necessary. >> >>> >>> Integration with ADM >>> ADM''s event manager would replace the HPSS policy engine. It would >>> need some minor modifications to be integrated with the Lustre >>> changelogs (instead of DMAPI) and ioctl interface to the >>> coordinator. It also produces a similar list of files and actions. >>> The ADM core would be the copytool, consuming the list and sending >>> files to tape. We would also need a bit of work to pass >>> communications between ADM''s Archive Information Manager and the >>> policy engine and copytools. ADM integration is dependent upon >>> having a Linux ADM implementation, or a Solaris Lustre >>> implementation (potentially Lustre client only). >>> >>> Feel free to question, correct, criticize. >>> Nathan >>> >-- --------------------------------------------------------------------- Rick Matthews email: Rick.Matthews at sun.com Sun Microsystems, Inc. phone:+1(651) 554-1518 1270 Eagan Industrial Road phone(internal): 54418 Suite 160 fax: +1(651) 554-1540 Eagan, MN 55121-1231 USA main: +1(651) 554-1500 ---------------------------------------------------------------------
On Jan 23, 2009 13:02 -0600, Rick Matthews wrote:> Having a mover to put data into QFS is a great idea, and can easily use > the QFS Linux client. I don''t think you would necessarily get QFS > policy for native Lustre files unless the "moved" files retained the > Lustre attributes, from which you want policy decisions made.There will not necessarily be HSM policy data stored with every file from Lustre, though there is a desire to store Lustre layout data in the archive. Is it possible to store extended attributes with each file in QFS?> The applicable Lustre namespace would be essentially duplicated in the > QFS space, and (I think) QFS classification and policy occur on that > name space. Doing so gives you access to rich QFS policy. This also > allows QFS to migrate data to/from archive media without I/O or > compute load on any Linux clients.The current Lustre HSM design will not export any of the filesystem namespace to the archive, so that we don''t have to track renames in the archive. The archive objects will only be identified by a Lustre FID (128-bit file identifier). IIRC, the HSM-specific copy tool would be given the file name (though not necessarily the full pathname) in order to perform the copyout, but the filesystem will be retrieving the file from the archive by FID. Nathan, can you confirm that is right? Does QFS have name-based policies? Are these policies only on the filename, or on the whole pathname?> Nathaniel Rutman wrote: >> (adding lustre-devel, dropping Bojanic from distro list; if anyone >> else wants off, let me know.) >> >> Hua Huang and Andreas wrote: >>> >>> Nathan, >>> >>> Thanks for the write-up. A few questions and comments. >>> >>> SAM-QFS only runs on Solaris, so it is always >>> remotely mounted on Lustre client via network connection, >>> right? >> QFS has a Linux native client >> (http://www.sun.com/download/products.xml?id=4429c1d1). >> So the copy nodes would be linux nodes acting as clients for both >> Lustre and QFS. This would generally result in two network hops for >> the data, but by placing the clients on OST nodes and having the >> coordinator choose wisely, we can probably save one of the network >> hops most of the time. This may or may not be a good idea, depending >> on the load imposed on the OST. The copytool would also require us to >> pump the data from kernel to userspace and back, potentially resulting >> in significant bus loading. We could memory map the Lustre side >> >>> >>> >>> Nathaniel Rutman wrote: >>>> Hi all - >>>> So we all have a common starting point, I''m going to jump right in >>>> and describe the current plan for integrating Lustre''s HSM feature >>>> (in development) with SAM-QFS and ADM. >>>> >>>> HSM for Lustre can be broken into two major components, both of >>>> which will live in userspace: the policy engine, which decides when >>>> files are archived (copy to (logical) tape), punched (removed from >>>> OSTs), or deleted; and the copytool, which moves file data to and >>>> from tape. A third component that we call the coordinator lives in >>>> kernel space and is responsible for relaying HSM requests to >>>> various client nodes. >>> s/tape/the archive/ >> yes, I knew my "(logical) tape" statement needed to be clarified :) >>> >>>> >>>> The policy engine collects filesystem info, maintains a database of >>>> files it is interested in, and makes archive and punch decisions >>>> that are then communicated back to Lustre. Note that the database >>>> is only used to make policy decisions, and is specifically _not_ a >>>> database of file/storage location information. Periodically, the >>>> policy engine give a list of file identifiers and operations (via >>>> the coordinator) to any number of Lustre clients running copytools. >>> This work will be done by CEA as part of the HPSS HSM solution. >>> This work is generic in the sense that it could be SAM-QFS or any >>> other tape backend on the remote side for archival, right? >> Yes. The issue here is that the policy engine is a big part of >> "brains" of the HSM, and could be a key differentiator for customers. >> That''s why the ADM integration would likely replace the HPSS policy >> engine with ADM''s Event Manager -- presumably we''ll be able to get >> enhanced features by doing this. The actual benefits need to be >> investigated. >>> Is it expected that a given copytool would be given multiple files to >>> archive at one time? This would allow optimizing the archiving >>> operations >>> to e.g. aggregate small files into a single archive object, but would >>> make identifying and extracting these files from the aggregate harder. >>> >> I do expect the coordinator to hand a list of files to each copytool. >> But SAM-QFS would actually handle small file aggregation "underneath" >> the copytool itself; we don''t have to worry about >> identification/extraction. >> >>>> The copytool will take the list of files and perform the requested >>>> operation: archive, delete, or restore. (It is potentially >>>> possible to have finer-grained archive commands passed from the >>>> policy engine, e.g. archive_level_3.) It will then copy the files >>>> off to tape/storage using whatever hardware/software specific >>>> commands are necessary. Note that the file identifiers are opaque >>>> 16-byte strings. Files are requested using the same identifiers; >>>> "paths may change, but the fids remain the same" is the basic >>>> philosophy. The copytool may hash the fids into dirs/subdirs to >>>> relieve problems with a flat namespace, but this is invisible to >>>> Lustre. Having said that, additional information such as the full >>>> path name, EAs, etc. may be added by the copytool (using a tar >>>> wrapper, for example), for disaster recovery or striping recovery. >>>> The initial version of the copytool and policy engine will be >>>> written targeted for HPSS, but it is likely that the SAM-QFS >>>> integration will use the same pieces. Perhaps calling it the >>>> "Lustre policy engine" would be more appropriate. >>> >>> So the initial version will be done by CEA as part of the HPSS. >> Part of the "HPSS-compatible Lustre HSM solution", which is our >> initial target, yes. >>> >>> You mentioned other details above, which can be SAM_QFS specific? >>> I am trying to figure out if the full-version of copy-tool used in >>> Lustre/SAM_QFS integration will be implemented specifically for SAM-QFS >>> from the Lustre side. >> There are two items that I can think of that may be archive-specific >> 1. hash the fids into dirs/subdirs to avoid a big flat namespace >> 2. inclusion of file extended attributes (EAs) >> But in fact, I don''t know enough about HPSS to say we don''t need these >> items anyhow. CEA, can you comment? >> I think current versions of HPSS are able to store EAs automatically, >> and QFS is not, so that may be one difference. >>> >>>> >>>> Integration with SAM-QFS >>>> The SAM policy engine is tightly tied directly to the QFS >>>> filesystem and for this reason it is not possible to replace the >>>> HPSS policy engine with SAM. However, SAM policies could be >>>> layered in at the copytool level. The split as we envision it is >>>> this: existing Lustre policy engine decides which and when files >>>> should be archived and punched, and SAM-QFS decides how and where >>>> to archive them. The copytool in this case >>> >>> SAM-QFS already does all these, i.e, "how and where". >> Yes. SAM policies would likely have to be written without reference >> to specific filenames/directories, since that info will not be readily >> available. If this proves to be performance-limiting (maybe certain >> file extensions (.mpg) should be stored in a different manner than >> another (.txt)), then we can probably find a way to pass the full >> pathname through to SAM, but this would require SAM code changes. >>> >>>> is simply the unix "cp" command (or perhaps tar as mentioned >>>> above), that copies the file from the Lustre mount point to the QFS >>>> mount point on one (of many) clients that has both filesystems >>>> mounted. SAM-QFS''s file staging and small-file aggregation (as >>>> well as parallel operation) would all be used "out of the box" to >>>> provide the best performance possible. >>> >>> The one thing that should be taken into account is that the files being >>> moved from Lustre to SAM are losing the "age" information. This might >>> cause SAM some heartburn because all of the files being added will be >>> considered "new" but there will be a large enough influx of files that >>> it will need to archive and purge files within hours. >>> >>> It may be that the SAM copytool will need to be modified to allow it >>> to pass on some "age" information (if that is something other than >>> atime and mtime) so the SAM policy engine can treat these files >>> sensibly. >>> Alternately, it may be that the SAM copytool will need to be smart >>> enough >>> to mark the new files as "archive & purge immediately" in some manner. >>> >> We will just use cp -a to preserve timestamps, ownership, perms etc; I >> don''t see what any additional age info could be. As to the heartburn >> problem, QFS has disk cache as the first level of archive; as that >> fills files are moved off to secondary automatically. We can adjust >> these watermarks to aggressively move files off to tape. If something >> backs up, the cp command will simply block. It would be nice to have >> some visibility when this situation occurs, but in fact it''s not at >> all clear what we should do besides change our archiving policy. This >> is a general issue, not QFS specific. >> >>> Again, SAM-QFS already does all of these. Correct? >>> So no code changes are expected at SAM-QFS side, right? >> Correct. As I see it today, no SAM-QFS code changes are necessary, >> and the QFS copytool will likely be identical or almost identical to >> the HPSS copytool. >>> >>> For Lustre/SAM-QFS integration, could you point out specifically >>> which area (in this write-up) can be done by U.Minn students? >> I don''t actually see any work to be done at this point. There''s the >> pathname pass-through potential, but I''m not convinced it''s at all >> necessary. >>> >>>> >>>> Integration with ADM >>>> ADM''s event manager would replace the HPSS policy engine. It would >>>> need some minor modifications to be integrated with the Lustre >>>> changelogs (instead of DMAPI) and ioctl interface to the >>>> coordinator. It also produces a similar list of files and actions. >>>> The ADM core would be the copytool, consuming the list and sending >>>> files to tape. We would also need a bit of work to pass >>>> communications between ADM''s Archive Information Manager and the >>>> policy engine and copytools. ADM integration is dependent upon >>>> having a Linux ADM implementation, or a Solaris Lustre >>>> implementation (potentially Lustre client only). >>>> >>>> Feel free to question, correct, criticize. >>>> Nathan >>>> >> > > > -- > --------------------------------------------------------------------- > Rick Matthews email: Rick.Matthews at sun.com > Sun Microsystems, Inc. phone:+1(651) 554-1518 > 1270 Eagan Industrial Road phone(internal): 54418 > Suite 160 fax: +1(651) 554-1540 > Eagan, MN 55121-1231 USA main: +1(651) 554-1500 > ---------------------------------------------------------------------Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jan 23, 2009 10:46 -0600, Harriet G. Coverston wrote:> SAM supports classification policy rules for files -- (1) number of > copies, up to 4 (2) where to put the copies on which vsn pools - > disk and/or tape, local and/or remote) (3) when to make the copies > (time based archiving). You specify the policy in the archiver.cmd > file. You can group files for a policy rule by pathname, owner, group, > size, wildcard, and access time. > > This brings up the question of restore. In case of a Lustre disk > failure, how are you going to restore your Lustre file system?The initial HSM implementation is focussed mainly on the space management issues, rather than backup/restore, though of course there is a lot of overlap between the two and we have discussed backup aspects in the past. There are two main issues that would need to be addressed: - a Lustre-level policy on the minimum file size that should be sent to the archive. For Lustre, there would be minimal space savings if a small file is moved to the archive, so that would only be useful in the archive-as-backup case. We would need to decide whether the HPSS implementation can/should handle aggregating multiple small files into a single archive object. I think that is useful, and this is one reason I advocate being able to pass multiple files at once from the coordinator to the agent. - since the archive does not contain a copy of the namespace (it only has 128-bit FIDs as identifiers for the file) we would need to make a separate backup of the MDS filesystem (which is all namespace). There are already several mechanisms to do this, either using the ext2 "dump" program to read from the raw device, or to make an LVM snapshot and use e.g. tar to make a filesystem-level backup. Both of these need to include a backup of the extended attributes.> Agree. I don''t see any SAM-QFS code changes required. The Lustre > copytool will write to HPSS using the HPSS APIs and write to SAM-QFS > with a ftp or pftp interface. This is minimum changes.We weren''t thinking of using an FTP interface to SAM, though I guess this is possible. Rather we were thinking of just mounting both QFS and Lustre on a Linux client and using "cp" or equivalent tool. Depending on the performance requirements, it might make sense to use a smarter tool that avoids the kernel-user-kernel memory copies.> I do see work to switch the HPSS APIs to ftp or pftp. If this is > already supported by HPSS, then, yes, no changes are required.I think CEA is planning on writing a copytool using the HPSS APIs directly. There is also "htar" which is a tar-like interface to HPSS, but I don''t think that was anyone''s intention to use. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jan 23, 2009 12:39 -0500, Shipman, Galen M. wrote:> Looks like HPSS will support EA in 7.1.2.0, June 2009 > I have asked Vicky here at ORNL to dig a bit into what the EA features will > look like. Do we have a set of requirements for EAs for HSM integration?As yet we don''t have a hard requirement for EAs in HSM. We would ideally keep the LOV EA for the file layout in the HSM, so that the file gets (approximately) the same layout when it is restored. This is only really needed for files that were not allocated using the default layout, and we might consider saving e.g. "stripe over all OSTs" instead of "stripe over N OSTs" so that if the number of OSTs increases from when the file was archived until it is restored the new file gets the full performance. In the absence of EAs in the HSM we could fall back to using a tar file format that supports EAs (as in RHEL5.x and star) to store the layout information. We are also considering to keep the layout information in the MDS, but that doesn''t help in the "backup" use case where the file was deleted or the MDS is lost.> -----Original Message----- > From: Andreas.Dilger at sun.com on behalf of Andreas Dilger > Sent: Thu 1/22/2009 5:55 PM > To: Nathaniel Rutman > Cc: Hua Huang; lustre-hsm-core-ext at sun.com; lustre-devel at lists.lustre.org; Karen Jourdenais; Erica Dorenkamp; Harriet.Coverston at sun.com; Rick Matthews; karl at tacc.utexas.edu > Subject: Re: SAM-QFS, ADM, and Lustre HSM > > On Jan 22, 2009 12:46 -0800, Nathaniel Rutman wrote: > > QFS has a Linux native client > > So the copy nodes would be linux nodes acting as clients for both Lustre > > and QFS. This would generally result in two network hops for the data, > > but by placing the clients on OST nodes and having the coordinator > > choose wisely, we can probably save one of the network hops most of the > > time. This may or may not be a good idea, depending on the load imposed > > on the OST. The copytool would also require us to pump the data from > > kernel to userspace and back, potentially resulting in significant bus > > loading. We could memory map the Lustre side > > I was just wondering to myself if we couldn''t make an optimized "cp" > command that would work in the kernel and be able to use newer APIs > like "splice" or just a read-write loop that avoids kernel-user-kernel > data copies. Unfortunately, I don''t think mmap IO is very fast with > Lustre, or memcpy() from mmap Lustre to mmap QFS would give us a single > memcpy() operation (which is the best I think we can do). > > > There are two items that I can think of that may be archive-specific > > 1. hash the fids into dirs/subdirs to avoid a big flat namespace > > 2. inclusion of file extended attributes (EAs) > > But in fact, I don''t know enough about HPSS to say we don''t need these > > items anyhow. CEA, can you comment? > > I think current versions of HPSS are able to store EAs automatically, > > and QFS is not, so that may be one difference. > > I got a paper from CEA that indicated HPSS was going to (or may have > already) implemented EA support, but it isn''t at all clear if that > version of software would be available at all sites, since AFAIK it > is relatively new. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Jan 23, 2009 10:46 -0600, Harriet G. Coverston wrote: > >> SAM supports classification policy rules for files -- (1) number of >> copies, up to 4 (2) where to put the copies on which vsn pools - >> disk and/or tape, local and/or remote) (3) when to make the copies >> (time based archiving). You specify the policy in the archiver.cmd >> file. You can group files for a policy rule by pathname, owner, group, >> size, wildcard, and access time. >>My point about this is that files will be stored using the FID as the file name, so name-based policies at the copytool level are worthless. Unless we a.) add the path/filename back to the file (EA?), and b.) modify the SAM policy engine to use the "real" path/filename instead of the FID.>> This brings up the question of restore. In case of a Lustre disk >> failure, how are you going to restore your Lustre file system? >> > > ... > > - since the archive does not contain a copy of the namespace (it only > has 128-bit FIDs as identifiers for the file) we would need to make > a separate backup of the MDS filesystem (which is all namespace). > There are already several mechanisms to do this, either using the > ext2 "dump" program to read from the raw device, or to make an LVM > snapshot and use e.g. tar to make a filesystem-level backup. Both > of these need to include a backup of the extended attributes. >Or include the path/filename in each file, and the restore process uses this to repopulate the filesystem.> >> Agree. I don''t see any SAM-QFS code changes required. The Lustre >> copytool will write to HPSS using the HPSS APIs and write to SAM-QFS >> with a ftp or pftp interface. This is minimum changes. >> > > We weren''t thinking of using an FTP interface to SAM, though I guess > this is possible. Rather we were thinking of just mounting both QFS > and Lustre on a Linux client and using "cp" or equivalent tool. >Harriet already knew this, she just forgot :)
Andreas Dilger wrote:> On Jan 23, 2009 13:02 -0600, Rick Matthews wrote: > >> Having a mover to put data into QFS is a great idea, and can easily use >> the QFS Linux client. I don''t think you would necessarily get QFS >> policy for native Lustre files unless the "moved" files retained the >> Lustre attributes, from which you want policy decisions made. >> > > There will not necessarily be HSM policy data stored with every file > from Lustre, though there is a desire to store Lustre layout data in > the archive. Is it possible to store extended attributes with each > file in QFS? >We can always store EA''s, either natively or "poor-man''s EA''s" via mini-tarballs.> >> The applicable Lustre namespace would be essentially duplicated in the >> QFS space, and (I think) QFS classification and policy occur on that >> name space. Doing so gives you access to rich QFS policy. This also >> allows QFS to migrate data to/from archive media without I/O or >> compute load on any Linux clients. >> > > The current Lustre HSM design will not export any of the filesystem > namespace to the archive, so that we don''t have to track renames in > the archive. The archive objects will only be identified by a Lustre > FID (128-bit file identifier). IIRC, the HSM-specific copy tool would > be given the file name (though not necessarily the full pathname) in > order to perform the copyout, but the filesystem will be retrieving the > file from the archive by FID. Nathan, can you confirm that is right? >There is a mechanism to get the current full pathname for a given fid from userspace, so an HSM-specific copytool could find it out, but a central tenet of the design here is that as far as the HSM is concerned, the entire Lustre FS is a flat namespace of FIDs. You can get a full pathname if you want to for catastrophe recovery, but Lustre itself will only speak to the HSM with FIDs. As I said in the other email, although SAM-QFS can do name-based policies, the "name" as far as QFS is concerned is just the FID, so name-based policies at the copytool level are worthless. Unless we a.) add the path/filename back to the file (EA, or use a tarball wrapper), and b.) modify the SAM policy engine to use the "real" path/filename instead of the FID. But in the bigger picture sense, note that all this is simply an optimization to allow SAM-QFS filename-based policies, which ultimately only influences where SAM-QFS stores files, not whether or when the files are archived by Lustre. These "top-level" policy decisions are made by the Lustre policy manager, and so perhaps there is no real need to spend any effort getting b.) above working. Note that a.) is still useful for disaster recovery.
Andreas, On Jan 26, 2009, at 1:47 PM, Andreas Dilger wrote:> On Jan 23, 2009 10:46 -0600, Harriet G. Coverston wrote: >> SAM supports classification policy rules for files -- (1) number of >> copies, up to 4 (2) where to put the copies on which vsn pools - >> disk and/or tape, local and/or remote) (3) when to make the copies >> (time based archiving). You specify the policy in the archiver.cmd >> file. You can group files for a policy rule by pathname, owner, >> group, >> size, wildcard, and access time. >> >> This brings up the question of restore. In case of a Lustre disk >> failure, how are you going to restore your Lustre file system? > > The initial HSM implementation is focussed mainly on the space > management > issues, rather than backup/restore, though of course there is a lot of > overlap between the two and we have discussed backup aspects in the > past.I do see below that you are dumping your metadata which maps the 128- FIDs to the full pathname. In this case, you would be able to restore your Luster file system from the archive. If you don''t have this restore feature, then you would just be using the archive as a disk extender and you would also need a conventional backup.> > There are two main issues that would need to be addressed: > - a Lustre-level policy on the minimum file size that should be sent > to > the archive. For Lustre, there would be minimal space savings if a > small file is moved to the archive, so that would only be useful in > the archive-as-backup case. > > We would need to decide whether the HPSS implementation can/should > handle aggregating multiple small files into a single archive object.Last I knew, they still don''t build a container for small files. They write a tape mark between each file. This means they are start/stopping the tape for small files. A lot of sites use SRB which builds a tar container.> > I think that is useful, and this is one reason I advocate being able > to pass multiple files at once from the coordinator to the agent.If you decided to build a container, then that will work for both HPSS and SAM.> > > - since the archive does not contain a copy of the namespace (it only > has 128-bit FIDs as identifiers for the file) we would need to make > a separate backup of the MDS filesystem (which is all namespace). > There are already several mechanisms to do this, either using the > ext2 "dump" program to read from the raw device, or to make an LVM > snapshot and use e.g. tar to make a filesystem-level backup. Both > of these need to include a backup of the extended attributes. > >> Agree. I don''t see any SAM-QFS code changes required. The Lustre >> copytool will write to HPSS using the HPSS APIs and write to SAM-QFS >> with a ftp or pftp interface. This is minimum changes. > > We weren''t thinking of using an FTP interface to SAM, though I guess > this is possible. Rather we were thinking of just mounting both QFS > and Lustre on a Linux client and using "cp" or equivalent tool. > Depending on the performance requirements, it might make sense to > use a smarter tool that avoids the kernel-user-kernel memory copies.Yes, we support Linux clients and you can use the datamover architecture. You benefit with direct access to the storage from both the Lustre file system and the SAM file system, no OTW performance penalty. I would not recommend cp since it is mmap I/O (on Solaris, not sure about Linux). You will want to use direct I/O to avoid the useless data copy. If you use ftp/ pftp/gridftp, that is just a loop back move on the datamover(s); however, any standard file system interface will work to SAM.> > >> I do see work to switch the HPSS APIs to ftp or pftp. If this is >> already supported by HPSS, then, yes, no changes are required. > > I think CEA is planning on writing a copytool using the HPSS APIs > directly. There is also "htar" which is a tar-like interface to > HPSS, but I don''t think that was anyone''s intention to use.If they decide to use the non standard HPSS APIs, then yes, there would be changes required to use a standard file system interface for SAM. Best regards, - Harriet> > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >- Harriet Harriet G. Coverston Solaris, Storage Software | Email: harriet.coverston at sun.com Sun Microsystems, Inc. | AT&T: 651-554-1515 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 Eagan, MN 55121-1231
Nathan, On Jan 26, 2009, at 4:13 PM, Nathaniel Rutman wrote:> Andreas Dilger wrote: >> On Jan 23, 2009 13:02 -0600, Rick Matthews wrote: >> >>> Having a mover to put data into QFS is a great idea, and can >>> easily use the QFS Linux client. I don''t think you would >>> necessarily get QFS >>> policy for native Lustre files unless the "moved" files retained the >>> Lustre attributes, from which you want policy decisions made. >>> >> >> There will not necessarily be HSM policy data stored with every file >> from Lustre, though there is a desire to store Lustre layout data in >> the archive. Is it possible to store extended attributes with each >> file in QFS? >> > We can always store EA''s, either natively or "poor-man''s EA''s" via > mini-tarballs. >> >>> The applicable Lustre namespace would be essentially duplicated in >>> the >>> QFS space, and (I think) QFS classification and policy occur on >>> that name space. Doing so gives you access to rich QFS policy. >>> This also >>> allows QFS to migrate data to/from archive media without I/O or >>> compute load on any Linux clients. >>> >> >> The current Lustre HSM design will not export any of the filesystem >> namespace to the archive, so that we don''t have to track renames in >> the archive. The archive objects will only be identified by a Lustre >> FID (128-bit file identifier). IIRC, the HSM-specific copy tool >> would >> be given the file name (though not necessarily the full pathname) in >> order to perform the copyout, but the filesystem will be retrieving >> the >> file from the archive by FID. Nathan, can you confirm that is right? >> > There is a mechanism to get the current full pathname for a given > fid from userspace, so an HSM-specific copytool could find it out, > but a central tenet of the design here is that as far as the HSM is > concerned, the entire Lustre FS is a flat namespace of FIDs.Be careful here. We are a file system. We don''t have a limit on # of files in one directory, but we don''t recommend more than 500,000 files in one single directory or you will start to see some performance problems. You will have to create a tree, not use a flat namespace.> You can get a full pathname if you want to for catastrophe > recovery, but Lustre itself will only speak to the HSM with FIDs. > As I said in the other email, although SAM-QFS can do name-based > policies, the "name" as far as QFS is concerned is just the FID, so > name-based policies at the copytool level are worthless. Unless we > a.) add the path/filename back to the file (EA, or use a tarball > wrapper), and b.) modify the SAM policy engine to use the "real" > path/filename instead of the FID.Currently, we don''t support policy using EA (extended attributes are in 5.0). We have had lots of requests for this, especially from our digital preservation customers.> > > But in the bigger picture sense, note that all this is simply an > optimization to allow SAM-QFS filename-based policies, which > ultimately only influences where SAM-QFS stores files, not whether > or when the files are archived by Lustre. These "top-level" policy > decisions are made by the Lustre policy manager, and so perhaps > there is no real need to spend any effort getting b.) above > working. Note that a.) is still useful for disaster recovery.Agree. We have lots of customer with only one archive set. This means all files are archived with the same policy -- very simple. - Harriet Harriet G. Coverston Solaris, Storage Software | Email: harriet.coverston at sun.com Sun Microsystems, Inc. | AT&T: 651-554-1515 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 Eagan, MN 55121-1231
Hi, AFAIK, HPSS distribution includes pftp and gridftp support (also available here for download). At CEA, we are using our own copytool that directly uses HPSS API. This already exists and is in production for years. I think there will be few modifications to adapt it to Lustre-HSM purpose (basically, add fid <-> HSM id mapping and backup of attributes, path, stripe...) Thomas CEA/DAM Andreas Dilger a écrit : I do see work to switch the HPSS APIs to ftp or pftp. If this is already supported by HPSS, then, yes, no changes are required. I think CEA is planning on writing a copytool using the HPSS APIs directly. There is also "htar" which is a tar-like interface to HPSS, but I don''t think that was anyone''s intention to use. Cheers, Andreas _______________________________________________ Lustre-devel mailing list Lustre-devel-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-devel
Comments on several messages; I am slowly catching up.> I do see work to switch the HPSS APIs to ftp or pftp. If this is > already supported by HPSS, then, yes, no changes are required. >HPSS supports ftp and pftp. However, this seems to be a moot point as Thomas points out that CEA is using the HPSS client API library for their copy tool:> At CEA, we are using our own copytool that directly uses HPSS API. > This already exists and is in production for years. > I think there will be few modifications to adapt it to Lustre-HSM purpose > (basically, add fid <-> HSM id mapping and backup of attributes, path, > stripe...)> There is also "htar" which is a tar-like interface to > HPSS, but I don''t think that was anyone''s intention to use.htar is a well proven and valuable tool for aggregation to HPSS. It is widely used at HPSS sites as a stand-alone utility and has been incorporated into other interfaces.> Looks like HPSS will support EA in 7.1.2.0, June 2009 > I have asked Vicky here at ORNL to dig a bit into what the EA features will look like. >The last draft of this design I saw was from November. Work on this is picking up right now and has been bumped to a high priority, due for release this June, as Galen says. I am trying to find out if there is a later design and how much about it I can share.> Do we have a set of requirements for EAs for HSM integration?I never saw an answer to Galen''s question above; did I miss it? Now is the time to speak up if we need to influence the design of the HPSS EAs.> > We would need to decide whether the HPSS implementation can/should > > handle aggregating multiple small files into a single archive object. > > I think that is useful, and this is one reason I advocate being able > > to pass multiple files at once from the coordinator to the agent. > Last I knew, they still don''t build a container for small files. They > write > a tape mark between each file. This means they are start/stopping the > tape for small files. A lot of sites use SRB which builds a tar container.As of HPSS 7.1, we build a container for small files before copying them to tape. It''s called Tape Aggregation and we call the container an aggregate. Tape Aggregation is controlled via the HPSS migration policy, where the sysadm can configure whether or not to aggregate, the minimum and maximum files to place in each aggregate, and the maximum size of each aggregate. Vicky
>> Looks like HPSS will support EA in 7.1.2.0, June 2009 >> I have asked Vicky here at ORNL to dig a bit into what the EA >> features will look like. > > The last draft of this design I saw was from November. Work on this > is picking up right now and has been bumped to a high priority, due > for release this June, as Galen says. I am trying to find out if > there is a later design and how much about it I can share.There is a more recent draft, though the main change seems to be change the name from "Extended Attributes" to "User Defined Attributes" (UDAs). The gist of the current draft is that a new database table would be added to the HPSS schema which would consist of two columns, an object ID and an XML document. The XML document would define all the UDAs for some HPSS name space object (file, directory, symlink, hard link, etc.) in some key/value format. It would take advantage of the new capability in version 9 of DB2 of handling XML columns and being able to index and query them as XML, not just as a text string. The object ID column of the new table would hold the ID of the HPSS name space object to which the extended attribute(s) apply. The design is intended to handle small UDAs, up to 512 bytes in length for the total XML document, in order to be able to store the data in the same row; larger documents will be accepted but would have to be stored in a large object (LOB) area external to the main table, reducing efficiency. This is something to keep in mind if we start talking about putting full (or even relative) pathnames in as UDAs. I understand that the CEA folks have a copy of this draft of the design and are in communication with its authors. Vicky
Andreas Dilger wrote:> On Jan 23, 2009 12:39 -0500, Shipman, Galen M. wrote: > >> Looks like HPSS will support EA in 7.1.2.0, June 2009 >> I have asked Vicky here at ORNL to dig a bit into what the EA features will >> look like. Do we have a set of requirements for EAs for HSM integration? >> > > As yet we don''t have a hard requirement for EAs in HSM. We would ideally > keep the LOV EA for the file layout in the HSM, so that the file gets > (approximately) the same layout when it is restored. This is only really > needed for files that were not allocated using the default layout, and > we might consider saving e.g. "stripe over all OSTs" instead of "stripe > over N OSTs" so that if the number of OSTs increases from when the file > was archived until it is restored the new file gets the full performance. >Also, if for some reason the number of OFTs decreases, the stripe all could just use the new available value.>> I got a paper from CEA that indicated HPSS was going to (or may have >> already) implemented EA support, but it isn''t at all clear if that >> version of software would be available at all sites, since AFAIK it >> is relatively new. >> >> Cheers, Andreas >>This version of the HPSS software with the EA support (now called UDA for User Defined Attributes) will be available in the baseline HPSS code, available to all sites. Target availability date is summer 2009. Vicky
> >> Looks like HPSS will support EA in 7.1.2.0, June 2009 >> I have asked Vicky here at ORNL to dig a bit into what the EA >> features will look like. > > The last draft of this design I saw was from November. Work on this > is picking up right now and has been bumped to a high priority, due > for release this June, as Galen says. I am trying to find out if > there is a later design and how much about it I can share.I just realized the June date is internal. That''s when hpss developers are to have their code unit tested. After that comes integration and system testing. This feature will not likely be released until around September. Vicky
LEIBOVICI Thomas wrote:> At CEA, we are using our own copytool that directly uses HPSS API. > This already exists and is in production for years. > I think there will be few modifications to adapt it to Lustre-HSM purpose > (basically, add fid <-> HSM id mapping and backup of attributes, path, > stripe...)So then the QFS copytool will indeed be a new tool, and should be scheduled accordingly. Features: 1. "cp --preserve" like functionality (include metadata attributes in cp) 2. add EA''s (create mini-tarball) 3. implement FID hash to subdivide namespace 4. periodic status reporting (via ioctl on file) Harriet G. Coverston wrote:>> There is a mechanism to get the current full pathname for a given fid >> from userspace, so an HSM-specific copytool could find it out, but a >> central tenet of the design here is that as far as the HSM is >> concerned, the entire Lustre FS is a flat namespace of FIDs. > > Be careful here. We are a file system. We don''t have a limit on # of > files in one directory, but we don''t recommend more than 500,000 files > in one single directory or you will start to see some performance > problems. You will have to create a tree, not use a flat namespace.Yes, a tree based on a hash of the fid. The other option is to use the actual filename for storage, but from Lustre''s point of view this gets extremely tricky. For example: Send /foo/bar to archive. Client A opens /foo/bar. Client B renames /foo/bar to /abc/xyz, but this change hasn''t propagated to the archive yet. Client A now tries to read its open file handle, which tells Lustre to read the offline file FID 123, which it translates to /abc/xyz currently, which the archive doesn''t know about yet. Not just xyz, but renames on any ancestor path element cause similar misses. Since the FID remains constant throughout the life of a file, we don''t have to worry about any namespace changes (file or parents). If there was an alternate way of bypassing the archive''s namespace to directly access a file, we could conceivably store e.g. an archive-specific identifier within the Lustre stripe EA, and pass this down to the copytool when reading an offline file, but this presupposes that such a thing exists, is of reasonable size, has a userspace method to access it, etc.> >> You can get a full pathname if you want to for catastrophe recovery, >> but Lustre itself will only speak to the HSM with FIDs. >> As I said in the other email, although SAM-QFS can do name-based >> policies, the "name" as far as QFS is concerned is just the FID, so >> name-based policies at the copytool level are worthless. Unless we >> a.) add the path/filename back to the file (EA, or use a tarball >> wrapper), and b.) modify the SAM policy engine to use the "real" >> path/filename instead of the FID. > > Currently, we don''t support policy using EA (extended attributes are > in 5.0). We have had lots of requests for this, especially from our > digital preservation customers.Ah, policy based on EAs would be the general case, yes.
Nathan, On Jan 30, 2009, at 6:21 PM, Nathaniel Rutman wrote:> LEIBOVICI Thomas wrote: >> At CEA, we are using our own copytool that directly uses HPSS API. >> This already exists and is in production for years. >> I think there will be few modifications to adapt it to Lustre-HSM >> purpose >> (basically, add fid <-> HSM id mapping and backup of attributes, >> path, stripe...) > So then the QFS copytool will indeed be a new tool, and should be > scheduled accordingly. > Features: > 1. "cp --preserve" like functionality (include metadata attributes > in cp) > 2. add EA''s (create mini-tarball) > 3. implement FID hash to subdivide namespace > 4. periodic status reporting (via ioctl on file) > > > Harriet G. Coverston wrote: >>> There is a mechanism to get the current full pathname for a given >>> fid from userspace, so an HSM-specific copytool could find it out, >>> but a central tenet of the design here is that as far as the HSM >>> is concerned, the entire Lustre FS is a flat namespace of FIDs. >> >> Be careful here. We are a file system. We don''t have a limit on # >> of files in one directory, but we don''t recommend more than 500,000 >> files in one single directory or you will start to see some >> performance problems. You will have to create a tree, not use a >> flat namespace. > Yes, a tree based on a hash of the fid. > The other option is to use the actual filename for storage, but from > Lustre''s point of view this gets extremely tricky. For example: > Send /foo/bar to archive. Client A opens /foo/bar. Client B > renames /foo/bar to /abc/xyz, but this change hasn''t propagated to > the archive yet. Client A now tries to read its open file handle, > which tells Lustre to read the offline file FID 123, which it > translates to /abc/xyz currently, which the archive doesn''t know > about yet. Not just xyz, but renames on any ancestor path element > cause similar misses. Since the FID remains constant throughout the > life of a file, we don''t have to worry about any namespace changes > (file or parents). If there was an alternate way of bypassing the > archive''s namespace to directly access a file, we could conceivably > store e.g. an archive-specific identifier within the Lustre stripe > EA, and pass this down to the copytool when reading an offline file, > but this presupposes that such a thing exists, is of reasonable > size, has a userspace method to access it, etc.Yes, we have a FID like concept in SAM-QFS. It is called the file ID. It is 64 bits and consists of the inode/generation number. It is unique. You can store it. You can issue an ioctl to open the ID. You can issue an ioctl to do an ID stat, etc. It is much more efficient than using the filename (expensive lookup). This means if you store and use the ID, you can cover the rename window and still be guaranteed that you will get the right file. Note, we don''t rearchive on a rename. I really think a replicated namespace will be much more intuitive and solves restore. If you prefer to build a tar container, that is OK, too. The tar file can have a suffix and then you know it is tar and you can tar it back.> > >> >>> You can get a full pathname if you want to for catastrophe >>> recovery, but Lustre itself will only speak to the HSM with FIDs. >>> As I said in the other email, although SAM-QFS can do name-based >>> policies, the "name" as far as QFS is concerned is just the FID, >>> so name-based policies at the copytool level are worthless. >>> Unless we a.) add the path/filename back to the file (EA, or use a >>> tarball wrapper), and b.) modify the SAM policy engine to use the >>> "real" path/filename instead of the FID. >> >> Currently, we don''t support policy using EA (extended attributes >> are in 5.0). We have had lots of requests for this, especially from >> our digital preservation customers. > Ah, policy based on EAs would be the general case, yes.Yes, this would be a nice feature for us. - Harriet Harriet G. Coverston Solaris, Storage Software | Email: harriet.coverston at sun.com Sun Microsystems, Inc. | AT&T: 651-554-1515 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 Eagan, MN 55121-1231
Harriet G. Coverston wrote: Hi,> Nathan, > > On Jan 30, 2009, at 6:21 PM, Nathaniel Rutman wrote: > >> LEIBOVICI Thomas wrote: >>> At CEA, we are using our own copytool that directly uses HPSS API. >>> This already exists and is in production for years. >>> I think there will be few modifications to adapt it to Lustre-HSM >>> purpose >>> (basically, add fid <-> HSM id mapping and backup of attributes, >>> path, stripe...) >> So then the QFS copytool will indeed be a new tool, and should be >> scheduled accordingly. >> Features: >> 1. "cp --preserve" like functionality (include metadata attributes in >> cp) >> 2. add EA''s (create mini-tarball) >> 3. implement FID hash to subdivide namespace >> 4. periodic status reporting (via ioctl on file) >> >> >> Harriet G. Coverston wrote: >>>> There is a mechanism to get the current full pathname for a given >>>> fid from userspace, so an HSM-specific copytool could find it out, >>>> but a central tenet of the design here is that as far as the HSM is >>>> concerned, the entire Lustre FS is a flat namespace of FIDs. >>> >>> Be careful here. We are a file system. We don''t have a limit on # of >>> files in one directory, but we don''t recommend more than 500,000 >>> files in one single directory or you will start to see some >>> performance problems. You will have to create a tree, not use a flat >>> namespace. >> Yes, a tree based on a hash of the fid. >> The other option is to use the actual filename for storage, but from >> Lustre''s point of view this gets extremely tricky. For example: >> Send /foo/bar to archive. Client A opens /foo/bar. Client B renames >> /foo/bar to /abc/xyz, but this change hasn''t propagated to the >> archive yet. Client A now tries to read its open file handle, which >> tells Lustre to read the offline file FID 123, which it translates to >> /abc/xyz currently, which the archive doesn''t know about yet. Not >> just xyz, but renames on any ancestor path element cause similar >> misses. Since the FID remains constant throughout the life of a >> file, we don''t have to worry about any namespace changes (file or >> parents). If there was an alternate way of bypassing the archive''s >> namespace to directly access a file, we could conceivably store e.g. >> an archive-specific identifier within the Lustre stripe EA, and pass >> this down to the copytool when reading an offline file, but this >> presupposes that such a thing exists, is of reasonable size, has a >> userspace method to access it, etc. > > Yes, we have a FID like concept in SAM-QFS. It is called the file ID. > It is 64 bits and consists of the inode/generation number. It is > unique. You can store it. You can issue an ioctl to open the ID. You > can issue an ioctl to do an ID stat, etc. It is much more efficient > than using the filename (expensive lookup). This means if you store > and use the ID, you can cover the rename window and still be > guaranteed that you will get the right file. Note, we don''t rearchive > on a rename.I believe this facility only exist on the Meta Data Server Node and not on the Linux/Solaris clients. Am I correct? Thanks. colin> > I really think a replicated namespace will be much more intuitive and > solves restore. If you prefer > to build a tar container, that is OK, too. The tar file can have a > suffix and then you know it is tar and > you can tar it back. >> >> >>> >>>> You can get a full pathname if you want to for catastrophe >>>> recovery, but Lustre itself will only speak to the HSM with FIDs. >>>> As I said in the other email, although SAM-QFS can do name-based >>>> policies, the "name" as far as QFS is concerned is just the FID, >>>> so name-based policies at the copytool level are worthless. >>>> Unless we a.) add the path/filename back to the file (EA, or use a >>>> tarball wrapper), and b.) modify the SAM policy engine to use the >>>> "real" path/filename instead of the FID. >>> >>> Currently, we don''t support policy using EA (extended attributes are >>> in 5.0). We have had lots of requests for this, especially from our >>> digital preservation customers. >> Ah, policy based on EAs would be the general case, yes. > Yes, this would be a nice feature for us. > > - Harriet > > Harriet G. Coverston > Solaris, Storage Software | Email: harriet.coverston at sun.com > Sun Microsystems, Inc. | AT&T: 651-554-1515 > 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 > Eagan, MN 55121-1231 > > > >
Colin, On Feb 2, 2009, at 8:56 AM, Colin Ngam wrote:>>> >>> >>> Harriet G. Coverston wrote: >>>>> There is a mechanism to get the current full pathname for a >>>>> given fid from userspace, so an HSM-specific copytool could find >>>>> it out, but a central tenet of the design here is that as far as >>>>> the HSM is concerned, the entire Lustre FS is a flat namespace >>>>> of FIDs. >>>> >>>> Be careful here. We are a file system. We don''t have a limit on # >>>> of files in one directory, but we don''t recommend more than >>>> 500,000 files in one single directory or you will start to see >>>> some performance problems. You will have to create a tree, not >>>> use a flat namespace. >>> Yes, a tree based on a hash of the fid. >>> The other option is to use the actual filename for storage, but >>> from Lustre''s point of view this gets extremely tricky. For >>> example: >>> Send /foo/bar to archive. Client A opens /foo/bar. Client B >>> renames /foo/bar to /abc/xyz, but this change hasn''t propagated to >>> the archive yet. Client A now tries to read its open file handle, >>> which tells Lustre to read the offline file FID 123, which it >>> translates to /abc/xyz currently, which the archive doesn''t know >>> about yet. Not just xyz, but renames on any ancestor path element >>> cause similar misses. Since the FID remains constant throughout >>> the life of a file, we don''t have to worry about any namespace >>> changes (file or parents). If there was an alternate way of >>> bypassing the archive''s namespace to directly access a file, we >>> could conceivably store e.g. an archive-specific identifier within >>> the Lustre stripe EA, and pass this down to the copytool when >>> reading an offline file, but this presupposes that such a thing >>> exists, is of reasonable size, has a userspace method to access >>> it, etc. >> >> Yes, we have a FID like concept in SAM-QFS. It is called the file >> ID. It is 64 bits and consists of the inode/generation number. It >> is unique. You can store it. You can issue an ioctl to open the ID. >> You >> can issue an ioctl to do an ID stat, etc. It is much more efficient >> than using the filename (expensive lookup). This means if you store >> and use the ID, you can cover the rename window and still be >> guaranteed that you will get the right file. Note, we don''t >> rearchive on a rename. > I believe this facility only exist on the Meta Data Server Node and > not on the Linux/Solaris clients. Am I correct?It is supported on the MDS and the Solaris client nodes, but currently not on Linux. I thought about this a bit. After we do a samfsrestore (reload the metadata after a crash of the SAM-QFS disk cache), the ID is not the same. Therefore, you would not be able to use this after a SAM restore unless the ID that you are storing is updated. We really need to think about this. - Harriet> > > Thanks. > > colin >> >> I really think a replicated namespace will be much more intuitive >> and solves restore. If you prefer >> to build a tar container, that is OK, too. The tar file can have a >> suffix and then you know it is tar and >> you can tar it back. >>> >>> >>>> >>>>> You can get a full pathname if you want to for catastrophe >>>>> recovery, but Lustre itself will only speak to the HSM with FIDs. >>>>> As I said in the other email, although SAM-QFS can do name-based >>>>> policies, the "name" as far as QFS is concerned is just the FID, >>>>> so name-based policies at the copytool level are worthless. >>>>> Unless we a.) add the path/filename back to the file (EA, or use >>>>> a tarball wrapper), and b.) modify the SAM policy engine to use >>>>> the "real" path/filename instead of the FID. >>>> >>>> Currently, we don''t support policy using EA (extended attributes >>>> are in 5.0). We have had lots of requests for this, especially >>>> from our digital preservation customers. >>> Ah, policy based on EAs would be the general case, yes. >> Yes, this would be a nice feature for us. >> >> - Harriet >> >> Harriet G. Coverston >> Solaris, Storage Software | Email: harriet.coverston at sun.com >> Sun Microsystems, Inc. | AT&T: >> 651-554-1515 >> 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 >> Eagan, MN 55121-1231 >> >> >> >> >- Harriet Harriet G. Coverston Solaris, Storage Software | Email: harriet.coverston at sun.com Sun Microsystems, Inc. | AT&T: 651-554-1515 1270 Eagan Industrial Rd., Suite 160 | Fax: 651-554-1540 Eagan, MN 55121-1231
Hi Nathan, I wrote up what I think can be done in SAMQFS to support Lustre HSM effort. They are talking points that will help us to move forward quickly. It needs to be sanitized ... Thanks. colin -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Osam-proposal Url: http://lists.lustre.org/pipermail/lustre-devel/attachments/20090202/03d17db7/attachment-0001.ksh
Colin Ngam wrote:> Before I forget - can someone point me to HPSS API??See http://www.hpss-collaboration.org/hpss/users/user_doc.jsp. You would want the most recent (6.2) version of the HPSS Programmer''s Reference Guide, Volume 1. The 7.1 version should be out very soon. Vicky
> 6. No namepace. No namespace. Lustre pathnames can be stored as Extended > Attributes.I realize you are talking about SAMQFS, but do we want to keep the design consistent with what we do for hpss? HPSS will support Extended Attributes in release 7.2, to be available September 2009, but it will be expensive to use these EAs for pathnames, though it can be done. The current design is to specify EAs in XML, and for the most efficient storage of the EA in the database, the containing XML document needs to be 512 bytes or fewer so that it can be stored inline. Larger XML objects will have to be stored externally as LOBs (large objects), which will make queries cost more. So we need to think about what that cost will be when we are considering repositories with millions or billions of files. Vicky
Vicky White wrote:> >> 6. No namepace. No namespace. Lustre pathnames can be stored as >> Extended >> Attributes. > > > I realize you are talking about SAMQFS, but do we want to keep the > design consistent with what we do for hpss? > > HPSS will support Extended Attributes in release 7.2, to be available > September 2009, but it will be expensive to use these EAs for > pathnames, though it can be done. The current design is to specify > EAs in XML, and for the most efficient storage of the EA in the > database, the containing XML document needs to be 512 bytes or fewer > so that it can be stored inline. Larger XML objects will have to be > stored externally as LOBs (large objects), which will make queries > cost more. > > So we need to think about what that cost will be when we are > considering repositories with millions or billions of files. > > VickyHi Vicky, I do not see why we need to query these EAs in normal operation. These EAs will only be accessed when we need to perform Ultimate Disaster Recovery - when you have lost all data on disks and all you have are tapes. I was thinking about XML - but it is "opaque" to Object SAMQFS so, it is up to the Lustre side. Whatever it is, the Applications - Lustre-Restore for example, is the one that has to understand the format. I am not a Tar Header expert - but I assume that these EAs can go with the file in the tar ball. I do not expect to keep any of it on line on disk cache, on the SAMFS side. I see no reason. With respect to whether it should be consistent with HPSS - I would say if that is all we need and it is sufficient - why not. Otherwise, let''s make it better than HPSS. It must be in SUN''s best interest to sell SAMQFS? I do apologize Vicky, are you a SUN employee? Thanks. colin
Colin Ngam wrote:> I do not see why we need to query these EAs in normal operation. > These EAs will only be accessed when we need to perform Ultimate > Disaster Recovery - when you have lost all data on disks and all you > have are tapes.That would help. I don''t know what it would cost to store the EA in a separate object to begin with, though, and that would be incurred on every file. Plus you have to consider the space it takes up.> I was thinking about XML - but it is "opaque" to Object SAMQFS so, it > is up to the Lustre side. Whatever it is, the Applications - > Lustre-Restore for example, is the one that has to understand the > format. I am not a Tar Header expert - but I assume that these EAs > can go with the file in the tar ball.I think what you put in the tar ball is up to you. Putting the EAs in there regardless of what the hsm was might simplify the design, so you wouldn''t have to extract the EA in a different way for each hsm. I was just trying to keep the hpss EA design in front of folks so that if we were considering using that, we knew all the tradeoffs.> I do not expect to keep any of it on line on disk cache, on the SAMFS > side. I see no reason. > > With respect to whether it should be consistent with HPSS - I would > say if that is all we need and it is sufficient - why not. Otherwise, > let''s make it better than HPSS. It must be in SUN''s best interest to > sell SAMQFS?Oh, I''m sure it is.> I do apologize Vicky, are you a SUN employee?No. Were you going to feel sorry for me if I worked for Sun? ;) Vicky
Vicky White wrote: Hi Vicky,> Colin Ngam wrote: >> I do not see why we need to query these EAs in normal operation. >> These EAs will only be accessed when we need to perform Ultimate >> Disaster Recovery - when you have lost all data on disks and all you >> have are tapes. > > > That would help. I don''t know what it would cost to store the EA in a > separate object to begin with, though, and that would be incurred on > every file. Plus you have to consider the space it takes up. > > >> I was thinking about XML - but it is "opaque" to Object SAMQFS so, it >> is up to the Lustre side. Whatever it is, the Applications - >> Lustre-Restore for example, is the one that has to understand the >> format. I am not a Tar Header expert - but I assume that these EAs >> can go with the file in the tar ball. > > > I think what you put in the tar ball is up to you. Putting the EAs > in there regardless of what the hsm was might simplify the design, so > you wouldn''t have to extract the EA in a different way for each hsm. > > I was just trying to keep the hpss EA design in front of folks so that > if we were considering using that, we knew all the tradeoffs.Good point. I guess EA can be anywhere in the tar file, but, best if somehow it is put together for fast access. But then, we do not need it unless it is for Ultimate Disaster Recovery .. that should never happen :-)) With respect to space - how about compression? The problem is, I always thought space is cheap. Does HPSS ever scrub/recycle? Policy driven? If EAs are going to be in a Database, I can see it can be a problem. Path name does not need to be in the EA. It is needed for the tar header only. I guess the EA will consist of everything that Lustre needs to restore a file, completely.> > >> I do not expect to keep any of it on line on disk cache, on the SAMFS >> side. I see no reason. >> >> With respect to whether it should be consistent with HPSS - I would >> say if that is all we need and it is sufficient - why not. >> Otherwise, let''s make it better than HPSS. It must be in SUN''s best >> interest to sell SAMQFS? > > > Oh, I''m sure it is. > > >> I do apologize Vicky, are you a SUN employee? > > > No. Were you going to feel sorry for me if I worked for Sun? ;)No, it''s kind of fun to design with .. and I want to say competitor, but I guess you do not really fall into that category. Say hi to Kim K. or Dave W(Cray folks) for me if they cross your path.> > > Vicky >PS-The programmer''s reference is close to 400 pages! Perhaps I should start with User''s Guide.
> PS-The programmer''s reference is close to 400 pages!Alas, yes. Take two steps backward from it. Think of chapter 2 as "posix on hpss", because that''s basically what it is - a client api interface to map all the posix calls into corresponding hpss calls. The first half of the chapter describes the functions and the second half the relevant data structures. The other chapters are gravy - additional kinds of features you can use but wouldn''t have to right off the bat, and some of which you''d never use. Funny...I thought there used to be some programming examples in the back, but maybe that was in another book.> Perhaps I should start with User''s Guide.I always think of that as just an explanation of the standard user interfaces like ftp and vfs, but you''re right, it does talk about some hpss concepts that would be a useful intro. Vicky
Colin Ngam wrote: Is OSAM available on Linux? Object SAMQFS - HSM for Lustre ------------------------------ 0. We re basically looking at the HSM as a Repository right? yes 2. Object SAMQFS meta data(inodes) is used as a database for files that are archived etc. You mean, store the Lustre metadata attributes in these inodes? Or rather that these inodes just keep track of the objects in the archive (like block pointers) 3. This database can be dumped and restored really quick using normal meta data backup of the HSM. The inodes are kept in 1 file. This is not a Lustre dump but rather a dump of Object SAMQFS. No file data dump is required. Files not archived yet are irrelevent .. Incrementals can be obtained by comparing 2 full dumps and just keeping the diffs. Persistent Object SAMQFS file id can be preserved if we restore a complete version of the dump. Otherwise, it can be different. We can update Lustre with the new file id for the given Lustre File ID. Consider this error recovery path .. If we''re already storing archive-specific opaque data (the SamFID), I see no reason why we couldn''t allow the archive to modify that value at will. We''d need to put a lock around it... 4. Object SAMQFS should have very simple policies - archive immediate, number of copies and when copies to be made etc.. This can actually be passed by Lustre and executed by Object SAMQFS. Last thing we want to do is to have to configure 2 Policy engines. I was envisioning the Lustre "action list" as a list of files and actions. The actions could be semi-complex (e.g. "archive at level 4") which would mean something to the archive. 5. Lustre will store a 16 Bytes Object SAMQFS identifier. A 8 bytes unique file system ID and a 8 bytes Object SAMQFS File ID. An Object SAMQFS can only support 32 bits number of files. This will be less if we use inodes for extended attributes etc. The file system ID will allow us to create multiple Object SAMQFS "mat" file system - provide infinite number of files that can be supported. Do separate filesystems need separate disks? This opens up a inodecount/filesize relation, or we have to create new OSAM filesystems on demand (ENOSPC, create new fs, store file -- hmm, not so hard). 6. No namepace. Lustre pathnames can be stored as Extended Attributes. No problem except for the disaster recovery scenario. And even in that case we don''t need EAs if we''re storing mini-tarballs already - just add an empty file to the tarball with the actual filename. 7. Files to be archived and staged in together(associative archiving) to be given in a list by Lustre. Object SAMQFS will figure out a way to link these files together and put them on the same tarball - this is not for free. It''s actually not clear that this is useful for Lustre. If the point of Lustre HSM is to extend the filesystem space, it makes little sense to bother archiving small files. Anyhow, this can be a future optimization. Basic Object SAMQFS - HSM for Lustre Archive Events ------------------------------------------- Lustre calls with the following Information: 1. Luster FID 2. Luster Opaque Meta Data 3. Luster Tar File required Data e.g. Path Name 4. Luster Archiving Policy for this file - must be simple. Lustre gets back: 1. Object SAMQFS Identifier. Depending on asynchronous or synchronous archiving: 1. Lustre can status with the given "Object SAMQFS Identifier" Sounds fine. Lustre will always use asynchronous archiving, as far as I can see. Basic Object SAMQFS - HSM for Lustre Stage In Events(bring data back) --------------------------------------------------------------------- 1. Lustre just reads the file with the given "Object SAMQFS Identifier" Basic Object SAMQFS - HSM for Lustre status Events(check state) 1. Lustre perform "sls" command on Object SAMQFS Client. PS - We can have both User level command and API capabilities. well technically, Lustre calls with the following information 1. Luster FID 2. Luster Opaque Meta Data (BTW, that''s Lustre, not Luster) OSAM ignores fid and just uses OSAM identifier Basic Object SAMQFS - HSM for Lustre Delete Event ------------------------------------------------- 1. Lustre can effectively do an "rm" on the Object SAMQFS Identifier or calls an API. Object SAMQFS Dump and Restore ------------------------------ Independent Administrative event. Lustre Dump and Restore ----------------------- Can be an Independent Lustre event. However, this does have impact on when we can actually delete a file from tape if a Lustre Dump has a reference to this file e.g. 1. Archive file. 2. Dump Lustre. 3. Delete file. Now you want to restore the deleted file. Dumping the Lustre metadata isn''t something we''ve really talked about before - or, rather, the restore part isn''t :) Effectively, the Lustre metadata is (all the data on) the entire MDT disk. I''m not sure it makes any sense to try to be any more elaborate than that, but maybe. It would be nice to be able to e.g. dump the disk to a regular (big!) file store in OSAM, so we''ve got everything on 1 set of tapes... Ultimate Disaster Recovery - Directly from Tapes ------------------------------------------------ Requires Tar File to be complete with Lustre Meta Data. Since this is a recreation of both the Lustre FS and Object SAMQFS "mat" FS I would be incline to believe that at a minimum, we will not require the Object SAMQFS identifier to be persistent from previous incantation. I am also incline to believe that if you take regular Object SAMQFS dumps, both full and also incrementals and store this safely on tape - you may not need this procedure .. but then, that''s why we call it Ultimate Recovery. If everything is wiped out except the tapes, we would just repopulate a new Lustre fs anyhow. Once the OSAM fs is regenerated, we walk all the objects and create object placeholders in the new Lustre fs referencing the new OSAM fids and marking everything as punched. As users start using files they are pulled back in automatically. Syncing Object SAMQFS with Lustre --------------------------------- Lustre File Identifier and Object SAMQFS Identifier can get out of sync - shit happens. We need syncing capabilities. Only if we stored enough information to mismatch :) If Lustre asks for a FID, and it gets back the wrong file, it doesn''t / can''t know. Unless we store the FID inside the file it gets back and we verify it. Object SAMQFS - Freeing space on tapes -------------------------------------- We will need a way to determine with Lustre - conclusively that an archive is no longer needed. If Lustre policy manager says "rm", then Lustre has no way to ever get that file back. There''s no time-machine like old versions of directories. Would be a cool feature though. Maybe archive says "ok" to the rm, but secretly holds on to the file for some time in a special "recently deleted" dir?
On Feb 3, 2009, at 6:41 PM, Nathaniel Rutman wrote: Hi, If these are all agreeable, lets start drawing up the Spec.> Colin Ngam wrote: > > Is OSAM available on Linux?Can be access from a Linux Client. It is another file system type to SAMQFS. We have inserted software restrictions to prevent it from being used as a Shared QFS file system type. This is one of those, code is there, needs testing. I did the code so ... Keep in mind that the Meta Data Server is still only Solaris.> > > Object SAMQFS - HSM for Lustre > ------------------------------ > > 0. We re basically looking at the HSM as a Repository right? > > yes > > > 2. Object SAMQFS meta data(inodes) is used as a database for files > that are > archived etc. > > You mean, store the Lustre metadata attributes in these inodes? Or > rather that these inodes just keep track of the objects in the > archive (like block pointers)Inodes on the OSAM nodes are for managing the files in archive and the link to Lustre. I expect to store Lustre Meta Data as EA in tar file. I am assuming that we do not need Lustre Meta data on disk cache. Lustre already has it in Lustre .. only need to access these for Ultimate Disaster Recovery from tape.> > > 3. This database can be dumped and restored really quick using > normal meta > data backup of the HSM. The inodes are kept in 1 file. This is not > a Lustre > dump but rather a dump of Object SAMQFS. No file data dump is > required. Files > not archived yet are irrelevent .. Incrementals can be obtained by > comparing > 2 full dumps and just keeping the diffs. Persistent Object SAMQFS > file id > can be preserved if we restore a complete version of the dump. > Otherwise, > it can be different. We can update Lustre with the new file id for > the given > Lustre File ID. Consider this error recovery path .. > > If we''re already storing archive-specific opaque data (the SamFID), > I see no reason why we couldn''t allow the archive to modify that > value at will. We''d need to put a lock around it...Yes we can. It is just a matter of how do we initiate this change between the archive and Lustre.> > > > 4. Object SAMQFS should have very simple policies - archive > immediate, number > of copies and when copies to be made etc.. This can actually be > passed by > Lustre and executed by Object SAMQFS. Last thing we want to do is > to have to > configure 2 Policy engines. > > I was envisioning the Lustre "action list" as a list of files and > actions. The actions could be semi-complex (e.g. "archive at level > 4") which would mean something to the archive.Yes, this needs to be defined. This should include future action like "made 2nd copy after 24 hours etc. SAMQFS has a standard set of Policies .. if you want to deviate we will have to provide new code. We need to define these actions.> > > > 5. Lustre will store a 16 Bytes Object SAMQFS identifier. A 8 > bytes unique > file system ID and a 8 bytes Object SAMQFS File ID. An Object > SAMQFS can only > support 32 bits number of files. This will be less if we use inodes > for > extended attributes etc. The file system ID will allow us to create > multiple > Object SAMQFS "mat" file system - provide infinite number of files > that can > be supported. > > Do separate filesystems need separate disks? This opens up a > inodecount/filesize relation, or we have to create new OSAM > filesystems on demand (ENOSPC, create new fs, store file -- hmm, not > so hard).No, a file system is configured using slices/partitions. More than 1 FS can reside on the same disk. There will not be any inodecount/ filesize relationship because on the SAMQFS node we will release file data space as needed after the file is on Tape. We also do the "punch". Yes FS can be created on demand.> > > > 6. No namepace. Lustre pathnames can be stored as Extended > Attributes. > > No problem except for the disaster recovery scenario. And even in > that case we don''t need EAs if we''re storing mini-tarballs already - > just add an empty file to the tarball with the actual filename.OK.> > > > 7. Files to be archived and staged in together(associative > archiving) to be > given in a list by Lustre. Object SAMQFS will figure out a way to > link these > files together and put them on the same tarball - this is not for > free. > > It''s actually not clear that this is useful for Lustre. If the > point of Lustre HSM is to extend the filesystem space, it makes > little sense to bother archiving small files. Anyhow, this can be a > future optimization.Lustre''s call.> > > > > Basic Object SAMQFS - HSM for Lustre Archive Events > ------------------------------------------- > > Lustre calls with the following Information: > > 1. Luster FID > 2. Luster Opaque Meta Data > 3. Luster Tar File required Data e.g. Path Name > 4. Luster Archiving Policy for this file - must be simple. > > Lustre gets back: > > 1. Object SAMQFS Identifier. > > Depending on asynchronous or synchronous archiving: > > 1. Lustre can status with the given "Object SAMQFS Identifier" > > Sounds fine. Lustre will always use asynchronous archiving, as far > as I can see.Okay.> > > > > Basic Object SAMQFS - HSM for Lustre Stage In Events(bring data back) > --------------------------------------------------------------------- > > 1. Lustre just reads the file with the given "Object SAMQFS > Identifier" > > > Basic Object SAMQFS - HSM for Lustre status Events(check state) > > 1. Lustre perform "sls" command on Object SAMQFS Client. > > PS - We can have both User level command and API capabilities. > > well technically, Lustre calls with the following information > 1. Luster FID > 2. Luster Opaque Meta Data > (BTW, that''s Lustre, not Luster) > OSAM ignores fid and just uses OSAM identifierRight, Fiber/Fibre :-) I am missing something here .. Stage-In is to get a file from archive .. why do we need Item 2? Or is 2 OSAM Identifier? If so, great. I like it. In this case, we should trust Lustre FID. The OSAM ID is for a very fast search - direct index.> > > > > Basic Object SAMQFS - HSM for Lustre Delete Event > ------------------------------------------------- > > 1. Lustre can effectively do an "rm" on the Object SAMQFS > Identifier or > calls an API. > > > Object SAMQFS Dump and Restore > ------------------------------ > > Independent Administrative event. > > Lustre Dump and Restore > ----------------------- > > Can be an Independent Lustre event. > However, this does have impact on when we can actually delete a file > from > tape if a Lustre Dump has a reference to this file e.g. > 1. Archive file. > 2. Dump Lustre. > 3. Delete file. > > Now you want to restore the deleted file. > > Dumping the Lustre metadata isn''t something we''ve really talked > about before - or, rather, the restore part isn''t :) > Effectively, the Lustre metadata is (all the data on) the entire MDT > disk. I''m not sure it makes any sense to try to be any more > elaborate than that, but maybe. It would be nice to be able to e.g. > dump the disk to a regular (big!) file store in OSAM, so we''ve got > everything on 1 set of tapes...Lustre''s call.> > > > > Ultimate Disaster Recovery - Directly from Tapes > ------------------------------------------------ > > Requires Tar File to be complete with Lustre Meta Data. > Since this is a recreation of both the Lustre FS and Object SAMQFS > "mat" FS > I would be incline to believe that at a minimum, we will not require > the > Object SAMQFS identifier to be persistent from previous > incantation. I am also > incline to believe that if you take regular Object SAMQFS dumps, > both full and > also incrementals and store this safely on tape - you may not need > this > procedure .. but then, that''s why we call it Ultimate Recovery. > > If everything is wiped out except the tapes, we would just > repopulate a new Lustre fs anyhow. Once the OSAM fs is regenerated, > we walk all the objects and create object placeholders in the new > Lustre fs referencing the new OSAM fids and marking everything as > punched. As users start using files they are pulled back in > automatically.Yes. The chances of both a Lustre and OSAM collapse at the same time is not very good.> > > > > Syncing Object SAMQFS with Lustre > --------------------------------- > > Lustre File Identifier and Object SAMQFS Identifier can get out of > sync - shit > happens. We need syncing capabilities. > > Only if we stored enough information to mismatch :) If Lustre asks > for a FID, and it gets back the wrong file, it doesn''t / can''t > know. Unless we store the FID inside the file it gets back and we > verify it.If you always call with Lustre ID and OSAM ID, if we find that the Lustre ID does not match the OSAM ID, because perhaps we have done OSAM recovery and we are using a different OSAM ID to hold the Lustre ID now, we can search for the inode that match the Lustre ID, fetch the file and also update Lustre with the new OSAM ID.> > > > Object SAMQFS - Freeing space on tapes > -------------------------------------- > > We will need a way to determine with Lustre - conclusively that an > archive is > no longer needed. > > If Lustre policy manager says "rm", then Lustre has no way to ever > get that file back. There''s no time-machine like old versions of > directories. Would be a cool feature though. Maybe archive says > "ok" to the rm, but secretly holds on to the file for some time in a > special "recently deleted" dir?No namespace - no dir. If Lustre removes the file, we can delay the scrub. If Lustre can come back with the Lustre ID and OSAM ID, if it has not been scrubbed, you can get it back. Thanks. colin> > > >
Colin Ngam wrote:> If these are all agreeable, lets start drawing up the Spec. >sure>> Basic Object SAMQFS - HSM for Lustre status Events(check state) >> >> 1. Lustre perform "sls" command on Object SAMQFS Client. >> >> PS - We can have both User level command and API capabilities. >> >> well technically, Lustre calls with the following information >> 1. Luster FID >> 2. Luster Opaque Meta Data >> (BTW, that''s Lustre, not Luster) >> OSAM ignores fid and just uses OSAM identifier > > Right, Fiber/Fibre :-) > > I am missing something here .. Stage-In is to get a file from archive > .. why do we need Item 2? Or is 2 OSAM Identifier? If so, great. I > like it.Yes, item 2 is archive-specific (e.g. OSAM) identifier.> > In this case, we should trust Lustre FID. The OSAM ID is for a very > fast search - direct index.Agreed, the Lustre FID will be the authoritative value. OSAM would internally use the OSAM identifier for the fast lookup, then verify the Lustre FID matches. If not, we would have to do a slow search for the Lustre FID...>> Syncing Object SAMQFS with Lustre >> --------------------------------- >> >> Lustre File Identifier and Object SAMQFS Identifier can get out of >> sync - shit >> happens. We need syncing capabilities. >> >> Only if we stored enough information to mismatch :) If Lustre asks >> for a FID, and it gets back the wrong file, it doesn''t / can''t >> know. Unless we store the FID inside the file it gets back and we >> verify it. > If you always call with Lustre ID and OSAM ID, if we find that the > Lustre ID does not match the OSAM ID, because perhaps we have done > OSAM recovery and we are using a different OSAM ID to hold the Lustre > ID now, we can search for the inode that match the Lustre ID, fetch > the file and also update Lustre with the new OSAM ID.Perfect. We will also need a way of modifying the Lustre FID stored in the archive for the ultimate disaster recovery, which is equivalent to "pre-populating" a Lustre FS with files from the archive. We create an empty file, mark it as "in archive", add the OSAM fid, and need to set the archive''s Lustre FID to match the new empty file.> >> >> >> >> Object SAMQFS - Freeing space on tapes >> -------------------------------------- >> >> We will need a way to determine with Lustre - conclusively that an >> archive is >> no longer needed. >> >> If Lustre policy manager says "rm", then Lustre has no way to ever >> get that file back. There''s no time-machine like old versions of >> directories. Would be a cool feature though. Maybe archive says >> "ok" to the rm, but secretly holds on to the file for some time in a >> special "recently deleted" dir? > No namespace - no dir. > > If Lustre removes the file, we can delay the scrub. If Lustre can > come back with the Lustre ID and OSAM ID, if it has not been scrubbed, > you can get it back. >Let''s not worry about this for V1. Once a file is rm''ed from Lustre, no way to get back Lustre ID or OSAM ID.