This is intended to be a starting point for discussion; the concepts here have been hashed through a few times and hopefully represent the best current thinking. Baseline concepts 1. all single-file coherency issues are in kernel space (file locking, recovery) 2. all policy decisions are in user space (using changelogs, df, etc) 3. coordinator/mover communication will use LNET 4. "simple" refers to a. integration with HPSS only b. depends on changelog for policy decisions c. restore on file open, not data read/write 5. HSM tracks entire files, not stripe objects 6. HSM namespace is flat, all files are addressed by FID only 7. Desired: coordinator and movers can be reused by (non-HSM) replication Components 1. Mover a. combined kernel (LNET comms) and userspace processes b. userspace processes will use Lustre clients for data i/o c. will use special fid directory for file access (.lustre/fid/XXXX) d. interfaces with hardware-specific copy tool to access HSM files e. kernel process encompasses service threads listening for coordinator requests, passes these up to userspace process via upcall. No interaction with the client is needed; this is a simple message passing service. 2. Coordinator a. decides and dispatches copyin and copyout requests to movers b. consolidates repeat requests c. re-queues requests to a new agent if an agent becomes unresponsive d. kernel space, associated with the MDT for cache-miss e. ioctl interface for copyout, purge requests from policy engine 3. Policy engine (aka Space Manager) a. makes policy decisions for copyout, purge b. normally uses changelogs and ''df'' for input; rarely is allowed to scan filesystem c. userspace process, requests copyout and purge via ioctl to coordinator 4. MDT changes a. Per-file layout lock A new layout lock is created for every file. Private writer lock is taken by the MDT when allocating/changing file layout (LOV EA). Shared reader locks are taken by anyone reading the layout (client opens, lfs getstripe). Anyone taking a new extent lock anywhere in the file must first hold the layout lock. Problem: Layout lock can''t be held by liblustre during i/o? b. lov EA changes i. flags: file_is_purged "purged", copyout_begin, file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged flag is always manipulated under a write layout lock, the other flags are not. ii: "window" EA range of non-purged data (rev2) c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone Algorithms 1. copyout a. Policy engine decides to copy a file to HSM, executes HSMCopyOut ioctl on file b. ioctl handled by MDT, which passes request to Coordinator c. coordinator dispatches request to mover. request should include file extents (for future purposes) d. normal extents read lock is taken by mover running on client e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA. f. any writes to the file set the "hsm_dirty" bit (may be lazy/delayed with mtime or filesize change updates on MDT). Note that file writes need not cancel copyout; for a fs with a single big file, we don''t want to keep interrupting copyout or it will never finish. g. when done, mover checks hsm_dirty bit. If set, clears copyout_begin, indicating current file is not in HSM. If not set, mover sets "copyout_complete" bit. File layout write lock is not taken during mover flag manipulation. (Note: file modifications after copyout is complete will have both copyout_complete and hsm_dirty bits set.) 2. purge (aka punch) a. Policy engine decides to purge a file, exectues HSMPurge ioctl on file b. ioctl handled by MDT c. MDT takes a write lock on the file layout lock d. MDT enques write locks on all extents of the file. After these are granted, then no client has any dirty cache and no child can take new extent locks until layout lock is released. MDT drops all extent locks. e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit is set f. MDT marks the LOV EA as "purged" g. MDT sends destroys the OST objects, using destroy llog entries to guard against object leakage during OST failover h. MDT drops layout lock. 3. restore (aka copyin aka cache miss) a. Client open intent enques layout read lock. b. MDT checks "purged" bit; if purged, lock request response includes "wait forever" flag, causing client to block the open. c. MDT creates a new layout with a similar stripe pattern as the original, allocating new objects on new OSTs. (We should try to respect specific layout settings (pool, stripecount, stripesize), but be flexible if e.g. pool doesn''t exist anymore. Maybe we want to ignore offset and/or specific ost allocations in order to rebalance.) d. MDT sends request to coordinator requesting copyin of the file to .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future to (a) copy in part of a file, in low-disk-space situations; (b) copy in individual stripes simultaneously on multiple OSTs. e. Coordinator distributes that request to an appropriate mover. f. Writes into .lustre/fid/* are not required to hold layout read lock (or special flag is passed to open, or group write lock on layout is passed to mover) g. Mover copies data from HSM h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file i. MDT clears "purged" bit from LOV EA j. MDT releases the layout lock k. This sends a completion AST to the original client, who now completes his open. State machines TBD - I think there''s enough in here to chew on for awhile Things requiring a more detailed look 1. configuration of HSM/movers 2. policy engine 3. "complex" HSM roadmap a. partial access to files during restore b. partial purging for file type identification, image thumbnails, ?? c. integration with other HSM backends (ADM, ??) 4. layout locks and liblustre
Nathaniel Rutman a ?crit :> c. restore on file open, not data read/writetake care of the difficulties to move this behavious to a restore-on-first-I/O later> d. interfaces with hardware-specific copy tool to access HSM filesrather "HSM-specific"> e. kernel process encompasses service threads listening for > coordinator requests, passes these up to userspace process via upcall. > No interaction with the client is needed; this is a simple message > passing service.Depending on how you can manage a user-space process, but, AFAIK, to be able to manage the copy tool process, that means: - start this process - send a signal - get its output (for "complex" hsm, we will need feedback from copy tool process) - wait for the process end - ... All of this is easily doable from userspace, and very hard in kernel-space (we cannot use the fire-and-forget call call_usermodehelper). So I rather imagine: - a kernel space mover, simply getting LNET messages and passing them to user-space mover - a user-space mover, forking, spawning and managing the copy tool process. Maybe it will need to manage several copy tool processes, so it will need queues, process list, etc... So I think this tool needs a bit more than just a "simple message passing service".> b. lov EA changes > i. flags: file_is_purged "purged", copyout_begin, > file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged > flag is always manipulated under a write layout lock, the other flags > are not. > ii: "window" EA range of non-purged data (rev2)If you add a window EA (will be needed anyway for hsm v2), you do not need a purged flag: window.start ==window.end is comparable to a purged flag unset. (or window.end == 0)> c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone > > > Algorithms > 1. copyout > a. Policy engine decides to copy a file to HSM, executes HSMCopyOut > ioctl on file > b. ioctl handled by MDT, which passes request to Coordinator > c. coordinator dispatches request to mover. request should include > file extents (for future purposes) > d. normal extents read lock is taken by mover running on client > e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA. > f. any writes to the file set the "hsm_dirty" bit (may be > lazy/delayed with mtime or filesize change updates on MDT). Note that > file writes need not cancel copyout; for a fs with a single big file, we > don''t want to keep interrupting copyout or it will never finish.Is it interesting to have a file that is outdated and possibly uncoherent?> g. when done, mover checks hsm_dirty bit. If set, clears > copyout_begin, indicating current file is not in HSM. If not set, > mover sets "copyout_complete" bit. File layout write lock is not taken > during mover flag manipulation. (Note: file modifications after copyout > is complete will have both copyout_complete and hsm_dirty bits set.) > > 2. purge (aka punch) > a. Policy engine decides to purge a file, exectues HSMPurge ioctl on > file > b. ioctl handled by MDT > c. MDT takes a write lock on the file layout lock > d. MDT enques write locks on all extents of the file. After these > are granted, then no client has any dirty cache and no child can take > new extent locks until layout lock is released. MDT drops all extent locks. > e. MDT verifies that hsm_dirty bit is clear and copyout_complete bit > is set > f. MDT marks the LOV EA as "purged" > g. MDT sends destroys the OST objects, using destroy llog entries to > guard against object leakage during OST failoverAre you sure you want to remove those objects if we will need them later, in "complex" HSM? As this mecanism will need to change a lot when we will implement the restore-in-place feature, i''m not sure this is the best idea.> h. MDT drops layout lock. > > 3. restore (aka copyin aka cache miss) > a. Client open intent enques layout read lock. > b. MDT checks "purged" bit; if purged, lock request response > includes "wait forever" flag, causing client to block the open. > c. MDT creates a new layout with a similar stripe pattern as the > original, allocating new objects on new OSTs. (We should try to respect > specific layout settings (pool, stripecount, stripesize), but be > flexible if e.g. pool doesn''t exist anymore. Maybe we want to ignore > offset and/or specific ost allocations in order to rebalance.) > d. MDT sends request to coordinator requesting copyin of the file to > .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the future > to (a) copy in part of a file, in low-disk-space situations; (b) copy in > individual stripes simultaneously on multiple OSTs. > e. Coordinator distributes that request to an appropriate mover. > f. Writes into .lustre/fid/* are not required to hold layout read > lock (or special flag is passed to open, or group write lock on layout > is passed to mover) > g. Mover copies data from HSM > h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file > i. MDT clears "purged" bit from LOV EA > j. MDT releases the layout lock > k. This sends a completion AST to the original client, who now > completes his open.Concerning the new flag copyout_begin/copyout_complete, I''m not a ldlm/recovery specialist but is it possible to have the mover to take a kind of write extent lock on the area it has to copied in/out and downgrade it on a smaller range as the copy tool goes along. Copy-out - Mover take a specific lock on range (0-EOF for the moment) - On this range, reads pass, writes raise a callback on the mover. - Receiving this callback, if the mover release its lock, the copyout is cancelled, if not, the write i/o is blocked - When the mover has copied [0 - cursor], it can downgrade its lock to [cursor - EOF] and release the lock on [ 0 - cursor]. Same thing could be done for copy in. The two key points are: - Could we have a layout lock on a specific range? - Is it possible to downgrade a range lock with ldlm? -- Aurelien Degremont CEA
Aurelien Degremont wrote:> Nathaniel Rutman a ?crit : >> c. restore on file open, not data read/write > > take care of the difficulties to move this behavious to a > restore-on-first-I/O later.Indeed. From a client point of view, it only changes which locks its waiting on, but from a server point of view the OSTs would need to become involved in HSM knowledge. It is more work, but I don''t think there would be much "throwaway" code from the former to the latter.> >> d. interfaces with hardware-specific copy tool to access HSM files > rather "HSM-specific" > >> e. kernel process encompasses service threads listening for >> coordinator requests, passes these up to userspace process via >> upcall. No interaction with the client is needed; this is a simple >> message passing service. > > Depending on how you can manage a user-space process, but, AFAIK, to > be able to manage the copy tool process, that means: > - start this process > - send a signal > - get its output (for "complex" hsm, we will need feedback from copy > tool process) > - wait for the process end > - ... > All of this is easily doable from userspace, and very hard in > kernel-space (we cannot use the fire-and-forget call > call_usermodehelper). > So I rather imagine: > - a kernel space mover, simply getting LNET messages and passing them > to user-space mover > - a user-space mover, forking, spawning and managing the copy tool > process. > > Maybe it will need to manage several copy tool processes, so it will > need queues, process list, etc... > > So I think this tool needs a bit more than just a "simple message > passing service".As we discussed in the HSM concall this morning, the return path can mostly take place through the file itself via ioctl calls. The mover will open the destination file location in Lustre and then can indicate status through an ioctl: starting, waiting for HSM, periodic pinging or "% complete" messages, copyin complete. This is the "fire-and-forget" model, and can be started from call_usermodehelper. The in-kernel code will only have to deal with one-way requests from coordinator to mover. We also specified 4 types of requests from coordinator: 1. copyin FID 2. copyout FID 3. abort copy(in|out) FID 4. purge FID from HSM To accomplish 3, it might make sense to store the PID of the process started from the upcall in the kernel (this is returned by the upcall). Closing the file could clear the pid from the kernel list.> >> b. lov EA changes >> i. flags: file_is_purged "purged", copyout_begin, >> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged >> flag is always manipulated under a write layout lock, the other flags >> are not. >> ii: "window" EA range of non-purged data (rev2) > > If you add a window EA (will be needed anyway for hsm v2), you do not > need a purged flag: > > window.start ==window.end is comparable to a purged flag unset. (or > window.end == 0)True, but I don''t really see a large market for partially purged files, so I don''t really believe that it is worth the effort. One of the important points here is that we are deleting stripes off the OSTs, freeing up space, and we won''t necessarily restore to those same OSTs. As soon as we have partially purged files that''s no longer the case, and I think complicates things too much.> > >> c. new file ioctls: HSMCopyOut, HSMPurge, HSMCopyinDone >> >> >> Algorithms >> 1. copyout >> a. Policy engine decides to copy a file to HSM, executes >> HSMCopyOut ioctl on file >> b. ioctl handled by MDT, which passes request to Coordinator >> c. coordinator dispatches request to mover. request should >> include file extents (for future purposes) >> d. normal extents read lock is taken by mover running on client >> e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA. >> f. any writes to the file set the "hsm_dirty" bit (may be >> lazy/delayed with mtime or filesize change updates on MDT). Note >> that file writes need not cancel copyout; for a fs with a single big >> file, we don''t want to keep interrupting copyout or it will never >> finish. > > Is it interesting to have a file that is outdated and possibly > uncoherent?It is probably useful in some cases -- simulation checkpoints maybe.> >> g. when done, mover checks hsm_dirty bit. If set, clears >> copyout_begin, indicating current file is not in HSM. If not set, >> mover sets "copyout_complete" bit. File layout write lock is not >> taken during mover flag manipulation. (Note: file modifications >> after copyout is complete will have both copyout_complete and >> hsm_dirty bits set.) >> >> 2. purge (aka punch) >> a. Policy engine decides to purge a file, exectues HSMPurge ioctl >> on file >> b. ioctl handled by MDT >> c. MDT takes a write lock on the file layout lock >> d. MDT enques write locks on all extents of the file. After >> these are granted, then no client has any dirty cache and no child >> can take new extent locks until layout lock is released. MDT drops >> all extent locks. >> e. MDT verifies that hsm_dirty bit is clear and copyout_complete >> bit is set >> f. MDT marks the LOV EA as "purged" >> g. MDT sends destroys the OST objects, using destroy llog entries >> to guard against object leakage during OST failover > > Are you sure you want to remove those objects if we will need them > later, in "complex" HSM? > As this mecanism will need to change a lot when we will implement the > restore-in-place feature, i''m not sure this is the best idea.Ah, I think it is important that we do NOT restore in place to the old OST objects. The OSTs may now be full, or indeed not exist anymore. The restore in place for complex HSM is at the file level; the objects may move around. "Complex" in this case just means that clients will have access to partially restored files.> > >> h. MDT drops layout lock. >> >> 3. restore (aka copyin aka cache miss) >> a. Client open intent enques layout read lock. b. MDT checks >> "purged" bit; if purged, lock request response includes "wait >> forever" flag, causing client to block the open. >> c. MDT creates a new layout with a similar stripe pattern as the >> original, allocating new objects on new OSTs. (We should try to >> respect specific layout settings (pool, stripecount, stripesize), but >> be flexible if e.g. pool doesn''t exist anymore. Maybe we want to >> ignore offset and/or specific ost allocations in order to rebalance.) >> d. MDT sends request to coordinator requesting copyin of the file >> to .lustre/fid/XXXX with extents 0-EOF. Extents may be used in the >> future to (a) copy in part of a file, in low-disk-space situations; >> (b) copy in individual stripes simultaneously on multiple OSTs. >> e. Coordinator distributes that request to an appropriate mover. >> f. Writes into .lustre/fid/* are not required to hold layout read >> lock (or special flag is passed to open, or group write lock on >> layout is passed to mover) >> g. Mover copies data from HSM >> h. When finished, mover calls ioctl HSM_COPYIN_DONE on the file >> i. MDT clears "purged" bit from LOV EA >> j. MDT releases the layout lock >> k. This sends a completion AST to the original client, who now >> completes his open. > > > > > Concerning the new flag copyout_begin/copyout_complete, I''m not a > ldlm/recovery specialist but is it possible to have the mover to take > a kind of write extent lock on the area it has to copied in/out and > downgrade it on a smaller range as the copy tool goes along.This is called "lock conversion" and is not yet implemented, but has been a general Lustre design goal for some time. So yes, for "complex" HSM this is what we would want to do.> > Copy-out > - Mover take a specific lock on range (0-EOF for the moment) > - On this range, reads pass, writes raise a callback on the mover. > - Receiving this callback, if the mover release its lock, the copyout > is cancelled, if not, the write i/o is blockedI don''t think we want to block the write just because the HSM copy isn''t done yet. If the data is changing, then the policy engine shouldn''t have started a copyout process in the first place. If the customer''s goal is to do a coherent checkpoint, then it should explicitly wait for the copyout to be done. If it''s just the policy engine that got it wrong, it doesn''t matter if it finishes or not; the file will be marked "hsm_dirty", and so the policy engine should re-queue it for copyout again later, and it can''t be purged in the meantime since the dirty bit is set.> - When the mover has copied [0 - cursor], it can downgrade its lock to > [cursor - EOF] and release the lock on [ 0 - cursor]. > > Same thing could be done for copy in. > > The two key points are: > - Could we have a layout lock on a specific range?Not the layout lock - layout means the striping pattern, and must be held first before any extent locks can be taken. So I think what you are asking we plan to do with two locks: the layout lock plus another extent lock.> - Is it possible to downgrade a range lock with ldlm? >Not yet, but as I said, lock conversion is a general Lustre goal.
Nathaniel Rutman wrote:> >>> b. lov EA changes >>> i. flags: file_is_purged "purged", copyout_begin, >>> file_in_HSM_is_out_of_date "hsm_dirty", copyout_complete. The purged >>> flag is always manipulated under a write layout lock, the other flags >>> are not. >>> ii: "window" EA range of non-purged data (rev2) >>> >> If you add a window EA (will be needed anyway for hsm v2), you do not >> need a purged flag: >> >> window.start ==window.end is comparable to a purged flag unset. (or >> window.end == 0) >> > True, but I don''t really see a large market for partially purged files, > so I don''t really believe that it is worth the effort. One of the > important points here is that we are deleting stripes off the OSTs, > freeing up space, and we won''t necessarily restore to those same OSTs. > As soon as we have partially purged files that''s no longer the case, and > I think complicates things too much. >Ok, I''ve been told I''m dead wrong here, and this will absolutely be required for "complex" HSM (not "simple"), and so therefore we should at least think about the arch now. Supposedly we need to keep X bytes at the beginning of the file for the unix "file" command, and supposedly icon/preview data, and Y bytes at the end of the file, not sure exactly why. We would still plan on deleting the OST objects in the middle. And clearly, a simple beginning/ending byte count is insufficient for the final "complex" requirement of enabling partial file reads while doing a copyin (where we would at a minimum need a per-object cursor). Anyhow, as I write this none of this sounds like something that can''t be implemented at a later time, so I think we should stick with the simplest of the simple options for now.
Nathaniel Rutman a ?crit :> Ok, I''ve been told I''m dead wrong here, and this will absolutely be > required for "complex" HSM (not "simple"), and so therefore we should at > least think about the arch now. Supposedly we need to keep X bytes at > the beginning of the file for the unix "file" command, and supposedly > icon/preview data, and Y bytes at the end of the file, not sure exactly > why. > We would still plan on deleting the OST objects in the middle. And > clearly, a simple beginning/ending byte count is insufficient for the > final "complex" requirement of enabling partial file reads while doing a > copyin (where we would at a minimum need a per-object cursor). Anyhow, > as I write this none of this sounds like something that can''t be > implemented at a later time, so I think we should stick with the > simplest of the simple options for now. >Ok. Can you just sum up the inplace copy-in mechanism that have been decided (between Menlo Park version and the other ones)? -- Aurelien Degremont CEA
Nathan,> True, but I don''t really see a large market for partially purged files, > so I don''t really believe that it is worth the effort. One of the > important points here is that we are deleting stripes off the OSTs, > freeing up space, and we won''t necessarily restore to those same OSTs. > As soon as we have partially purged files that''s no longer the case, and > I think complicates things too much.Partially purged files is a requirement to allow graphical file browsers to retrieve icons from within the file. It''s OK to miss this out in the first version, but it has to be there for the full product.> >> Algorithms > >> 1. copyout > >> a. Policy engine decides to copy a file to HSM, executes > >> HSMCopyOut ioctl on file > >> b. ioctl handled by MDT, which passes request to Coordinator > >> c. coordinator dispatches request to mover. request should > >> include file extents (for future purposes) > >> d. normal extents read lock is taken by mover running on client > >> e. mover sets "copyout_begin" bit and clears "hsm_dirty" bit in EA. > >> f. any writes to the file set the "hsm_dirty" bit (may be > >> lazy/delayed with mtime or filesize change updates on MDT). Note > >> that file writes need not cancel copyout; for a fs with a single big > >> file, we don''t want to keep interrupting copyout or it will never > >> finish. > > > > Is it interesting to have a file that is outdated and possibly > > uncoherent? > It is probably useful in some cases -- simulation checkpoints maybe.A corrupt simulation checkpoint is useless. We _must_ provide a way to ensure the HSM copy of a file is a known good snapshot. We don''t necessarily have to abort the copyout if there is an update that could mean the HSM copy would be corrupt since we can always just copy it out again, but it doesn''t seem hugely complicated to notify the backend, if not the agent and let it decide.> > Are you sure you want to remove those objects if we will need them > > later, in "complex" HSM? > > As this mecanism will need to change a lot when we will implement the > > restore-in-place feature, i''m not sure this is the best idea. > Ah, I think it is important that we do NOT restore in place to the old > OST objects. The OSTs may now be full, or indeed not exist anymore. > The restore in place for complex HSM is at the file level; the objects > may move around. "Complex" in this case just means that clients will > have access to partially restored files.Can''t the "complex" HSM restore to new objects? It just depends on when the new-being-restored objects become the new contents of the file doesn''t it?> > Copy-out > > - Mover take a specific lock on range (0-EOF for the moment) > > - On this range, reads pass, writes raise a callback on the mover. > > - Receiving this callback, if the mover release its lock, the copyout > > is cancelled, if not, the write i/o is blocked > I don''t think we want to block the write just because the HSM copy isn''t > done yet. If the data is changing, then the policy engine shouldn''t > have started a copyout process in the first place.Indeed.> If the customer''s > goal is to do a coherent checkpoint, then it should explicitly wait for > the copyout to be done.Disagree - the customer doesn''t have to know a copyout is in progress. The HSM should abort the copyout or mark the copy corrupt.> If it''s just the policy engine that got it > wrong, it doesn''t matter if it finishes or not; the file will be marked > "hsm_dirty", and so the policy engine should re-queue it for copyout > again later, and it can''t be purged in the meantime since the dirty bit > is set.Indeed. Cheers, Eric
Eric Barton a ?crit : > Partially purged files is a requirement to allow graphical file browsers > to retrieve icons from within the file. It''s OK to miss this out in the > first version, but it has to be there for the full product. Think also of command like $ file foo* >>> Is it interesting to have a file that is outdated and possibly >>> uncoherent? >> It is probably useful in some cases -- simulation checkpoints maybe. > > A corrupt simulation checkpoint is useless. We _must_ provide a way to > ensure the HSM copy of a file is a known good snapshot. We don''t necessarily > have to abort the copyout if there is an update that could mean the > HSM copy would be corrupt since we can always just copy it out again, > but it doesn''t seem hugely complicated to notify the backend, if not > the agent and let it decide. Indeed, this is important >> I don''t think we want to block the write just because the HSM copy isn''t >> done yet. If the data is changing, then the policy engine shouldn''t >> have started a copyout process in the first place. > > Indeed. You were speaking of a FS with only one big file and so we need to have a way to be sure it will be copied at least once, even if people are writting on it. In this case, with a classical policy engine, this file will never be copied out because data is constantly changing. -- Aurelien Degremont CEA
Aurelien,> >> I don''t think we want to block the write just because the HSM > >> copy isn''t done yet. If the data is changing, then the policy > >> engine shouldn''t have started a copyout process in the first > >> place. > > > > Indeed. > > You were speaking of a FS with only one big file and so we need to > have a way to be sure it will be copied at least once, even if > people are writting on it. In this case, with a classical policy > engine, this file will never be copied out because data is > constantly changing.I''m not so sure that''s a realistic case. If this file is so active that it''s impossible to take a consistent copy of it without some sort of a snapshot facility, does that really mean it''s a candidate for archiving? Cheers, Eric
On 10/17/08 3:47 AM, "Aurelien Degremont" <aurelien.degremont at cea.fr> wrote:> Eric Barton a ?crit : >> Partially purged files is a requirement to allow graphical file browsers >> to retrieve icons from within the file. It''s OK to miss this out in the >> first version, but it has to be there for the full product. > > Think also of command like > > $ file foo* > >>>> Is it interesting to have a file that is outdated and possibly >>>> uncoherent?99.99% of (probably more 9''s) backup systems do work this way, with relatively little harm. Also remember that many files are append only - for those it might be fine. Philosophically it is a disaster of course. I would offer archiving of files that are active, and in due course use snapshots. Peter>>> It is probably useful in some cases -- simulation checkpoints maybe. >> >> A corrupt simulation checkpoint is useless. We _must_ provide a way to >> ensure the HSM copy of a file is a known good snapshot. We don''t > necessarily >> have to abort the copyout if there is an update that could mean the >> HSM copy would be corrupt since we can always just copy it out again, >> but it doesn''t seem hugely complicated to notify the backend, if not >> the agent and let it decide. > > Indeed, this is important > >>> I don''t think we want to block the write just because the HSM copy > isn''t >>> done yet. If the data is changing, then the policy engine shouldn''t >>> have started a copyout process in the first place. >> >> Indeed. > > You were speaking of a FS with only one big file and so we need to have > a way to be sure it will be copied at least once, even if people are > writting on it. > In this case, with a classical policy engine, this file will never be > copied out because data is constantly changing. > >
High-level architecture page for the Lustre HSM project http://arch.lustre.org/index.php?title=HSM_Migration HSM core team - this is intended to be sufficient to write a full HLD/DLD from. What is it missing?
Nathaniel Rutman a ?crit :> High-level architecture page for the Lustre HSM project > http://arch.lustre.org/index.php?title=HSM_Migration > > > HSM core team - this is intended to be sufficient to write a full > HLD/DLD from. What is it missing? >I think after tuestday conf call, all the main element for writting hld will be their. May be some small points will be missing but we could discuss them by e-mail or at confcalls. I think most of it is already there. -- Aurelien Degremont CEA
Few questions : - For large existing archive of tapes (~10,000,000 files) it is desirable to import file metadata to lustre fs without actually copying files on disk. Import shall be done in reasonable time (hours rather than month) or online. - to provide bandwidth to tape it is desirable to have multiple migrator nodes connected to HSM. What element of proposed design distributes copy-out processes across migrator nodes to provide scalability ? Is it functionality of HSM specific copy tool or does lustre agent provide it ? - a "smart" HSM system can reorder requests to optimize tape access. It is common to have 2000 requests pending in queue with tens or hundreds IO transfers actually served. Current limit of pending requests is about 30,000. We found implementing of pending requests as processes (one copy-out tool process per request waiting for IO) is resource consuming and is not scalable. What is the way to serve ~100,000 request waiting for transfer ? - how to prestage files ? Send asynchronous request for copy-in file from tape without blocking on wait. It is needed to stage large data sets for future processing. Prestaging "file sets" is desirable. - what proposed scanario to handle OST down ? Suppose file is present on one of OSTs and it went down (striping is one). My understanding is client will wait when OST will come back (case[1]) and file will not be staged from tape automatically. IF file is not present on any OST, it will be staged immediately (case[2]). Is possible to stage file automatically (case[1]) to another OST and mark a copy on old OST for removal ? We discussed some of these questions with Peter, he suggested to ask on devel list. Best regards, Alex. Nathaniel Rutman wrote:> High-level architecture page for the Lustre HSM project > http://arch.lustre.org/index.php?title=HSM_Migration > > > HSM core team - this is intended to be sufficient to write a full > HLD/DLD from. What is it missing? > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
Alex Kulyavtsev wrote:> Few questions : > - For large existing archive of tapes (~10,000,000 files) it is > desirable to import file metadata to lustre fs without actually > copying files on disk. > Import shall be done in reasonable time (hours rather than month) or > online.Agreed. Probably best done via a special ioctl that would create a stub file and populate the metadata.> > - to provide bandwidth to tape it is desirable to have multiple > migrator nodes connected to HSM. What element of proposed design > distributes copy-out processes across migrator nodes to provide > scalability ? Is it functionality of HSM specific copy tool or does > lustre agent provide it ?Lustre agents can run on multiple Lustre clients in parallel. Coordinator distributes copyout jobs to different agents.> > > - a "smart" HSM system can reorder requests to optimize tape access. > It is common to have 2000 requests pending in queue with tens or > hundreds IO transfers actually served. Current limit of pending > requests is about 30,000. We found implementing of pending requests as > processes (one copy-out tool process per request waiting for IO) is > resource consuming and is not scalable. What is the way to serve > ~100,000 request waiting for transfer ? >Coordinator decides when to request copyin/out jobs, and could throttle the total number of concurrent accesses.> - how to prestage files ? Send asynchronous request for copy-in file > from tape without blocking on wait. It is needed to stage large data > sets for future processing. Prestaging "file sets" is desirable.Policy engine would request copyin of files before cache miss on open. Policy could define file sets.> > - what proposed scanario to handle OST down ? Suppose file is present > on one of OSTs and it went down (striping is one). My understanding is > client will wait when OST will come back (case[1]) and file will not > be staged from tape automatically. IF file is not present on any OST, > it will be staged immediately (case[2]). Is possible to stage file > automatically (case[1]) to another OST and mark a copy on old OST for > removal ?With our V2 HSM, we will have the ability to keep more detailed layouts; this optimization could be part of those changes.> > We discussed some of these questions with Peter, he suggested to ask > on devel list.We greatly appreciate it! Please ask/suggest away.> > Best regards, Alex. > > Nathaniel Rutman wrote: >> High-level architecture page for the Lustre HSM project >> http://arch.lustre.org/index.php?title=HSM_Migration >> >> >> HSM core team - this is intended to be sufficient to write a full >> HLD/DLD from. What is it missing? >> _______________________________________________ >> Lustre-devel mailing list >> Lustre-devel at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel >> >
On Nov 25, 2008 10:59 -0600, Alex Kulyavtsev wrote:> - For large existing archive of tapes (~10,000,000 files) it is > desirable to import file metadata to lustre fs without actually copying > files on disk. > Import shall be done in reasonable time (hours rather than month) or online.Concievably this could be done with "mknod" and "setxattr" to store the striping information into the Lustre inode. However, one issue will be how to identify this new file to the HSM. The current plan is that the Lustre HSM policy engine database will contain the mapping between the Lustre FID (~= inode number) and the file in the archive. Since this is a new file (FID) then we would also need to add an entry to the policy engine database that contains the mapping from FID->archive file.> - a "smart" HSM system can reorder requests to optimize tape access. It > is common to have 2000 requests pending in queue with tens or hundreds > IO transfers actually served. Current limit of pending requests is about > 30,000. We found implementing of pending requests as processes (one > copy-out tool process per request waiting for IO) is resource consuming > and is not scalable. What is the way to serve ~100,000 request waiting > for transfer ?The Lustre HSM design has the policy engine as a mediator between the copyin/copyout/purge requests and the userspace agents that are specific to the HSM and do the actual work. The policy engine it is free to reorder all of the requests as it sees fit. CEA is supplying their existing policy engine as a starting point for Lustre HSM+HPSS, and I this could be made available to interested parties sooner rather than later.> - what proposed scanario to handle OST down ? Suppose file is present on > one of OSTs and it went down (striping is one). My understanding is > client will wait when OST will come back (case[1]) and file will not be > staged from tape automatically. IF file is not present on any OST, it > will be staged immediately (case[2]). Is possible to stage file > automatically (case[1]) to another OST and mark a copy on old OST for > removal ?Since the OST objects will be removed when the file is purged there is no requirement to store the file on a particular OST during copyin. The HSM will store the striping attributes (probably only if they do not match the filesystem defaults) to ensure that wide striped files retain this property when returned to the filesystem. In addition to not saving the striping for files that match the default layout, we may also consider to save the layout of files with "stripe_count == target_count" as having a stripe_count = -1 (stripe over all OSTs) so that if there are more OSTs available when the file is restored it takes advantage of the additional bandwidth. We might also consider having a (policy engine?) tunable that files with > N stripes are restriped over all OSTs when restored. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.