Andreas Dilger wrote:> Nathan, > Eric and I had a lengthy discussion today about HSM and the copy-in > process. This was largely driven by Braam''s assertion that having a > copy-in process that blocks all access to the file data is not sufficient > to meet customer demands. Some customers require processes be able to > access the file data as soon as it is present in the objects. > > Eric and I both agreed that we want to start with as simple an HSM solution > as possible and incrementally provide improvements, so long as the early > implementation is not a "throw-away" that consumes significant developer > resources but doesn''t provide long term benefits. In both the "simple" > and the "complex" copy-in the client has no knowledge/participation > of the process being done by the HSM/coordinator. > > We both agreed that the simplest copy-in process is a reasonable starting > point and can be used by many customers. To review the simple case > (I hope this also matches your recollection): > > 1) client tries to access a file that has been purged > a) if client is only doing getattr, attributes can be returned from MDS > - MDS holds file size[*] > - client may get MDS attribute read locks, but not layout lock > -> DONE > b) if client is trying to do an open (read or write) > - layout lock is required by client to do any read/write of the file > - client enqueues layout lock request > - MDS notices that file is purged, does upcall to coordinator to > start copy-in on FID N >s/does upcall/asks/. We expect coordinator to be in-kernel for LNET comms to agents.> 2) client is blocked waiting for layout lock > - if MDS crashes at this point, client will resend open to MDS, goto 1b > - MDS should send early replies indicating lock will be slow to grant >The reply to the layout lock request includes a "wait forever" flag (this is the one client code change required for HSM at this point.) There are no early replies for lock enqueue requests. Maybe indefinite ongoing early replies for lock enqueus are a requirement for HSM copyin?> ? need to have a mechanism to ensure copy-in hasn''t failed? >Coordinator needs to decide is copy-in has failed, and redistribute the request to a new agent. (Needs detail: timer? progress messages from agent?) There''s nothing the client or MDT can do at this point (except fail the request), so we may as well just wait.> 3) coordinator contacts agent(s) to retrieve FID N from HSM > - agent(s) create temp file X (new or backed-up layout parameters) [!] >backed up in EA with original copyout request. We should try to respect specific layout settings (pool, stripecount, stripesize), but be flexible if e.g. pool doesn''t exist anymore. Maybe we want to ignore offset and/or specific ost allocations in order to rebalance.> - agent(s) restore data into temp file > - agent or coordinator do ioctl on file to move file X objects to > file N, old objects are destroyed on file close, or > - agent or coordinator do ioctl on file to notify MDS copy-in is done >I was thinking the latter, and MDT moves the layout from X to N.> 4) MDS handles ioctl, drops layout lock > 5) client(s) waiting on layout lock are granted the layout lock by MDS > - client(s) get OST extent locks > - client(s) read/write file data > -> DONE > > [*] The MDS will already store the file size today, even without SOM, if > the file does not have any objects/striping. If SOM is not implemented > then the "purged" state and object removal (with destroy llog entries) > would need to be a synchronous operation BEFORE the objects are actually > destroyed. Otherwise, SOM-like recovery of the object purge state is > needed. Avoiding the sync is desirable, but making HSM dependent upon > SOM is undesirable. >All we really have to do is insure that the destroy llog entry is committed, right? Then the OSTs should eventually purge the objects during orphan recovery, yes?> [!] If MDS kept layout then it could pre-create the temp file and pass the > restore-to FID to the coordinator/agent, to keep agent more similar to > "complex" case where it is restoring directly into real file. The only > reason the agent is restoring into the temp file is to avoid needing > to open the file while the MDS is blocking layout lock access, but maybe > that isn''t a big obstacle (e.g. open flag). >You mean open flag O_IGNORE_LAYOUT_LOCK? So the one problem I see with this is the case of a stuck agent - if we want to start another agent doing copyin we have to insure that the first agent doesn''t try to write anything else. Or we give them two separate temp files, but this remains a problem with the direct restore into real file case. Although I suppose this is already handled by write extent locks and eviction...> In the "complex" case, the clients should be able to read/write the file > data as soon as possible and the OSTs need to prevent access to the parts > of the file which have not yet been restored. > > 1) client tries to access a file that has been purged > a) if client is only doing getattr, attributes can be returned from MDS > - MDS holds file size[*] > - client may get MDS attribute read locks, but not layout lock > -> DONE > b) if client is trying to do an open (read or write) > - layout lock is required by client to do any read/write of the file > - client enqueues layout lock request >- MDT generates new layout based on old lov EA, assigning newly created OST objects.> - MDS grants layout lock to client > 2) client enqueues extent lock on OST > - object was previously marked fully/partly invalid during purge > - object may have persistent invalid map of extent(s) that indicate > which parts of object require copy-in >I''ll read this as if you''re proposing your 2,3 (call it "per-object invalid ranges held on OSTs") as a new method to do the copyin in-place. This is not the original in-place idea proposed in Menlo Park (see below), and so I''ll comment with an eye toward the differences. I think we can''t assume we''re restoring back to the original OSTs. Therefore the MDT must create new empty objects on the OSTs and have the OSTs mark them purged before the layout lock can be granted to the clients.> - access to invalid parts of object trigger copy-in upcall to coordinator >Now we need to figure out how to map the object back to a particular range extent of a particular file (are we storing this in an EA with each object now?) We also need to initiate OST->coordinator communication, so either coordinator becomes a distributed function on the OSTs or we need new services going the reverse of the normal mdt->ost direction. Maybe the coordinator-as-distributed-function works - the coordinators must all choose the same agent for objects belonging to the same file, yet distribute load among agents: I think the coordinator just got a lot more complicated.> ? group locks on invalid part of file block writes to missing data >The issue here is that we can''t allow any client to write and then have the agent overwrite the new data with old data being restored. So we could have the OST give a group lock to agent via coordinator, preventing all other writes. But it seems that we can check the special "clear invalid" flag used by the agent (see (3) below), and silently drop writes into areas not in the "invalid extents" list. Any client write to any extent will clear the invalid flag for those extents. And then we only ever need to block on reading. What about reads to missing data? OST refuses to grant read locks on invalid extents, needs clients to wait forever.> - clients block waiting on extent locks for invalid parts of objects >We''ll have to set this extent lock enqueue timeout to wait forever.> - OST crash at this timek restarts enqueue process >Agent crash will still have to be detected and restarted by coordinator> 3) coordinator contacts agent(s) to retrieve FID N from HSM > - agents write to actual object to be restored with "clear invalid" flag > - writes by agent shrink invalid extent, periodically update on-disk > invalid extent and release locks on that part of file (on commit?) >The OST should keep track of all invalid extents. Invalid extents list changes should be stored on disk, transactionally with the data write.> - client or agent agent crash doesn''t want to access parts of multi- > part archive it will >?? Invalid extents list will be accurate regardless of client, agent, or OST crash. I hope. Subsequent requests to missing data will result in new OST requests to coordinator.> 4) client is granted extent lock when that part of file is copied in >So that actually doesn''t sound too bad. I think the original idea of keeping the locking (and the coordinator) on the MDT (below) is still simpler, but I think it''s going to be the recovery issues that decide this one way or the other. Original in-place copyin idea: When MDT generates new layout, it takes PW write locks on all extents of every stripe on behalf of the agent, and then somehow transfers these locks to the agent (this transferability was the point of using the group lock). The agent then releases extent locks as it copies in data. This was the first design we discussed in Menlo Park: (older idea, for posterity) Open intent enques layout lock. MDT checks "purged" bit; if purged, MDT selects new layout and populates MD. MDT takes group extent locks on all objects, then grants layout read lock to client, allowing open to finish successfully, quickly. (Client reads/writes will block forever on extents enqueues until group lock has been dropped.) MDT then sends request to coordinator requesting copyin FID XXXX with group lock id YYYY (and extents 0-end). Coordinator distributes that request to an appropriate agent. Agent retrieves file from HSM and writes into /.lustre/fid/XXXX:XXXX using group lock YYYY. Agent takes group lock, MDT still holds group lock. When finished, the agent clears "purged" bit from EA, and drops the group lock. Clearing purged bit causes MDT to drop group lock as well, allowing the client to read/write. It gets fuzzy at the end there, about exactly when the MDT drops the group lock in order to do handle the dead agent case. It seems the safe thing to do is for the MDT to keep it until the agent is done, but then this blocks access to completed extents. If the MDT drops the group lock as soon as the agent takes it, then somehow the agent converts the group lock to regular write lock, then other clients can get read/write locks on released extents. But if the agent dies, the extent locks will be freed at eviction, and other clients are free to start reading (missing) data.
Note that Andreas'' simple vs. complex case seems to fundamentally affect the design of the coordinator (whether it is associated with the MDT or the OSTs), and so I don''t see a clear non-throw-away path from one to the other. I think the "original in-place copyin" idea is more compatible with the simple case. Also note that Braam also posited that the copyin at open is a desired simplification for the "Simplified HSM for Lustre" (lustre-devel 7/16). Nathaniel Rutman wrote:> Andreas Dilger wrote: > >> Nathan, >> Eric and I had a lengthy discussion today about HSM and the copy-in >> process. This was largely driven by Braam''s assertion that having a >> copy-in process that blocks all access to the file data is not sufficient >> to meet customer demands. Some customers require processes be able to >> access the file data as soon as it is present in the objects. >> >> Eric and I both agreed that we want to start with as simple an HSM solution >> as possible and incrementally provide improvements, so long as the early >> implementation is not a "throw-away" that consumes significant developer >> resources but doesn''t provide long term benefits. In both the "simple" >> and the "complex" copy-in the client has no knowledge/participation >> of the process being done by the HSM/coordinator. >> >> We both agreed that the simplest copy-in process is a reasonable starting >> point and can be used by many customers. To review the simple case >> (I hope this also matches your recollection): >> >> 1) client tries to access a file that has been purged >> a) if client is only doing getattr, attributes can be returned from MDS >> - MDS holds file size[*] >> - client may get MDS attribute read locks, but not layout lock >> -> DONE >> b) if client is trying to do an open (read or write) >> - layout lock is required by client to do any read/write of the file >> - client enqueues layout lock request >> - MDS notices that file is purged, does upcall to coordinator to >> start copy-in on FID N >> >> > s/does upcall/asks/. We expect coordinator to be in-kernel for LNET > comms to agents. > >> 2) client is blocked waiting for layout lock >> - if MDS crashes at this point, client will resend open to MDS, goto 1b >> - MDS should send early replies indicating lock will be slow to grant >> >> > The reply to the layout lock request includes a "wait forever" flag > (this is the one client code change required for HSM at this point.) > There are no early replies for lock enqueue requests. Maybe indefinite > ongoing early replies for lock enqueus are a requirement for HSM copyin? > >> ? need to have a mechanism to ensure copy-in hasn''t failed? >> >> > Coordinator needs to decide is copy-in has failed, and redistribute the > request to a new agent. (Needs detail: timer? progress messages from > agent?) There''s nothing the client or MDT can do at this point (except > fail the request), so we may as well just wait. > >> 3) coordinator contacts agent(s) to retrieve FID N from HSM >> - agent(s) create temp file X (new or backed-up layout parameters) [!] >> >> > backed up in EA with original copyout request. We should try to respect > specific layout settings (pool, stripecount, stripesize), but be > flexible if e.g. pool doesn''t exist anymore. Maybe we want to ignore > offset and/or specific ost allocations in order to rebalance. > >> - agent(s) restore data into temp file >> - agent or coordinator do ioctl on file to move file X objects to >> file N, old objects are destroyed on file close, or >> - agent or coordinator do ioctl on file to notify MDS copy-in is done >> >> > I was thinking the latter, and MDT moves the layout from X to N. > >> 4) MDS handles ioctl, drops layout lock >> 5) client(s) waiting on layout lock are granted the layout lock by MDS >> - client(s) get OST extent locks >> - client(s) read/write file data >> -> DONE >> >> [*] The MDS will already store the file size today, even without SOM, if >> the file does not have any objects/striping. If SOM is not implemented >> then the "purged" state and object removal (with destroy llog entries) >> would need to be a synchronous operation BEFORE the objects are actually >> destroyed. Otherwise, SOM-like recovery of the object purge state is >> needed. Avoiding the sync is desirable, but making HSM dependent upon >> SOM is undesirable. >> >> > All we really have to do is insure that the destroy llog entry is > committed, right? Then the OSTs should eventually purge the objects > during orphan recovery, yes? > >> [!] If MDS kept layout then it could pre-create the temp file and pass the >> restore-to FID to the coordinator/agent, to keep agent more similar to >> "complex" case where it is restoring directly into real file. The only >> reason the agent is restoring into the temp file is to avoid needing >> to open the file while the MDS is blocking layout lock access, but maybe >> that isn''t a big obstacle (e.g. open flag). >> >> > You mean open flag O_IGNORE_LAYOUT_LOCK? So the one problem I see with > this is the case of a stuck agent - if we want to start another agent > doing copyin we have to insure that the first agent doesn''t try to write > anything else. Or we give them two separate temp files, but this > remains a problem with the direct restore into real file case. Although > I suppose this is already handled by write extent locks and eviction... > >> In the "complex" case, the clients should be able to read/write the file >> data as soon as possible and the OSTs need to prevent access to the parts >> of the file which have not yet been restored. >> >> 1) client tries to access a file that has been purged >> a) if client is only doing getattr, attributes can be returned from MDS >> - MDS holds file size[*] >> - client may get MDS attribute read locks, but not layout lock >> -> DONE >> b) if client is trying to do an open (read or write) >> - layout lock is required by client to do any read/write of the file >> - client enqueues layout lock request >> >> > - MDT generates new layout based on old lov EA, assigning > newly created OST objects. > >> - MDS grants layout lock to client >> 2) client enqueues extent lock on OST >> - object was previously marked fully/partly invalid during purge >> - object may have persistent invalid map of extent(s) that indicate >> which parts of object require copy-in >> >> > I''ll read this as if you''re proposing your 2,3 (call it "per-object > invalid ranges held on OSTs") as a new method to do the copyin > in-place. This is not the original in-place idea proposed in Menlo Park > (see below), and so I''ll comment with an eye toward the differences. > > I think we can''t assume we''re restoring back to the original OSTs. > Therefore the MDT must create new empty objects on the OSTs and have the > OSTs mark them purged before the layout lock can be granted to the clients. > >> - access to invalid parts of object trigger copy-in upcall to coordinator >> >> > Now we need to figure out how to map the object back to a particular > range extent of a particular file (are we storing this in an EA with > each object now?) We also need to initiate OST->coordinator > communication, so either coordinator becomes a distributed function on > the OSTs or we need new services going the reverse of the normal > mdt->ost direction. Maybe the coordinator-as-distributed-function works > - the coordinators must all choose the same agent for objects belonging > to the same file, yet distribute load among agents: I think the > coordinator just got a lot more complicated. > >> ? group locks on invalid part of file block writes to missing data >> >> > The issue here is that we can''t allow any client to write and then have > the agent overwrite the new data with old data being restored. So we > could have the OST give a group lock to agent via coordinator, > preventing all other writes. But it seems that we can check the special > "clear invalid" flag used by the agent (see (3) below), and silently > drop writes into areas not in the "invalid extents" list. Any client > write to any extent will clear the invalid flag for those extents. And > then we only ever need to block on reading. > What about reads to missing data? OST refuses to grant read locks on > invalid extents, needs clients to wait forever. > >> - clients block waiting on extent locks for invalid parts of objects >> >> > We''ll have to set this extent lock enqueue timeout to wait forever. > >> - OST crash at this timek restarts enqueue process >> >> > Agent crash will still have to be detected and restarted by coordinator > > >> 3) coordinator contacts agent(s) to retrieve FID N from HSM >> - agents write to actual object to be restored with "clear invalid" flag >> - writes by agent shrink invalid extent, periodically update on-disk >> invalid extent and release locks on that part of file (on commit?) >> >> > The OST should keep track of all invalid extents. Invalid extents list > changes should be stored on disk, transactionally with the data write. > >> - client or agent agent crash doesn''t want to access parts of multi- >> part archive it will >> >> > ?? > > Invalid extents list will be accurate regardless of client, agent, or > OST crash. I hope. Subsequent requests to missing data will result in > new OST requests to coordinator. > >> 4) client is granted extent lock when that part of file is copied in >> >> > > So that actually doesn''t sound too bad. I think the original idea of > keeping the locking (and the coordinator) on the MDT (below) is still > simpler, but I think it''s going to be the recovery issues that decide > this one way or the other. > > Original in-place copyin idea: > When MDT generates new layout, it takes PW write locks on all extents of > every stripe on behalf of the agent, and then somehow transfers these > locks to the agent (this transferability was the point of using the > group lock). The agent then releases extent locks as it copies in data. > This was the first design we discussed in Menlo Park: > > (older idea, for posterity) > Open intent enques layout lock. MDT checks "purged" bit; if purged, > MDT selects new layout and populates MD. MDT takes group extent > locks on all objects, then grants layout read lock to client, > allowing open to finish successfully, quickly. (Client reads/writes > will block forever on extents enqueues until group lock has been > dropped.) MDT then sends request to coordinator requesting copyin > FID XXXX with group lock id YYYY (and extents 0-end). Coordinator > distributes that request to an appropriate agent. Agent retrieves > file from HSM and writes into /.lustre/fid/XXXX:XXXX using group > lock YYYY. Agent takes group lock, MDT still holds group lock. > When finished, the agent clears "purged" bit from EA, and drops the > group lock. Clearing purged bit causes MDT to drop group lock as > well, allowing the client to read/write. > > It gets fuzzy at the end there, about exactly when the MDT drops the > group lock in order to do handle the dead agent case. It seems the safe > thing to do is for the MDT to keep it until the agent is done, but then > this blocks access to completed extents. If the MDT drops the group > lock as soon as the agent takes it, then somehow the agent converts the > group lock to regular write lock, then other clients can get read/write > locks on released extents. But if the agent dies, the extent locks will > be freed at eviction, and other clients are free to start reading > (missing) data. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel >
On Oct 09, 2008 12:11 -0700, Nathaniel Rutman wrote:> Andreas Dilger wrote: >> The only reason the agent is restoring into the temp file is to avoid >> needing to open the file while the MDS is blocking layout lock access, >> but maybe that isn''t a big obstacle (e.g. open flag). > > You mean open flag O_IGNORE_LAYOUT_LOCK? So the one problem I see with > this is the case of a stuck agent - if we want to start another agent > doing copyin we have to insure that the first agent doesn''t try to write > anything else.Having two agents on the same file wouldn''t itself be harmful, because they should both be restoring the same data to the same place. That said, we would still want to be able to kill the stuck agent to avoid it continuing to "restore" the file over new user data after the second agent had reported "file is available" and the user process started writing to it.>> 2) client enqueues extent lock on OST >> - object was previously marked fully/partly invalid during purge >> - object may have persistent invalid map of extent(s) that indicate >> which parts of object require copy-in > > I''ll read this as if you''re proposing your 2,3 (call it "per-object > invalid ranges held on OSTs") as a new method to do the copyin in-place. > This is not the original in-place idea proposed in Menlo Park (see > below), and so I''ll comment with an eye toward the differences.Correct, this is something Eric and I recently discussed in the context of being able to begin using a file before copyin had completed.> I think we can''t assume we''re restoring back to the original OSTs.Definitely not.> Therefore the MDT must create new empty objects on the OSTs and have the > OSTs mark them purged before the layout lock can be granted to the > clients.Correct.>> - access to invalid parts of object trigger copy-in upcall to coordinator > > Now we need to figure out how to map the object back to a particular > range extent of a particular file (are we storing this in an EA with > each object now?)We had also discussed the need for this with migration. The OSTs already store the MDS FID on each object, and even if the OSTs cannot do the object->file extent mapping, their upcall to the coordinator can do this with the LOV EA and the object extent.> We also need to initiate OST->coordinator > communication, so either coordinator becomes a distributed function on > the OSTs or we need new services going the reverse of the normal > mdt->ost direction. Maybe the coordinator-as-distributed-function works > - the coordinators must all choose the same agent for objects belonging > to the same file, yet distribute load among agents: I think the > coordinator just got a lot more complicated.I don''t think this implies the need for a distributed coordinator. The OSTs would contact the coordinator (as MDS does at file access in "simple" model) with MDS FID (+OST extent?) and coordinator determines if there is an existing copyin for that FID or not.>> ? group locks on invalid part of file block writes to missing data > > The issue here is that we can''t allow any client to write and then have > the agent overwrite the new data with old data being restored. So we > could have the OST give a group lock to agent via coordinator, > preventing all other writes. But it seems that we can check the special > "clear invalid" flag used by the agent (see (3) below), and silently > drop writes into areas not in the "invalid extents" list. Any client > write to any extent will clear the invalid flag for those extents. And > then we only ever need to block on reading.Eric and I discussed this at length. The solution we came up with is to have "agent" writes that are restoring the file be flagged as such and only be allowed for parts of the file which are still marked "in HSM". This allows normal writes to proceed without danger of being overwritten, and for operations like "truncate" it would remove the need to restore some/any of the file data because it would also clear the "in HSM" marker from the truncated parts of the file. NB: we haven''t discussed truncates/unlinks in the context of HSM, but this should _definitely_ not start a copyin of the file data.> What about reads to missing data? OST refuses to grant read locks on > invalid extents, needs clients to wait forever.This would also trigger HSM copy-in. If the HSM decides this data is permanently inaccessible then the object (or parts thereof) should be marked as such and client reads should get -EIO.>> 3) coordinator contacts agent(s) to retrieve FID N from HSM >> - agents write to actual object to be restored with "clear invalid" flag >> - writes by agent shrink invalid extent, periodically update on-disk >> invalid extent and release locks on that part of file (on commit?) > > The OST should keep track of all invalid extents. Invalid extents list > changes should be stored on disk, transactionally with the data write.Yes, definitely it needs to be stored on disk, and it should be kept with the object itself. For completely purged objects, the MDS needs to mark the whole file as "in HSM", and it would also truncate the objects to the right size as soon as they are created (this already happens today when the MDS file has no objects and is storing the size). Remember this is all in the "complex" case where we want concurrent file access with HSM copyin, and in simple case client will just block until the copyin is finished. Similarly, if copyin crashes in the middle, it would have to start at the beginning, but that should be rare enough to ignore it until the full solution is implemented. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.