''Andreas Dilger''
2009-Jan-13 00:26 UTC
[Lustre-devel] RAID-1 SNS / migration discussion summary
Further discussion on migration using RAID-1 uncovered a number of issues that need careful attention. I don''t think I''ve captured all of them here, so I''d welcome some review of this document. While full RAID-1 coherency while to the file is much nicer technically, it will be significantly more complex to implement (more time, more bugs), and we could have basic functionality earlier with the "simple space balance migration". That is similar to the proposal to have a "basic HSM" (blocks IO during copyin) and "complex HSM" (allows file IO during copyin ASAP when data is available). As with "basic HSM", "simple space balance migration" clients would be blocked during file access if the file is being migrated, with the option of killing the migration if it is estimated to take too long. The clients would also be blocked on the MDS layout lock during migration, as with HSM. Below is the description of migration using RAID-1. Most of the mechanism is in the RAID-1 functionality, very little of it relates to migration itself. It would also be desirable if the implementation of RAID-1 was agnostic to the number of data copies, because if we need to migrate a RAID-1 object this might need 3 copies of the data at one time, and some environments may want to have multiple copies of the data (e.g. remote caches, many replicas of binaries). ------------------------------------- A client initiates migration by requesting the MDT for this file to change the LOV EA layout to instantiate a second mirror copy of the file. When a file migration is requested, the MDS handles this by revoking the file layout from all clients and adding a new RAID-1 mirror to the layout. We didn''t discuss specifics on how this mirror file should be created, but in light of the later discussion about HSM copy-in I''ll suggest that the file be created by the client using normal file striping parameters, and then request that the MDT "attach" the new file as an additional mirror copy. The new objects of the RAID-1 mirror would be marked "stale" in some manner (in the MDS layout, or on the objects themselves as is proposed for HSM). Eric and I also discussed a "stale map" for each object that is persistent on disk, so that an object can be partially updated and reads can be satisfied from the valid parts of the disk even in the case of multiple OST failures. A simplifying assumption was to keep the stripe size the same on both copies of the file, so that a chunk on one OST maps directly to another OST, instead of possibly being split in the middle. All reads from the file will only be handled by the valid mirror(s). It should be possible to do reads from either copy of the file by only getting a single lock on that object+extent. Writes need to have a write lock over the same extents on all copies while writing. This will allow the file to continue being used while it is being resynced. The writes in the filesystem are done via COW (as in ldiskfs hardening) and the llog records are atomically committed with the object''s metadata describing the newly-allocated extent update to ensure that if the OST crashes that the old file data is not overwritten. This implies that non-COW backing filesystems cannot participate in RAID-1. Writes to each stripe will cause the local OST to generate an llog record that describes what part of the object was modified, and the llog cookie will be sent back to the client in the reply. These llog records will in essence be "stale data map" updates for the _remote_ objects. [Q] We discussed having "tags" that are sent with each write, so that the secondary copy knows which llog cookies are cancelled with each write. We would need to have a way for tags to be (relatively) unique and generated by the clients, because false collisions could result in missed data updates on the backup objects. [Q] For lockless IO the "tags" on the writes are critical because there is no coherent locking of the object between OSTs, unless the OST itself is doing the mirroring while locking the remote object? How would we detect racing overlapping IO to different copies of the file? We said during our discussion that when the write is complete on the mirror the client will cancel the llog cookie to indicate that both sides of the write are up-to-date. [Q] What happens on a write-caching OST? The initial writes will generate an llog cookie on one side, and the cookie will be cancelled by the client. Instead it seems that the client needs to pass the cookie on to the mirror OST and they are only cancelled when the data is persistent on disk (one transaction later). The resync of the new copy proceeds by the client/agent reading the file data (from the non-stale copy only) and writing data to the mirror copy. If a client gets a timeout when writing to one stripe after having written to the partner stripe, then it is up to the OSTs to do recovery of the stale parts of the file. <begin hand waving> The object on the updated OST needs to be able to detect that the other copy was not updated independently of the client (presumably by a timeout), and then cause the other OST to replay its llog records. A similar mechanism will be needed in case an up-to-date mirror''s OST goes offline and writes are not being sent there. We have currently mandated a restriction that the stripe size of both copies be the same, in order to facilitate logging of updates. If the stripe size is the same then a write to one chunk of an object will map to whole chunk on the mirror copy. Nikita has suggested that the OSTs keep a copy of the LOV EA locally so that each OST can generate appropriate llog update records. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ----- End forwarded message ----- Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ----- End forwarded message ----- Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.