A quick note on my setup, 40 oss''s each with a md0 raid1 (The OST) (2x500gb) (amd x2 5400 8gb ddr2) and a single mdt/mgs/mds (dual xeon quad core 32gb ddr2 4tb raid6). I have everything mounted and working properly and am still going through the basic tests for performance but I am confused about the safety of the cluster. I not concerned about the OST''s because they are a raid 1 which I can recover quickly and monitor but my question is, what if an OSS goes down. Will that cause corruption of the data? and what happens if it comes back up. I also would like to know if you can dynamically add to the cluster new OSS/OST''s or do you have to unmount the client then remount after doing so. Thanks for your help in advance.
On Mon, 2009-04-06 at 15:27 -0400, Christopher Deneen wrote:> A quick note on my setup, 40 oss''s each with a md0 raid1 (The OST) > (2x500gb) (amd x2 5400 8gb ddr2) and a single mdt/mgs/mds (dual xeon > quad core 32gb ddr2 4tb raid6). I have everything mounted and working > properly and am still going through the basic tests for performance > but I am confused about the safety of the cluster. I not concerned > about the OST''s because they are a raid 1 which I can recover quickly > and monitor but my question is, what if an OSS goes down. Will that > cause corruption of the data?Unless you also lose clients, no. In the event of an OSS going down, the client will not have gotten the reply back from the OST to say that it''s data was actually written to disk. Until the client gets such a reply, it holds on to that data so that if an OSS does crash, it can "replay" that transaction. Thus, all data is either physically on-disk, on in client memory ready to be replayed to disk. The one exception to this is if you have some cache between the OST and the disk that the OSS doesn''t know about. It might think it''s written to disk but yet only written to a disk''s cache. Should that disk go down, that is lost data and possible corruption. This is why we typically recommend disabling write caching on disk arrays unless they can survive a power event and recover so that the disk is fully coherent with what the host thinks should be there.> I > also would like to know if you can dynamically add to the cluster new > OSS/OST''s or do you have to unmount the client then remount after > doing so.No. You just add them as you need. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090406/a0f11d7b/attachment.bin
Brian,> Unless you also lose clients, no. In the event of an OSS going down, > the client will not have gotten the reply back from the OST to say that > it''s data was actually written to disk. Until the client gets such a > reply, it holds on to that data so that if an OSS does crash, it can > "replay" that transaction. Thus, all data is either physically on-disk, > on in client memory ready to be replayed to disk.what about if you have multiple clients, all having transactions with the OSS open. Now the OSS goes down and comes back. From what I understand, the server goes into recovery and rejects new connections before recovery is finished (correct?). What if all but one client reconnect, i.e. you lose one client: are the transactions of the successfully reconnected clients replayed or are they discarded?> > I > > also would like to know if you can dynamically add to the cluster new > > OSS/OST''s or do you have to unmount the client then remount after > > doing so. > > No. You just add them as you need. >Independent from the load? I think the ''official'' statement was that the cluster has to be quiescent, i.e. no client activity. Is that (still) true? TIA, Arne
On Apr 06, 2009 15:27 -0400, Christopher Deneen wrote:> A quick note on my setup, 40 oss''s each with a md0 raid1 (The OST) > (2x500gb) (amd x2 5400 8gb ddr2) and a single mdt/mgs/mds (dual xeon > quad core 32gb ddr2 4tb raid6).There is no benefit at all to have such a large MDT filesystem, and in fact using RAID6 will make metadata performance very bad. You should instead make this a RAID 1+0 with 4 or whatever drives you have. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Tue, 2009-04-07 at 09:20 +0200, Arne Wiebalck wrote:> Brian,Arne,> what about if you have multiple clients, all having transactions with > the OSS open. Now the OSS goes down and comes back. From what I > understand, the server goes into recovery and rejects new connections > before recovery is finished (correct?).Correct.> What if all but one client > reconnect, i.e. you lose one client: are the transactions of the > successfully reconnected clients replayed or are they discarded?If the lost client has a transaction that needs to be replayed, all of the transactions up to that missing transaction are replayed but all subsequent transactions are discarded and when the recovery timer expires, recovery is aborted. The semantics of this will change when VBR becomes available, in 1.8.something, where something might be 0 even. In that case, only transactions actually dependent on the missing transactions will be discarded.> Independent from the load? I think the ''official'' statement was that the > cluster has to be quiescent, i.e. no client activity. Is that (still) > true?Yes, that is the official statement and I don''t think any further testing has been done to change that statement, officially, but I think the general feeling is that quiescence should not be necessary, but we just don''t have the scientific testing to be assured of that. So if you want to be safe, quiesce the filesystem first. :-) b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090407/6dac8e93/attachment.bin
Brian, [snip]> If the lost client has a transaction that needs to be replayed, all of > the transactions up to that missing transaction are replayed but all > subsequent transactions are discarded and when the recovery timer > expires, recovery is aborted.Thanks a lot for the details. So, even clients working on disjoint files/objects will experience that their data are lost when one client is lost during recovery ? Until 1.8.something, that is :) [snip]> Yes, that is the official statement and I don''t think any further > testing has been done to change that statement, officially, but I think > the general feeling is that quiescence should not be necessary, but we > just don''t have the scientific testing to be assured of that. > > So if you want to be safe, quiesce the filesystem first. :-)Well, if I can do that :-) Thanks again! Arne
On Tue, Apr 07, 2009 at 08:14:55AM -0400, Brian J. Murrell wrote: [snip]> If the lost client has a transaction that needs to be replayed, all of > the transactions up to that missing transaction are replayed but all > subsequent transactions are discarded and when the recovery timer > expires, recovery is aborted.[snip] Discarding all transactions causes a lot of collateral damage in a multi-cluster, mixed parallel job environment where "file-per-process" style I/O predominates. Could somebody remind me of the use cases protected by this behavior? In the case of I/O to a shared file, aren''t lustre''s errror handling obligations met by evicting the single offending client? Perhaps I am thinking too provincially because in our environment, I/O to shared files generally (always?) takes place in the context of a parallel job, and the single client eviction and EIO (or reboot of client) should be sufficient to terminate the whole job with an error. Thanks, Jim
On Tue, 2009-04-07 at 08:34 -0700, Jim Garlick wrote:> > Discarding all transactionsOnly transactions subsequent to a missing transaction.> causes a lot of collateral damage in a > multi-cluster, mixed parallel job environment where "file-per-process" > style I/O predominates.Indeed, depending where the AWOL client''s transaction sits in the replay stream. So if it was the last transaction, the loss is absolutely minimal but if it was the first transaction, the loss is absolutely maximal.> Could somebody remind me of the use cases protected by this behavior?Simply transactional dependency. If you don''t know what the AWOL client did to a given file, you cannot reliably process any further updates to that file, and if you don''t have the AWOL client to ask what files it has transactions for, everything subsequent to that client''s transaction has to be suspect. While I don''t have any examples off-hand, I am sure one of the devs that constantly have their fingers in replay can cite many actual scenarios where this is a problem.> In the case of I/O to a shared file, aren''t lustre''s errror handling > obligations met by evicting the single offending client?No. All clients subsequently have to be evicted, per the above.> Perhaps I am > thinking too provincially because in our environment, I/O to shared > files generally (always?) takes place in the context of a parallel job, > and the single client eviction and EIO (or reboot of client) should > be sufficient to terminate the whole job with an error.Yours is probably a scenario where VBR will do really well then given that VBR only serializes replay on truly dependent transactions rather than the single serial stream (of assumed dependent transactions) that replay currently operates with. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090407/58e0c28d/attachment.bin
On Tue, Apr 07, 2009 at 11:43:19AM -0400, Brian J. Murrell wrote:> On Tue, 2009-04-07 at 08:34 -0700, Jim Garlick wrote: > > > > Discarding all transactions > > Only transactions subsequent to a missing transaction. > > > causes a lot of collateral damage in a > > multi-cluster, mixed parallel job environment where "file-per-process" > > style I/O predominates. > > Indeed, depending where the AWOL client''s transaction sits in the replay > stream. So if it was the last transaction, the loss is absolutely > minimal but if it was the first transaction, the loss is absolutely > maximal. > > > Could somebody remind me of the use cases protected by this behavior? > > Simply transactional dependency. > > If you don''t know what the AWOL client did to a given file, you cannot > reliably process any further updates to that file, and if you don''t have > the AWOL client to ask what files it has transactions for, everything > subsequent to that client''s transaction has to be suspect. While I > don''t have any examples off-hand, I am sure one of the devs that > constantly have their fingers in replay can cite many actual scenarios > where this is a problem.For us, error handling is at best: abort the parallel job on EIO, throw away the output, and restart from the last checkpoint. I am virtually certain that nobody around here tries to recover from an EIO. Also, if the AWOL node has actually rebooted, it will cause our resource manager to terminate the whole parallel job. This is good in the sense that it gives codes with poor I/O error handling is a second chance of noticing the error before bad physics data has to be analyzed and explained. Not so with the collateral evictions. So, it would be pretty easy for us to patch our 1.6.6 based lustre to allow those transactions after the missed one to be committed and avoid the collateral evictions. We suspect this is a bad idea but we are having a hard time imagining why. Any insight would be apprecaited.> > In the case of I/O to a shared file, aren''t lustre''s errror handling > > obligations met by evicting the single offending client? > > No. All clients subsequently have to be evicted, per the above. > > > Perhaps I am > > thinking too provincially because in our environment, I/O to shared > > files generally (always?) takes place in the context of a parallel job, > > and the single client eviction and EIO (or reboot of client) should > > be sufficient to terminate the whole job with an error. > > Yours is probably a scenario where VBR will do really well then given > that VBR only serializes replay on truly dependent transactions rather > than the single serial stream (of assumed dependent transactions) that > replay currently operates with. > > b. >> _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http:// lists.lustre.org/mailman/listinfo/lustre-discuss
On Tue, 2009-04-07 at 14:29 +0200, Arne Wiebalck wrote:> Brian,Arne,> Thanks a lot for the details.NP.> So, even clients working on disjoint files/objects will experience that > their data are lost when one client is lost during recovery ?Yes. Because, given the existing recovery design, there is no way to know that AWOL clients were working on disjoint files/objects.> Until 1.8.something, that is :)Right. That is what VBR is all about.> Well, if I can do that :-)Good. I have opened a bug to formalize the testing of that to see if we can be more positive about our situation there. I don''t recall the number but it should be easy to find. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090413/ce0de4ad/attachment.bin