behlendorf1@llnl.gov
2007-May-16 13:54 UTC
[Lustre-devel] [Bug 11710] use PG_writeback in lustre
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11710 (In reply to comment #60)> Created an attachment (id=10547)Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=10547&action=view)> handling recoverable error. v2Thanks Alexey, I had a chance to review the latest proposed patch and DLD. In general I like the refined approach, in particular breaking the recoverable errors in two levels at the rpc and osc layer is nice. Plus the optimization of tracking the readonly state in the import is nice, but the side effect of having a client evicted is it a higher price than I think we''re willing to pay. After auditing the patch I still have other concerns. 1) Minor: The proposed patch is against 1.6 not 1.4, so I have not yet tried the patch I''ve only read through it. It will need to be backported to 1.4 because this is serious flaw for 1.4 based release as well. 2) I''m not thrilled with how fatal errors are now handled by simply flagging the page in error in the client page cache. This approach is simply wasteful of the client cache and to my paranoid mind a little bit dangerous for client cache cohearency. There''s no reason to leave this page around, it really needs to be discarded as soon as possible. Perhaps the best solution which now occurs to me is to keep a list of these pages as was done in LLNL patch. But instead of adding another ll_ap_truncate() method (which was clunky) we could integrate this cache reclamation in to llap_shrink_cache(). This function would have access to this fatal page list and all pages in error could be disposed of first before touching any other pages. We could then simply make sure llap_shrunk_cache() is called from brw_interpret_oap() after osc_ap_completion() has run on all the pages. We also need to add a multinode test to ensure clients mounted on different physcal nodes see the exact same behavior in the fatal and recoverable case. Consistency is very import to us. 3) With this patch ll_ap_completion cannot fail, we should take a look at the other users such as obdecho/liblustre and see if it makes sense to make this a void return. If they can''t fail either then the conditional check in osc_ap_completion() is just pointless. I''m sure we''ll give this a spin at LLNL soon. But just as a heads up there is a good chance we''ll stick with our local patch for our 1.4.8 based release. We have many weeks of successful testing at scale with this patch and our deadline for a getting a new production release ready is almost here.
shadow@clusterfs.com
2007-May-28 00:20 UTC
[Lustre-devel] [Bug 11710] use PG_writeback in lustre
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11710 (In reply to comment #85)> Created an attachment (id=10733)Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: --> (https://bugzilla.lustre.org/attachment.cgi?id=10733&action=view)> 10547: handling recoverable error. v2 > > update handling recoverable error.New patch simplify code by merge handling recoverable errors with handling checksum errors. Fixes handling recoverable errors for direct io case and write partial page (not exist at LLNL patch). Also fixes bug with release page from cache if uses checksum control and resend limit exceed.