Hello, On 03 Feb 2009 Jody McIntyre wrote: > It is possible to significantly reduce resync (but not recovery) times > using bitmaps, but these have been shown to hurt performance > significantly. ?Another approach, journal-guided?resynchronization, was > studied in a 2005 paper but has never been merged into the kernel. ?The > paper shows improvements in resync times from 254 seconds to 0.21 > seconds (for a 1 GB test array) with under 5% performance impact. ?This > is an option if we''re willing to develop and maintain the patches to do > it. what is the status of this? Is ext3 guided resync code (RHEL 5 version was posted on lkml in October) used by Lustre? > > Cheers, > Jody Thank you, Nikita.
Hi Nikita, On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:> what is the status of this? Is ext3 guided resync code (RHEL 5 version > was posted on lkml in October) used by Lustre?This is covered by bug 19932. Cheers, Jody> > > > > Cheers, > > Jody > > Thank you, > Nikita.
2009/12/2 Jody McIntyre <scjody at sun.com>:> Hi Nikita,Hello Jody,> > On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote: > >> what is the status of this? Is ext3 guided resync code (RHEL 5 version >> was posted on lkml in October) used by Lustre? > > This is covered by bug 19932.The last (43rd) comment there is rather intriguing. Can you elaborate on why guided resync cannot work with the Lustre IO stack?> > Cheers, > JodyThank you, Nikita.
Hi Nikita! On 2009-12-02, at 07:13, Nikita Danilov wrote:>> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote: >>> what is the status of this? Is ext3 guided resync code (RHEL 5 >>> version was posted on lkml in October) used by Lustre? >> >> This is covered by bug 19932. > > The last (43rd) comment there is rather intriguing. Can you elaborate > on why guided resync cannot work with the Lustre IO stack?The problem lies in the way that obdfilter submits IO. Since it is not using the normal buffer cache to track "data=ordered" (or in the case of this patch "data=declared") mode the bio_submit() will likely start modifying the MD device before the corresponding declare blocks are committed to the journal. This breaks the whole validity of declared mode in case of a crash, since we can no longer be certain that the declare blocks contain all of the locations in the MD RAID that may need to have parity rebuilt. It would be possible to fix this by having the OST use the normal VFS methods to order the IO to disk, but I''m sure you''re well aware of the performance impact of this. It wouldn''t be so bad with older versions of Lustre, where we had to wait for the journal commit before returning to the client anyway, but in 1.8.2 there is a (disabled by default) async journal commit option that allows the client to get RPC replies before the bulk IO is committed. In order to accommodate declared mode it mean that we need to implement full write-cached IO on the OST, which wouldn''t be impossible given that 1.8 already uses the page cache for reading, but given the amount of change and risk this would introduce it wasn''t thought worthwhile to implement for the short lifespan it would have. It wouldn''t be practical to introduce such a major change any sooner than the DMU OSD in the 2.1 release, at which point it is largely obsolete. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
2009/12/2 Andreas Dilger <adilger at sun.com>:> Hi Nikita!Hello Andreas!> > On 2009-12-02, at 07:13, Nikita Danilov wrote: >>> >>> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote: >>>> >>>> what is the status of this? Is ext3 guided resync code (RHEL 5 version >>>> was posted on lkml in October) used by Lustre? >>> >>> This is covered by bug 19932. >> >> The last (43rd) comment there is rather intriguing. Can you elaborate >> on why guided resync cannot work with the Lustre IO stack? > > > The problem lies in the way that obdfilter submits IO. ?Since it is not > using the normal buffer cache to track "data=ordered" (or in the case of > this patch "data=declared") mode the bio_submit() will likely start > modifying the MD device before the corresponding declare blocks are > committed to the journal.Thank you for the detailed explanation, data-path completely escaped my mind. Still, on the mdt side, osd goes through the normal VFS paths and data=declared should work, right? [...]> > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >Thank you, Nikita.
On 2009-12-02, at 13:58, Nikita Danilov wrote:>> The problem lies in the way that obdfilter submits IO. Since it is >> not >> using the normal buffer cache to track "data=ordered" (or in the >> case of >> this patch "data=declared") mode the bio_submit() will likely start >> modifying the MD device before the corresponding declare blocks are >> committed to the journal. > > Thank you for the detailed explanation, data-path completely escaped > my mind. Still, on the mdt side, osd goes through the normal VFS paths > and data=declared should work, right?Yes, though in general the MDT is a lot smaller than the OSTs, fails less often, has RAID-1 instead of RAID-6 so the rebuild goes considerably faster, has metadata journaling for everything (so doesn''t get inconsistent in the first place). There would likely be some improvement, but we haven''t benchmarked it - the main concern was for the OSTs. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.