thr3ads.net - Lustre devel - [Lustre-devel] Fwd: Disk rebuild [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Nikita Danilov

2009-Dec-01 15:00 UTC

[Lustre-devel] Fwd: Disk rebuild

Hello,

On 03 Feb 2009 Jody McIntyre wrote:
 > It is possible to significantly reduce resync (but not recovery) times
 > using bitmaps, but these have been shown to hurt performance
 > significantly. ?Another approach, journal-guided?resynchronization, was
 > studied in a 2005 paper but has never been merged into the kernel. ?The
 > paper shows improvements in resync times from 254 seconds to 0.21
 > seconds (for a 1 GB test array) with under 5% performance impact. ?This
 > is an option if we''re willing to develop and maintain the patches
to do
 > it.

what is the status of this? Is ext3 guided resync code (RHEL 5 version
was posted on lkml in October) used by Lustre?

 >
 > Cheers,
 > Jody

Thank you,
Nikita.

Jody McIntyre

2009-Dec-01 23:57 UTC

head link

[Lustre-devel] Fwd: Disk rebuild

Hi Nikita,

On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
> what is the status of this? Is ext3 guided resync code (RHEL 5 version
> was posted on lkml in October) used by Lustre?
This is covered by bug 19932.

Cheers,
Jody
> 
>  >
>  > Cheers,
>  > Jody
> 
> Thank you,
> Nikita.

Nikita Danilov

2009-Dec-02 14:13 UTC

head link

[Lustre-devel] Fwd: Disk rebuild

2009/12/2 Jody McIntyre <scjody at sun.com>:> Hi Nikita,
Hello Jody,
>
> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
>
>> what is the status of this? Is ext3 guided resync code (RHEL 5 version
>> was posted on lkml in October) used by Lustre?
>
> This is covered by bug 19932.
The last (43rd) comment there is rather intriguing. Can you elaborate
on why guided resync cannot work with the Lustre IO stack?
>
> Cheers,
> Jody
Thank you,
Nikita.

Andreas Dilger

2009-Dec-02 19:43 UTC

head link

[Lustre-devel] Fwd: Disk rebuild

Hi Nikita!

On 2009-12-02, at 07:13, Nikita Danilov wrote:>> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
>>> what is the status of this? Is ext3 guided resync code (RHEL 5  
>>> version was posted on lkml in October) used by Lustre?
>>
>> This is covered by bug 19932.
>
> The last (43rd) comment there is rather intriguing. Can you elaborate
> on why guided resync cannot work with the Lustre IO stack?

The problem lies in the way that obdfilter submits IO.  Since it is  
not using the normal buffer cache to track "data=ordered" (or in the  
case of this patch "data=declared") mode the bio_submit() will likely
start modifying the MD device before the corresponding declare blocks  
are committed to the journal.

This breaks the whole validity of declared mode in case of a crash,  
since we can no longer be certain that the declare blocks contain all  
of the locations in the MD RAID that may need to have parity rebuilt.

It would be possible to fix this by having the OST use the normal VFS  
methods to order the IO to disk, but I''m sure you''re well
aware of the
performance impact of this.  It wouldn''t be so bad with older versions
of Lustre, where we had to wait for the journal commit before  
returning to the client anyway, but in 1.8.2 there is a (disabled by  
default) async journal commit option that allows the client to get RPC  
replies before the bulk IO is committed.

In order to accommodate declared mode it mean that we need to  
implement full write-cached IO on the OST, which wouldn''t be  
impossible given that 1.8 already uses the page cache for reading, but  
given the amount of change and risk this would introduce it wasn''t  
thought worthwhile to implement for the short lifespan it would have.   
It wouldn''t be practical to introduce such a major change any sooner  
than the DMU OSD in the 2.1 release, at which point it is largely  
obsolete.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nikita Danilov

2009-Dec-02 20:58 UTC

head link

[Lustre-devel] Fwd: Disk rebuild

2009/12/2 Andreas Dilger <adilger at sun.com>:> Hi Nikita!
Hello Andreas!
>
> On 2009-12-02, at 07:13, Nikita Danilov wrote:
>>>
>>> On Tue, Dec 01, 2009 at 06:00:39PM +0300, Nikita Danilov wrote:
>>>>
>>>> what is the status of this? Is ext3 guided resync code (RHEL 5
version
>>>> was posted on lkml in October) used by Lustre?
>>>
>>> This is covered by bug 19932.
>>
>> The last (43rd) comment there is rather intriguing. Can you elaborate
>> on why guided resync cannot work with the Lustre IO stack?
>
>
> The problem lies in the way that obdfilter submits IO. ?Since it is not
> using the normal buffer cache to track "data=ordered" (or in the
case of
> this patch "data=declared") mode the bio_submit() will likely
start
> modifying the MD device before the corresponding declare blocks are
> committed to the journal.
Thank you for the detailed explanation, data-path completely escaped
my mind. Still, on the mdt side, osd goes through the normal VFS paths
and data=declared should work, right?

[...]
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
Thank you,
Nikita.

Andreas Dilger

2009-Dec-02 22:48 UTC

head link

[Lustre-devel] Fwd: Disk rebuild

On 2009-12-02, at 13:58, Nikita Danilov wrote:>> The problem lies in the way that obdfilter submits IO.  Since it is  
>> not
>> using the normal buffer cache to track "data=ordered" (or in
the
>> case of
>> this patch "data=declared") mode the bio_submit() will likely
start
>> modifying the MD device before the corresponding declare blocks are
>> committed to the journal.
>
> Thank you for the detailed explanation, data-path completely escaped
> my mind. Still, on the mdt side, osd goes through the normal VFS paths
> and data=declared should work, right?
Yes, though in general the MDT is a lot smaller than the OSTs, fails  
less often, has RAID-1 instead of RAID-6 so the rebuild goes  
considerably faster, has metadata journaling for everything (so  
doesn''t get inconsistent in the first place).

There would likely be some improvement, but we haven''t benchmarked it  
- the main concern was for the OSTs.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre devel - Dec 2009 - Fwd: Disk rebuild

[Lustre-devel] Fwd: Disk rebuild

[Lustre-devel] Fwd: Disk rebuild

[Lustre-devel] Fwd: Disk rebuild

[Lustre-devel] Fwd: Disk rebuild

[Lustre-devel] Fwd: Disk rebuild

[Lustre-devel] Fwd: Disk rebuild