thr3ads.net - Lustre discuss - [Lustre-discuss] filesystem corruption [Sep 2009]

If this information is useful, please help other people find it:
Share via:

恩强周

2009-Sep-02 23:16 UTC

[Lustre-discuss] filesystem corruption

hi all?

I have lustre  corrupted when a OSS powered off by ipmi accidentally. I get
followling messages after that OSS restart.
...
(fs/jbd/recovery.c, 256): journal_recover: JBD: recovery, exit status 0,
recovered transactions 1875142 to 1885541
(fs/jbd/recovery.c, 258): journal_recover: JBD: Replayed 48361 and revoked
0/0 blocks
kjournald starting.  Commit interval 5 seconds
LDISKFS-fs warning: maximal mount count reached, running e2fsck is
recommended
LDISKFS FS on sda, internal journal
LDISKFS-fs: recovery complete.
LDISKFS-fs: mounted filesystem with ordered data mode.
LDISKFS-fs error (device sda): ldiskfs_check_descriptors: Block bitmap for
group 43776 not in group (block 222298112)!
Remounting filesystem read-only
LDISKFS-fs: group descriptors corrupted!
LustreError: 3901:0:(obd_mount.c:1320:server_kernel_mount()) ll_kern_mount
failed: rc = -22
LustreError: 3901:0:(obd_mount.c:1590:server_fill_super()) Unable to mount
device /dev/sda: -22
LustreError: 3901:0:(obd_mount.c:1993:lustre_fill_super()) Unable to mount
(-22)

Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter with
write cache enabled.I have a worry about data on lustre  maybe lost.
And waht''s the cause of such a problem?How can I fixed it?

Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090903/4ceed8cc/attachment.html

Andreas Dilger

2009-Sep-02 23:31 UTC

head link

[Lustre-discuss] filesystem corruption

On Sep 03, 2009  07:16 +0800, ????? wrote:> I have lustre  corrupted when a OSS powered off by ipmi accidentally. I get
> followling messages after that OSS restart.
> ...
> LDISKFS-fs error (device sda): ldiskfs_check_descriptors: Block bitmap for
> group 43776 not in group (block 222298112)!
> Remounting filesystem read-only
> LDISKFS-fs: group descriptors corrupted!
> LustreError: 3901:0:(obd_mount.c:1320:server_kernel_mount()) ll_kern_mount
> failed: rc = -22
> LustreError: 3901:0:(obd_mount.c:1590:server_fill_super()) Unable to mount
> device /dev/sda: -22
> LustreError: 3901:0:(obd_mount.c:1993:lustre_fill_super()) Unable to mount
> (-22)
> 
> Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter
with
> write cache enabled.I have a worry about data on lustre  maybe lost.
> And waht''s the cause of such a problem?How can I fixed it?
Running with write cache enabled is dangerous and can cause corruption
like this.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Peter Kjellstrom

2009-Sep-03 07:06 UTC

head link

[Lustre-discuss] filesystem corruption

On Thursday 03 September 2009, ??? wrote:> hi all?
>
> I have lustre  corrupted when a OSS powered off by ipmi accidentally. I get
> followling messages after that OSS restart.
...> Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter
> with write cache enabled.
Drive write-back cache is dangerous, controller write-back cache is dangerous 
if you don''t have a battery backup unit on the card. Which group are
you in?

Either way the next step is probably fsck while keeping your fingers crossed.

/Peter
> I have a worry about data on lustre  maybe lost. 
> And waht''s the cause of such a problem?How can I fixed it?
>
> Thanks in advance!-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090903/b07bd67f/attachment.bin

恩强周

2009-Sep-04 12:51 UTC

head link

[Lustre-discuss] filesystem corruption

It''s really dangerous! e2fsck  bring it back.

2009/9/3 Peter Kjellstrom <cap at nsc.liu.se>
> On Thursday 03 September 2009, ??? wrote:
> > hi all?
> >
> > I have lustre  corrupted when a OSS powered off by ipmi accidentally.
I
> get
> > followling messages after that OSS restart.
> ...
> > Our OSS server is running lustre-1.8.0,equipped with areca RAID
adapter
> > with write cache enabled.
>
> Drive write-back cache is dangerous, controller write-back cache is
> dangerous
> if you don''t have a battery backup unit on the card. Which group
are you
> in?
>
> Either way the next step is probably fsck while keeping your fingers
> crossed.
>
> /Peter
>
> > I have a worry about data on lustre  maybe lost.
> > And waht''s the cause of such a problem?How can I fixed it?
> >
> > Thanks in advance!
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090904/f485807e/attachment-0001.html

Mag Gam

2009-Sep-06 12:55 UTC

head link

[Lustre-discuss] filesystem corruption

So, I have to ask

Why disable write-back cache on the controller?



2009/9/4 ??? <eqzhou at gmail.com>:> It''s really dangerous! e2fsck? bring it back.
>
> 2009/9/3 Peter Kjellstrom <cap at nsc.liu.se>
>>
>> On Thursday 03 September 2009, ??? wrote:
>> > hi all?
>> >
>> > I have lustre ?corrupted when a OSS powered off by ipmi
accidentally. I
>> > get
>> > followling messages after that OSS restart.
>> ...
>> > Our OSS server is running lustre-1.8.0,equipped with areca RAID
adapter
>> > with write cache enabled.
>>
>> Drive write-back cache is dangerous, controller write-back cache is
>> dangerous
>> if you don''t have a battery backup unit on the card. Which
group are you
>> in?
>>
>> Either way the next step is probably fsck while keeping your fingers
>> crossed.
>>
>> /Peter
>>
>> > I have a worry about data on lustre ?maybe lost.
>> > And waht''s the cause of such a problem?How can I fixed
it?
>> >
>> > Thanks in advance!
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Kevin Van Maren

2009-Sep-06 14:44 UTC

head link

[Lustre-discuss] filesystem corruption

This subject has been discussed many times...

Not just the controller, but the drives as well.

The problem is with write-back caches that _lie_ about the data being in 
persistent store.  The drive itself, with write-back cache enabled, lies 
and says data is on disk.  RAID controllers likewise use write-back 
cache to lie about the data being on disk.

So why do they lie?  Because it makes the operating system run faster, 
as it doesn''t have to wait as long for the data to be "on
disk".

What is the problem?  The reason the OS is waiting for the data to be 
"on disk" is to ensure consistency of the filesystem.   If the 
controller/drive says the data is in persistent store, but it is not 
actually there, and the system loses power/crashes/experiences some 
other problem, then when the filesystem comes up things aren''t in a 
consistent state.

With ext3, the journal is used to ensure the filesystem is recoverable 
-- assuming the controller does not lie -- even if the outstanding 
writes do not complete.  So while there may be loss of data, the 
filesystem is not mangled due to a hard crash.  [Journaling is only one 
of many approaches taken over the years to improve performance; see also 
Kirk McKusick''s work on soft updates for the BSD FFS filesystem -- 
http://www.ece.cmu.edu/~ganger/papers/mckusick99.pdf]

Note that write-back caches do not always lie about being in stable 
storage -- _some_ HW RAID controllers do have special features to turn 
the controller cache into non-volatile storage, with mirrored write 
cache and battery backup.  Battery backup makes it less likely it is 
lying, at least until the system loses power for several days and the 
battery dies.

Kevin

Mag Gam wrote:> So, I have to ask
>
> Why disable write-back cache on the controller?
>
>
>
> 2009/9/4 ??? <eqzhou at gmail.com>:
>   
>> It''s really dangerous! e2fsck  bring it back.
>>
>> 2009/9/3 Peter Kjellstrom <cap at nsc.liu.se>
>>     
>>> On Thursday 03 September 2009, ??? wrote:
>>>       
>>>> hi all?
>>>>
>>>> I have lustre  corrupted when a OSS powered off by ipmi
accidentally. I
>>>> get
>>>> followling messages after that OSS restart.
>>>>         
>>> ...
>>>       
>>>> Our OSS server is running lustre-1.8.0,equipped with areca RAID
adapter
>>>> with write cache enabled.
>>>>         
>>> Drive write-back cache is dangerous, controller write-back cache is
>>> dangerous
>>> if you don''t have a battery backup unit on the card. Which
group are you
>>> in?
>>>
>>> Either way the next step is probably fsck while keeping your
fingers
>>> crossed.
>>>
>>> /Peter
>>>
>>>       
>>>> I have a worry about data on lustre  maybe lost.
>>>> And waht''s the cause of such a problem?How can I fixed
it?
>>>>
>>>> Thanks in advance!
>>>>         
>>> _______________________________________________

Richard Smith

2009-Sep-07 11:54 UTC

head link

[Lustre-discuss] filesystem corruption

Kevin Van Maren wrote:> This subject has been discussed many times...
>
> Not just the controller, but the drives as well.
>
> The problem is with write-back caches that _lie_ about the data being in 
> persistent store.  The drive itself, with write-back cache enabled, lies 
> and says data is on disk.  RAID controllers likewise use write-back 
> cache to lie about the data being on disk.<snip>

I''m not convinced there''s any lie involved. SCSI permits data
to be
written back only as far as a cache and have a GOOD status returned at that
point. If for any reason a guarantee is required that the data really is on
media, then my understanding is that''s what SYNCHRONIZE CACHE command
and/or
FUA (Force Unit Access) control bit is for. What''s not so clear to me
is
under what circumstances either technique is triggered, whether an fsync
is sufficient for example to propagate the request down to the low-level
device driver. It sounds like it would be device driver-specific.

-- 
===========================================================================   
,-_|\   Richard Smith Staff Engineer PAE
   /     \  Sun Microsystems                   Phone : +61 3 9869 6200
richard.smith at Sun.COM                         Direct : +61 3 9869 6224
   \_,-._/  476 St Kilda Road                    Fax : +61 3 9869 6290
        v   Melbourne Vic 3004 Australia
===========================================================================

Peter Kjellstrom

2009-Sep-07 16:46 UTC

head link

[Lustre-discuss] filesystem corruption

On Monday 07 September 2009, Richard Smith wrote:> Kevin Van Maren wrote:
> > This subject has been discussed many times...
> >
> > Not just the controller, but the drives as well.
> >
> > The problem is with write-back caches that _lie_ about the data being
in
> > persistent store.  The drive itself, with write-back cache enabled,
lies
> > and says data is on disk.  RAID controllers likewise use write-back
> > cache to lie about the data being on disk.
>
> <snip>
>
> I''m not convinced there''s any lie involved. SCSI permits
data to be
> written back only as far as a cache and have a GOOD status returned at that
> point. If for any reason a guarantee is required that the data really is on
> media, then my understanding is that''s what SYNCHRONIZE CACHE
command
> and/or FUA (Force Unit Access) control bit is for.
I also feel that "lie" is a bit incorrect as a description. However,
while FUA
does exist it''s typically ignored by controllers. If you''re
lucky you have
a "ignore FUA: enable/disable" setting.

/Peter
> What''s not so clear to 
> me is under what circumstances either technique is triggered, whether an
> fsync is sufficient for example to propagate the request down to the
> low-level device driver. It sounds like it would be device driver-specific.-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090907/427fb101/attachment.bin

Richard Smith

2009-Sep-08 01:16 UTC

head link

[Lustre-discuss] filesystem corruption

Peter Kjellstrom wrote:> I also feel that "lie" is a bit incorrect as a description.
However,
> while FUA
> does exist it''s typically ignored by controllers. If
you''re lucky you
> have
> a "ignore FUA: enable/disable" setting.
This is a concern, that FUA might be ignored silently. I wasn''t aware
that
implementing FUA was optional. If it is optional, then I would have expected
a mode page to describe whether a given device implements it. Sounds like
it would be safer to use SYNCHRONIZE CACHE unless devices lie about that 
too.
I realise that synchronisation isn''t by itself sufficient to avoid 
corruption:
there is still a need to use techniques, such as a journal, to provide the
atomic update semantics required when a change involves multiple 
non-contiguous
blocks.

-- 
===========================================================================   
,-_|\   Richard Smith Staff Engineer PAE
   /     \  Sun Microsystems                   Phone : +61 3 9869 6200
richard.smith at Sun.COM                         Direct : +61 3 9869 6224
   \_,-._/  476 St Kilda Road                    Fax : +61 3 9869 6290
        v   Melbourne Vic 3004 Australia
===========================================================================

Brian J. Murrell

2009-Sep-08 14:45 UTC

head link

[Lustre-discuss] filesystem corruption

On Mon, 2009-09-07 at 21:54 +1000, Richard Smith wrote:> 
> I''m not convinced there''s any lie involved.
Well, whatever you want to call it... when the hardware tells the
software (Lustre) that something is on a platter, in order for Lustre to
work properly, it MUST be physically on the platter, or be able to make
it there in the face of other environmental issues, such as power
outages, etc. (i.e. this does allow for the battery-backed cache case,
so long as the batter cannot drain before the disk unit is powered back
up to receive the writes in the battery-backed cache).

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090908/a46c8601/attachment.bin

Richard Smith

2009-Sep-08 23:23 UTC

head link

[Lustre-discuss] filesystem corruption

Brian J. Murrell wrote:
> Well, whatever you want to call it... when the hardware tells the
> software (Lustre) that something is on a platter, in order for Lustre to
> work properly, it MUST be physically on the platter, or be able to make
> it there in the face of other environmental issues, such as power ...
I don''t think its in dispute that there is a need at various times to
ensure
that data has been written to non-volatile storage, at least not by me.
Where I was coming from is that high-performance software should be 
encouraged
to take full advantage of the capabilities of the underlying hardware 
provided
they can do so safely. [And under some circumstances people may even be 
prepared
to sacrifice safety for a performance benefit, but that''s a separate
issue.]

At least in the case of SCSI, the hardware doesn''t tell software 
(Lustre) that
something is on a platter. The hardware receives requests and tells the 
software
that it has obeyed them, or has failed in the attempt. A WRITE carries 
with it
no guarantee that the data is on non-volatile media, hence my comment about
using SYNCHRONIZE CACHE or FUA bit as well if that is really what is wanted.

Neither SYNCHRONIZE CACHE or FUA bit is exposed at an application level, but
I think there''s a reasonable expectation that the underlying software 
will do
whatever is necessary to maintain integrity of a filesystem. The way I 
interpret
this is that the combination of filesystem and device driver(s) should 
establish
what the device is capable of, and then use those capabilities to maintain
integrity while maximizing performance. Does it implement FUA? I''m 
caught out
here--I was unaware that there were devices that silently ignored FUA, 
and didn''t
know if SCSI permitted that. If FUA can''t be relied upon, then
I''d expect
[system] software to use SYNCHRONIZE CACHE instead.

Admittedly I am out of my depth here. The block layer for a device is 
supposed
to implement the concept of a barrier request, and should take steps to 
force
a drive to write data to the media. Maybe some drivers do, and others
don''t.
I expect the implementation to require SYNCHRONIZE CACHE or FUA. At a higher
level then, all that should be required is the appropriate generation
of barrier requests, assuming the underlying layer implements them.

The final piece of the puzzle, unless there''s something I''ve
overlooked, is
for appropriate warnings to be generated in the case where the software 
stack
cannot verify it can implement barriers. This could mean [and I''m only 
guessing
here] a situation where a device has a write-cache enabled but provides 
no means
of informing higher layers of how to ensure data is written to physical 
media,
in order.

Adopting the same "fail-safe" principle as railways/railroads in
theory use,
this should probably be inverted, so that unless you see a message stating
positively that the conditions for using a write-cache have been met, then
it would be prudent not to use a write-cache. At the same time, I think 
there
are circumstances in which the use of write-caches should be acceptable.
There''s a bunch of other things that can go wrong with i/os that none
of the
above address, so nothing is completely risk-free.

-- 
===========================================================================  
,-_|\   Richard Smith Staff Engineer PAE
  /     \  Sun Microsystems                   Phone : +61 3 9869 6200
richard.smith at Sun.COM                        Direct : +61 3 9869 6224
  \_,-._/  476 St Kilda Road                    Fax : +61 3 9869 6290
       v   Melbourne Vic 3004 Australia
===========================================================================

Lustre discuss - Sep 2009 - filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption

[Lustre-discuss] filesystem corruption