hi all? I have lustre corrupted when a OSS powered off by ipmi accidentally. I get followling messages after that OSS restart. ... (fs/jbd/recovery.c, 256): journal_recover: JBD: recovery, exit status 0, recovered transactions 1875142 to 1885541 (fs/jbd/recovery.c, 258): journal_recover: JBD: Replayed 48361 and revoked 0/0 blocks kjournald starting. Commit interval 5 seconds LDISKFS-fs warning: maximal mount count reached, running e2fsck is recommended LDISKFS FS on sda, internal journal LDISKFS-fs: recovery complete. LDISKFS-fs: mounted filesystem with ordered data mode. LDISKFS-fs error (device sda): ldiskfs_check_descriptors: Block bitmap for group 43776 not in group (block 222298112)! Remounting filesystem read-only LDISKFS-fs: group descriptors corrupted! LustreError: 3901:0:(obd_mount.c:1320:server_kernel_mount()) ll_kern_mount failed: rc = -22 LustreError: 3901:0:(obd_mount.c:1590:server_fill_super()) Unable to mount device /dev/sda: -22 LustreError: 3901:0:(obd_mount.c:1993:lustre_fill_super()) Unable to mount (-22) Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter with write cache enabled.I have a worry about data on lustre maybe lost. And waht''s the cause of such a problem?How can I fixed it? Thanks in advance! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090903/4ceed8cc/attachment.html
On Sep 03, 2009 07:16 +0800, ????? wrote:> I have lustre corrupted when a OSS powered off by ipmi accidentally. I get > followling messages after that OSS restart. > ... > LDISKFS-fs error (device sda): ldiskfs_check_descriptors: Block bitmap for > group 43776 not in group (block 222298112)! > Remounting filesystem read-only > LDISKFS-fs: group descriptors corrupted! > LustreError: 3901:0:(obd_mount.c:1320:server_kernel_mount()) ll_kern_mount > failed: rc = -22 > LustreError: 3901:0:(obd_mount.c:1590:server_fill_super()) Unable to mount > device /dev/sda: -22 > LustreError: 3901:0:(obd_mount.c:1993:lustre_fill_super()) Unable to mount > (-22) > > Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter with > write cache enabled.I have a worry about data on lustre maybe lost. > And waht''s the cause of such a problem?How can I fixed it?Running with write cache enabled is dangerous and can cause corruption like this. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Thursday 03 September 2009, ??? wrote:> hi all? > > I have lustre corrupted when a OSS powered off by ipmi accidentally. I get > followling messages after that OSS restart....> Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter > with write cache enabled.Drive write-back cache is dangerous, controller write-back cache is dangerous if you don''t have a battery backup unit on the card. Which group are you in? Either way the next step is probably fsck while keeping your fingers crossed. /Peter> I have a worry about data on lustre maybe lost. > And waht''s the cause of such a problem?How can I fixed it? > > Thanks in advance!-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090903/b07bd67f/attachment.bin
It''s really dangerous! e2fsck bring it back. 2009/9/3 Peter Kjellstrom <cap at nsc.liu.se>> On Thursday 03 September 2009, ??? wrote: > > hi all? > > > > I have lustre corrupted when a OSS powered off by ipmi accidentally. I > get > > followling messages after that OSS restart. > ... > > Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter > > with write cache enabled. > > Drive write-back cache is dangerous, controller write-back cache is > dangerous > if you don''t have a battery backup unit on the card. Which group are you > in? > > Either way the next step is probably fsck while keeping your fingers > crossed. > > /Peter > > > I have a worry about data on lustre maybe lost. > > And waht''s the cause of such a problem?How can I fixed it? > > > > Thanks in advance! > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090904/f485807e/attachment-0001.html
So, I have to ask Why disable write-back cache on the controller? 2009/9/4 ??? <eqzhou at gmail.com>:> It''s really dangerous! e2fsck? bring it back. > > 2009/9/3 Peter Kjellstrom <cap at nsc.liu.se> >> >> On Thursday 03 September 2009, ??? wrote: >> > hi all? >> > >> > I have lustre ?corrupted when a OSS powered off by ipmi accidentally. I >> > get >> > followling messages after that OSS restart. >> ... >> > Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter >> > with write cache enabled. >> >> Drive write-back cache is dangerous, controller write-back cache is >> dangerous >> if you don''t have a battery backup unit on the card. Which group are you >> in? >> >> Either way the next step is probably fsck while keeping your fingers >> crossed. >> >> /Peter >> >> > I have a worry about data on lustre ?maybe lost. >> > And waht''s the cause of such a problem?How can I fixed it? >> > >> > Thanks in advance! >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
This subject has been discussed many times... Not just the controller, but the drives as well. The problem is with write-back caches that _lie_ about the data being in persistent store. The drive itself, with write-back cache enabled, lies and says data is on disk. RAID controllers likewise use write-back cache to lie about the data being on disk. So why do they lie? Because it makes the operating system run faster, as it doesn''t have to wait as long for the data to be "on disk". What is the problem? The reason the OS is waiting for the data to be "on disk" is to ensure consistency of the filesystem. If the controller/drive says the data is in persistent store, but it is not actually there, and the system loses power/crashes/experiences some other problem, then when the filesystem comes up things aren''t in a consistent state. With ext3, the journal is used to ensure the filesystem is recoverable -- assuming the controller does not lie -- even if the outstanding writes do not complete. So while there may be loss of data, the filesystem is not mangled due to a hard crash. [Journaling is only one of many approaches taken over the years to improve performance; see also Kirk McKusick''s work on soft updates for the BSD FFS filesystem -- http://www.ece.cmu.edu/~ganger/papers/mckusick99.pdf] Note that write-back caches do not always lie about being in stable storage -- _some_ HW RAID controllers do have special features to turn the controller cache into non-volatile storage, with mirrored write cache and battery backup. Battery backup makes it less likely it is lying, at least until the system loses power for several days and the battery dies. Kevin Mag Gam wrote:> So, I have to ask > > Why disable write-back cache on the controller? > > > > 2009/9/4 ??? <eqzhou at gmail.com>: > >> It''s really dangerous! e2fsck bring it back. >> >> 2009/9/3 Peter Kjellstrom <cap at nsc.liu.se> >> >>> On Thursday 03 September 2009, ??? wrote: >>> >>>> hi all? >>>> >>>> I have lustre corrupted when a OSS powered off by ipmi accidentally. I >>>> get >>>> followling messages after that OSS restart. >>>> >>> ... >>> >>>> Our OSS server is running lustre-1.8.0,equipped with areca RAID adapter >>>> with write cache enabled. >>>> >>> Drive write-back cache is dangerous, controller write-back cache is >>> dangerous >>> if you don''t have a battery backup unit on the card. Which group are you >>> in? >>> >>> Either way the next step is probably fsck while keeping your fingers >>> crossed. >>> >>> /Peter >>> >>> >>>> I have a worry about data on lustre maybe lost. >>>> And waht''s the cause of such a problem?How can I fixed it? >>>> >>>> Thanks in advance! >>>> >>> _______________________________________________
Kevin Van Maren wrote:> This subject has been discussed many times... > > Not just the controller, but the drives as well. > > The problem is with write-back caches that _lie_ about the data being in > persistent store. The drive itself, with write-back cache enabled, lies > and says data is on disk. RAID controllers likewise use write-back > cache to lie about the data being on disk.<snip> I''m not convinced there''s any lie involved. SCSI permits data to be written back only as far as a cache and have a GOOD status returned at that point. If for any reason a guarantee is required that the data really is on media, then my understanding is that''s what SYNCHRONIZE CACHE command and/or FUA (Force Unit Access) control bit is for. What''s not so clear to me is under what circumstances either technique is triggered, whether an fsync is sufficient for example to propagate the request down to the low-level device driver. It sounds like it would be device driver-specific. -- =========================================================================== ,-_|\ Richard Smith Staff Engineer PAE / \ Sun Microsystems Phone : +61 3 9869 6200 richard.smith at Sun.COM Direct : +61 3 9869 6224 \_,-._/ 476 St Kilda Road Fax : +61 3 9869 6290 v Melbourne Vic 3004 Australia ===========================================================================
On Monday 07 September 2009, Richard Smith wrote:> Kevin Van Maren wrote: > > This subject has been discussed many times... > > > > Not just the controller, but the drives as well. > > > > The problem is with write-back caches that _lie_ about the data being in > > persistent store. The drive itself, with write-back cache enabled, lies > > and says data is on disk. RAID controllers likewise use write-back > > cache to lie about the data being on disk. > > <snip> > > I''m not convinced there''s any lie involved. SCSI permits data to be > written back only as far as a cache and have a GOOD status returned at that > point. If for any reason a guarantee is required that the data really is on > media, then my understanding is that''s what SYNCHRONIZE CACHE command > and/or FUA (Force Unit Access) control bit is for.I also feel that "lie" is a bit incorrect as a description. However, while FUA does exist it''s typically ignored by controllers. If you''re lucky you have a "ignore FUA: enable/disable" setting. /Peter> What''s not so clear to > me is under what circumstances either technique is triggered, whether an > fsync is sufficient for example to propagate the request down to the > low-level device driver. It sounds like it would be device driver-specific.-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090907/427fb101/attachment.bin
Peter Kjellstrom wrote:> I also feel that "lie" is a bit incorrect as a description. However, > while FUA > does exist it''s typically ignored by controllers. If you''re lucky you > have > a "ignore FUA: enable/disable" setting.This is a concern, that FUA might be ignored silently. I wasn''t aware that implementing FUA was optional. If it is optional, then I would have expected a mode page to describe whether a given device implements it. Sounds like it would be safer to use SYNCHRONIZE CACHE unless devices lie about that too. I realise that synchronisation isn''t by itself sufficient to avoid corruption: there is still a need to use techniques, such as a journal, to provide the atomic update semantics required when a change involves multiple non-contiguous blocks. -- =========================================================================== ,-_|\ Richard Smith Staff Engineer PAE / \ Sun Microsystems Phone : +61 3 9869 6200 richard.smith at Sun.COM Direct : +61 3 9869 6224 \_,-._/ 476 St Kilda Road Fax : +61 3 9869 6290 v Melbourne Vic 3004 Australia ===========================================================================
On Mon, 2009-09-07 at 21:54 +1000, Richard Smith wrote:> > I''m not convinced there''s any lie involved.Well, whatever you want to call it... when the hardware tells the software (Lustre) that something is on a platter, in order for Lustre to work properly, it MUST be physically on the platter, or be able to make it there in the face of other environmental issues, such as power outages, etc. (i.e. this does allow for the battery-backed cache case, so long as the batter cannot drain before the disk unit is powered back up to receive the writes in the battery-backed cache). b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090908/a46c8601/attachment.bin
Brian J. Murrell wrote:> Well, whatever you want to call it... when the hardware tells the > software (Lustre) that something is on a platter, in order for Lustre to > work properly, it MUST be physically on the platter, or be able to make > it there in the face of other environmental issues, such as power ...I don''t think its in dispute that there is a need at various times to ensure that data has been written to non-volatile storage, at least not by me. Where I was coming from is that high-performance software should be encouraged to take full advantage of the capabilities of the underlying hardware provided they can do so safely. [And under some circumstances people may even be prepared to sacrifice safety for a performance benefit, but that''s a separate issue.] At least in the case of SCSI, the hardware doesn''t tell software (Lustre) that something is on a platter. The hardware receives requests and tells the software that it has obeyed them, or has failed in the attempt. A WRITE carries with it no guarantee that the data is on non-volatile media, hence my comment about using SYNCHRONIZE CACHE or FUA bit as well if that is really what is wanted. Neither SYNCHRONIZE CACHE or FUA bit is exposed at an application level, but I think there''s a reasonable expectation that the underlying software will do whatever is necessary to maintain integrity of a filesystem. The way I interpret this is that the combination of filesystem and device driver(s) should establish what the device is capable of, and then use those capabilities to maintain integrity while maximizing performance. Does it implement FUA? I''m caught out here--I was unaware that there were devices that silently ignored FUA, and didn''t know if SCSI permitted that. If FUA can''t be relied upon, then I''d expect [system] software to use SYNCHRONIZE CACHE instead. Admittedly I am out of my depth here. The block layer for a device is supposed to implement the concept of a barrier request, and should take steps to force a drive to write data to the media. Maybe some drivers do, and others don''t. I expect the implementation to require SYNCHRONIZE CACHE or FUA. At a higher level then, all that should be required is the appropriate generation of barrier requests, assuming the underlying layer implements them. The final piece of the puzzle, unless there''s something I''ve overlooked, is for appropriate warnings to be generated in the case where the software stack cannot verify it can implement barriers. This could mean [and I''m only guessing here] a situation where a device has a write-cache enabled but provides no means of informing higher layers of how to ensure data is written to physical media, in order. Adopting the same "fail-safe" principle as railways/railroads in theory use, this should probably be inverted, so that unless you see a message stating positively that the conditions for using a write-cache have been met, then it would be prudent not to use a write-cache. At the same time, I think there are circumstances in which the use of write-caches should be acceptable. There''s a bunch of other things that can go wrong with i/os that none of the above address, so nothing is completely risk-free. -- =========================================================================== ,-_|\ Richard Smith Staff Engineer PAE / \ Sun Microsystems Phone : +61 3 9869 6200 richard.smith at Sun.COM Direct : +61 3 9869 6224 \_,-._/ 476 St Kilda Road Fax : +61 3 9869 6290 v Melbourne Vic 3004 Australia ===========================================================================