Hello, Many warm thank yous to Bill Rugolsky Jr. and Stephen Tweedie for their help on this one. Both pointed out that since the file system is journaled, if the primary box (nas1) were to crash, the secondary box should mount the ext3 file system without any problems. Depending on the nature of the journal (metadata journaling and/or data journaling), we may have little or no data loss. Bill Rigolsky, also pointed out that I may have some performance benefits from data=journal option, since I am exporting the EXT3 filesystem with the "rw,sync,no_wdelay" options, thus forcing NFS to do synchronous commits. His reasoning is based on a theory that with "data=ordered" and "sync" options for EXT3 and NFS, the system will have to work harder to write out data blocks and may need to "seek all over the disk to do so." This will decrease throughput. However, with "data=journal", the NFS forced syncs will write the data in a (likely) contiguous journal (less disk seeking, less latency, increased throughput) and allow the kernel to do its actual disk commits on it''s own pace. Best Regards, Bill Antoniadis ------------------------------------------------------------------------------- Following is my original email: ------------------------------- Setup: ------ I have two RedHat 7.2 (2.4.9-31) boxes that are attached to one external RAID unit. Both boxes are able to see the RAID unit as /dev/sdb1, but only one box mounts (cat /proc/mounts yields: /dev/sdb1 /nas ext3 rw 0 0) the unit at any give time. The other box listens, via heartbeat (linux-ha), waiting to mount the RAID unit, should it''s sibling crash (actually, heartbeat no longer heard via serial and ethernet). The /nas directory is NFS exported with the rw,sync,no_wdelay options to several Linux and Tru64 boxes. Questions: ---------- What will I encounter should the primary (i.e. box currently mounting /dev/sdb1) crash, and the backup take over? From my simulations, I see the backup mount /dev/sdb1 but I get the following in it''s /var/log/messages: nas2 kernel: kjournald starting. Commit interval 5 seconds nas2 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended nas2 kernel: EXT3 FS 2.4-0.9.11, 3 Oct 2001 on sd(8,17), internal journal nas2 kernel: EXT3-fs: recovery complete. nas2 kernel: EXT3-fs mounted filesystem with ordered data mode. My limited understanding is that since both the primary box (named "nas1") and the secondary (named "nas2") are keeping a metadata-only journal, that data updates were flushed to disk (on nas1) and the metadata changes were not committed, thus nas2 sees an inconsistent filesystem when mounting. Am I correct? If we run with nas2 box for a while, and then decide to switch back to nas1, how will nas1 and it''s journal playback, react to the changes committed by nas2 since the crash? Would it be safer to always run e2fsck on nas2 takeover, prior to mounting /dev/sdb1? Am I wrong in choosing EXT3 over EXT2 in this setup? Any help is greatly appreciated. Thanks in advance, Bill Antoniadis
Stephen Tweedie
2002-Apr-05 07:33 UTC
Re: [SUMMARY] 2 Linux boxes, failover, & 1 EXT3 RAID
Hi, On Tue, Apr 02, 2002 at 04:24:16PM -0500, Bill Antoniadis wrote:> Many warm thank yous to Bill Rugolsky Jr. and Stephen Tweedie for their help on > this one. Both pointed out that since the file system is journaled, if the > primary box (nas1) were to crash, the secondary box should mount the ext3 file > system without any problems. Depending on the nature of the journal (metadata > journaling and/or data journaling), we may have little or no data loss.More than that --- think of the failover as a simple system crash. The only difference is that the "reboot" involves bringing up the filesystem on a different node, rather than the original node. Thinking about it this way makes data integrity much easier to visualise. Any time you want to make data persistent over a reboot at a certain point in your application, it's up to your application to ensure that it tells the filesystem so by calling fsync() or by using synchronised IO. The result of the fsync is *exactly* the same regardless of whether you are doing a single-node reboot or a two-node failover. Unix performs universal write-behind data caching for local disk writes, so any application which assumes data integrity on disk without asking for that explicitly is simply broken. Cheers, Stephen