Alexander Zarochentsev
2008-Aug-20 11:23 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
Hello, Currently "lctl readonly" command is used in recovery-related acceptance test as a disk write barrier, any server operations after such a barrier do not change disk state. replay-dual.sh:test_14 is an example, it uses test-framework.sh:replay_barrier which calls "lctl readonly". However "lctl readonly" cannot immediately "freeze" disk state because it does device sync right before setting device r/o. The additional sync breaks COS unit tests: COS-effect (syncs between dependent transactions) is hidden before the R/O barrier because the barrier does device sync anyway and COS-effect is invisible after the barrier because all disk writes are disabled already. Currently syncs are issued twice at different levels of call stack: lctl readonly | mdt_iocontrol -> dt_sync | osd_ro | __lvfs_set_rdonly -> lvfs_sbdev_sync(dev) | dev_set_rdonly(dev, jdev) Is there any requirement to have a device sync before setting the device r/o? If not, syncless variant of "lctl readonly" would be more useful for testing. If changing of "lctl readonly" behavior is not possible, it would be good to have a syncless set R/O operation under another ioctl. syncless R/O and a network barrier might be a good simulation of server crash without a need of machine restart. Thanks, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Peter Braam
2008-Aug-20 17:39 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
If I remember correctly the flush is only there to try to reduce rollback. However, given that failover may happen on a system where the software is not fully responsive, one could question the wisdom of this reduction. In any case having more replay due to more rollback is harmless. I do not see a problem removing this, and I would avoid making it an option. Peter On 8/20/08 5:23 AM, "Alexander Zarochentsev" <Alexander.Zarochentsev at Sun.COM> wrote:> Hello, > > Currently "lctl readonly" command is used in recovery-related acceptance > test as a disk write barrier, any server operations after such a > barrier do not change disk state. replay-dual.sh:test_14 is an example, > it uses test-framework.sh:replay_barrier which calls "lctl readonly". > > However "lctl readonly" cannot immediately "freeze" disk state because > it does device sync right before setting device r/o. > > The additional sync breaks COS unit tests: COS-effect (syncs between > dependent transactions) is hidden before the R/O barrier because the > barrier does device sync anyway and COS-effect is invisible after the > barrier because all disk writes are disabled already. > > Currently syncs are issued twice at different levels of call stack: > > lctl readonly > | > mdt_iocontrol -> dt_sync > | > osd_ro > | > __lvfs_set_rdonly -> lvfs_sbdev_sync(dev) > | > dev_set_rdonly(dev, jdev) > > Is there any requirement to have a device sync before setting the device > r/o? > > If not, syncless variant of "lctl readonly" would be more useful for > testing. If changing of "lctl readonly" behavior is not possible, it > would be good to have a syncless set R/O operation under another ioctl. > > syncless R/O and a network barrier might be a good simulation of server > crash without a need of machine restart. > > Thanks,
Andreas Dilger
2008-Aug-20 19:29 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On Aug 20, 2008 11:39 -0600, Peter J. Braam wrote:> If I remember correctly the flush is only there to try to reduce rollback. > However, given that failover may happen on a system where the software is > not fully responsive, one could question the wisdom of this reduction. In > any case having more replay due to more rollback is harmless. > > I do not see a problem removing this, and I would avoid making it an option.One major caveat is that with mountconf we ALWAYS mark the device as "readonly" when it is being unmounted. If we don''t have the sync there I fear data loss after a clean server unmount, when all clients are also being unmounted and cannot do replay. I''d be thrilled if this was fixed so a normal shutdown did not do a "force" unmount and set the device read-only, because that would also avoid leaving the journal needing recovery.> On 8/20/08 5:23 AM, "Alexander Zarochentsev" > <Alexander.Zarochentsev at Sun.COM> wrote: > > > Hello, > > > > Currently "lctl readonly" command is used in recovery-related acceptance > > test as a disk write barrier, any server operations after such a > > barrier do not change disk state. replay-dual.sh:test_14 is an example, > > it uses test-framework.sh:replay_barrier which calls "lctl readonly". > > > > However "lctl readonly" cannot immediately "freeze" disk state because > > it does device sync right before setting device r/o. > > > > The additional sync breaks COS unit tests: COS-effect (syncs between > > dependent transactions) is hidden before the R/O barrier because the > > barrier does device sync anyway and COS-effect is invisible after the > > barrier because all disk writes are disabled already. > > > > Currently syncs are issued twice at different levels of call stack: > > > > lctl readonly > > | > > mdt_iocontrol -> dt_sync > > | > > osd_ro > > | > > __lvfs_set_rdonly -> lvfs_sbdev_sync(dev) > > | > > dev_set_rdonly(dev, jdev) > > > > Is there any requirement to have a device sync before setting the device > > r/o? > > > > If not, syncless variant of "lctl readonly" would be more useful for > > testing. If changing of "lctl readonly" behavior is not possible, it > > would be good to have a syncless set R/O operation under another ioctl. > > > > syncless R/O and a network barrier might be a good simulation of server > > crash without a need of machine restart. > > > > Thanks, > > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-develCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Mikhail Pershin
2008-Aug-20 19:42 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On Wed, 20 Aug 2008 21:39:13 +0400, Peter Braam <Peter.Braam at sun.com> wrote:> If I remember correctly the flush is only there to try to reduce > rollback.Also to be sure that all before barrier is on disk, many test cases about recovery rely on the fact that operations before barrier are on disk. This is important with HARD failure case especially, without sync before barrier the operations before barrier may be lost and affect the tests. I agree that syncless variant of barrier is needed but we also need barrier with sync depending on test case. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
Peter Braam
2008-Aug-20 19:59 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
Ah yes - testing requires the sync and shutdown also. The former unfortunately does demand an option, but the justification is now very clear. Peter On 8/20/08 1:42 PM, "Mikhail Pershin" <Mikhail.Pershin at Sun.COM> wrote:> On Wed, 20 Aug 2008 21:39:13 +0400, Peter Braam <Peter.Braam at sun.com> > wrote: > >> If I remember correctly the flush is only there to try to reduce >> rollback. > > Also to be sure that all before barrier is on disk, many test cases about > recovery rely on the fact that operations before barrier are on disk. This > is important with HARD failure case especially, without sync before > barrier the operations before barrier may be lost and affect the tests. > I agree that syncless variant of barrier is needed but we also need > barrier with sync depending on test case.
Alexander Zarochentsev
2008-Aug-21 09:24 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On 20 August 2008 23:42:39 Mikhail Pershin wrote:> On Wed, 20 Aug 2008 21:39:13 +0400, Peter Braam <Peter.Braam at sun.com> > > wrote: > > If I remember correctly the flush is only there to try to reduce > > rollback. > > Also to be sure that all before barrier is on disk, many test cases > about recovery rely on the fact that operations before barrier are on > disk. This is important with HARD failure case especially, without > sync before barrier the operations before barrier may be lost and > affect the tests. I agree that syncless variant of barrier is needed > but we also need barrier with sync depending on test case.You don''t need lctl readonly to do a sync, the replay_barrier function from test-framework.sh adds a sync explicitly: replay_barrier() { local facet=$1 do_facet $facet sync df $MOUNT local svc=${facet}_svc do_facet $facet $LCTL --device %${!svc} readonly do_facet $facet $LCTL --device %${!svc} notransno do_facet $facet $LCTL mark "$facet REPLAY BARRIER on ${!svc}" $LCTL mark "local REPLAY BARRIER on ${!svc}" } WBR, -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Mikhail Pershin
2008-Aug-21 10:03 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On Thu, 21 Aug 2008 13:24:08 +0400, Alexander Zarochentsev <Alexander.Zarochentsev at sun.com> wrote:> On 20 August 2008 23:42:39 Mikhail Pershin wrote: >> Also to be sure that all before barrier is on disk, many test cases >> about recovery rely on the fact that operations before barrier are on >> disk. This is important with HARD failure case especially, without >> sync before barrier the operations before barrier may be lost and >> affect the tests. I agree that syncless variant of barrier is needed >> but we also need barrier with sync depending on test case. > > You don''t need lctl readonly to do a sync, the replay_barrier function > from test-framework.sh adds a sync explicitlyYou are right, it is safe in that case. -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.
Alex Zhuravlev
2008-Aug-21 10:43 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
my 2c: __lvfs_set_rdonly() -> fsync_bdev() -> fsync_super() -> ext3_sync_fs(): if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) { if (wait) log_wait_commit(EXT3_SB(sb)->s_journal, target); if this is called by process started transaction - deadlock before rdonly if some another process holding transaction hits assertion - again deadlock before rdonly. thanks, Alex> On Thu, 21 Aug 2008 13:24:08 +0400, Alexander Zarochentsev > <Alexander.Zarochentsev at sun.com> wrote: > > > On 20 August 2008 23:42:39 Mikhail Pershin wrote: > >> Also to be sure that all before barrier is on disk, many test cases > >> about recovery rely on the fact that operations before barrier are > on > >> disk. This is important with HARD failure case especially, without > >> sync before barrier the operations before barrier may be lost and > >> affect the tests. I agree that syncless variant of barrier is needed > >> but we also need barrier with sync depending on test case. > > > > You don''t need lctl readonly to do a sync, the replay_barrier function > > from test-framework.sh adds a sync explicitly > > You are right, it is safe in that case. > > > -- > Mikhail Pershin > Staff Engineer > Lustre Group > Sun Microsystems, Inc. > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Alexander Zarochentsev
2008-Aug-22 15:45 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On 20 August 2008 23:29:22 Andreas Dilger wrote:> On Aug 20, 2008 11:39 -0600, Peter J. Braam wrote: > > If I remember correctly the flush is only there to try to reduce > > rollback. However, given that failover may happen on a system where > > the software is not fully responsive, one could question the wisdom > > of this reduction. In any case having more replay due to more > > rollback is harmless. > > > > I do not see a problem removing this, and I would avoid making it > > an option. > > One major caveat is that with mountconf we ALWAYS mark the device as > "readonly" when it is being unmounted. > If we don''t have the sync > there I fear data loss after a clean server unmount, when all clients > are also being unmounted and cannot do replay. > > I''d be thrilled if this was fixed so a normal shutdown did not do a > "force" unmount and set the device read-only, because that would also > avoid leaving the journal needing recovery.I think the proposed changes do not touch umount or mountconf. The affected procedures are: 1. obd_fail_write calls lvfs_set_rdonly -- IMO it might be better to not sync device there. 2. filter_iocontrol -> lvfs_set_rdonly -- not changed, there is filter_sync() call right before lvfs_set_readonly 3. mdt_iocontrol -> dt_ro/osd_ro -> __lvfs_set_rdonly -- it is what I want to change 4. mdt_fail_write -> dt_ro/osd_ro -> __lvfs_set_rdonly -- sync is needed? Here is the patch: --- Index: lustre/lvfs/lvfs_linux.c ==================================================================RCS file: /cvsroot/cfs/lustre-core/lvfs/lvfs_linux.c,v retrieving revision 1.41 diff -u -p -r1.41 lvfs_linux.c --- lustre/lvfs/lvfs_linux.c 7 Aug 2008 14:50:49 -0000 1.41 +++ lustre/lvfs/lvfs_linux.c 22 Aug 2008 14:35:23 -0000 @@ -476,7 +476,6 @@ int dev_check_rdonly(lvfs_sbdev_type dev void __lvfs_set_rdonly(lvfs_sbdev_type dev, lvfs_sbdev_type jdev) { - lvfs_sbdev_sync(dev); if (jdev && (jdev != dev)) { CDEBUG(D_IOCTL | D_HA, "set journal dev %lx rdonly\n", (long)jdev); Index: lustre/mdt/mdt_handler.c ==================================================================RCS file: /cvsroot/cfs/lustre-core/mdt/mdt_handler.c,v retrieving revision 1.46 diff -u -p -r1.46 mdt_handler.c --- lustre/mdt/mdt_handler.c 7 Aug 2008 16:52:04 -0000 1.46 +++ lustre/mdt/mdt_handler.c 22 Aug 2008 14:35:26 -0000 @@ -4630,7 +4716,6 @@ static int mdt_iocontrol(unsigned int cm rc = dt->dd_ops->dt_sync(&env, dt); break; case OBD_IOC_SET_READONLY: - rc = dt->dd_ops->dt_sync(&env, dt); dt->dd_ops->dt_ro(&env, dt); break; case OBD_IOC_ABORT_RECOVERY: Index: lustre/tests/test-framework.sh ==================================================================RCS file: /cvsroot/cfs/lustre-core/tests/test-framework.sh,v retrieving revision 1.119 diff -u -p -r1.119 test-framework.sh --- lustre/tests/test-framework.sh 5 Aug 2008 21:31:26 -0000 1.119 +++ lustre/tests/test-framework.sh 22 Aug 2008 14:35:34 -0000 @@ -770,6 +770,16 @@ replay_barrier_nodf() { $LCTL mark "local REPLAY BARRIER on ${!svc}" } +replay_barrier_nosync() { + local facet=$1 echo running=${running} + local svc=${facet}_svc + echo Replay barrier on ${!svc} + do_facet $facet $LCTL --device %${!svc} readonly + do_facet $facet $LCTL --device %${!svc} notransno + do_facet $facet $LCTL mark "$facet REPLAY BARRIER on ${!svc}" + $LCTL mark "local REPLAY BARRIER on ${!svc}" +} + mds_evict_client() { UUID=`lctl get_param -n mdc.${mds1_svc}-mdc-*.uuid` do_facet mds1 "lctl set_param -n mdt.${mds1_svc}.evict_client $UUID" -- Alexander "Zam" Zarochentsev Staff Engineer Lustre Group, Sun Microsystems
Andreas Dilger
2008-Aug-24 02:02 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On Aug 22, 2008 19:45 +0400, Alexander Zarochentsev wrote:> On 20 August 2008 23:29:22 Andreas Dilger wrote: > > On Aug 20, 2008 11:39 -0600, Peter J. Braam wrote: > > > If I remember correctly the flush is only there to try to reduce > > > rollback. However, given that failover may happen on a system where > > > the software is not fully responsive, one could question the wisdom > > > of this reduction. In any case having more replay due to more > > > rollback is harmless. > > > > One major caveat is that with mountconf we ALWAYS mark the device as > > "readonly" when it is being unmounted. > > If we don''t have the sync > > there I fear data loss after a clean server unmount, when all clients > > are also being unmounted and cannot do replay. > > > > I''d be thrilled if this was fixed so a normal shutdown did not do a > > "force" unmount and set the device read-only, because that would also > > avoid leaving the journal needing recovery. > > I think the proposed changes do not touch umount or mountconf. > The affected procedures are: > > 1. obd_fail_write calls lvfs_set_rdonly > -- IMO it might be better to not sync device there. > 2. filter_iocontrol -> lvfs_set_rdonly > -- not changed, there is filter_sync() call right before > lvfs_set_readonly > 3. mdt_iocontrol -> dt_ro/osd_ro -> __lvfs_set_rdonly > -- it is what I want to change > 4. mdt_fail_write -> dt_ro/osd_ro -> __lvfs_set_rdonly > -- sync is needed?I can''t see the details now, but in the case of a "normal" unmount of an MDT or OST it does the shutdown in "failover" mode, in order to preserve the client state instead of evicting the clients. Otherwise, an accidental unmount/shutdown would cause application errors on the clients. Nathan, can you please look into this? You know the twisty-turny mountconf callbacks better than anyone. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nathaniel Rutman
2008-Aug-28 17:11 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
Andreas Dilger wrote:> On Aug 22, 2008 19:45 +0400, Alexander Zarochentsev wrote: > >> On 20 August 2008 23:29:22 Andreas Dilger wrote: >> >>> On Aug 20, 2008 11:39 -0600, Peter J. Braam wrote: >>> >>>> If I remember correctly the flush is only there to try to reduce >>>> rollback. However, given that failover may happen on a system where >>>> the software is not fully responsive, one could question the wisdom >>>> of this reduction. In any case having more replay due to more >>>> rollback is harmless. >>>>IIRC one other reason for the flush is that loopback disks tend not to "really" flush everything to disk when asked, and additional sync calls seem to help. So beware when running loopback disks...>>> One major caveat is that with mountconf we ALWAYS mark the device as >>> "readonly" when it is being unmounted. >>> If we don''t have the sync >>> there I fear data loss after a clean server unmount, when all clients >>> are also being unmounted and cannot do replay. >>> >>> I''d be thrilled if this was fixed so a normal shutdown did not do a >>> "force" unmount and set the device read-only, because that would also >>> avoid leaving the journal needing recovery. >>> >umount does either force or failover shutdown; failover sets readonly but force does not. Test-framework regularly does both. Andreas, if you want to avoid journal recovery, use umount -f. Really read-only is intended to simulate a power loss, so I think sync before it is a bit of a cheat. Having said that, I think there were real issues that prompted us to include the sync in the first place, and some heavy recovery testing (including loopback devs) is in order if it is removed.
Andreas Dilger
2008-Aug-28 20:25 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
On Aug 28, 2008 10:11 -0700, Nathaniel Rutman wrote:> Andreas Dilger wrote: > >>> I''d be thrilled if this was fixed so a normal shutdown did not do a > >>> "force" unmount and set the device read-only, because that would also > >>> avoid leaving the journal needing recovery. > > umount does either force or failover shutdown; failover sets readonly > but force does not. Test-framework regularly does both. Andreas, if > you want to avoid journal recovery, use umount -f.Well, issue is that with normal mount/unmount the filesystem is always dirty and needs journal recovery. Also, for the DMU, where we will likely not have "read-only" support we need to be able to just block all client transactions and flush everything to disk, doing a normal shutdown. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Nathaniel Rutman
2008-Aug-29 18:46 UTC
[Lustre-devel] [RFC] "lctl readonly" modification proposal
Andreas Dilger wrote:> On Aug 28, 2008 10:11 -0700, Nathaniel Rutman wrote: > >> Andreas Dilger wrote: >> >>>>> I''d be thrilled if this was fixed so a normal shutdown did not do a >>>>> "force" unmount and set the device read-only, because that would also >>>>> avoid leaving the journal needing recovery. >>>>> >> umount does either force or failover shutdown; failover sets readonly >> but force does not. Test-framework regularly does both. Andreas, if >> you want to avoid journal recovery, use umount -f. >> > > Well, issue is that with normal mount/unmount the filesystem is always > dirty and needs journal recovery. Also, for the DMU, where we will > likely not have "read-only" support we need to be able to just block > all client transactions and flush everything to disk, doing a normal > shutdown. >Perhaps we should remove the sync/readonly for failover shutdown, and instead add a OBD_FAIL_LOC (or just use the ioctl) for "power loss" failover testing. One possible gotcha here is that sometimes it takes a long time for the umount to complete, and we want to make sure the failover partner does not try to start writing to the disk until the unmounting one is done.