thr3ads.net - Lustre devel - [Lustre-devel] [RFC] "lctl readonly" modification proposal [Aug 2008]

If this information is useful, please help other people find it:
Share via:

Alexander Zarochentsev

2008-Aug-20 11:23 UTC

[Lustre-devel] [RFC] "lctl readonly" modification proposal

Hello,

Currently "lctl readonly" command is used in recovery-related
acceptance
test as a disk write barrier, any server operations after such a 
barrier do not change disk state. replay-dual.sh:test_14 is an example, 
it uses test-framework.sh:replay_barrier which calls "lctl readonly".

However "lctl readonly" cannot immediately "freeze" disk
state because
it does device sync right before setting device r/o.

The additional sync breaks COS unit tests: COS-effect (syncs between 
dependent transactions) is hidden before the R/O barrier because the 
barrier does device sync anyway and COS-effect is invisible after the 
barrier because all disk writes are disabled already.

Currently syncs are issued twice at different levels of call stack:

lctl readonly
      |
mdt_iocontrol -> dt_sync
      |
   osd_ro
      |
__lvfs_set_rdonly -> lvfs_sbdev_sync(dev)
      |
dev_set_rdonly(dev, jdev)

Is there any requirement to have a device sync before setting the device 
r/o?  

If not, syncless variant of "lctl readonly" would be more useful for 
testing. If changing of "lctl readonly" behavior is not possible, it 
would be good to have a syncless set R/O operation under another ioctl.

syncless R/O and a network barrier might be a good simulation of server 
crash without a need of machine restart.

Thanks,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Peter Braam

2008-Aug-20 17:39 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

If I remember correctly the flush is only there to try to reduce rollback.
However, given that failover may happen on a system where the software is
not fully responsive, one could question the wisdom of this reduction.  In
any case having more replay due to more rollback is harmless.

I do not see a problem removing this, and I would avoid making it an option.

Peter


On 8/20/08 5:23 AM, "Alexander Zarochentsev"
<Alexander.Zarochentsev at Sun.COM> wrote:
> Hello,
> 
> Currently "lctl readonly" command is used in recovery-related
acceptance
> test as a disk write barrier, any server operations after such a
> barrier do not change disk state. replay-dual.sh:test_14 is an example,
> it uses test-framework.sh:replay_barrier which calls "lctl
readonly".
> 
> However "lctl readonly" cannot immediately "freeze"
disk state because
> it does device sync right before setting device r/o.
> 
> The additional sync breaks COS unit tests: COS-effect (syncs between
> dependent transactions) is hidden before the R/O barrier because the
> barrier does device sync anyway and COS-effect is invisible after the
> barrier because all disk writes are disabled already.
> 
> Currently syncs are issued twice at different levels of call stack:
> 
> lctl readonly
>       |
> mdt_iocontrol -> dt_sync
>       |
>    osd_ro
>       |
> __lvfs_set_rdonly -> lvfs_sbdev_sync(dev)
>       |
> dev_set_rdonly(dev, jdev)
> 
> Is there any requirement to have a device sync before setting the device
> r/o?  
> 
> If not, syncless variant of "lctl readonly" would be more useful
for
> testing. If changing of "lctl readonly" behavior is not possible,
it
> would be good to have a syncless set R/O operation under another ioctl.
> 
> syncless R/O and a network barrier might be a good simulation of server
> crash without a need of machine restart.
> 
> Thanks,

Andreas Dilger

2008-Aug-20 19:29 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On Aug 20, 2008  11:39 -0600, Peter J. Braam wrote:> If I remember correctly the flush is only there to try to reduce rollback.
> However, given that failover may happen on a system where the software is
> not fully responsive, one could question the wisdom of this reduction.  In
> any case having more replay due to more rollback is harmless.
> 
> I do not see a problem removing this, and I would avoid making it an
option.
One major caveat is that with mountconf we ALWAYS mark the device as
"readonly" when it is being unmounted.  If we don''t have the
sync there
I fear data loss after a clean server unmount, when all clients are also
being unmounted and cannot do replay.

I''d be thrilled if this was fixed so a normal shutdown did not do a
"force" unmount and set the device read-only, because that would also
avoid leaving the journal needing recovery.
> On 8/20/08 5:23 AM, "Alexander Zarochentsev"
> <Alexander.Zarochentsev at Sun.COM> wrote:
> 
> > Hello,
> > 
> > Currently "lctl readonly" command is used in
recovery-related acceptance
> > test as a disk write barrier, any server operations after such a
> > barrier do not change disk state. replay-dual.sh:test_14 is an
example,
> > it uses test-framework.sh:replay_barrier which calls "lctl
readonly".
> > 
> > However "lctl readonly" cannot immediately
"freeze" disk state because
> > it does device sync right before setting device r/o.
> > 
> > The additional sync breaks COS unit tests: COS-effect (syncs between
> > dependent transactions) is hidden before the R/O barrier because the
> > barrier does device sync anyway and COS-effect is invisible after the
> > barrier because all disk writes are disabled already.
> > 
> > Currently syncs are issued twice at different levels of call stack:
> > 
> > lctl readonly
> >       |
> > mdt_iocontrol -> dt_sync
> >       |
> >    osd_ro
> >       |
> > __lvfs_set_rdonly -> lvfs_sbdev_sync(dev)
> >       |
> > dev_set_rdonly(dev, jdev)
> > 
> > Is there any requirement to have a device sync before setting the
device
> > r/o?  
> > 
> > If not, syncless variant of "lctl readonly" would be more
useful for
> > testing. If changing of "lctl readonly" behavior is not
possible, it
> > would be good to have a syncless set R/O operation under another
ioctl.
> > 
> > syncless R/O and a network barrier might be a good simulation of
server
> > crash without a need of machine restart.
> > 
> > Thanks,
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Mikhail Pershin

2008-Aug-20 19:42 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On Wed, 20 Aug 2008 21:39:13 +0400, Peter Braam <Peter.Braam at sun.com>  
wrote:
> If I remember correctly the flush is only there to try to reduce  
> rollback.
Also to be sure that all before barrier is on disk, many test cases about  
recovery rely on the fact that operations before barrier are on disk. This  
is important with HARD failure case especially, without sync before  
barrier the operations before barrier may be lost and affect the tests.
I agree that syncless variant of barrier is needed but we also need  
barrier with sync depending on test case.

-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Peter Braam

2008-Aug-20 19:59 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

Ah yes - testing requires the sync and shutdown also.  The former
unfortunately does demand an option, but the justification is now very
clear.

Peter


On 8/20/08 1:42 PM, "Mikhail Pershin" <Mikhail.Pershin at
Sun.COM> wrote:
> On Wed, 20 Aug 2008 21:39:13 +0400, Peter Braam <Peter.Braam at
sun.com>
> wrote:
> 
>> If I remember correctly the flush is only there to try to reduce
>> rollback.
> 
> Also to be sure that all before barrier is on disk, many test cases about
> recovery rely on the fact that operations before barrier are on disk. This
> is important with HARD failure case especially, without sync before
> barrier the operations before barrier may be lost and affect the tests.
> I agree that syncless variant of barrier is needed but we also need
> barrier with sync depending on test case.

Alexander Zarochentsev

2008-Aug-21 09:24 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On 20 August 2008 23:42:39 Mikhail Pershin wrote:> On Wed, 20 Aug 2008 21:39:13 +0400, Peter Braam <Peter.Braam at
sun.com>
>
> wrote:
> > If I remember correctly the flush is only there to try to reduce
> > rollback.
>
> Also to be sure that all before barrier is on disk, many test cases
> about recovery rely on the fact that operations before barrier are on
> disk. This is important with HARD failure case especially, without
> sync before barrier the operations before barrier may be lost and
> affect the tests. I agree that syncless variant of barrier is needed
> but we also need barrier with sync depending on test case.
You don''t need lctl readonly to do a sync, the replay_barrier function 
from test-framework.sh adds a sync explicitly:

replay_barrier() {
    local facet=$1
    do_facet $facet sync
    df $MOUNT
    local svc=${facet}_svc
    do_facet $facet $LCTL --device %${!svc} readonly
    do_facet $facet $LCTL --device %${!svc} notransno
    do_facet $facet $LCTL mark "$facet REPLAY BARRIER on ${!svc}"
    $LCTL mark "local REPLAY BARRIER on ${!svc}"
}

WBR,
-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Mikhail Pershin

2008-Aug-21 10:03 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On Thu, 21 Aug 2008 13:24:08 +0400, Alexander Zarochentsev  
<Alexander.Zarochentsev at sun.com> wrote:
> On 20 August 2008 23:42:39 Mikhail Pershin wrote:
>> Also to be sure that all before barrier is on disk, many test cases
>> about recovery rely on the fact that operations before barrier are on
>> disk. This is important with HARD failure case especially, without
>> sync before barrier the operations before barrier may be lost and
>> affect the tests. I agree that syncless variant of barrier is needed
>> but we also need barrier with sync depending on test case.
>
> You don''t need lctl readonly to do a sync, the replay_barrier
function
> from test-framework.sh adds a sync explicitly
You are right, it is safe in that case.


-- 
Mikhail Pershin
Staff Engineer
Lustre Group
Sun Microsystems, Inc.

Alex Zhuravlev

2008-Aug-21 10:43 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

my 2c:

__lvfs_set_rdonly() -> fsync_bdev() -> fsync_super() -> ext3_sync_fs():

if (journal_start_commit(EXT3_SB(sb)->s_journal, &target)) {
if (wait)
  log_wait_commit(EXT3_SB(sb)->s_journal, target);

if this is called by process started transaction - deadlock before rdonly
if some another process holding transaction hits assertion - again deadlock
before rdonly.

thanks, Alex

> On Thu, 21 Aug 2008 13:24:08 +0400, Alexander Zarochentsev  
> <Alexander.Zarochentsev at sun.com> wrote:
> 
> > On 20 August 2008 23:42:39 Mikhail Pershin wrote:
> >> Also to be sure that all before barrier is on disk, many test
cases
> >> about recovery rely on the fact that operations before barrier are
> on
> >> disk. This is important with HARD failure case especially, without
> >> sync before barrier the operations before barrier may be lost and
> >> affect the tests. I agree that syncless variant of barrier is
needed
> >> but we also need barrier with sync depending on test case.
> >
> > You don''t need lctl readonly to do a sync, the replay_barrier
function
> > from test-framework.sh adds a sync explicitly
> 
> You are right, it is safe in that case.
> 
> 
> -- 
> Mikhail Pershin
> Staff Engineer
> Lustre Group
> Sun Microsystems, Inc.
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Alexander Zarochentsev

2008-Aug-22 15:45 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On 20 August 2008 23:29:22 Andreas Dilger wrote:> On Aug 20, 2008  11:39 -0600, Peter J. Braam wrote:
> > If I remember correctly the flush is only there to try to reduce
> > rollback. However, given that failover may happen on a system where
> > the software is not fully responsive, one could question the wisdom
> > of this reduction.  In any case having more replay due to more
> > rollback is harmless.
> >
> > I do not see a problem removing this, and I would avoid making it
> > an option.
>
> One major caveat is that with mountconf we ALWAYS mark the device as
> "readonly" when it is being unmounted.
> If we don''t have the sync 
> there I fear data loss after a clean server unmount, when all clients
> are also being unmounted and cannot do replay.
>
> I''d be thrilled if this was fixed so a normal shutdown did not do
a
> "force" unmount and set the device read-only, because that would
also
> avoid leaving the journal needing recovery.
I think the proposed changes do not touch umount or mountconf.
The affected procedures are:

1. obd_fail_write calls lvfs_set_rdonly    
	-- IMO it might be better to not sync device there.
2. filter_iocontrol -> lvfs_set_rdonly  
	-- not changed, there is filter_sync() call right before
           lvfs_set_readonly
3. mdt_iocontrol -> dt_ro/osd_ro -> __lvfs_set_rdonly
        -- it is what I want to change
4. mdt_fail_write -> dt_ro/osd_ro -> __lvfs_set_rdonly
        -- sync is needed?

Here is the patch:
---
Index: lustre/lvfs/lvfs_linux.c
==================================================================RCS file:
/cvsroot/cfs/lustre-core/lvfs/lvfs_linux.c,v
retrieving revision 1.41
diff -u -p -r1.41 lvfs_linux.c
--- lustre/lvfs/lvfs_linux.c	7 Aug 2008 14:50:49 -0000	1.41
+++ lustre/lvfs/lvfs_linux.c	22 Aug 2008 14:35:23 -0000
@@ -476,7 +476,6 @@ int dev_check_rdonly(lvfs_sbdev_type dev
 
 void __lvfs_set_rdonly(lvfs_sbdev_type dev, lvfs_sbdev_type jdev)
 {
-        lvfs_sbdev_sync(dev);
         if (jdev && (jdev != dev)) {
                 CDEBUG(D_IOCTL | D_HA, "set journal dev %lx
rdonly\n",
                        (long)jdev);
Index: lustre/mdt/mdt_handler.c
==================================================================RCS file:
/cvsroot/cfs/lustre-core/mdt/mdt_handler.c,v
retrieving revision 1.46
diff -u -p -r1.46 mdt_handler.c
--- lustre/mdt/mdt_handler.c	7 Aug 2008 16:52:04 -0000	1.46
+++ lustre/mdt/mdt_handler.c	22 Aug 2008 14:35:26 -0000
@@ -4630,7 +4716,6 @@ static int mdt_iocontrol(unsigned int cm
                 rc = dt->dd_ops->dt_sync(&env, dt);
                 break;
         case OBD_IOC_SET_READONLY:
-                rc = dt->dd_ops->dt_sync(&env, dt);
                 dt->dd_ops->dt_ro(&env, dt);
                 break;
         case OBD_IOC_ABORT_RECOVERY:
Index: lustre/tests/test-framework.sh
==================================================================RCS file:
/cvsroot/cfs/lustre-core/tests/test-framework.sh,v
retrieving revision 1.119
diff -u -p -r1.119 test-framework.sh
--- lustre/tests/test-framework.sh	5 Aug 2008 21:31:26 -0000	1.119
+++ lustre/tests/test-framework.sh	22 Aug 2008 14:35:34 -0000
@@ -770,6 +770,16 @@ replay_barrier_nodf() {
     $LCTL mark "local REPLAY BARRIER on ${!svc}"
 }
 
+replay_barrier_nosync() {
+    local facet=$1    echo running=${running}
+    local svc=${facet}_svc
+    echo Replay barrier on ${!svc}
+    do_facet $facet $LCTL --device %${!svc} readonly
+    do_facet $facet $LCTL --device %${!svc} notransno
+    do_facet $facet $LCTL mark "$facet REPLAY BARRIER on ${!svc}"
+    $LCTL mark "local REPLAY BARRIER on ${!svc}"
+}
+
 mds_evict_client() {
     UUID=`lctl get_param -n mdc.${mds1_svc}-mdc-*.uuid`
     do_facet mds1 "lctl set_param -n mdt.${mds1_svc}.evict_client 
$UUID"

-- 
Alexander "Zam" Zarochentsev
Staff Engineer
Lustre Group, Sun Microsystems

Andreas Dilger

2008-Aug-24 02:02 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On Aug 22, 2008  19:45 +0400, Alexander Zarochentsev
wrote:> On 20 August 2008 23:29:22 Andreas Dilger wrote:
> > On Aug 20, 2008  11:39 -0600, Peter J. Braam wrote:
> > > If I remember correctly the flush is only there to try to reduce
> > > rollback. However, given that failover may happen on a system
where
> > > the software is not fully responsive, one could question the
wisdom
> > > of this reduction.  In any case having more replay due to more
> > > rollback is harmless.
> >
> > One major caveat is that with mountconf we ALWAYS mark the device as
> > "readonly" when it is being unmounted.
> > If we don''t have the sync 
> > there I fear data loss after a clean server unmount, when all clients
> > are also being unmounted and cannot do replay.
> >
> > I''d be thrilled if this was fixed so a normal shutdown did
not do a
> > "force" unmount and set the device read-only, because that
would also
> > avoid leaving the journal needing recovery.
> 
> I think the proposed changes do not touch umount or mountconf.
> The affected procedures are:
> 
> 1. obd_fail_write calls lvfs_set_rdonly    
> 	-- IMO it might be better to not sync device there.
> 2. filter_iocontrol -> lvfs_set_rdonly  
> 	-- not changed, there is filter_sync() call right before
>            lvfs_set_readonly
> 3. mdt_iocontrol -> dt_ro/osd_ro -> __lvfs_set_rdonly
>         -- it is what I want to change
> 4. mdt_fail_write -> dt_ro/osd_ro -> __lvfs_set_rdonly
>         -- sync is needed?
I can''t see the details now, but in the case of a "normal"
unmount
of an MDT or OST it does the shutdown in "failover" mode, in order
to preserve the client state instead of evicting the clients.
Otherwise, an accidental unmount/shutdown would cause application
errors on the clients.

Nathan,
can you please look into this?  You know the twisty-turny mountconf
callbacks better than anyone.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nathaniel Rutman

2008-Aug-28 17:11 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

Andreas Dilger wrote:> On Aug 22, 2008  19:45 +0400, Alexander Zarochentsev wrote:
>   
>> On 20 August 2008 23:29:22 Andreas Dilger wrote:
>>     
>>> On Aug 20, 2008  11:39 -0600, Peter J. Braam wrote:
>>>       
>>>> If I remember correctly the flush is only there to try to
reduce
>>>> rollback. However, given that failover may happen on a system
where
>>>> the software is not fully responsive, one could question the
wisdom
>>>> of this reduction.  In any case having more replay due to more
>>>> rollback is harmless.
>>>>         IIRC one other reason for the flush is that loopback disks tend not to 
"really" flush everything to disk when asked, and additional sync
calls
seem to help.  So beware when running loopback disks...>>> One major caveat is that with mountconf we ALWAYS mark the device
as
>>> "readonly" when it is being unmounted.
>>> If we don''t have the sync 
>>> there I fear data loss after a clean server unmount, when all
clients
>>> are also being unmounted and cannot do replay.
>>>
>>> I''d be thrilled if this was fixed so a normal shutdown did
not do a
>>> "force" unmount and set the device read-only, because
that would also
>>> avoid leaving the journal needing recovery.
>>>       
>umount does either force or failover shutdown; failover sets readonly 
but force does not.  Test-framework regularly does both.  Andreas, if 
you want to avoid journal recovery, use umount -f.
Really read-only is intended to simulate a power loss, so I think sync 
before it is a bit of a cheat.    Having said that, I think there were 
real issues that prompted us to include the sync in the first place, and 
some heavy recovery testing (including loopback devs) is in order if it 
is removed.

Andreas Dilger

2008-Aug-28 20:25 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

On Aug 28, 2008  10:11 -0700, Nathaniel Rutman wrote:> Andreas Dilger wrote:
> >>> I''d be thrilled if this was fixed so a normal
shutdown did not do a
> >>> "force" unmount and set the device read-only,
because that would also
> >>> avoid leaving the journal needing recovery.
>
> umount does either force or failover shutdown; failover sets readonly 
> but force does not.  Test-framework regularly does both.  Andreas, if 
> you want to avoid journal recovery, use umount -f.
Well, issue is that with normal mount/unmount the filesystem is always
dirty and needs journal recovery.  Also, for the DMU, where we will
likely not have "read-only" support we need to be able to just block
all client transactions and flush everything to disk, doing a normal
shutdown.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Nathaniel Rutman

2008-Aug-29 18:46 UTC

head link

[Lustre-devel] [RFC] "lctl readonly" modification proposal

Andreas Dilger wrote:> On Aug 28, 2008  10:11 -0700, Nathaniel Rutman wrote:
>   
>> Andreas Dilger wrote:
>>     
>>>>> I''d be thrilled if this was fixed so a normal
shutdown did not do a
>>>>> "force" unmount and set the device read-only,
because that would also
>>>>> avoid leaving the journal needing recovery.
>>>>>           
>> umount does either force or failover shutdown; failover sets readonly 
>> but force does not.  Test-framework regularly does both.  Andreas, if 
>> you want to avoid journal recovery, use umount -f.
>>     
>
> Well, issue is that with normal mount/unmount the filesystem is always
> dirty and needs journal recovery.  Also, for the DMU, where we will
> likely not have "read-only" support we need to be able to just
block
> all client transactions and flush everything to disk, doing a normal
> shutdown.
>   Perhaps we should remove the sync/readonly for failover shutdown, and 
instead add a OBD_FAIL_LOC (or just use the ioctl) for "power loss" 
failover testing.  One possible gotcha here is that sometimes it takes a 
long time for the umount to complete, and we want to make sure the 
failover partner does not try to start writing to the disk until the 
unmounting one is done.

Lustre devel - Aug 2008 - [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal

[Lustre-devel] [RFC] "lctl readonly" modification proposal