thr3ads.net - Lustre discuss - [Lustre-discuss] Regarding redundancy [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Christopher Deneen

2009-Apr-06 19:27 UTC

[Lustre-discuss] Regarding redundancy

A quick note on my setup, 40 oss''s each with a md0 raid1 (The OST)
(2x500gb) (amd x2 5400 8gb ddr2)   and a single mdt/mgs/mds (dual xeon
quad core 32gb ddr2 4tb raid6). I have everything mounted and working
properly and am still going through the basic tests for performance
but I am confused about the safety of the cluster. I not concerned
about the OST''s because they are a raid 1 which I can recover quickly
and monitor but my question is, what if an OSS goes down. Will that
cause corruption of the data? and what happens if it comes back up. I
also would like to know if you can dynamically add to the cluster new
OSS/OST''s or do you have to unmount the client then remount after
doing so.
Thanks for your help in advance.

Brian J. Murrell

2009-Apr-06 19:55 UTC

head link

[Lustre-discuss] Regarding redundancy

On Mon, 2009-04-06 at 15:27 -0400, Christopher Deneen
wrote:> A quick note on my setup, 40 oss''s each with a md0 raid1 (The OST)
> (2x500gb) (amd x2 5400 8gb ddr2)   and a single mdt/mgs/mds (dual xeon
> quad core 32gb ddr2 4tb raid6). I have everything mounted and working
> properly and am still going through the basic tests for performance
> but I am confused about the safety of the cluster. I not concerned
> about the OST''s because they are a raid 1 which I can recover
quickly
> and monitor but my question is, what if an OSS goes down. Will that
> cause corruption of the data?
Unless you also lose clients, no.  In the event of an OSS going down,
the client will not have gotten the reply back from the OST to say that
it''s data was actually written to disk.  Until the client gets such a
reply, it holds on to that data so that if an OSS does crash, it can
"replay" that transaction.  Thus, all data is either physically
on-disk,
on in client memory ready to be replayed to disk.

The one exception to this is if you have some cache between the OST and
the disk that the OSS doesn''t know about.  It might think it''s
written
to disk but yet only written to a disk''s cache.  Should that disk go
down, that is lost data and possible corruption.  This is why we
typically recommend disabling write caching on disk arrays unless they
can survive a power event and recover so that the disk is fully coherent
with what the host thinks should be there.
> I
> also would like to know if you can dynamically add to the cluster new
> OSS/OST''s or do you have to unmount the client then remount after
> doing so.
No.  You just add them as you need.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090406/a0f11d7b/attachment.bin

Arne Wiebalck

2009-Apr-07 07:20 UTC

head link

[Lustre-discuss] Regarding redundancy

Brian,
> Unless you also lose clients, no.  In the event of an OSS going down,
> the client will not have gotten the reply back from the OST to say that
> it''s data was actually written to disk.  Until the client gets
such a
> reply, it holds on to that data so that if an OSS does crash, it can
> "replay" that transaction.  Thus, all data is either physically
on-disk,
> on in client memory ready to be replayed to disk.
what about if you have multiple clients, all having transactions with
the OSS open. Now the OSS goes down and comes back. From what I
understand, the server goes into recovery and rejects new connections 
before recovery is finished (correct?). What if all but one client
reconnect, i.e. you lose one client: are the transactions of the
successfully reconnected clients replayed or are they discarded?
> > I
> > also would like to know if you can dynamically add to the cluster new
> > OSS/OST''s or do you have to unmount the client then remount
after
> > doing so.
> 
> No.  You just add them as you need.
> 
Independent from the load? I think the ''official'' statement
was that the
cluster has to be quiescent, i.e. no client activity. Is that (still)
true?

TIA,
 Arne

Andreas Dilger

2009-Apr-07 07:43 UTC

head link

[Lustre-discuss] Regarding redundancy

On Apr 06, 2009  15:27 -0400, Christopher Deneen wrote:> A quick note on my setup, 40 oss''s each with a md0 raid1 (The OST)
> (2x500gb) (amd x2 5400 8gb ddr2)   and a single mdt/mgs/mds (dual xeon
> quad core 32gb ddr2 4tb raid6).
There is no benefit at all to have such a large MDT filesystem, and
in fact using RAID6 will make metadata performance very bad.  You
should instead make this a RAID 1+0 with 4 or whatever drives you have.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Brian J. Murrell

2009-Apr-07 12:14 UTC

head link

[Lustre-discuss] Regarding redundancy

On Tue, 2009-04-07 at 09:20 +0200, Arne Wiebalck wrote:> Brian,
Arne,
> what about if you have multiple clients, all having transactions with
> the OSS open. Now the OSS goes down and comes back. From what I
> understand, the server goes into recovery and rejects new connections 
> before recovery is finished (correct?).
Correct.
> What if all but one client
> reconnect, i.e. you lose one client: are the transactions of the
> successfully reconnected clients replayed or are they discarded?
If the lost client has a transaction that needs to be replayed, all of
the transactions up to that missing transaction are replayed but all
subsequent transactions are discarded and when the recovery timer
expires, recovery is aborted.

The semantics of this will change when VBR becomes available, in
1.8.something, where something might be 0 even.  In that case, only
transactions actually dependent on the missing transactions will be
discarded.
> Independent from the load? I think the ''official''
statement was that the
> cluster has to be quiescent, i.e. no client activity. Is that (still)
> true?
Yes, that is the official statement and I don''t think any further
testing has been done to change that statement, officially, but I think
the general feeling is that quiescence should not be necessary, but we
just don''t have the scientific testing to be assured of that.

So if you want to be safe, quiesce the filesystem first.  :-)

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090407/6dac8e93/attachment.bin

Arne Wiebalck

2009-Apr-07 12:29 UTC

head link

[Lustre-discuss] Regarding redundancy

Brian,

[snip]
> If the lost client has a transaction that needs to be replayed, all of
> the transactions up to that missing transaction are replayed but all
> subsequent transactions are discarded and when the recovery timer
> expires, recovery is aborted.
Thanks a lot for the details. 

So, even clients working on disjoint files/objects will experience that
their data are lost when one client is lost during recovery ? 
Until 1.8.something, that is :) 

[snip]
> Yes, that is the official statement and I don''t think any further
> testing has been done to change that statement, officially, but I think
> the general feeling is that quiescence should not be necessary, but we
> just don''t have the scientific testing to be assured of that.
> 
> So if you want to be safe, quiesce the filesystem first.  :-)
Well, if I can do that :-)

Thanks again!
 Arne

Jim Garlick

2009-Apr-07 15:34 UTC

head link

[Lustre-discuss] Regarding redundancy

On Tue, Apr 07, 2009 at 08:14:55AM -0400, Brian J. Murrell wrote:
[snip]> If the lost client has a transaction that needs to be replayed, all of
> the transactions up to that missing transaction are replayed but all
> subsequent transactions are discarded and when the recovery timer
> expires, recovery is aborted.[snip]

Discarding all transactions causes a lot of collateral damage in a
multi-cluster, mixed parallel job environment where "file-per-process"
style I/O predominates.

Could somebody remind me of the use cases protected by this behavior?

In the case of I/O to a shared file, aren''t lustre''s errror
handling
obligations met by evicting the single offending client?  Perhaps I am
thinking too provincially because in our environment, I/O to shared
files generally (always?) takes place in the context of a parallel job,
and the single client eviction and EIO (or reboot of client) should
be sufficient to terminate the whole job with an error.

Thanks,

Jim

Brian J. Murrell

2009-Apr-07 15:43 UTC

head link

[Lustre-discuss] Regarding redundancy

On Tue, 2009-04-07 at 08:34 -0700, Jim Garlick wrote:> 
> Discarding all transactions
Only transactions subsequent to a missing transaction.
>  causes a lot of collateral damage in a
> multi-cluster, mixed parallel job environment where
"file-per-process"
> style I/O predominates.
Indeed, depending where the AWOL client''s transaction sits in the
replay
stream.  So if it was the last transaction, the loss is absolutely
minimal but if it was the first transaction, the loss is absolutely
maximal.
> Could somebody remind me of the use cases protected by this behavior?
Simply transactional dependency.

If you don''t know what the AWOL client did to a given file, you cannot
reliably process any further updates to that file, and if you don''t
have
the AWOL client to ask what files it has transactions for, everything
subsequent to that client''s transaction has to be suspect.  While I
don''t have any examples off-hand, I am sure one of the devs that
constantly have their fingers in replay can cite many actual scenarios
where this is a problem.
> In the case of I/O to a shared file, aren''t lustre''s
errror handling
> obligations met by evicting the single offending client?
No.  All clients subsequently have to be evicted, per the above.
> Perhaps I am
> thinking too provincially because in our environment, I/O to shared
> files generally (always?) takes place in the context of a parallel job,
> and the single client eviction and EIO (or reboot of client) should
> be sufficient to terminate the whole job with an error.
Yours is probably a scenario where VBR will do really well then given
that VBR only serializes replay on truly dependent transactions rather
than the single serial stream (of assumed dependent transactions) that
replay currently operates with.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090407/58e0c28d/attachment.bin

Jim Garlick

2009-Apr-08 02:37 UTC

head link

[Lustre-discuss] Regarding redundancy

On Tue, Apr 07, 2009 at 11:43:19AM -0400, Brian J. Murrell
wrote:> On Tue, 2009-04-07 at 08:34 -0700, Jim Garlick wrote:
> > 
> > Discarding all transactions
> 
> Only transactions subsequent to a missing transaction.
> 
> >  causes a lot of collateral damage in a
> > multi-cluster, mixed parallel job environment where
"file-per-process"
> > style I/O predominates.
> 
> Indeed, depending where the AWOL client''s transaction sits in the
replay
> stream.  So if it was the last transaction, the loss is absolutely
> minimal but if it was the first transaction, the loss is absolutely
> maximal.
> 
> > Could somebody remind me of the use cases protected by this behavior?
> 
> Simply transactional dependency.
> 
> If you don''t know what the AWOL client did to a given file, you
cannot
> reliably process any further updates to that file, and if you
don''t have
> the AWOL client to ask what files it has transactions for, everything
> subsequent to that client''s transaction has to be suspect.  While
I
> don''t have any examples off-hand, I am sure one of the devs that
> constantly have their fingers in replay can cite many actual scenarios
> where this is a problem.
For us, error handling is at best:  abort the parallel job on EIO,
throw away the output, and restart from the last checkpoint.  I am
virtually certain that nobody around here tries to recover from an EIO.

Also, if the AWOL node has actually rebooted, it will cause our resource
manager to terminate the whole parallel job.  This is good in the sense that
it gives codes with poor I/O error handling is a second chance of noticing
the error before bad physics data has to be analyzed and explained.
Not so with the collateral evictions.

So, it would be pretty easy for us to patch our 1.6.6 based lustre
to allow those transactions after the missed one to be committed and
avoid the collateral evictions.  We suspect this is a bad idea but we
are having a hard time imagining why.

Any insight would be apprecaited.
> > In the case of I/O to a shared file, aren''t lustre''s
errror handling
> > obligations met by evicting the single offending client?
> 
> No.  All clients subsequently have to be evicted, per the above.
> 
> > Perhaps I am
> > thinking too provincially because in our environment, I/O to shared
> > files generally (always?) takes place in the context of a parallel
job,
> > and the single client eviction and EIO (or reboot of client) should
> > be sufficient to terminate the whole job with an error.
> 
> Yours is probably a scenario where VBR will do really well then given
> that VBR only serializes replay on truly dependent transactions rather
> than the single serial stream (of assumed dependent transactions) that
> replay currently operates with.
> 
> b.
> 

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http:// lists.lustre.org/mailman/listinfo/lustre-discuss

Brian J. Murrell

2009-Apr-13 21:27 UTC

head link

[Lustre-discuss] Regarding redundancy

On Tue, 2009-04-07 at 14:29 +0200, Arne Wiebalck wrote:> Brian,
Arne,
> Thanks a lot for the details. 
NP.
> So, even clients working on disjoint files/objects will experience that
> their data are lost when one client is lost during recovery ? 
Yes.  Because, given the existing recovery design, there is no way to
know that AWOL clients were working on disjoint files/objects.
> Until 1.8.something, that is :) 
Right.  That is what VBR is all about.
> Well, if I can do that :-)
Good.  I have opened a bug to formalize the testing of that to see if we
can be more positive about our situation there.  I don''t recall the
number but it should be easy to find.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090413/ce0de4ad/attachment.bin

Lustre discuss - Apr 2009 - Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy

[Lustre-discuss] Regarding redundancy