thr3ads.net - Lustre discuss - [Lustre-discuss] Large Corosync/Pacemaker clusters [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Hall, Shawn

2012-Oct-19 16:52 UTC

[Lustre-discuss] Large Corosync/Pacemaker clusters

Hi,

 

We''re setting up fairly large Lustre 2.1.2 filesystems, each with 18
nodes and 159 resources all in one Corosync/Pacemaker cluster as
suggested by our vendor.  We''re getting mixed messages on how large of
a
Corosync/Pacemaker cluster will work well between our vendor an others.

 

1.       Are there Lustre Corosync/Pacemaker clusters out there of this
size or larger?

2.       If so, what tuning needed to be done to get it to work well?

3.       Should we be looking more seriously into splitting this
Corosync/Pacemaker cluster into pairs or sets of 4 nodes?

 

Right now, our current configuration takes a long time to start/stop all
resources (~30-45 mins), and failing back OSTs puts a heavy load on the
cib process on every node in the cluster.  Under heavy IO load, the many
of the nodes will show as "unclean/offline" and many OST resources
will
show as inactive in crm status, despite the fact that every single MDT
and OST is still mounted in the appropriate place.  We are running 2
corosync rings, each on a private 1 GbE network.  We have a bonded 10
GbE network for the LNET.

 

Thanks,

Shawn

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20121019/46326bb4/attachment.html

Charles Taylor

2012-Oct-24 20:32 UTC

head link

[Lustre-discuss] Large Corosync/Pacemaker clusters

FWIW, we are running HA Lustre using corosync/pacemaker.    We broke our OSSs
and MDSs out into individual HA *pairs*.   Thought about other configurations
but it was our first step into corosync/pacemaker so we decided to keep it as
simple as possible.   Seems to work well.    I''m not sure I would
attempt what you are doing though it may be perfectly fine.   When HA is a
requirement, it probably makes sense to avoid pushing the limits of what works.

Doesn''t really help you much other than to provide a data point with
regard to what other sites are doing.

Good luck and report back.   

Charlie Taylor
UF HPC Center

On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:
> Hi,
>  
> We?re setting up fairly large Lustre 2.1.2 filesystems, each with 18 nodes
and 159 resources all in one Corosync/Pacemaker cluster as suggested by our
vendor.  We?re getting mixed messages on how large of a Corosync/Pacemaker
cluster will work well between our vendor an others.
>  
> 1.       Are there Lustre Corosync/Pacemaker clusters out there of this
size or larger?
> 2.       If so, what tuning needed to be done to get it to work well?
> 3.       Should we be looking more seriously into splitting this
Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>  
> Right now, our current configuration takes a long time to start/stop all
resources (~30-45 mins), and failing back OSTs puts a heavy load on the cib
process on every node in the cluster.  Under heavy IO load, the many of the
nodes will show as ?unclean/offline? and many OST resources will show as
inactive in crm status, despite the fact that every single MDT and OST is still
mounted in the appropriate place.  We are running 2 corosync rings, each on a
private 1 GbE network.  We have a bonded 10 GbE network for the LNET.
>  
> Thanks,
> Shawn
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Jeff Johnson

2012-Oct-24 20:58 UTC

head link

[Lustre-discuss] Large Corosync/Pacemaker clusters

Shawn,

In my opinion you shouldn''t be running corosync on any more than two 
machines. They should be configured in self contained pairs (mds pair, 
oss pairs). Anything beyond that would be chaos to manage, even if it 
worked. Don''t forget the stonith portion. Not every block storage 
implementation respects mmp protection.

--Jeff


On 10/19/12 9:52 AM, Hall, Shawn wrote:>
> Hi,
>
> We?re setting up fairly large Lustre 2.1.2 filesystems, each with 18 
> nodes and 159 resources all in one Corosync/Pacemaker cluster as 
> suggested by our vendor. We?re getting mixed messages on how large of 
> a Corosync/Pacemaker cluster will work well between our vendor an others.
>
> 1.Are there Lustre Corosync/Pacemaker clusters out there of this size 
> or larger?
>
> 2.If so, what tuning needed to be done to get it to work well?
>
> 3.Should we be looking more seriously into splitting this 
> Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>
> Right now, our current configuration takes a long time to start/stop 
> all resources (~30-45 mins), and failing back OSTs puts a heavy load 
> on the cib process on every node in the cluster. Under heavy IO load, 
> the many of the nodes will show as ?unclean/offline? and many OST 
> resources will show as inactive in crm status, despite the fact that 
> every single MDT and OST is still mounted in the appropriate place. We 
> are running 2 corosync rings, each on a private 1 GbE network. We have 
> a bonded 10 GbE network for the LNET.
>
> Thanks,
>
> Shawn
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
------------------------------
Jeff Johnson
Co-Founder
Aeon Computing

jeff.johnson at aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x101   f: 858-412-3845
m: 619-204-9061

/* New Address */
4170 Morena Boulevard, Suite D - San Diego, CA 92117

Hall, Shawn

2012-Oct-31 01:43 UTC

head link

[Lustre-discuss] Large Corosync/Pacemaker clusters

Thanks for the replies.  We''ve worked on the HA and have it to a
satisfactory point where we can put it into production.  We broke it
into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective,
it''s
actually easier to manage groups of 4 than groups of 2, since it''s half
as many configurations to keep track of.

After splitting the cluster into 5 pieces it has become much more
responsive and stable.  It''s more difficult to manage than one large
cluster, but the stability is obviously worth it.  We''ve been
performing
heavy load testing and have not been able to "break" the cluster.  We
did a few more things to get to this point:

- Lowered the nice value of the corosync process to make it more
responsive under load and prevent a node from getting kicked out due to
unresponsiveness.
- Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to
move around.  Without this certain nodes would have low memory issues
related to networking and would get stonithed due to unresponsiveness.

Thanks,
Shawn

-----Original Message-----
From: Charles Taylor [mailto:taylor at hpc.ufl.edu] 
Sent: Wednesday, October 24, 2012 3:33 PM
To: Hall, Shawn
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters

FWIW, we are running HA Lustre using corosync/pacemaker.    We broke our
OSSs and MDSs out into individual HA *pairs*.   Thought about other
configurations but it was our first step into corosync/pacemaker so we
decided to keep it as simple as possible.   Seems to work well.    I''m
not sure I would attempt what you are doing though it may be perfectly
fine.   When HA is a requirement, it probably makes sense to avoid
pushing the limits of what works.

Doesn''t really help you much other than to provide a data point with
regard to what other sites are doing.   

Good luck and report back.   

Charlie Taylor
UF HPC Center

On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:
> Hi,
>  
> We''re setting up fairly large Lustre 2.1.2 filesystems, each with
18nodes and 159 resources all in one Corosync/Pacemaker cluster as
suggested by our vendor.  We''re getting mixed messages on how large of
a
Corosync/Pacemaker cluster will work well between our vendor an
others.>  
> 1.       Are there Lustre Corosync/Pacemaker clusters out there of
this size or larger?> 2.       If so, what tuning needed to be done to get it to work well?
> 3.       Should we be looking more seriously into splitting thisCorosync/Pacemaker cluster into pairs or sets of 4
nodes?>  
> Right now, our current configuration takes a long time to start/stopall resources (~30-45 mins), and failing back OSTs puts a heavy load on
the cib process on every node in the cluster.  Under heavy IO load, the
many of the nodes will show as "unclean/offline" and many OST
resources
will show as inactive in crm status, despite the fact that every single
MDT and OST is still mounted in the appropriate place.  We are running 2
corosync rings, each on a private 1 GbE network.  We have a bonded 10
GbE network for the LNET.>  
> Thanks,
> Shawn
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Marco Passerini

2012-Nov-06 13:12 UTC

head link

[Lustre-discuss] Large Corosync/Pacemaker clusters

Hi,

I''m also setting up a high-available Lustre system, I configured pairs 
for the OSSes and MDSes, redundant Corosync rings (two separate rings: 
IB and Eth), and Stonith is enabled.

The current configuration seems to work fine, however yesterday we 
experienced some problem because 4 OSSes got rebooted by Stonith. I 
suspect that Corosync missed a heartbeat due to a kernel/corosync hung, 
rather than a network problem. I will try the "renice" solution you 
proposed.

I have been thinking that I could increase the "token" timeout value
in
/etc/corosync/corosync.conf , to prevent short "hiccups". Did you 
specify a value to this parameter or did you leave the default 1000ms value?

Marco



On 2012-10-31 03:43, Hall, Shawn wrote:> Thanks for the replies.  We''ve worked on the HA and have it to a
> satisfactory point where we can put it into production.  We broke it
> into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective,
it''s
> actually easier to manage groups of 4 than groups of 2, since it''s
half
> as many configurations to keep track of.
>
> After splitting the cluster into 5 pieces it has become much more
> responsive and stable.  It''s more difficult to manage than one
large
> cluster, but the stability is obviously worth it.  We''ve been
performing
> heavy load testing and have not been able to "break" the cluster.
We
> did a few more things to get to this point:
>
> - Lowered the nice value of the corosync process to make it more
> responsive under load and prevent a node from getting kicked out due to
> unresponsiveness.
> - Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to
> move around.  Without this certain nodes would have low memory issues
> related to networking and would get stonithed due to unresponsiveness.
>
> Thanks,
> Shawn
>
> -----Original Message-----
> From: Charles Taylor [mailto:taylor at hpc.ufl.edu]
> Sent: Wednesday, October 24, 2012 3:33 PM
> To: Hall, Shawn
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters
>
>
> FWIW, we are running HA Lustre using corosync/pacemaker.    We broke our
> OSSs and MDSs out into individual HA *pairs*.   Thought about other
> configurations but it was our first step into corosync/pacemaker so we
> decided to keep it as simple as possible.   Seems to work well.   
I''m
> not sure I would attempt what you are doing though it may be perfectly
> fine.   When HA is a requirement, it probably makes sense to avoid
> pushing the limits of what works.
>
> Doesn''t really help you much other than to provide a data point
with
> regard to what other sites are doing.
>
> Good luck and report back.
>
> Charlie Taylor
> UF HPC Center
>
> On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:
>
>> Hi,
>>
>> We''re setting up fairly large Lustre 2.1.2 filesystems, each
with 18
> nodes and 159 resources all in one Corosync/Pacemaker cluster as
> suggested by our vendor.  We''re getting mixed messages on how
large of a
> Corosync/Pacemaker cluster will work well between our vendor an others.
>>
>> 1.       Are there Lustre Corosync/Pacemaker clusters out there of
> this size or larger?
>> 2.       If so, what tuning needed to be done to get it to work well?
>> 3.       Should we be looking more seriously into splitting this
> Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>>
>> Right now, our current configuration takes a long time to start/stop
> all resources (~30-45 mins), and failing back OSTs puts a heavy load on
> the cib process on every node in the cluster.  Under heavy IO load, the
> many of the nodes will show as "unclean/offline" and many OST
resources
> will show as inactive in crm status, despite the fact that every single
> MDT and OST is still mounted in the appropriate place.  We are running 2
> corosync rings, each on a private 1 GbE network.  We have a bonded 10
> GbE network for the LNET.
>>
>> Thanks,
>> Shawn
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Hall, Shawn

2012-Nov-06 13:58 UTC

head link

[Lustre-discuss] Large Corosync/Pacemaker clusters

Hi,

Our vendor actually has several of the parameters in corosync.conf
increased by default, and we have not touched them.  These are:

Token: 10000
Retransmits_before_loss: 25
Consensus: 12000
Join: 1000
Merge: 400
Downcheck: 2000

We also have secauth turned off, as this can consume 75% of your CPU
cycles and cut bandwidth by a third, according to the corosync.conf
manpage.  I''m not sure if these parameters are necessary now that we
have split our cluster up, but they haven''t seemed to hurt anything
either.

Hope this helps,
Shawn

-----Original Message-----
From: Marco Passerini [mailto:marco.passerini at csc.fi] 
Sent: Tuesday, November 06, 2012 7:13 AM
To: lustre-discuss at lists.lustre.org
Cc: Hall, Shawn
Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters

Hi,

I''m also setting up a high-available Lustre system, I configured pairs
for the OSSes and MDSes, redundant Corosync rings (two separate rings: 
IB and Eth), and Stonith is enabled.

The current configuration seems to work fine, however yesterday we
experienced some problem because 4 OSSes got rebooted by Stonith. I
suspect that Corosync missed a heartbeat due to a kernel/corosync hung,
rather than a network problem. I will try the "renice" solution you
proposed.

I have been thinking that I could increase the "token" timeout value
in
/etc/corosync/corosync.conf , to prevent short "hiccups". Did you
specify a value to this parameter or did you leave the default 1000ms
value?

Marco



On 2012-10-31 03:43, Hall, Shawn wrote:> Thanks for the replies.  We''ve worked on the HA and have it to a 
> satisfactory point where we can put it into production.  We broke it 
> into a MDS pair and 4 groups of 4 OSS nodes.  From our perspective, 
> it''s actually easier to manage groups of 4 than groups of 2, since
> it''s half as many configurations to keep track of.
>
> After splitting the cluster into 5 pieces it has become much more 
> responsive and stable.  It''s more difficult to manage than one
large
> cluster, but the stability is obviously worth it.  We''ve been 
> performing heavy load testing and have not been able to "break"
the
> cluster.  We did a few more things to get to this point:
>
> - Lowered the nice value of the corosync process to make it more 
> responsive under load and prevent a node from getting kicked out due 
> to unresponsiveness.
> - Increased vm.min_free_kbytes to give TCP/IP w/ jumbo frames room to 
> move around.  Without this certain nodes would have low memory issues 
> related to networking and would get stonithed due to unresponsiveness.
>
> Thanks,
> Shawn
>
> -----Original Message-----
> From: Charles Taylor [mailto:taylor at hpc.ufl.edu]
> Sent: Wednesday, October 24, 2012 3:33 PM
> To: Hall, Shawn
> Cc: lustre-discuss at lists.lustre.org
> Subject: Re: [Lustre-discuss] Large Corosync/Pacemaker clusters
>
>
> FWIW, we are running HA Lustre using corosync/pacemaker.    We broke
our> OSSs and MDSs out into individual HA *pairs*.   Thought about other
> configurations but it was our first step into corosync/pacemaker so we
> decided to keep it as simple as possible.   Seems to work well.   
I''m
> not sure I would attempt what you are doing though it may be perfectly
> fine.   When HA is a requirement, it probably makes sense to avoid
> pushing the limits of what works.
>
> Doesn''t really help you much other than to provide a data point
with
> regard to what other sites are doing.
>
> Good luck and report back.
>
> Charlie Taylor
> UF HPC Center
>
> On Oct 19, 2012, at 12:52 PM, Hall, Shawn wrote:
>
>> Hi,
>>
>> We''re setting up fairly large Lustre 2.1.2 filesystems, each
with 18
> nodes and 159 resources all in one Corosync/Pacemaker cluster as 
> suggested by our vendor.  We''re getting mixed messages on how
large of
> a Corosync/Pacemaker cluster will work well between our vendor an
others.>>
>> 1.       Are there Lustre Corosync/Pacemaker clusters out there of
> this size or larger?
>> 2.       If so, what tuning needed to be done to get it to work well?
>> 3.       Should we be looking more seriously into splitting this
> Corosync/Pacemaker cluster into pairs or sets of 4 nodes?
>>
>> Right now, our current configuration takes a long time to start/stop
> all resources (~30-45 mins), and failing back OSTs puts a heavy load 
> on the cib process on every node in the cluster.  Under heavy IO load,
> the many of the nodes will show as "unclean/offline" and many OST
> resources will show as inactive in crm status, despite the fact that 
> every single MDT and OST is still mounted in the appropriate place.  
> We are running 2 corosync rings, each on a private 1 GbE network.  We 
> have a bonded 10 GbE network for the LNET.
>>
>> Thanks,
>> Shawn
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Adrian Ulrich

2012-Nov-07 10:14 UTC

head link

[Lustre-discuss] Large Corosync/Pacemaker clusters

> I will try the "renice" solution you proposed.
re-niceing corosync should not be required as the process is supposed to run
with RT-Priority anyway.

> I have been thinking that I could increase the "token" timeout
value in
> /etc/corosync/corosync.conf , to prevent short "hiccups". Did you
> specify a value to this parameter or did you leave the default 1000ms
value?
We configured the token timeout to 17 seconds:

 totem {
        [....]
        transport: udpu
        rrp_mode: passive
        token:     17000
 }


This configuration works just fine for us since months: We didn''t see a
single ''false positive STONITH'' with this configuration.


Regards,
 Adrian

Apparently Analagous Threads

Search for more reasonably related threads

Lustre discuss - Oct 2012 - Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

[Lustre-discuss] Large Corosync/Pacemaker clusters

Apparently Analagous Threads