thr3ads.net - Lustre discuss - [Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ? [May 2006]

If this information is useful, please help other people find it:
Share via:

Phil Schwan

2006-May-19 07:36 UTC

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

Hi Chris--

Chris Samuel wrote:> 
> I''ve got a couple of quick questions:
> 
> 1) how are people Lustre under 2.6 ?
I don''t think many people are using it yet.  It will very soon become a
first-class citizen of our testing infrastructure, with daily reports on
Buffalo and our staff using 2.6 for our daily work.
> 2) If I create an OST exporting an OSD on a SAN, can I later add another
OST
> exporting that same OST for failover/performance or will I have to reformat
?
For a "hot/warm" failover pair (where only one object storage server
is
exporting that data at a time, with the other waiting patiently), you
can leave the data as it is.  Of course, that gives you no performance
benefit, only redundancy.

For a "hot/hot" pair, where both servers export half of the SAN for
performance, you would still not have to reformat.  You can use the ext3
resizer to shrink that partition and add a second alongside it.

-Phil

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

Chris Samuel wrote:> 
>>For a "hot/warm" failover pair (where only one object storage
server is
>>exporting that data at a time, with the other waiting patiently), you
>>can leave the data as it is.  Of course, that gives you no performance
>>benefit, only redundancy.
>>
>>For a "hot/hot" pair, where both servers export half of the
SAN for
>>performance, you would still not have to reformat.  You can use the ext3
>>resizer to shrink that partition and add a second alongside it.
> 
> Does that mean the two are mutually exclusive ?   I''d got the
impression that
> you could have a situation where the two OST''s shared the load as
in your
> hot/hot example, except running from the same partition, and if one of the 
> two failed then the other would take over seamlessly.
They are not mutually exclusive, but they do not share a single
partition -- that is the key.

What you describe is a "hot/hot" pair, but they don''t share a
single
partition, they share two partitions (which can be on the same SAN).
When one of the servers fails, the other can take over and start serving
both partitions.

-Phil

Chris Samuel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 27 Feb 2004 04:27 am, Phil Schwan wrote:
> They are not mutually exclusive, but they do not share a single
> partition -- that is the key.
>
> What you describe is a "hot/hot" pair, but they don''t
share a single
> partition, they share two partitions (which can be on the same SAN).
> When one of the servers fails, the other can take over and start serving
> both partitions.
Ahhh.. mea culpa, now I understand. :-)

Thanks for that Phil!

=2D --=20
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAPmvwO2KABBYQAh8RAlx/AJ4yG5d/E4i8qnBhpKuI8n/yBD7lOACggFO3
y1tyEzv1B95mzq9ZcHPiwSA=3D
=3D2apt
=2D----END PGP SIGNATURE-----

Chris Samuel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 27 Feb 2004 08:22 am, Phil Schwan wrote:
> There is more to worry about, too, in a hot/hot configuration.  To give
> you one graphic example, we found in one installation that we had to
> reload the fibrechannel driver on the new OSS after the other node had
> failed, or we would read very old data from the disk.  Reloading the FC
> driver is of course not an option when the node is already serving data
> from another partition.
Ouch!   What was the kernel/module/controller combination ?

=46or the record we''ve got dual IBM FC HBAs (rebadged QLogic
QLA2312''s) on one=20
storage node which are driver through the qla2300 driver.

The other storage node is in service (NFS serving local SCSI disks) and wont=20
get its two FC cards until we can shut the cluster down on the 15th, hence=20
asking about adding other nodes later. :)

=2D --=20
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAPm2oO2KABBYQAh8RAvGeAJ9KR3LioDq2f8QPkQJ5XCULMgLYzQCglCoU
GX8jJ9LQBTP9/+AwnF3ev7s=3D
=3DW2rc
=2D----END PGP SIGNATURE-----

Chris Samuel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, 27 Feb 2004 09:15 am, Phil Schwan wrote:
> It was a qlogic FC controller and qla2300 driver (the precise version
> escapes me), backed by a DataDirect raid array.
That''s pretty much what we''ve got on the host side, with an
IBM FAStT 600 and=20
Apple XServe RAID on the SAN.

We''re going to be using the IBM drivers for the IBM rebadged QLogics
(source=20
not binary modules fortunately) as they support multi-path failover between=20
controllers.

Worrying though, and no real way to test. :-(

=2D --=20
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAPng6O2KABBYQAh8RAsb7AJ9NZKUUhfBw9cd8teNX9+Ae4fKBAQCfcgj+
dx8AMKYHc6gxosOrV0wFrrE=3D
=3DBa3A
=2D----END PGP SIGNATURE-----

Daire Byrne

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

One very last question on this subject - once you''ve failed over to one
machine how do you go about bringing the failed machine back up again? As 
you now have one machine serving both partitions/drives I assume you have 
to actually bring the cluster down and restart Lustre on the failover 
pair? Or is there some sort of inbuilt recovery mode for this?

Daire

> Daire Byrne wrote:
> > 
> > Now I''m really confused!
> 
> "hot" nodes are actively serving data; "warm" nodes are
idle, standing
> by to take over in the event of a failure.
> 
> A "hot/hot" configuration has two nodes serving data all the
time, and a
> "hot/warm" configuration has only one node serving data at any
time.  In
> both cases, one node will take over for the other in the event of a
> failure.  I hope that helps.
> 
> > So just to make this clear, for hot/hot
> > configuration :- I have two servers and one fibre channel RAID array 
> > connected to both of them. On the RAID I make two partitions sda1 and 
> > sda2. I setup one server to use sda1 and the other server to use sda2?
I
> > group them with lconf so that either one can take over the serving of 
> > the other server''s partition?
> 
> Correct.  As you point out below, the precise configuration mechanics
> may not be terribly well documented yet.
> 
> > Is this because file-locking is controlled by an OSS and not by the 
> > metadata controller? So if you had two OST machines accessing the same
> > partition they wouldnt know about eachothers locks?
> 
> It is more fundamental than that -- imagine what would happen if you had
> a shared block device which contained an ext3 file system, and you
> mounted that file system read/write from two nodes at the same time.
> You''d have complete chaos and a very corrupt file system, because
each
> node would be writing to the journal, and updating metadata, and so on,
> without the knowledge of the other.
> 
> This is exactly what would happen if you configured two OSTs to share
> one physical partition, because Lustre uses ext3 as the backend disk
> file system.
> 
> > In the failover config guides I''ve read for Lustre it always
seems to
> > refer to setting the same partition for each OST. i.e sda1 on both 
> > servers. May be that was for "hot/warm". I''m
probably too confused to be
> > making any sense now!
> 
> That is almost certainly true.  We designed the system originally for
> "hot/warm" failover, and the documentation is lagging.
> 
> There is more to worry about, too, in a hot/hot configuration.  To give
> you one graphic example, we found in one installation that we had to
> reload the fibrechannel driver on the new OSS after the other node had
> failed, or we would read very old data from the disk.  Reloading the FC
> driver is of course not an option when the node is already serving data
> from another partition.
> 
> -Phil
>

Chris Samuel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 26 Feb 2004 01:50 pm, Phil Schwan wrote:
> Hi Chris--
Hi Phil,
> Chris Samuel wrote:
> > I''ve got a couple of quick questions:
> >
> > 1) how are people Lustre under 2.6 ?
>
> I don''t think many people are using it yet.  It will very soon
become a
> first-class citizen of our testing infrastructure, with daily reports on
> Buffalo and our staff using 2.6 for our daily work.
That''s excellent news.   For the moment I think we want to stick with
2.4=20
(especially as one of the clusters I''d like to see use Lustre is
running=20
RH-7.3).
> > 2) If I create an OST exporting an OSD on a SAN, can I later add
another
> > OST exporting that same OST for failover/performance or will I have to
> > reformat ?
>
> For a "hot/warm" failover pair (where only one object storage
server is
> exporting that data at a time, with the other waiting patiently), you
> can leave the data as it is.  Of course, that gives you no performance
> benefit, only redundancy.
>
> For a "hot/hot" pair, where both servers export half of the SAN
for
> performance, you would still not have to reformat.  You can use the ext3
> resizer to shrink that partition and add a second alongside it.
Does that mean the two are mutually exclusive ?   I''d got the
impression that=20
you could have a situation where the two OST''s shared the load as in
your=20
hot/hot example, except running from the same partition, and if one of the=20
two failed then the other would take over seamlessly.

cheers,
Chris

=2D --=20
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFAPXxaO2KABBYQAh8RAq6GAKCJPzkM2xcRj1VBrDLxeAvwXvbYHgCfQ2ap
V5Qv5e2XQiYzamewofNZZ/E=3D
=3DKXvA
=2D----END PGP SIGNATURE-----

Daire Byrne

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

On Thu, 26 Feb 2004, Phil Schwan wrote:
> Chris Samuel wrote:
> > 
> >>For a "hot/warm" failover pair (where only one object
storage server is
> >>exporting that data at a time, with the other waiting patiently),
you
> >>can leave the data as it is.  Of course, that gives you no
performance
> >>benefit, only redundancy.
> >>
> >>For a "hot/hot" pair, where both servers export half of
the SAN for
> >>performance, you would still not have to reformat.  You can use the
ext3
> >>resizer to shrink that partition and add a second alongside it.
> > 
> > Does that mean the two are mutually exclusive ?   I''d got the
impression that
> > you could have a situation where the two OST''s shared the
load as in your
> > hot/hot example, except running from the same partition, and if one of
the
> > two failed then the other would take over seamlessly.
> 
> They are not mutually exclusive, but they do not share a single
> partition -- that is the key.
> 
> What you describe is a "hot/hot" pair, but they don''t
share a single
> partition, they share two partitions (which can be on the same SAN).
> When one of the servers fails, the other can take over and start serving
> both partitions.
Now I''m really confused! So just to make this clear, for hot/hot 
configuration :- I have two servers and one fibre channel RAID array 
connected to both of them. On the RAID I make two partitions sda1 and 
sda2. I setup one server to use sda1 and the other server to use sda2? I 
group them with lconf so that either one can take over the serving of 
the other server''s partition?

Is this because file-locking is controlled by an OSS and not by the 
metadata controller? So if you had two OST machines accessing the same 
partition they wouldnt know about eachothers locks?

In the failover config guides I''ve read for Lustre it always seems to 
refer to setting the same partition for each OST. i.e sda1 on both 
servers. May be that was for "hot/warm". I''m probably too
confused to be
making any sense now!

Daire

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

Daire Byrne wrote:> 
> Now I''m really confused!
"hot" nodes are actively serving data; "warm" nodes are
idle, standing
by to take over in the event of a failure.

A "hot/hot" configuration has two nodes serving data all the time, and
a
"hot/warm" configuration has only one node serving data at any time. 
In
both cases, one node will take over for the other in the event of a
failure.  I hope that helps.
> So just to make this clear, for hot/hot
> configuration :- I have two servers and one fibre channel RAID array 
> connected to both of them. On the RAID I make two partitions sda1 and 
> sda2. I setup one server to use sda1 and the other server to use sda2? I 
> group them with lconf so that either one can take over the serving of 
> the other server''s partition?
Correct.  As you point out below, the precise configuration mechanics
may not be terribly well documented yet.
> Is this because file-locking is controlled by an OSS and not by the 
> metadata controller? So if you had two OST machines accessing the same 
> partition they wouldnt know about eachothers locks?
It is more fundamental than that -- imagine what would happen if you had
a shared block device which contained an ext3 file system, and you
mounted that file system read/write from two nodes at the same time.
You''d have complete chaos and a very corrupt file system, because each
node would be writing to the journal, and updating metadata, and so on,
without the knowledge of the other.

This is exactly what would happen if you configured two OSTs to share
one physical partition, because Lustre uses ext3 as the backend disk
file system.
> In the failover config guides I''ve read for Lustre it always seems
to
> refer to setting the same partition for each OST. i.e sda1 on both 
> servers. May be that was for "hot/warm". I''m probably
too confused to be
> making any sense now!
That is almost certainly true.  We designed the system originally for
"hot/warm" failover, and the documentation is lagging.

There is more to worry about, too, in a hot/hot configuration.  To give
you one graphic example, we found in one installation that we had to
reload the fibrechannel driver on the new OSS after the other node had
failed, or we would read very old data from the disk.  Reloading the FC
driver is of course not an option when the node is already serving data
from another partition.

-Phil

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

Chris Samuel wrote:> On Fri, 27 Feb 2004 08:22 am, Phil Schwan wrote:
> 
>>There is more to worry about, too, in a hot/hot configuration.  To give
>>you one graphic example, we found in one installation that we had to
>>reload the fibrechannel driver on the new OSS after the other node had
>>failed, or we would read very old data from the disk.  Reloading the FC
>>driver is of course not an option when the node is already serving data
>>from another partition.
> 
> Ouch!   What was the kernel/module/controller combination ?
It was a qlogic FC controller and qla2300 driver (the precise version
escapes me), backed by a DataDirect raid array.

The quick-fix of reloading the qlogic driver was enough to get us
moving, so we never determined the precise origin of the problem.  It
could have been block layer caching in Linux, or some very strange
behaviour of the controller, or a misconfigured cache on the DataDirect.
 Or perhaps something else.

My tale was mostly cautionary; if another customer encounters this and
it turns out to be a problem in Linux, we can probably fix it.  If it
turns out to be in the controller or HBA, maybe not.

-Phil

Robert Read

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

Hi,

On Mar 3, 2004, at 02:41, Daire Byrne wrote:
>
> One very last question on this subject - once you''ve failed over
to one
> machine how do you go about bringing the failed machine back up again? 
> As
> you now have one machine serving both partitions/drives I assume you 
> have
> to actually bring the cluster down and restart Lustre on the failover
> pair? Or is there some sort of inbuilt recovery mode for this?
Yes, we do have support for failing back to the original load-balanced 
configuration without having to bring the cluster down. You first have 
to stop the device  in failover mode, and then start it up on the 
original node.  The clients will see it as another failover and recover 
in the same way they did for the original failover.

robert
> Daire
>
>
>> Daire Byrne wrote:
>>>
>>> Now I''m really confused!
>>
>> "hot" nodes are actively serving data; "warm" nodes
are idle, standing
>> by to take over in the event of a failure.
>>
>> A "hot/hot" configuration has two nodes serving data all the
time,
>> and a
>> "hot/warm" configuration has only one node serving data at
any time.
>> In
>> both cases, one node will take over for the other in the event of a
>> failure.  I hope that helps.
>>
>>> So just to make this clear, for hot/hot
>>> configuration :- I have two servers and one fibre channel RAID
array
>>> connected to both of them. On the RAID I make two partitions sda1
and
>>> sda2. I setup one server to use sda1 and the other server to use 
>>> sda2? I
>>> group them with lconf so that either one can take over the serving
of
>>> the other server''s partition?
>>
>> Correct.  As you point out below, the precise configuration mechanics
>> may not be terribly well documented yet.
>>
>>> Is this because file-locking is controlled by an OSS and not by the
>>> metadata controller? So if you had two OST machines accessing the 
>>> same
>>> partition they wouldnt know about eachothers locks?
>>
>> It is more fundamental than that -- imagine what would happen if you 
>> had
>> a shared block device which contained an ext3 file system, and you
>> mounted that file system read/write from two nodes at the same time.
>> You''d have complete chaos and a very corrupt file system,
because each
>> node would be writing to the journal, and updating metadata, and so 
>> on,
>> without the knowledge of the other.
>>
>> This is exactly what would happen if you configured two OSTs to share
>> one physical partition, because Lustre uses ext3 as the backend disk
>> file system.
>>
>>> In the failover config guides I''ve read for Lustre it
always seems to
>>> refer to setting the same partition for each OST. i.e sda1 on both
>>> servers. May be that was for "hot/warm". I''m
probably too confused
>>> to be
>>> making any sense now!
>>
>> That is almost certainly true.  We designed the system originally for
>> "hot/warm" failover, and the documentation is lagging.
>>
>> There is more to worry about, too, in a hot/hot configuration.  To 
>> give
>> you one graphic example, we found in one installation that we had to
>> reload the fibrechannel driver on the new OSS after the other node had
>> failed, or we would read very old data from the disk.  Reloading the 
>> FC
>> driver is of course not an option when the node is already serving 
>> data
>> from another partition.
>>
>> -Phil
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@lists.clusterfs.com
> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss

Chris Samuel

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

=2D----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

We''re looking at experimenting with Lustre for our new storage after
seeing=20
Phil do his stuff at APAC in Canberra (I was the one asking asking questions=20
over the Access Grid, thanks for the answers!).

I''ve got a couple of quick questions:

1) how are people Lustre under 2.6 ?

2) If I create an OST exporting an OSD on a SAN, can I later add another OST=20
exporting that same OST for failover/performance or will I have to reformat ?

cheers,
Chris
=2D --=20
 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin
 Victorian Partnership for Advanced Computing http://www.vpac.org/
 Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia

=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQFANY6AO2KABBYQAh8RAgs7AJ9U9zHSwjScfAL49IqSYxiFw11P6QCdFz9r
REtZHEGHQLNoYUAh+a+zuW4=3D
=3DKxct
=2D----END PGP SIGNATURE-----

Lustre discuss - May 2006 - Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?

[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?