Phil Schwan
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
Hi Chris-- Chris Samuel wrote:> > I''ve got a couple of quick questions: > > 1) how are people Lustre under 2.6 ?I don''t think many people are using it yet. It will very soon become a first-class citizen of our testing infrastructure, with daily reports on Buffalo and our staff using 2.6 for our daily work.> 2) If I create an OST exporting an OSD on a SAN, can I later add another OST > exporting that same OST for failover/performance or will I have to reformat ?For a "hot/warm" failover pair (where only one object storage server is exporting that data at a time, with the other waiting patiently), you can leave the data as it is. Of course, that gives you no performance benefit, only redundancy. For a "hot/hot" pair, where both servers export half of the SAN for performance, you would still not have to reformat. You can use the ext3 resizer to shrink that partition and add a second alongside it. -Phil
Phil Schwan
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
Chris Samuel wrote:> >>For a "hot/warm" failover pair (where only one object storage server is >>exporting that data at a time, with the other waiting patiently), you >>can leave the data as it is. Of course, that gives you no performance >>benefit, only redundancy. >> >>For a "hot/hot" pair, where both servers export half of the SAN for >>performance, you would still not have to reformat. You can use the ext3 >>resizer to shrink that partition and add a second alongside it. > > Does that mean the two are mutually exclusive ? I''d got the impression that > you could have a situation where the two OST''s shared the load as in your > hot/hot example, except running from the same partition, and if one of the > two failed then the other would take over seamlessly.They are not mutually exclusive, but they do not share a single partition -- that is the key. What you describe is a "hot/hot" pair, but they don''t share a single partition, they share two partitions (which can be on the same SAN). When one of the servers fails, the other can take over and start serving both partitions. -Phil
Chris Samuel
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 27 Feb 2004 04:27 am, Phil Schwan wrote:> They are not mutually exclusive, but they do not share a single > partition -- that is the key. > > What you describe is a "hot/hot" pair, but they don''t share a single > partition, they share two partitions (which can be on the same SAN). > When one of the servers fails, the other can take over and start serving > both partitions.Ahhh.. mea culpa, now I understand. :-) Thanks for that Phil! =2D --=20 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAPmvwO2KABBYQAh8RAlx/AJ4yG5d/E4i8qnBhpKuI8n/yBD7lOACggFO3 y1tyEzv1B95mzq9ZcHPiwSA=3D =3D2apt =2D----END PGP SIGNATURE-----
Chris Samuel
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 27 Feb 2004 08:22 am, Phil Schwan wrote:> There is more to worry about, too, in a hot/hot configuration. To give > you one graphic example, we found in one installation that we had to > reload the fibrechannel driver on the new OSS after the other node had > failed, or we would read very old data from the disk. Reloading the FC > driver is of course not an option when the node is already serving data > from another partition.Ouch! What was the kernel/module/controller combination ? =46or the record we''ve got dual IBM FC HBAs (rebadged QLogic QLA2312''s) on one=20 storage node which are driver through the qla2300 driver. The other storage node is in service (NFS serving local SCSI disks) and wont=20 get its two FC cards until we can shut the cluster down on the 15th, hence=20 asking about adding other nodes later. :) =2D --=20 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAPm2oO2KABBYQAh8RAvGeAJ9KR3LioDq2f8QPkQJ5XCULMgLYzQCglCoU GX8jJ9LQBTP9/+AwnF3ev7s=3D =3DW2rc =2D----END PGP SIGNATURE-----
Chris Samuel
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Fri, 27 Feb 2004 09:15 am, Phil Schwan wrote:> It was a qlogic FC controller and qla2300 driver (the precise version > escapes me), backed by a DataDirect raid array.That''s pretty much what we''ve got on the host side, with an IBM FAStT 600 and=20 Apple XServe RAID on the SAN. We''re going to be using the IBM drivers for the IBM rebadged QLogics (source=20 not binary modules fortunately) as they support multi-path failover between=20 controllers. Worrying though, and no real way to test. :-( =2D --=20 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAPng6O2KABBYQAh8RAsb7AJ9NZKUUhfBw9cd8teNX9+Ae4fKBAQCfcgj+ dx8AMKYHc6gxosOrV0wFrrE=3D =3DBa3A =2D----END PGP SIGNATURE-----
Daire Byrne
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
One very last question on this subject - once you''ve failed over to one machine how do you go about bringing the failed machine back up again? As you now have one machine serving both partitions/drives I assume you have to actually bring the cluster down and restart Lustre on the failover pair? Or is there some sort of inbuilt recovery mode for this? Daire> Daire Byrne wrote: > > > > Now I''m really confused! > > "hot" nodes are actively serving data; "warm" nodes are idle, standing > by to take over in the event of a failure. > > A "hot/hot" configuration has two nodes serving data all the time, and a > "hot/warm" configuration has only one node serving data at any time. In > both cases, one node will take over for the other in the event of a > failure. I hope that helps. > > > So just to make this clear, for hot/hot > > configuration :- I have two servers and one fibre channel RAID array > > connected to both of them. On the RAID I make two partitions sda1 and > > sda2. I setup one server to use sda1 and the other server to use sda2? I > > group them with lconf so that either one can take over the serving of > > the other server''s partition? > > Correct. As you point out below, the precise configuration mechanics > may not be terribly well documented yet. > > > Is this because file-locking is controlled by an OSS and not by the > > metadata controller? So if you had two OST machines accessing the same > > partition they wouldnt know about eachothers locks? > > It is more fundamental than that -- imagine what would happen if you had > a shared block device which contained an ext3 file system, and you > mounted that file system read/write from two nodes at the same time. > You''d have complete chaos and a very corrupt file system, because each > node would be writing to the journal, and updating metadata, and so on, > without the knowledge of the other. > > This is exactly what would happen if you configured two OSTs to share > one physical partition, because Lustre uses ext3 as the backend disk > file system. > > > In the failover config guides I''ve read for Lustre it always seems to > > refer to setting the same partition for each OST. i.e sda1 on both > > servers. May be that was for "hot/warm". I''m probably too confused to be > > making any sense now! > > That is almost certainly true. We designed the system originally for > "hot/warm" failover, and the documentation is lagging. > > There is more to worry about, too, in a hot/hot configuration. To give > you one graphic example, we found in one installation that we had to > reload the fibrechannel driver on the new OSS after the other node had > failed, or we would read very old data from the disk. Reloading the FC > driver is of course not an option when the node is already serving data > from another partition. > > -Phil >
Chris Samuel
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Thu, 26 Feb 2004 01:50 pm, Phil Schwan wrote:> Hi Chris--Hi Phil,> Chris Samuel wrote: > > I''ve got a couple of quick questions: > > > > 1) how are people Lustre under 2.6 ? > > I don''t think many people are using it yet. It will very soon become a > first-class citizen of our testing infrastructure, with daily reports on > Buffalo and our staff using 2.6 for our daily work.That''s excellent news. For the moment I think we want to stick with 2.4=20 (especially as one of the clusters I''d like to see use Lustre is running=20 RH-7.3).> > 2) If I create an OST exporting an OSD on a SAN, can I later add another > > OST exporting that same OST for failover/performance or will I have to > > reformat ? > > For a "hot/warm" failover pair (where only one object storage server is > exporting that data at a time, with the other waiting patiently), you > can leave the data as it is. Of course, that gives you no performance > benefit, only redundancy. > > For a "hot/hot" pair, where both servers export half of the SAN for > performance, you would still not have to reformat. You can use the ext3 > resizer to shrink that partition and add a second alongside it.Does that mean the two are mutually exclusive ? I''d got the impression that=20 you could have a situation where the two OST''s shared the load as in your=20 hot/hot example, except running from the same partition, and if one of the=20 two failed then the other would take over seamlessly. cheers, Chris =2D --=20 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFAPXxaO2KABBYQAh8RAq6GAKCJPzkM2xcRj1VBrDLxeAvwXvbYHgCfQ2ap V5Qv5e2XQiYzamewofNZZ/E=3D =3DKXvA =2D----END PGP SIGNATURE-----
Daire Byrne
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
On Thu, 26 Feb 2004, Phil Schwan wrote:> Chris Samuel wrote: > > > >>For a "hot/warm" failover pair (where only one object storage server is > >>exporting that data at a time, with the other waiting patiently), you > >>can leave the data as it is. Of course, that gives you no performance > >>benefit, only redundancy. > >> > >>For a "hot/hot" pair, where both servers export half of the SAN for > >>performance, you would still not have to reformat. You can use the ext3 > >>resizer to shrink that partition and add a second alongside it. > > > > Does that mean the two are mutually exclusive ? I''d got the impression that > > you could have a situation where the two OST''s shared the load as in your > > hot/hot example, except running from the same partition, and if one of the > > two failed then the other would take over seamlessly. > > They are not mutually exclusive, but they do not share a single > partition -- that is the key. > > What you describe is a "hot/hot" pair, but they don''t share a single > partition, they share two partitions (which can be on the same SAN). > When one of the servers fails, the other can take over and start serving > both partitions.Now I''m really confused! So just to make this clear, for hot/hot configuration :- I have two servers and one fibre channel RAID array connected to both of them. On the RAID I make two partitions sda1 and sda2. I setup one server to use sda1 and the other server to use sda2? I group them with lconf so that either one can take over the serving of the other server''s partition? Is this because file-locking is controlled by an OSS and not by the metadata controller? So if you had two OST machines accessing the same partition they wouldnt know about eachothers locks? In the failover config guides I''ve read for Lustre it always seems to refer to setting the same partition for each OST. i.e sda1 on both servers. May be that was for "hot/warm". I''m probably too confused to be making any sense now! Daire
Phil Schwan
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
Daire Byrne wrote:> > Now I''m really confused!"hot" nodes are actively serving data; "warm" nodes are idle, standing by to take over in the event of a failure. A "hot/hot" configuration has two nodes serving data all the time, and a "hot/warm" configuration has only one node serving data at any time. In both cases, one node will take over for the other in the event of a failure. I hope that helps.> So just to make this clear, for hot/hot > configuration :- I have two servers and one fibre channel RAID array > connected to both of them. On the RAID I make two partitions sda1 and > sda2. I setup one server to use sda1 and the other server to use sda2? I > group them with lconf so that either one can take over the serving of > the other server''s partition?Correct. As you point out below, the precise configuration mechanics may not be terribly well documented yet.> Is this because file-locking is controlled by an OSS and not by the > metadata controller? So if you had two OST machines accessing the same > partition they wouldnt know about eachothers locks?It is more fundamental than that -- imagine what would happen if you had a shared block device which contained an ext3 file system, and you mounted that file system read/write from two nodes at the same time. You''d have complete chaos and a very corrupt file system, because each node would be writing to the journal, and updating metadata, and so on, without the knowledge of the other. This is exactly what would happen if you configured two OSTs to share one physical partition, because Lustre uses ext3 as the backend disk file system.> In the failover config guides I''ve read for Lustre it always seems to > refer to setting the same partition for each OST. i.e sda1 on both > servers. May be that was for "hot/warm". I''m probably too confused to be > making any sense now!That is almost certainly true. We designed the system originally for "hot/warm" failover, and the documentation is lagging. There is more to worry about, too, in a hot/hot configuration. To give you one graphic example, we found in one installation that we had to reload the fibrechannel driver on the new OSS after the other node had failed, or we would read very old data from the disk. Reloading the FC driver is of course not an option when the node is already serving data from another partition. -Phil
Phil Schwan
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
Chris Samuel wrote:> On Fri, 27 Feb 2004 08:22 am, Phil Schwan wrote: > >>There is more to worry about, too, in a hot/hot configuration. To give >>you one graphic example, we found in one installation that we had to >>reload the fibrechannel driver on the new OSS after the other node had >>failed, or we would read very old data from the disk. Reloading the FC >>driver is of course not an option when the node is already serving data >>from another partition. > > Ouch! What was the kernel/module/controller combination ?It was a qlogic FC controller and qla2300 driver (the precise version escapes me), backed by a DataDirect raid array. The quick-fix of reloading the qlogic driver was enough to get us moving, so we never determined the precise origin of the problem. It could have been block layer caching in Linux, or some very strange behaviour of the controller, or a misconfigured cache on the DataDirect. Or perhaps something else. My tale was mostly cautionary; if another customer encounters this and it turns out to be a problem in Linux, we can probably fix it. If it turns out to be in the controller or HBA, maybe not. -Phil
Robert Read
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
Hi, On Mar 3, 2004, at 02:41, Daire Byrne wrote:> > One very last question on this subject - once you''ve failed over to one > machine how do you go about bringing the failed machine back up again? > As > you now have one machine serving both partitions/drives I assume you > have > to actually bring the cluster down and restart Lustre on the failover > pair? Or is there some sort of inbuilt recovery mode for this?Yes, we do have support for failing back to the original load-balanced configuration without having to bring the cluster down. You first have to stop the device in failover mode, and then start it up on the original node. The clients will see it as another failover and recover in the same way they did for the original failover. robert> Daire > > >> Daire Byrne wrote: >>> >>> Now I''m really confused! >> >> "hot" nodes are actively serving data; "warm" nodes are idle, standing >> by to take over in the event of a failure. >> >> A "hot/hot" configuration has two nodes serving data all the time, >> and a >> "hot/warm" configuration has only one node serving data at any time. >> In >> both cases, one node will take over for the other in the event of a >> failure. I hope that helps. >> >>> So just to make this clear, for hot/hot >>> configuration :- I have two servers and one fibre channel RAID array >>> connected to both of them. On the RAID I make two partitions sda1 and >>> sda2. I setup one server to use sda1 and the other server to use >>> sda2? I >>> group them with lconf so that either one can take over the serving of >>> the other server''s partition? >> >> Correct. As you point out below, the precise configuration mechanics >> may not be terribly well documented yet. >> >>> Is this because file-locking is controlled by an OSS and not by the >>> metadata controller? So if you had two OST machines accessing the >>> same >>> partition they wouldnt know about eachothers locks? >> >> It is more fundamental than that -- imagine what would happen if you >> had >> a shared block device which contained an ext3 file system, and you >> mounted that file system read/write from two nodes at the same time. >> You''d have complete chaos and a very corrupt file system, because each >> node would be writing to the journal, and updating metadata, and so >> on, >> without the knowledge of the other. >> >> This is exactly what would happen if you configured two OSTs to share >> one physical partition, because Lustre uses ext3 as the backend disk >> file system. >> >>> In the failover config guides I''ve read for Lustre it always seems to >>> refer to setting the same partition for each OST. i.e sda1 on both >>> servers. May be that was for "hot/warm". I''m probably too confused >>> to be >>> making any sense now! >> >> That is almost certainly true. We designed the system originally for >> "hot/warm" failover, and the documentation is lagging. >> >> There is more to worry about, too, in a hot/hot configuration. To >> give >> you one graphic example, we found in one installation that we had to >> reload the fibrechannel driver on the new OSS after the other node had >> failed, or we would read very old data from the disk. Reloading the >> FC >> driver is of course not an option when the node is already serving >> data >> from another partition. >> >> -Phil >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Chris Samuel
2006-May-19 07:36 UTC
[Lustre-discuss] Linux 2.6 stability & adding an OST to an OSD ?
=2D----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi all, We''re looking at experimenting with Lustre for our new storage after seeing=20 Phil do his stuff at APAC in Canberra (I was the one asking asking questions=20 over the Access Grid, thanks for the answers!). I''ve got a couple of quick questions: 1) how are people Lustre under 2.6 ? 2) If I create an OST exporting an OSD on a SAN, can I later add another OST=20 exporting that same OST for failover/performance or will I have to reformat ? cheers, Chris =2D --=20 Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia =2D----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQFANY6AO2KABBYQAh8RAgs7AJ9U9zHSwjScfAL49IqSYxiFw11P6QCdFz9r REtZHEGHQLNoYUAh+a+zuW4=3D =3DKxct =2D----END PGP SIGNATURE-----