On Mar 2, 2004, at 06:05, Jan Bruvoll wrote:> Specifically, I am wondering about the following: > > - in the case of a master node disappearing, what is the correct > procedure for starting the OST on the slave? Anything in particular I > should think of?The specific command depends on whether or not you are storing the config in LDAP or not. If you are, then you''ll change the current active server for the OST with the lactive command, and then start lustre on the active server with a normal lconf invocation for that node, ie: lconf --node <node name> --config <config> --ldapurl ldap://url If you are not using LDAP, then you''ll need to specify the active node on the lconf command line: lconf --node <node name> --select <ost-service>=<node name> config.xml You need to be careful here lustre has completely stopped on the failed node before starting the second node. It''s a good idea to power off the failed OST just to be sure it''s down.> - is the case of an OST briefly disappearing from the network and then > reappearing, with all connections reset, a problem for the cluster, or > is this scenario already covered?The clients will reconnect to the current active OST, so this should not be a problem. robert
Hi Robert, Robert Read wrote:> The specific command depends on whether or not you are storing the > config in LDAP or not. If you are, then you''ll change the current > active server for the OST with the lactive command, and then start > lustre on the active server with a normal lconf invocation for that > node, ie: > > lconf --node <node name> --config <config> --ldapurl ldap://url > > If you are not using LDAP, then you''ll need to specify the active node > on the lconf command line: > > lconf --node <node name> --select <ost-service>=<node name> config.xml > > You need to be careful here lustre has completely stopped on the > failed node before starting the second node. It''s a good idea to power > off the failed OST just to be sure it''s down.Does this mean that I will have to alert all clients of the failed node, ie. this would not be handled by the cluster itself? Hmm - one thing I definitely should mention: my set-up is such that I have, for each "unit" of storage, two machines mirrored using drbd. These machines will also using heartbeat decide among themselves which one of them is to serve on the IP address of the "storage unit", ie. in my case I have servers 172.16.3.1 and 172.16.3.2, but the lustre service (OST) is to be found on IP 172.16.3.51. Will this confuse and/or simplify things? With regards to power-down, the good thing about drbd is that read-only access is handled transparently to lustre - so I don''t -really- have to know exactly what node is up or not. For cluster perspectives (lustre), on the other hand, I should keep track of things I guess.> The clients will reconnect to the current active OST, so this should > not be a problem.Again, I guess somebody has to tell them? Thanks for your help! Jan
On Tue, Mar 02, 2004 at 02:05:38PM +0000, Jan Bruvoll wrote:> Dear all, > > I got in at the deep end and I am now trying to set up a cluster of > "non-stop" storage nodes using pairs of drbd-ed servers each providing > an OST to the cluster, all this without knowing too much about lustre. > I''m having fun, though! >Sounds very much like what I''m doing. I''m at about the same level as you at the moment.> > Ah, yes - my goal is to set up a storage cluster for a web farm, where > the storage requirement is quite high, probably reaching into 50Tb quite > soon. All storage needs to be accessible at all times, hence the use of > drbd (awaiting lustre-internal mirroring). The cluster will further down > the line be completed with an LVS setup to have all storage nodes, hot > or warm, also contribute with CPU cycles for Apaches, application > servers, etc. > > If you would like me to contribute back in the form of a HOWTO when > everything''s running, please let me know.This is similar to what I''m doing. Maybe we can help each other out and work on a HOWTO together? Anyone else interested in this? I think the current HOWTO, while it gets you going, is lacking in detail. I''m about half-way through the lustre manual now and only just starting to get an idea of what''s going on. Cheers, Paul.
Dear all, I got in at the deep end and I am now trying to set up a cluster of "non-stop" storage nodes using pairs of drbd-ed servers each providing an OST to the cluster, all this without knowing too much about lustre. I''m having fun, though! However, before I start paddling, I have a couple of questions I am wondering if anybody has answers to. Maybe somebody already tried this - I am sorry if this has been covered here already, but I couldn''t find it in the archives. Specifically, I am wondering about the following: - in the case of a master node disappearing, what is the correct procedure for starting the OST on the slave? Anything in particular I should think of? - is the case of an OST briefly disappearing from the network and then reappearing, with all connections reset, a problem for the cluster, or is this scenario already covered? If you have any other pointers, those would be most appreciated. Ah, yes - my goal is to set up a storage cluster for a web farm, where the storage requirement is quite high, probably reaching into 50Tb quite soon. All storage needs to be accessible at all times, hence the use of drbd (awaiting lustre-internal mirroring). The cluster will further down the line be completed with an LVS setup to have all storage nodes, hot or warm, also contribute with CPU cycles for Apaches, application servers, etc. If you would like me to contribute back in the form of a HOWTO when everything''s running, please let me know. Best regards Jan -- Mr Jan Bruvoll BRVL technology Ltd Office: +44 7005 94 3430 Managing Director Unit 303 Fax: +44 7005 93 8363 jan@brvl.com 5 King Edward''s Road Mobile: +44 7740 29 1600 www.brvl.com London E9 7SG, UK
cc''ed the list for anyone else who''s interested. On Tue, Mar 23, 2004 at 12:47:45PM +0100, Sture Lygren wrote:> Hello to both of you, > > I saw your posts on the lustre-discuss list, and it seems like what you > are doing is exactly what I try to implement for a disk-failover solution > here. > > Now to the question - have anyone of you had any luck so far? Could you > let me know how you plan on implementing it? Maby we could help each other > out? >I''ve been thinking about it in a couple of ways. Ideally what I''d like is some sort of network RAID-5. The problem there I think is that everything has to go through the RAID controller systems so you lose the performance benefit of parallel I/O to the OST''s. I''m not sure how you''d implement RAID-5 and get that performance benefit which is essential in some applications. So, lustre is the way I''ve picked to go instead. The main problem I have with lustre though is the shared back-end storage. You have redundancy on the OST''s but not the storage. So you use fibre-channel or some sort of RAID box or whatever. I don''t have enough money for some of these types of solutions and for a few reasons I''d like to do it using commodity hardware. The rationale here is that you''re not tied to any particular vendor and there are off-the-shelf hardware is everywhere. It''s also cheaper in many cases! So, enter DRBD. My thoughts are to possibly do it the way jan is planning, though that seems to have limitations. Namely that an you have OST pairs sharing their drives via DRBD. One is a failover for the other, but lustre also supports OST failover. In this case however, OST failover is not going to be used as the two boxes appear to lustre as a single system, with the failover controlled by some sort of heart beat and there is no OST failover by lustre at all. I''d rather have any OST available to failover for any OST. i.e. I don''t want a specific OST designated as a failover for a particular OST. So, we come back to the concept of shared storage. Using DRBD I can have redundant backend storage on separate boxes to the OST''s. Then, I put several OST''s in front of each DRBD pair. Okay, it''s not quite what I was aiming for, but it might work until something better comes along (e.g. RAID-1 in lustre). So, essentially I''m using lustre with the shared storage model, but cutting costs by using DRBD instead of say a RAID box. I''m also using commodity hardware to do it. I don''t think this is very scalable though. The problem is that to increase capacity, you need to set up a new DRBD system with more OST''s in front of it. It would be better if you could add more DRBD storage dynamically to an existing pair. I''m not sure what the best way round that is. Maybe lustre will help there, maybe DRBD or some derivative will come along with a way to do it. Maybe it can already be done with PVFS or LVM or something similar. Sorry for the long response, but I was trying to get my thoughts arranged as well as replying to your question.> Appreciate your responses. >Waiting on yours too!> Best regards, > Sture > > -- > Sture Lygren > Computer Systems Administrator > Andoya Rocket Range > Work: +4776144451 / Fax: +4776144401Cheers, Paul.
Sture Lygren wrote:>Hello to both of you, > >I saw your posts on the lustre-discuss list, and it seems like what you >are doing is exactly what I try to implement for a disk-failover solution >here. > >Now to the question - have anyone of you had any luck so far? Could you >let me know how you plan on implementing it? Maby we could help each other >out? > >Appreciate your responses. > >Best regards, >Sture > > >Hello Sture, first of all - hello to my old town - I grew up in Andenes - left about 20 years ago and sadly haven''t been back (yet)! So, back to Lustre + DRBD: my setup now consists of 4 test nodes (all running within VMWare), where two and two are fail-over OSTs. The setup is actually quite simple - the only thing I need to keep track of is the DRBD failover, and all scripts are ready out of the box for that. On the active and passive OST I just run the same config - the fact that no requests come in on the passive node (because the clients go for the service IP, as handled by heartbeat), as well as DRBD making sure that no writes could ever take place on the inactive node, takes care of the fail-over and STOMITH bit. Setup detail: node-1a: 172.16.3.1, owns a storage device of 500Mb node-1b: 172.16.3.2, also owns a storage device of 500Mb These two use heartbeat to agree on who provides the storage device (DRBD) service on ip address 172.16.3.51 node-2a: 172.16.254.1, storage device of 500Mb node-2b: 172.16.254.2, storage device of 500Mb Again, these two agree on the service on 172.16.254.51. The two OSTs on 172.16.3.51 and 172.16.254.51 jointly provide a LOV of 1Gb, and as far as I can see, this actually works quite well. Reads and writes are only slightly delayed when I pull the plug on the master node, and recoveries are quite smooth. Next step is to do the same thing with the MDS - so far I''ve connected node-1a and node-2a and used the same heartbeat "trick", but for obvious reasons this needs looking at (mainly because the nodes are not in the same network). The cleanest solution would most probably be to just use node pair 1 for this, but that would again create a SPOF on the switch connecting node pair 1 to the rest of the network. For the final setup, that''s not so much a problem, since all nodes will be hanging off the same switch anyway. Unfortunately, I still have a lot to learn about Lustre itself, but it seems I''ve so far been able to avoid and/or conceal this problem by using my knowledge about supporting applications, technologies & approaches. Hopefully this thing will work as well when scaled - this is going to support a web-based application with large-scale (at least in our understanding of "large") storage, ie. in the range of hopefully up to 50-60Tb. Hope this helps - I''d be happy to help out if there''s anything I can do. Best regards Jan