Dam Thanh Tung
2009-Nov-18 15:54 UTC
[Lustre-discuss] MDS doesn''t switch to failover OST node
Hi list I am encountering a problem with OST-MDS connecting. Because of RAID card hanging, our OST went down this morning and when i tried to mount the faill over node of that OST, problem occurred : MDS only sent request to the OST which was down and didn''t connect to our backup (failover) OST, so our backup solution was useless, we lost all data from that OST. It''s really a disaster for me because we even lost all of our data before with the same kind of problem: OST can''t connect to MDS !!!! We use drbd between OSTs to synchronize data. The backup (failover node) was mounted successfully without any error but didn''t have any client to recover like this: cat /proc/fs/lustre/obdfilter/lustre-OST0006/recovery_status status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/1 delayed_clients: 0/1 completed_clients: 0/1 replayed_requests: 0*/??* queued_requests: 0 next_transno: 30064771073 In MDS''s message log, we only saw the connection to our dead OST: Nov 18 22:44:03 MDS1 kernel: Lustre: Request x1314965674069373 sent from lustre-OST0006-osc to NID 192.168.1.66 at tcp 56s ago has timed out (limit 56s). ...... The output of* **lctl dl *command from MDS lctl dl 0 UP mgs MGS MGS 25 1 UP mgc MGC192.168.1.78 at tcp 0681a267-849f-350c-5b2c-6869c794550f 5 2 UP mdt MDS MDS_uuid 3 3 UP lov lustre-mdtlov lustre-mdtlov_UUID 4 4 UP mds lustre-MDT0000 lustre-MDT0000_UUID 15 5 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5 6 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5 7 IN osc lustre-OST0006-osc lustre-mdtlov_UUID 5 8 UP osc lustre-OST0004-osc lustre-mdtlov_UUID 5 9 UP osc lustre-OST0005-osc lustre-mdtlov_UUID 5 I did activated OST6 ( lctl --device 7 activate ) but it couldn''t help Could anyone tell me how to route MDS to connect to our backup OST ( with ip address 192.168.1.67 , for example ) ? , to bring our OST up ? Any help would be really appreciated ! Hope that i can receive your answers or suggestions as soon as possible Best Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091118/5b0a96ce/attachment.html
Brian J. Murrell
2009-Nov-18 16:10 UTC
[Lustre-discuss] MDS doesn''t switch to failover OST node
On Wed, 2009-11-18 at 22:54 +0700, Dam Thanh Tung wrote:> Hi listHi,> MDS only sent request to the OST which was down and didn''t connect to > our backup (failover) OST, so our backup solution was useless, we lost > all data from that OST.I don''t think you have actually lost any data. It''s there. Your clients (which the MDS is) just don''t know to use the failover OSS that you have set up (but not told Lustre about).> It''s really a disaster for me because we even lost all of our data > before with the same kind of problem: OST can''t connect to MDS !!!!Failures to connect between nodes does not result in data loss. The data is still there. You just need to have your clients access it.> Could anyone tell me how to route MDS to connect to our backup OST > ( with ip address 192.168.1.67 , for example ) ? , to bring our OST > up ?It sounds like you need to review the failover section of the manual. In summary, you need to tell the clients about failover nodes (--failnode) when you create the filesystem. You can add this feature after-the-fact with tunefs.lustre. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091118/f1c497e1/attachment.bin