Lisa Giacchetti
2010-Oct-14 19:22 UTC
[Lustre-discuss] steps to take to replace a failed ost with permanent data loss
Hi, I am looking a definitive list of steps a lustre clustre admin should take to recover from the following scenario: 1) an OST in the cluster has had a permanent data failure: The data can not be recovered but device itself will fixed. Please assume that the device is NOT mounted any more on the OSS it was being served from and therefore is NOT listed in the "lctl dl" command on that OSS. 2) data lost is not needed and there are no backups of it 3) It would be beneficial to be able to replace the OST with as the same device. (ie reuse the index) but please include what is used in the "--index" parameter of each command as the documentation on this is severely lacking 4) running mgs and mdt on two separate servers 5) there is no fail-over of any kind set up I have tried to find the appropriate steps to take and commands to use from within the docs and have been unsuccessful. So Unsuccessful that I have had to remake my entire cluster. If you need more clarification on the scenario before being able to tell me what steps to take - please ask for the info you need. Anyone? Lisa Giacchetti -------------- next part -------------- A non-text attachment was scrubbed... Name: lisa.vcf Type: text/x-vcard Size: 275 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101014/7a191e31/attachment.vcf
Andreas Dilger
2010-Oct-16 05:31 UTC
[Lustre-discuss] steps to take to replace a failed ost with permanent data loss
If the old OST is still accessible, you can copy the last_rcvd file and O/0/LAST_ID file, copy them over to the reformatted OST, and it should take on the identity of the old OST. The only other thing that identifies the filesystem is the label, which should be set by mkfs.lustre if the index is specified. Cheers, Andreas On 2010-10-14, at 13:22, Lisa Giacchetti <lisa at fnal.gov> wrote:> I am looking a definitive list of steps a lustre clustre admin should take to recover from the following scenario: > 1) an OST in the cluster has had a permanent data failure: The data can not be recovered but > device itself will fixed. Please assume that the device is NOT mounted any more on the OSS it > was being served from and therefore is NOT listed in the "lctl dl" command on that OSS. > 2) data lost is not needed and there are no backups of it > 3) It would be beneficial to be able to replace the OST with as the same device. (ie reuse the index) > but please include what is used in the "--index" parameter of each command as the documentation > on this is severely lacking > 4) running mgs and mdt on two separate servers > 5) there is no fail-over of any kind set up > > I have tried to find the appropriate steps to take and commands to use from within the docs and > have been unsuccessful. So Unsuccessful that I have had to remake my entire cluster. > If you need more clarification on the scenario before being able to tell me what steps to take - please > ask for the info you need. > > Anyone? > > Lisa Giacchetti > > <lisa.vcf> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Lisa Giacchetti
2010-Oct-18 12:49 UTC
[Lustre-discuss] steps to take to replace a failed ost with permanent data loss
On 10/16/10 12:31 AM, Andreas Dilger wrote:> If the old OST is still accessible, you can copy the last_rcvd file and O/0/LAST_ID file, copy them over to the reformatted OST, and it should take on the identity of the old OST.The disk is not still accessible. and I am going to reuse the same OST once its repaired.> The only other thing that identifies the filesystem is the label, which should be set by mkfs.lustre if the index is specified.What exactly do you use for this index parameter? Can you give me a concrete example? lisa> Cheers, Andreas > > On 2010-10-14, at 13:22, Lisa Giacchetti<lisa at fnal.gov> wrote: >> I am looking a definitive list of steps a lustre clustre admin should take to recover from the following scenario: >> 1) an OST in the cluster has had a permanent data failure: The data can not be recovered but >> device itself will fixed. Please assume that the device is NOT mounted any more on the OSS it >> was being served from and therefore is NOT listed in the "lctl dl" command on that OSS. >> 2) data lost is not needed and there are no backups of it >> 3) It would be beneficial to be able to replace the OST with as the same device. (ie reuse the index) >> but please include what is used in the "--index" parameter of each command as the documentation >> on this is severely lacking >> 4) running mgs and mdt on two separate servers >> 5) there is no fail-over of any kind set up >> >> I have tried to find the appropriate steps to take and commands to use from within the docs and >> have been unsuccessful. So Unsuccessful that I have had to remake my entire cluster. >> If you need more clarification on the scenario before being able to tell me what steps to take - please >> ask for the info you need. >> >> Anyone? >> >> Lisa Giacchetti >> >> <lisa.vcf> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: lisa.vcf Type: text/x-vcard Size: 275 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101018/5b9b9398/attachment.vcf
Hello, I need the lustre snmp-module to monitoring my lustres system with nagios-snmp. I try to compile with command: "./configure --with-linux=/usr/src/kernels/2.6.18-92.el5-x86_64/ --enable-snmp" But I get the error in some point: "checking for register_mib... no" I have Centos 5.2 and lustre version 1.8.0 Any package to install? Thank!!!! ---------------------------- Confidencialidad: Este mensaje y sus ficheros adjuntos se dirige exclusivamente a su destinatario y puede contener informaci?n privilegiada o confidencial. Si no es vd. el destinatario indicado, queda notificado de que la utilizaci?n, divulgaci?n y/o copia sin autorizaci?n est? prohibida en virtud de la legislaci?n vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente respondiendo al mensaje y proceda a su destrucci?n. Disclaimer: This message and its attached files is intended exclusively for its recipients and may contain confidential information. If you received this e-mail in error you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited and may be unlawful. In this case, please notify us by a reply and delete this email and its contents immediately. ----------------------------
Brian J. Murrell
2010-Oct-18 14:41 UTC
[Lustre-discuss] Compiling lustre with snmp feature
On Mon, 2010-10-18 at 16:37 +0200, Alfonso Pardo wrote:> Hello,Hi,> I try to compile with command: > > "./configure --with-linux=/usr/src/kernels/2.6.18-92.el5-x86_64/ > --enable-snmp" > > But I get the error in some point: > > "checking for register_mib... no"You need to look in config.log and see why it''s failing to find that.> I have Centos 5.2 and lustre version 1.8.01.8.0 had a subsequent 1.8.0.1 release which means that it fixed a critical bug. I would strongly advise upgrading, and since you are going to upgrade, it might as well be to 1.8.4, the latest release where you will likely get more people''s attention with questions and bug reports/fixes.> Any package to install?I don''t know off-hand, which is why I gave you instructions to discover what the problem is exactly. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101018/a1da1932/attachment.bin
which critical bug does 1.8.0 have and fixed in 1.8.0.1? I know 1.8.0 has bug #19528, but I don''t know whether it fixed or not in 1.8.0.1 On Mon, Oct 18, 2010 at 10:41 PM, Brian J. Murrell <brian.murrell at oracle.com> wrote:> On Mon, 2010-10-18 at 16:37 +0200, Alfonso Pardo wrote: >> Hello, > > Hi, > >> I try to compile with command: >> >> "./configure --with-linux=/usr/src/kernels/2.6.18-92.el5-x86_64/ >> --enable-snmp" >> >> But I get the error in some point: >> >> "checking for register_mib... no" > > You need to look in config.log and see why it''s failing to find that. > >> I have Centos 5.2 and lustre version 1.8.0 > > 1.8.0 had a subsequent 1.8.0.1 release which means that it fixed a > critical bug. ?I would strongly advise upgrading, and since you are > going to upgrade, it might as well be to 1.8.4, the latest release where > you will likely get more people''s attention with questions and bug > reports/fixes. > >> Any package to install? > > I don''t know off-hand, which is why I gave you instructions to discover > what the problem is exactly. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Hmm. My experience is that 20560 was the most disruptive issue in early 1.8.x releases, but that was fixed in 1.8.1.1. Larry, 19528 was fixed in 1.8.1. You can check by checking the patch and noting that the first release with a landed+ flag is 1.8.1. HTH Larry wrote:> which critical bug does 1.8.0 have and fixed in 1.8.0.1? I know 1.8.0 > has bug #19528, but I don''t know whether it fixed or not in 1.8.0.1 > > >
Brian J. Murrell
2010-Oct-19 14:40 UTC
[Lustre-discuss] Compiling lustre with snmp feature
On Tue, 2010-10-19 at 10:16 +0800, Larry wrote:> which critical bug does 1.8.0 have and fixed in 1.8.0.1?Our changelog (http://wiki.lustre.org/index.php/Use:Change_Log_1.8) for any give release always details the fixes that went into a release (http://wiki.lustre.org/index.php/Use:Change_Log_1.8#Changes_from_v1.8.0_to_v1.8.0.1). You can also typically use the wxyz-resolved bug alias to see what landed for release w.x.y.z. In this case, 1801-resolved, which is bug 19394 (https://bugzilla.lustre.org/show_bug.cgi?id=19394) There were 4 major bugs that went into 1.8.0.1, but none "critical".> I know 1.8.0 > has bug #19528, but I don''t know whether it fixed or not in 1.8.0.1According to bugzilla, 1.8.1 was the first release it went into. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101019/e3275f09/attachment.bin
Fortunately we don''t need the checksum, so we haven''t seen 20560 until now. But 19528 happened several days ago, we have two choice: 1, update to the higher version, eg, 1.8.1 or 1.8.4, I''d like to update to 1.8.4, but does that mean I should update the OFED driver together? We try to avoid changing the OFED. 2, patch the current version with attachment 23648 and 23751 for bz 19528. Considering the OFEd driver, we may have to patch 1.8.0 instead of updating it. On Tue, Oct 19, 2010 at 9:03 PM, Peter Jones <peter.x.jones at oracle.com> wrote:> Hmm. My experience is that 20560 was the most disruptive issue in early > 1.8.x releases, but that was fixed in 1.8.1.1. Larry, 19528 was fixed in > 1.8.1. You can check by checking the patch and noting that the first release > with a landed+ flag is 1.8.1. HTH > > Larry wrote: >> >> which critical bug does 1.8.0 have and fixed in 1.8.0.1? I know 1.8.0 >> has bug #19528, but I don''t know whether it fixed or not in 1.8.0.1 >> >> >> >
Thanks so much, I will update my lustre to 1.8.4 in few weeks. I''am compiling the 1.8.0 and 1.8.4 lustres snmp module with the same error in both versions. checking whether to try to build SNMP support... yes checking for net-snmp-config... net-snmp-config checking net-snmp/net-snmp-config.h usability... yes checking net-snmp/net-snmp-config.h presence... yes checking for net-snmp/net-snmp-config.h... yes checking for register_mib... no checking for register_mib... no checking for SNMP support... no (see config.log for errors) configure: error: SNMP support was requested, but unavailable I have installed "net-snmp" "net-snmp-util" "net-snmp-devel" "lm_sensor" "lm_sensor-devel" in my Centos5.2 with kernel "2.6.18-92.el5-x86_64" and I have test with "2.6.18-194.17.1.el5-x86_64" too. I have checked the "config.log" for errors, buy I haven''t got any error of SNMP module. Thanks El mar, 19-10-2010 a las 10:16 +0800, Larry escribi?:> which critical bug does 1.8.0 have and fixed in 1.8.0.1? I know 1.8.0 > has bug #19528, but I don''t know whether it fixed or not in 1.8.0.1 > > On Mon, Oct 18, 2010 at 10:41 PM, Brian J. Murrell > <brian.murrell at oracle.com> wrote: > > On Mon, 2010-10-18 at 16:37 +0200, Alfonso Pardo wrote: > >> Hello, > > > > Hi, > > > >> I try to compile with command: > >> > >> "./configure --with-linux=/usr/src/kernels/2.6.18-92.el5-x86_64/ > >> --enable-snmp" > >> > >> But I get the error in some point: > >> > >> "checking for register_mib... no" > > > > You need to look in config.log and see why it''s failing to find that. > > > >> I have Centos 5.2 and lustre version 1.8.0 > > > > 1.8.0 had a subsequent 1.8.0.1 release which means that it fixed a > > critical bug. I would strongly advise upgrading, and since you are > > going to upgrade, it might as well be to 1.8.4, the latest release where > > you will likely get more people''s attention with questions and bug > > reports/fixes. > > > >> Any package to install? > > > > I don''t know off-hand, which is why I gave you instructions to discover > > what the problem is exactly. > > > > b. > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Alfonso Pardo D?az Unidad de Sistemas y Explotacion (USE) CETA-CIEMAT Calle Sola n? 1, Trujillo (CACERES) Tel. 927 65 93 17 www.ceta-ciemat.es
Ok, the snmp lustre module is compiled!!!! I have installed the newest net-snmp (v5.6) and the module was compiled perfect. When I do a snmpwalk, I dont''t get any information, but my mudule is compiled and load with dlmod. El mi?, 20-10-2010 a las 08:58 +0200, Alfonso Pardo escribi?:> Thanks so much, I will update my lustre to 1.8.4 in few weeks. I''am > compiling the 1.8.0 and 1.8.4 lustres snmp module with the same error in > both versions. > > checking whether to try to build SNMP support... yes > checking for net-snmp-config... net-snmp-config > checking net-snmp/net-snmp-config.h usability... yes > checking net-snmp/net-snmp-config.h presence... yes > checking for net-snmp/net-snmp-config.h... yes > checking for register_mib... no > checking for register_mib... no > checking for SNMP support... no (see config.log for errors) > configure: error: SNMP support was requested, but unavailable > > I have installed "net-snmp" "net-snmp-util" "net-snmp-devel" "lm_sensor" > "lm_sensor-devel" in my Centos5.2 with kernel "2.6.18-92.el5-x86_64" and > I have test with "2.6.18-194.17.1.el5-x86_64" too. > > I have checked the "config.log" for errors, buy I haven''t got any error > of SNMP module. > > > Thanks > > > El mar, 19-10-2010 a las 10:16 +0800, Larry escribi?: > > which critical bug does 1.8.0 have and fixed in 1.8.0.1? I know 1.8.0 > > has bug #19528, but I don''t know whether it fixed or not in 1.8.0.1 > > > > On Mon, Oct 18, 2010 at 10:41 PM, Brian J. Murrell > > <brian.murrell at oracle.com> wrote: > > > On Mon, 2010-10-18 at 16:37 +0200, Alfonso Pardo wrote: > > >> Hello, > > > > > > Hi, > > > > > >> I try to compile with command: > > >> > > >> "./configure --with-linux=/usr/src/kernels/2.6.18-92.el5-x86_64/ > > >> --enable-snmp" > > >> > > >> But I get the error in some point: > > >> > > >> "checking for register_mib... no" > > > > > > You need to look in config.log and see why it''s failing to find that. > > > > > >> I have Centos 5.2 and lustre version 1.8.0 > > > > > > 1.8.0 had a subsequent 1.8.0.1 release which means that it fixed a > > > critical bug. I would strongly advise upgrading, and since you are > > > going to upgrade, it might as well be to 1.8.4, the latest release where > > > you will likely get more people''s attention with questions and bug > > > reports/fixes. > > > > > >> Any package to install? > > > > > > I don''t know off-hand, which is why I gave you instructions to discover > > > what the problem is exactly. > > > > > > b. > > > > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Alfonso Pardo D?az Unidad de Sistemas y Explotacion (USE) CETA-CIEMAT Calle Sola n? 1, Trujillo (CACERES) Tel. 927 65 93 17 www.ceta-ciemat.es