megan
2008-Jun-16 22:37 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
Greetings! I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a CentOS 5 linux x86_64 linux box. We had a hardware problem that caused the underlying ext3 partition table to completely blow up. This is resulting in only three of five OSTs being mountable. The main lustre disk of this unit cannot be mounted because the MDS knows that two of its parts are missing. The underlying set-up is JBOD hw that is passed to the linux OS, via an LSI 8888ELP card in this case, as a simple device, ie. sde, sdf,... The simple devices were partitioned using parted and formatted ext3 then lustre was built on top of the five ext3 units. There was no striping done across units/JBODS. Three of the five units passed an e2fsck and an lfsck. Those remaining units are mounted as such: /dev/sdc 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4- OST0003 /dev/sdd 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4- OST0004 /dev/sdf 13T 6.2T 5.8T 52% /srv/lustre/OST/crew4- OST0001 Being that it is unlikely that we shall be able to recover the underlying ext3 on the other two units, is there some method by which I might try to rescue the data from these last three units mounted currently on the OSS? Any and all suggestion genuinely appreciated. megan
Andreas Dilger
2008-Jun-18 04:48 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
On Jun 16, 2008 15:37 -0700, megan wrote:> I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a > CentOS 5 linux x86_64 linux box. > We had a hardware problem that caused the underlying ext3 partition > table to completely blow up. This is resulting in only three of five > OSTs being mountable. The main lustre disk of this unit cannot be > mounted because the MDS knows that two of its parts are missing.It should be possible to mount a Lustre filesystem with OSTs that are not available. However, access to files on the unavailable OSTs will cause the process to wait on OST recovery.> The underlying set-up is JBOD hw that is passed to the linux OS, via > an LSI 8888ELP card in this case, as a simple device, ie. sde, > sdf,... The simple devices were partitioned using parted and > formatted ext3 then lustre was built on top of the five ext3 units. > There was no striping done across units/JBODS. Three of the five > units passed an e2fsck and an lfsck. Those remaining units are > mounted as such: > /dev/sdc 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4- > OST0003 > /dev/sdd 13T 6.3T 5.7T 53% /srv/lustre/OST/crew4- > OST0004 > /dev/sdf 13T 6.2T 5.8T 52% /srv/lustre/OST/crew4- > OST0001 > > Being that it is unlikely that we shall be able to recover the > underlying ext3 on the other two units, is there some method by which > I might try to rescue the data from these last three units mounted > currently on the OSS? > > Any and all suggestion genuinely appreciated.The recoverability of your data depends heavily on the striping of the individual files (i.e. the default striping). If your files have a default stripe_count = 1, then you can probably recover 3/5 of the files in the filesystem. If your default stripe_count = 2, then you can probably only recover 1/5 of the files, and if you have a higher stripe_count you probably can''t recover any files. What you need to do is to mount one of the clients and mark the corresponding OSTs inactive with: lctl dl # get device numbers for OSC 0000 and OSC 0002 lctl --device N deactivate Then, instead of the clients waiting for the OSTs to recover the client will get an IO error when it accesses files on the failed OSTs. To get a list of the files that are on the good OSTs run: lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4-OST0004_UUID {mountpoint} Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
megan
2008-Jun-18 21:33 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
Thank you Andreas! Your information is wonderful. I did the following: I logged into my MDS (same as MGS) and issued the commands-- shell-prompt> mount -t lustre /dev/md1 /srv/lustre/mds/crew4-MDT0000 No errors so far. shell-prompt> lctl dl (Found my nids of failed JBODs) device 14 deactivate device 16 deactivate quit On one of our servers, I mounted the lustre disk /crew4. The disk will hang a UNIX df or ls command. However.... lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4- OST0004_UUID -print /crew4 Did indeed provide a list of files. I saved the list to a text file. I will next see if I am able to copy a single file to a new location. Thank you again Andreas for this incredibly useful information. Do you/Sun do paid Lustre consulting by any chance? Later, megan On Jun 18, 12:48?am, Andreas Dilger <adil... at sun.com> wrote:> On Jun 16, 2008 ?15:37 -0700, megan wrote: > > > I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a > > CentOS 5 linux x86_64 linux box. > > We had a hardware problem that caused the underlying ext3 partition > > table to completely blow up. ?This is resulting in only three of five > > OSTs being mountable. ? The main lustre disk of this unit cannot be > > mounted because the MDS knows that two of its parts are missing. > > It should be possible to mount a Lustre filesystem with OSTs that > are not available. ?However, access to files on the unavailable > OSTs will cause the process to wait on OST recovery. > > > > > The underlying set-up is JBOD hw that is passed to the linux OS, via > > an LSI 8888ELP card in this case, as a simple device, ie. sde, > > sdf,... ? ?The simple devices were partitioned using parted and > > formatted ext3 then lustre was built on top of the five ext3 units. > > There was no striping done across units/JBODS. ? Three of the five > > units passed an e2fsck and an lfsck. ?Those remaining units are > > mounted as such: > > /dev/sdc ? ? ? ? ? ? ? 13T ?6.3T ?5.7T ?53% /srv/lustre/OST/crew4- > > OST0003 > > /dev/sdd ? ? ? ? ? ? ? 13T ?6.3T ?5.7T ?53% /srv/lustre/OST/crew4- > > OST0004 > > /dev/sdf ? ? ? ? ? ? ? 13T ?6.2T ?5.8T ?52% /srv/lustre/OST/crew4- > > OST0001 > > > Being that it is unlikely that we shall be able to recover the > > underlying ext3 on the other two units, is there some method by which > > I might try to rescue the data from these last three units mounted > > currently on the OSS? > > > Any and all suggestion genuinely appreciated. > > The recoverability of your data depends heavily on the striping of > the individual files (i.e. the default striping). ?If your files have > a default stripe_count = 1, then you can probably recover 3/5 of the > files in the filesystem. ?If your default stripe_count = 2, then you > can probably only recover 1/5 of the files, and if you have a higher > stripe_count you probably can''t recover any files. > > What you need to do is to mount one of the clients and mark the > corresponding OSTs inactive with: > > ? ? ? ? lctl dl ? ?# get device numbers for OSC 0000 and OSC 0002 > ? ? ? ? lctl --device N deactivate > > Then, instead of the clients waiting for the OSTs to recover the > client will get an IO error when it accesses files on the failed OSTs. > > To get a list of the files that are on the good OSTs run: > > ? ? ? ? lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID > ? ? ? ? ? ? ? ? ?--ost crew4-OST0004_UUID {mountpoint} > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-disc... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss
Andreas Dilger
2008-Jun-19 05:31 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
On Jun 18, 2008 14:33 -0700, megan wrote:> shell-prompt> mount -t lustre /dev/md1 /srv/lustre/mds/crew4-MDT0000 > > No errors so far. > > shell-prompt> lctl > dl (Found my nids of failed JBODs) > device 14 > deactivate > > device 16 > deactivate > > quit > > On one of our servers, I mounted the lustre disk /crew4. > The disk will hang a UNIX df or ls command.You actually need to do the "deactivate" step on the client. Then "ls" will get EIO on the file, and "df" will return data only from the available OSTs.> However.... > lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4- > OST0004_UUID -print /crew4 > > Did indeed provide a list of files. I saved the list to a text > file. I will next see if I am able to copy a single file to a new > location. > > Thank you again Andreas for this incredibly useful information. Do > you/Sun do paid Lustre consulting by any chance?Yes, in fact we do...> On Jun 18, 12:48?am, Andreas Dilger <adil... at sun.com> wrote: > > On Jun 16, 2008 ?15:37 -0700, megan wrote: > > > > > I am using Lustre 2.6.18-53.1.13.el5_lustre.1.6.4.3smp kernel on a > > > CentOS 5 linux x86_64 linux box. > > > We had a hardware problem that caused the underlying ext3 partition > > > table to completely blow up. ?This is resulting in only three of five > > > OSTs being mountable. ? The main lustre disk of this unit cannot be > > > mounted because the MDS knows that two of its parts are missing. > > > > It should be possible to mount a Lustre filesystem with OSTs that > > are not available. ?However, access to files on the unavailable > > OSTs will cause the process to wait on OST recovery. > > > > > > > > > The underlying set-up is JBOD hw that is passed to the linux OS, via > > > an LSI 8888ELP card in this case, as a simple device, ie. sde, > > > sdf,... ? ?The simple devices were partitioned using parted and > > > formatted ext3 then lustre was built on top of the five ext3 units. > > > There was no striping done across units/JBODS. ? Three of the five > > > units passed an e2fsck and an lfsck. ?Those remaining units are > > > mounted as such: > > > /dev/sdc ? ? ? ? ? ? ? 13T ?6.3T ?5.7T ?53% /srv/lustre/OST/crew4- > > > OST0003 > > > /dev/sdd ? ? ? ? ? ? ? 13T ?6.3T ?5.7T ?53% /srv/lustre/OST/crew4- > > > OST0004 > > > /dev/sdf ? ? ? ? ? ? ? 13T ?6.2T ?5.8T ?52% /srv/lustre/OST/crew4- > > > OST0001 > > > > > Being that it is unlikely that we shall be able to recover the > > > underlying ext3 on the other two units, is there some method by which > > > I might try to rescue the data from these last three units mounted > > > currently on the OSS? > > > > > Any and all suggestion genuinely appreciated. > > > > The recoverability of your data depends heavily on the striping of > > the individual files (i.e. the default striping). ?If your files have > > a default stripe_count = 1, then you can probably recover 3/5 of the > > files in the filesystem. ?If your default stripe_count = 2, then you > > can probably only recover 1/5 of the files, and if you have a higher > > stripe_count you probably can''t recover any files. > > > > What you need to do is to mount one of the clients and mark the > > corresponding OSTs inactive with: > > > > ? ? ? ? lctl dl ? ?# get device numbers for OSC 0000 and OSC 0002 > > ? ? ? ? lctl --device N deactivate > > > > Then, instead of the clients waiting for the OSTs to recover the > > client will get an IO error when it accesses files on the failed OSTs. > > > > To get a list of the files that are on the good OSTs run: > > > > ? ? ? ? lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID > > ? ? ? ? ? ? ? ? ?--ost crew4-OST0004_UUID {mountpoint} > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-disc... at lists.lustre.orghttp://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Charles Taylor
2008-Jun-19 10:42 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
>>Just some feedback on the item below...When we were getting started with Lustre about seven or eight months ago we were having stability problems and losing a lot of time to Lustre-related issues. We were newbies and willing to pay someone to help us clear some initial hurdles. I contacted (phone and email) ClusterFS (I guess this was just prior to the Sun purchase, I don''t know) to ask about some consulting help and never heard back from anyone. We all want you to be successful and this seems like a lost revenue opportunity. FWIW, we are doing pretty well now and are mostly very happy with Lustre and this list has been invaluable (to wit, "The Dilger Procedure"). However, I''m sure there is plenty that we still don''t know and we plan to attend some training at the next opportunity. Regards, Charlie Taylor UF HPC Center>> Thank you again Andreas for this incredibly useful information. Do >> you/Sun do paid Lustre consulting by any chance? > > Yes, in fact we do... > >>
Ms. Megan Larko
2008-Jun-20 21:27 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
This is a follow-up from Megan on 20 June 2008: Success getting file information from remaining OST''s. Per the advice of Andreas, I mounted my good OST''s on my OSS. I went to the MDT and mounted the /srv/lustre/mds/crew4-MDT0000. On a compute node (not a lustre data OSS node), I mounted the disk (/crew4) and then I used the lctl to identify the known bad nids in /crew4 and then to "device {bad-nid} then "deactivate" that bad-nid. Finally I used Andreas suggestion of "lfs find --ost crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4-OST0004_UUID --print /crew4 >& crew4.find.20Jun08" I received a 759 Mb text output file of the names of files still resident on the remaining OST''s. (...and there was great rejoicing!) So--- I want to cp those known/found file names from the read-only mounted device named /crew4 onto some good space. May I just use a linux system "cp" command or is there a better lustre command that should be used for this specific task? Thanks bunches! megan On Thu, Jun 19, 2008 at 12:52 PM, Ms. Megan Larko <dobsonunit at gmail.com> wrote:> Howdy, > > WRT consulting, our experience at CREW (my company) is similar. Our > company president contacted Sun about Lustre consulting and the Sun > people to whom he spoke knew nothing about it. We are still > interested. I have signed up for the Lustre class to be taught at > Sun in San Jose California on July 15-17, 2008. I am still learning > how to set-up and manage my lustre files system. > > Our company has also purchased and received (yesterday) seven Xstore > 16 bay JBODs and 110 Hitachi Ultrastar 1Tb sATA hard drives to add to > our current lustre system. We do use InfiniBand. I don''t know if > the current system (I inherited) used quotas. We have two new servers > coming for the new disk space. I am following the new lustre 1.6 > release thread with great interest as I believe that is what I will > put onto the new servers to serve the new disk space (lustre format) > we have just purchased. > > Can''t get to that lustre class soon enough. > > megan > > On Thu, Jun 19, 2008 at 6:42 AM, Charles Taylor <taylor at hpc.ufl.edu> wrote: >>>> >> >> >> >> Just some feedback on the item below...When we were getting started with >> Lustre about seven or eight months ago we were having stability problems and >> losing a lot of time to Lustre-related issues. We were newbies and willing >> to pay someone to help us clear some initial hurdles. I contacted (phone >> and email) ClusterFS (I guess this was just prior to the Sun purchase, I >> don''t know) to ask about some consulting help and never heard back from >> anyone. We all want you to be successful and this seems like a lost >> revenue opportunity. >> >> FWIW, we are doing pretty well now and are mostly very happy with Lustre and >> this list has been invaluable (to wit, "The Dilger Procedure"). However, >> I''m sure there is plenty that we still don''t know and we plan to attend some >> training at the next opportunity. >> >> Regards, >> >> Charlie Taylor >> UF HPC Center >> >> >>>> Thank you again Andreas for this incredibly useful information. Do >>>> you/Sun do paid Lustre consulting by any chance? >>> >>> Yes, in fact we do... >>> >>>> >> >
Andreas Dilger
2008-Jun-23 20:54 UTC
[Lustre-discuss] How do I recover files from partial lustre disk?
On Jun 20, 2008 17:27 -0400, Ms. Megan Larko wrote:> This is a follow-up from Megan on 20 June 2008: > Success getting file information from remaining OST''s. > > Per the advice of Andreas, I mounted my good OST''s on my OSS. > I went to the MDT and mounted the /srv/lustre/mds/crew4-MDT0000. > > On a compute node (not a lustre data OSS node), I mounted the disk > (/crew4) and then I used the lctl to identify the known bad nids in > /crew4 and then to "device {bad-nid} then "deactivate" that > bad-nid. Finally I used Andreas suggestion of "lfs find --ost > crew4-OST0001_UUID --ost crew4-OST0003_UUID --ost crew4-OST0004_UUID > --print /crew4 >& crew4.find.20Jun08" > > I received a 759 Mb text output file of the names of files still > resident on the remaining OST''s. (...and there was great rejoicing!) > So--- I want to cp those known/found file names from the read-only > mounted device named /crew4 onto some good space. May I just use a > linux system "cp" command or is there a better lustre command that > should be used for this specific task?If the files are single-stripe files then "cp" is fine. If the files have multiple stripes (you can check with "lfs getstripe filename ...") then you should probably just skip them. If there is data in a striped file that is valuable even if you only have e.g. every other 1MB of the file, then you can recover the readable parts of the file with: COUNT=$(($(stat -c {filename}) + 65535) / 65536)) dd if={filename} of={savefilename} bs=64k count=$COUNT conv=sync,noerror the unreadable parts of the file will be filled with binary 0 (NUL) bytes. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.