I am emptying a set of OST so that I can reformat the underlying RAID-6 more efficiently. Two questions: 1. Is there a quick way to tell if the OST is really empty? lfs_find takes many hours to run. 2. When I reformat, I want it to retain the same ID so as to not make "holes" in the list. From the following information, am I correct to assume that the id is 24? If not, how do I determine the correct ID to use when we re-create the file system? /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% /lustre/umt3[OST:24] 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 Thanks much, bob
On 6 Nov 2010, at 14:24, Bob Ball wrote:> I am emptying a set of OST so that I can reformat the underlying RAID-6 > more efficiently. Two questions: > 1. Is there a quick way to tell if the OST is really empty? lfs_find > takes many hours to run.lfs df -i from a client or simply df -i from the OSS node will tell you whilst the OST in on-line, when it''s offline you can remount it as type ldiskfs and use ls to verify that it''s really empty.> 2. When I reformat, I want it to retain the same ID so as to not make > "holes" in the list. From the following information, am I correct to > assume that the id is 24? If not, how do I determine the correct ID to > use when we re-create the file system?"tunefs.lustre --print /dev/sdj" will tell you the index in base 10. Ashley.
On 11/06/2010 10:24 AM, Bob Ball wrote:> I am emptying a set of OST so that I can reformat the underlying RAID-6 > more efficiently. Two questions: > 1. Is there a quick way to tell if the OST is really empty? lfs_find > takes many hours to run.Yes ... df -H /path/to/OST/mount_point if you have more than a few megabytes in the used column, it is probably in use. There are other mechanisms, but we find this one works quickest to get a baseline read of how full the OST is. [...]> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 > 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 > umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% > /lustre/umt3[OST:24] > 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5This suggests you are at about 90% utilization of this OST (/dev/sdj) -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615
On 2010-11-06, at 8:24, Bob Ball <ball at umich.edu> wrote:> I am emptying a set of OST so that I can reformat the underlying RAID-6 > more efficiently. Two questions: > 1. Is there a quick way to tell if the OST is really empty? lfs_find > takes many hours to run.If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root.> 2. When I reformat, I want it to retain the same ID so as to not make > "holes" in the list. From the following information, am I correct to > assume that the id is 24? If not, how do I determine the correct ID to > use when we re-create the file system?If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST.> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 > 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 > umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% > /lustre/umt3[OST:24] > 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. Cheers, Andreas
Responding to everyone.... (and thanks to all) lfs df -i from a client or simply df -i from the OSS node ... This still shows of order 100 inodes after the OST was emptied. "tunefs.lustre --print /dev/sdj" will tell you the index in base 10. Yes, this worked. df -H /path/to/OST/mount_point This still shows several hundred MB as well, in disparate amounts per OST I''m emptying. An indicator, but not perfect. Yes, the OST were near full to begin with. More below. On 11/6/2010 11:09 AM, Andreas Dilger wrote:> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >> I am emptying a set of OST so that I can reformat the underlying RAID-6 >> more efficiently. Two questions: >> 1. Is there a quick way to tell if the OST is really empty? lfs_find >> takes many hours to run. > If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root.I assume this is ALL I should see. Certainly the used inode count agrees with this. I will look shortly.>> 2. When I reformat, I want it to retain the same ID so as to not make >> "holes" in the list. From the following information, am I correct to >> assume that the id is 24? If not, how do I determine the correct ID to >> use when we re-create the file system? > If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST.I will do that, thanks.>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >> /lustre/umt3[OST:24] >> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 > The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018.I intend to be VERY careful with this. Thank you all. Any further advice before I do this, likely on Monday, will be greatly appreciated. bob> Cheers, Andreas >
On 6 Nov 2010, at 19:28, Bob Ball wrote:> I intend to be VERY careful with this. Thank you all. Any further > advice before I do this, likely on Monday, will be greatly appreciated.I believe it is possible to use udev to assign device names to devices which would make the pathname both consistent across boots and configurable. If it isn''t possible with udev then I know it''s possible with multipath as we always use this on DDN systems (I work for DDN), it both handles failover between devices but also allows them to be named as we choose. I''m not saying we''re not careful ourselves but you need to be a lot less careful if you are working with /dev/mapper/ost_24 than you do if you are working with /dev/sdj Basically install device-mapper-multipath and add the entries to /etc/multipath.conf. On one of our test systems in the lab we have the following currently, this is a virtual test system so only single path to each device. /dev/mapper/ost_sab_1 4128448 1579584 2339152 41% /lustre/sab/ost_1 /dev/mapper/ost_sab_0 4128448 1278200 2640536 33% /lustre/sab/ost_0 Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Hi, Andreas. Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. So, while we are doing the reformat, is there any way to avoid this "hang" situation? Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. Thanks much, bob On 11/6/2010 11:09 AM, Andreas Dilger wrote:> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >> I am emptying a set of OST so that I can reformat the underlying RAID-6 >> more efficiently. Two questions: >> 1. Is there a quick way to tell if the OST is really empty? lfs_find >> takes many hours to run. > If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. > >> 2. When I reformat, I want it to retain the same ID so as to not make >> "holes" in the list. From the following information, am I correct to >> assume that the id is 24? If not, how do I determine the correct ID to >> use when we re-create the file system? > If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. > >> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >> /lustre/umt3[OST:24] >> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 > The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. > > Cheers, Andreas >
BTW, the new OST sizes will be much different from the original OST sizes. Is the "copy the old file" method below still valid in this case? bob On 11/7/2010 2:32 PM, Bob Ball wrote:> Hi, Andreas. > > Tomorrow, we will redo all 8 OST on the first file server we are > redoing. I am very nervous about this, as a lot is riding on us doing > this correctly. For example, on a client now, if I umount one of the > ost, without first taking some (unknown to me) action on the MDT, then > the client will hang on the "df" command. > > So, while we are doing the reformat, is there any way to avoid this > "hang" situation? > > Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from > your comment below that this must be hex? > > Finally, does supplying the --index even matter if we restore the files > below that you mention? That seems to be what you are saying. > > Thanks much, > bob > > On 11/6/2010 11:09 AM, Andreas Dilger wrote: >> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>> more efficiently. Two questions: >>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>> takes many hours to run. >> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >> >>> 2. When I reformat, I want it to retain the same ID so as to not make >>> "holes" in the list. From the following information, am I correct to >>> assume that the id is 24? If not, how do I determine the correct ID to >>> use when we re-create the file system? >> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >> >>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>> /lustre/umt3[OST:24] >>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >> >> Cheers, Andreas >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On 7 Nov 2010, at 19:32, Bob Ball wrote:> So, while we are doing the reformat, is there any way to avoid this > "hang" situation?I believe there is but it escapes me at the minute.> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from > your comment below that this must be hex?Decimal.> Finally, does supplying the --index even matter if we restore the files > below that you mention? That seems to be what you are saying.That''s my understanding as well. Use tunefs.lustre --print after the format to verify all the options are the same, including the index but also the flags, mount options and nodelist. One thing I''ve seen people forget is to enable quotas on new (or replacement) OSTs which has the effect of disabling quotas until you are able to run quotacheck - in a live filesystem this can be effectively permanent. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Thanks, Ashley. No quotas, fortunately. Tomorrow will be "fun". bob On 11/7/2010 3:44 PM, Ashley Pittman wrote:> On 7 Nov 2010, at 19:32, Bob Ball wrote: >> So, while we are doing the reformat, is there any way to avoid this >> "hang" situation? > I believe there is but it escapes me at the minute. > >> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from >> your comment below that this must be hex? > Decimal. > >> Finally, does supplying the --index even matter if we restore the files >> below that you mention? That seems to be what you are saying. > That''s my understanding as well. > > Use tunefs.lustre --print after the format to verify all the options are the same, including the index but also the flags, mount options and nodelist. One thing I''ve seen people forget is to enable quotas on new (or replacement) OSTs which has the effect of disabling quotas until you are able to run quotacheck - in a live filesystem this can be effectively permanent. > > Ashley. >
None if the Lustre config files stores the OST size, so it should be fine. Note that even if your OST isn''t empty, you can just copy over all of the files into the newly-formatted filesystem, so long as you copy the xattrs with them. Cheers, Andreas On 2010-11-07, at 12:43, Bob Ball <ball at umich.edu> wrote:> BTW, the new OST sizes will be much different from the original OST sizes. Is the "copy the old file" method below still valid in this case? > > bob > > On 11/7/2010 2:32 PM, Bob Ball wrote: >> Hi, Andreas. >> >> Tomorrow, we will redo all 8 OST on the first file server we are >> redoing. I am very nervous about this, as a lot is riding on us doing >> this correctly. For example, on a client now, if I umount one of the >> ost, without first taking some (unknown to me) action on the MDT, then >> the client will hang on the "df" command. >> >> So, while we are doing the reformat, is there any way to avoid this >> "hang" situation? >> >> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from >> your comment below that this must be hex? >> >> Finally, does supplying the --index even matter if we restore the files >> below that you mention? That seems to be what you are saying. >> >> Thanks much, >> bob >> >> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>> more efficiently. Two questions: >>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>> takes many hours to run. >>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>> >>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>> "holes" in the list. From the following information, am I correct to >>>> assume that the id is 24? If not, how do I determine the correct ID to >>>> use when we re-create the file system? >>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>> >>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>> /lustre/umt3[OST:24] >>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>> >>> Cheers, Andreas >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >>
On 2010-11-07, at 12:32, Bob Ball <ball at umich.edu> wrote:> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. > > So, while we are doing the reformat, is there any way to avoid this "hang" situation?If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex?Decimal, though it may also accept hex (I can''t check right now).> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying.Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning.> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>> more efficiently. Two questions: >>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>> takes many hours to run. >> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >> >>> 2. When I reformat, I want it to retain the same ID so as to not make >>> "holes" in the list. From the following information, am I correct to >>> assume that the id is 24? If not, how do I determine the correct ID to >>> use when we re-create the file system? >> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >> >>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>> /lustre/umt3[OST:24] >>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >> >> Cheers, Andreas >>-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101108/2aa9f11f/attachment-0001.html
OK, made new raid, made file system with same index, but they won''t mount. This is the error. What can I do here? bob mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use The target service''s index is already in use. (/dev/sdc) On 11/8/2010 5:01 AM, Andreas Dilger wrote: On 2010-11-07, at 12:32, Bob Ball <ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote: Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. So, while we are doing the reformat, is there any way to avoid this "hang" situation? If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? Decimal, though it may also accept hex (I can''t check right now). Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. On 11/6/2010 11:09 AM, Andreas Dilger wrote: On 2010-11-06, at 8:24, Bob Ball<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote: I am emptying a set of OST so that I can reformat the underlying RAID-6 more efficiently. Two questions: 1. Is there a quick way to tell if the OST is really empty? lfs_find takes many hours to run. If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. 2. When I reformat, I want it to retain the same ID so as to not make "holes" in the list. From the following information, am I correct to assume that the id is 24? If not, how do I determine the correct ID to use when we re-create the file system? If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% /lustre/umt3[OST:24] 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. Cheers, Andreas _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Don''t know if I sent to the whole list. One of those days. remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? Thanks, bob mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use The target service''s index is already in use. (/dev/sdc) On 11/8/2010 5:01 AM, Andreas Dilger wrote: On 2010-11-07, at 12:32, Bob Ball <ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote: Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. So, while we are doing the reformat, is there any way to avoid this "hang" situation? If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? Decimal, though it may also accept hex (I can''t check right now). Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. On 11/6/2010 11:09 AM, Andreas Dilger wrote: On 2010-11-06, at 8:24, Bob Ball<ball-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote: I am emptying a set of OST so that I can reformat the underlying RAID-6 more efficiently. Two questions: 1. Is there a quick way to tell if the OST is really empty? lfs_find takes many hours to run. If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. 2. When I reformat, I want it to retain the same ID so as to not make "holes" in the list. From the following information, am I correct to assume that the id is 24? If not, how do I determine the correct ID to use when we re-create the file system? If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% /lustre/umt3[OST:24] 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. Cheers, Andreas _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
On 2010-11-08, at 11:39, Bob Ball wrote:> Don''t know if I sent to the whole list. One of those days. > > remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? > > mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc > mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 > mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use > The target service''s index is already in use. (/dev/sdc)Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well.> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >> On 2010-11-07, at 12:32, Bob Ball <ball at umich.edu> wrote: >>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>> >>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >> >> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >> >>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >> >> Decimal, though it may also accept hex (I can''t check right now). >> >>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >> >> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >> >>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>> more efficiently. Two questions: >>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>> takes many hours to run. >>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>> >>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>> "holes" in the list. From the following information, am I correct to >>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>> use when we re-create the file system? >>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>> >>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>> /lustre/umt3[OST:24] >>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>> >>>> Cheers, Andreas >>>>Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Le 08/11/2010 21:04, Andreas Dilger a ?crit :> Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. >Tell me if I''m wrong regarding this OST update. AFAIK, there is two ways to replace an OST by a new one: Hot replace: 1 - Disable your OST on MDT (lctl deactivate) 2 - Empty your OST 3 - Backup the magic files (last_rcvd, LAST_ID, CONFIG/*) 4 - Deactivate the OST on all clients also. 5 - Unmount the OST 6 - Replace, reformat using same index 7 - Put back the backup magic files. 8 - Restart the OST. 9 - Activate the OST everywhere. Cold replace: 1 - Empty your OST 2 - Stop your filesystem 3 - Replace/reformat using the same index 4 - Restart using --writeconf 5 - Remount the clients Did I miss something ? As far as i understand this, the important point here is to have the OST internal information in sync with what the MGS (CONFIG/*) and the MDT (last_rcvd, LAST_ID) knows. What is currently preventing, a freshly formatted OST with the same index, to register itself properly (using first_time flag) to MGS and MDT when remounting and: - refreshing its CONFIG from MGS internal cache - telling MDT to reset last_rcvd/LAST_ID it knows for this OST. That way, we could have an easy way to hot replace an OST. How do you think this can be achieved ? Thanks Aur?lien
Yes, you are correct. That was the key here, did not put that file back in place. Back up and (so far) operating cleanly. Thanks, bob On 11/8/2010 3:04 PM, Andreas Dilger wrote:> On 2010-11-08, at 11:39, Bob Ball wrote: >> Don''t know if I sent to the whole list. One of those days. >> >> remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? >> >> mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc >> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 >> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use >> The target service''s index is already in use. (/dev/sdc) > Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. > >> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >>> On 2010-11-07, at 12:32, Bob Ball<ball at umich.edu> wrote: >>>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>>> >>>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >>> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >>> >>>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >>> Decimal, though it may also accept hex (I can''t check right now). >>> >>>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >>> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >>> >>>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>>> more efficiently. Two questions: >>>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>>> takes many hours to run. >>>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>>> >>>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>>> "holes" in the list. From the following information, am I correct to >>>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>>> use when we re-create the file system? >>>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>>> >>>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>>> /lustre/umt3[OST:24] >>>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>>> >>>>> Cheers, Andreas >>>>> > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > >
On 2010-11-08, at 14:18, Aur?lien Degr?mont wrote:> Tell me if I''m wrong regarding this OST update. > AFAIK, there is two ways to replace an OST by a new one: > > Hot replace: > 1 - Disable your OST on MDT (lctl deactivate) > 2 - Empty your OST > 3 - Backup the magic files (last_rcvd, LAST_ID, CONFIG/*) > 4 - Deactivate the OST on all clients also. > 5 - Unmount the OST > 6 - Replace, reformat using same index > 7 - Put back the backup magic files. > 8 - Restart the OST. > 9 - Activate the OST everywhere. > > Cold replace: > 1 - Empty your OST > 2 - Stop your filesystem > 3 - Replace/reformat using the same index > 4 - Restart using --writeconf > 5 - Remount the clients6 - fix up the MDS''s idea of the OST''s last-allocated object.> Did I miss something ?Other than #6, it looks correct.> As far as i understand this, the important point here is to have the OST > internal information in sync with what the MGS (CONFIG/*) and the MDT > (last_rcvd, LAST_ID) knows.Right.> What is currently preventing, a freshly formatted OST with the same > index, to register itself properly (using first_time flag) to MGS and > MDT when remounting and: > - refreshing its CONFIG from MGS internal cache > - telling MDT to reset last_rcvd/LAST_ID it knows for this OST. > That way, we could have an easy way to hot replace an OST. > How do you think this can be achieved ?It probably wouldn''t be impossible to have a new OST gracefully replace an old one, if that is what the administrator wanted. Some "special" action would need to be taken on the OST and/or MDT to ensure that this is what the admin wanted, instead of e.g. accidentally inserting some other OST with the same index and corrupting the filesystem because of duplicate object IDs, or not being able to access existing objects on the "real" OST at that index. - the new OST would be best off to start allocating objects at the LAST_ID of the old OST, so that there is no risk of confusion between objects - the MDT contains the old LAST_ID in it''s lov_objids file, and it sends this to the OST at connection time, this is no problem - currently the new OST will refuse to allow the MDT to connect, because it detects that the old LAST_ID value from the MDT is inconsistent with its own value - it would be relatively straight forward to have the OST detect if the local LAST_ID value was "new" and use the MDT value instead - the danger is if the LAST_ID file was lost for some reason (e.g. corruption causes e2fsck to erase it). in that case, the OST startup code should be smart enough to regenerate LAST_ID based on walking the object directories, which would also avoid the need to do this in e2fsck/lfsck (which can only run offline) - in cases where the on-disk LAST_ID is much lower than the MDT-supplied value, the OST should just skip precreation of all the intermediate objects and just start using the new MDT value - the only other thing is to avoid the case where a "new" OST is accidentally assigned the same index, when that isn''t what is wanted. There needs to be some way to "prime" the new OST (that is NOT the default for a newly formatted OST), or conversely tell the MDT that it should signal the new OST to take the place of the old one, so that there are not any mistakes In conclusion, most of this is already close to working, but needs some amount of effort to get it tested and working smoothly. Since this is something that has come up on this list a number of times in the last year, I guess it means that a Lustre filesystem is now outliving the hardware on which it runs, so it would definitely be worthwhile for someone to look at this. I filed bug 24128 on this, in case anyone wants to work on it. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Hi Andreas Dilger a ?crit :>> Cold replace: >> 1 - Empty your OST >> 2 - Stop your filesystem >> 3 - Replace/reformat using the same index >> 4 - Restart using --writeconf >> 5 - Remount the clients > 6 - fix up the MDS''s idea of the OST''s last-allocated object. > >> Did I miss something ? > > Other than #6, it looks correct. >How do you fix #6? What are the actions needed for that?>> What is currently preventing, a freshly formatted OST with the same >> index, to register itself properly (using first_time flag) to MGS and >> MDT when remounting and: >> - refreshing its CONFIG from MGS internal cache >> - telling MDT to reset last_rcvd/LAST_ID it knows for this OST. >> That way, we could have an easy way to hot replace an OST. >> How do you think this can be achieved ? > > It probably wouldn''t be impossible to have a new OST gracefully replace an old one, if that is what the administrator wanted. Some "special" action would need to be taken on the OST and/or MDT to ensure that this is what the admin wanted, instead of e.g. accidentally inserting some other OST with the same index and corrupting the filesystem because of duplicate object IDs, or not being able to access existing objects on the "real" OST at that index. > > - the new OST would be best off to start allocating objects at the LAST_ID > of the old OST, so that there is no risk of confusion between objects > - the MDT contains the old LAST_ID in it''s lov_objids file, and it sends this > to the OST at connection time, this is no problem > - currently the new OST will refuse to allow the MDT to connect, because it > detects that the old LAST_ID value from the MDT is inconsistent with its > own value > - it would be relatively straight forward to have the OST detect if the local > LAST_ID value was "new" and use the MDT value insteadCan we based this check on ''first_time'' flag. I mean, OST update its LAST_ID based on what MDT tell it only if it has the ''first_time'' flag set.> - the danger is if the LAST_ID file was lost for some reason (e.g. corruption > causes e2fsck to erase it). in that case, the OST startup code should be > smart enough to regenerate LAST_ID based on walking the object directories, > which would also avoid the need to do this in e2fsck/lfsck (which can only > run offline) > - in cases where the on-disk LAST_ID is much lower than the MDT-supplied > value, the OST should just skip precreation of all the intermediate objects > and just start using the new MDT valueThis seems a different feature, even if related, which is "Better handling of LAST_ID corruption".> - the only other thing is to avoid the case where a "new" OST is accidentally > assigned the same index, when that isn''t what is wanted. There needs to be > some way to "prime" the new OST (that is NOT the default for a newly > formatted OST), or conversely tell the MDT that it should signal the new > OST to take the place of the old one, so that there are not any mistakesIndeed, this is important. And if we want to have this supports online replace. Another option when formatting OST? --replace ? Which is only accepted when --index is set?> Since this is something that has come up on this list a number of times in the last year, I guess it means that a Lustre filesystem is now outliving the hardware on which it runs, so it would definitely be worthwhile for someone to look at this. I filed bug 24128 on this, in case anyone wants to work on it.Can you also add it to Community project list? Thanks -- Aurelien Degremont CEA
On 2010-11-09, at 03:07, Aurelien Degremont wrote:> Andreas Dilger a ?crit : >>> Cold replace: >>> 1 - Empty your OST >>> 2 - Stop your filesystem >>> 3 - Replace/reformat using the same index >>> 4 - Restart using --writeconf >>> 5 - Remount the clients >> 6 - fix up the MDS''s idea of the OST''s last-allocated object. >>> Did I miss something ? >> Other than #6, it looks correct. > > How do you fix #6? What are the actions needed for that?That is what is described in the rest of this email...>>> What is currently preventing, a freshly formatted OST with the same index, to register itself properly (using first_time flag) to MGS and MDT when remounting and: >>> - refreshing its CONFIG from MGS internal cache >>> - telling MDT to reset last_rcvd/LAST_ID it knows for this OST. >>> That way, we could have an easy way to hot replace an OST. >>> How do you think this can be achieved ? >> It probably wouldn''t be impossible to have a new OST gracefully replace an old one, if that is what the administrator wanted. Some "special" action would need to be taken on the OST and/or MDT to ensure that this is what the admin wanted, instead of e.g. accidentally inserting some other OST with the same index and corrupting the filesystem because of duplicate object IDs, or not being able to access existing objects on the "real" OST at that index. >> - the new OST would be best off to start allocating objects at the LAST_ID >> of the old OST, so that there is no risk of confusion between objects >> - the MDT contains the old LAST_ID in it''s lov_objids file, and it sends this >> to the OST at connection time, this is no problem >> - currently the new OST will refuse to allow the MDT to connect, because it >> detects that the old LAST_ID value from the MDT is inconsistent with its >> own value >> - it would be relatively straight forward to have the OST detect if the local >> LAST_ID value was "new" and use the MDT value instead > > Can we based this check on ''first_time'' flag. > I mean, OST update its LAST_ID based on what MDT tell it only if it has the ''first_time'' flag set.The problem is that if the ''first_time'' flag is always set on a new OST, then any OST accidentally claiming the same index (e.g. from a test filesystem of the same name, or from user error) could replace the valid OST. This ''first_time'' flag could not be the default.>> - the danger is if the LAST_ID file was lost for some reason (e.g. corruption >> causes e2fsck to erase it). in that case, the OST startup code should be >> smart enough to regenerate LAST_ID based on walking the object directories, >> which would also avoid the need to do this in e2fsck/lfsck (which can only >> run offline) >> - in cases where the on-disk LAST_ID is much lower than the MDT-supplied >> value, the OST should just skip precreation of all the intermediate objects >> and just start using the new MDT value > > This seems a different feature, even if related, which is "Better handling of LAST_ID corruption".Partly, yes.>> - the only other thing is to avoid the case where a "new" OST is accidentally >> assigned the same index, when that isn''t what is wanted. There needs to be >> some way to "prime" the new OST (that is NOT the default for a newly >> formatted OST), or conversely tell the MDT that it should signal the new >> OST to take the place of the old one, so that there are not any mistakes > > Indeed, this is important. And if we want to have this supports online replace. Another option when formatting OST? > --replace ? Which is only accepted when --index is set?Yes, that would probably be a good way to handle it from the user interface. The other question is how to handle this internally. Probably a flag stored in the mountinfo or last_rcvd file.>> Since this is something that has come up on this list a number of times in the last year, I guess it means that a Lustre filesystem is now outliving the hardware on which it runs, so it would definitely be worthwhile for someone to look at this. I filed bug 24128 on this, in case anyone wants to work on it. > > Can you also add it to Community project list?Done. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Well, we ran 2 days, migrating files off OST, then this morning, the MDT crashed. Could not get all clients reconnected before seeing another kernel panic on the mdt. did an e2fsck of the mdt db and tried again. crashed again, but this time the logged message is: 2010-11-10T12:40:26-05:00 lmd01.local kernel: [12307.325340] Lustre: 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids 2010-11-10T12:40:27-05:00 lmd01.local kernel: [12308.347087] Lustre: 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids I''ve seen this message elsewhere, but can''t seem to find anything on it now, or what to do about it. help? bob On 11/8/2010 4:27 PM, Bob Ball wrote:> Yes, you are correct. That was the key here, did not put that file back > in place. Back up and (so far) operating cleanly. > > Thanks, > bob > > On 11/8/2010 3:04 PM, Andreas Dilger wrote: >> On 2010-11-08, at 11:39, Bob Ball wrote: >>> Don''t know if I sent to the whole list. One of those days. >>> >>> remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? >>> >>> mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc >>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 >>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use >>> The target service''s index is already in use. (/dev/sdc) >> Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. >> >>> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >>>> On 2010-11-07, at 12:32, Bob Ball<ball at umich.edu> wrote: >>>>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>>>> >>>>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >>>> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >>>> >>>>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >>>> Decimal, though it may also accept hex (I can''t check right now). >>>> >>>>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >>>> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >>>> >>>>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>>>> more efficiently. Two questions: >>>>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>>>> takes many hours to run. >>>>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>>>> >>>>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>>>> "holes" in the list. From the following information, am I correct to >>>>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>>>> use when we re-create the file system? >>>>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>>>> >>>>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>>>> /lustre/umt3[OST:24] >>>>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>>>> >>>>>> Cheers, Andreas >>>>>> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
If this helps, the console shows this stuff at the kernel panic, leaving out most of the addresses and offsets for this "retyping" bob :ptlrpc:ldlm_handle_enqueue :mds:mds_handle :lnet:lnet_match_blocked_msg :ptlrpc:lustre_msg_get_conn_cnt :ptlrpc:ptlrpc_server_handle_request __activate_task try_to_wake_up lock_timer_base __mod_timer :ptlrpc:ptlrpc_main default_wake_function audit_syscall_exit child_rip :ptlrpc:ptlrpc_main child_rip Code: 41 8b 14 d3 89 54 24 54 31 d2 29 c5 89 6c 24 58 0f 84 bf 00 RIP [<ffffffff88c644ef>] :ldiskfs:do_split RSP <ffff810422ae53b0> CR2: ffff810acc143e38 <0> Kernel panic - not syncing: Fatal exception On 11/10/2010 1:01 PM, Bob Ball wrote:> Well, we ran 2 days, migrating files off OST, then this morning, the MDT > crashed. Could not get all clients reconnected before seeing another > kernel panic on the mdt. did an e2fsck of the mdt db and tried again. > crashed again, but this time the logged message is: > > 2010-11-10T12:40:26-05:00 lmd01.local kernel: [12307.325340] Lustre: > 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids > 2010-11-10T12:40:27-05:00 lmd01.local kernel: [12308.347087] Lustre: > 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids > > I''ve seen this message elsewhere, but can''t seem to find anything on it > now, or what to do about it. > > help? > > bob > > On 11/8/2010 4:27 PM, Bob Ball wrote: >> Yes, you are correct. That was the key here, did not put that file back >> in place. Back up and (so far) operating cleanly. >> >> Thanks, >> bob >> >> On 11/8/2010 3:04 PM, Andreas Dilger wrote: >>> On 2010-11-08, at 11:39, Bob Ball wrote: >>>> Don''t know if I sent to the whole list. One of those days. >>>> >>>> remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? >>>> >>>> mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc >>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 >>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use >>>> The target service''s index is already in use. (/dev/sdc) >>> Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. >>> >>>> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >>>>> On 2010-11-07, at 12:32, Bob Ball<ball at umich.edu> wrote: >>>>>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>>>>> >>>>>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >>>>> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >>>>> >>>>>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >>>>> Decimal, though it may also accept hex (I can''t check right now). >>>>> >>>>>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >>>>> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >>>>> >>>>>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>>>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>>>>> more efficiently. Two questions: >>>>>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>>>>> takes many hours to run. >>>>>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>>>>> >>>>>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>>>>> "holes" in the list. From the following information, am I correct to >>>>>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>>>>> use when we re-create the file system? >>>>>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>>>>> >>>>>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>>>>> /lustre/umt3[OST:24] >>>>>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>>>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>>>>> >>>>>>> Cheers, Andreas >>>>>>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Lustre Technical Lead >>> Oracle Corporation Canada Inc. >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
On 2010-11-10, at 11:01, Bob Ball wrote:> Well, we ran 2 days, migrating files off OST, then this morning, the MDT > crashed. Could not get all clients reconnected before seeing another > kernel panic on the mdt. did an e2fsck of the mdt db and tried again. > crashed again, but this time the logged message is: > > 2010-11-10T12:40:26-05:00 lmd01.local kernel: [12307.325340] Lustre: > 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids > 2010-11-10T12:40:27-05:00 lmd01.local kernel: [12308.347087] Lustre: > 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids > > I''ve seen this message elsewhere, but can''t seem to find anything on it > now, or what to do about it.This might be a recovery-only problem. Try mounting the MDS with the mount option "-o abort_recovery".> On 11/8/2010 4:27 PM, Bob Ball wrote: >> Yes, you are correct. That was the key here, did not put that file back >> in place. Back up and (so far) operating cleanly. >> >> Thanks, >> bob >> >> On 11/8/2010 3:04 PM, Andreas Dilger wrote: >>> On 2010-11-08, at 11:39, Bob Ball wrote: >>>> Don''t know if I sent to the whole list. One of those days. >>>> >>>> remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? >>>> >>>> mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc >>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 >>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use >>>> The target service''s index is already in use. (/dev/sdc) >>> Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. >>> >>>> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >>>>> On 2010-11-07, at 12:32, Bob Ball<ball at umich.edu> wrote: >>>>>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>>>>> >>>>>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >>>>> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >>>>> >>>>>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >>>>> Decimal, though it may also accept hex (I can''t check right now). >>>>> >>>>>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >>>>> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >>>>> >>>>>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>>>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>>>>> more efficiently. Two questions: >>>>>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>>>>> takes many hours to run. >>>>>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>>>>> >>>>>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>>>>> "holes" in the list. From the following information, am I correct to >>>>>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>>>>> use when we re-create the file system? >>>>>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>>>>> >>>>>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>>>>> /lustre/umt3[OST:24] >>>>>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>>>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>>>>> >>>>>>> Cheers, Andreas >>>>>>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Lustre Technical Lead >>> Oracle Corporation Canada Inc. >>> >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Yes, this brought us back up (sorry, took us a while). Clients see the system, and I can read and write files. But...... What have we lost by doing this? Can we now let it go and recover as usual? What is the next step here? bob On 11/10/2010 3:00 PM, Andreas Dilger wrote:> On 2010-11-10, at 11:01, Bob Ball wrote: >> Well, we ran 2 days, migrating files off OST, then this morning, the MDT >> crashed. Could not get all clients reconnected before seeing another >> kernel panic on the mdt. did an e2fsck of the mdt db and tried again. >> crashed again, but this time the logged message is: >> >> 2010-11-10T12:40:26-05:00 lmd01.local kernel: [12307.325340] Lustre: >> 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids >> 2010-11-10T12:40:27-05:00 lmd01.local kernel: [12308.347087] Lustre: >> 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids >> >> I''ve seen this message elsewhere, but can''t seem to find anything on it >> now, or what to do about it. > This might be a recovery-only problem. Try mounting the MDS with the mount option "-o abort_recovery". > > >> On 11/8/2010 4:27 PM, Bob Ball wrote: >>> Yes, you are correct. That was the key here, did not put that file back >>> in place. Back up and (so far) operating cleanly. >>> >>> Thanks, >>> bob >>> >>> On 11/8/2010 3:04 PM, Andreas Dilger wrote: >>>> On 2010-11-08, at 11:39, Bob Ball wrote: >>>>> Don''t know if I sent to the whole list. One of those days. >>>>> >>>>> remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? >>>>> >>>>> mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc >>>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 >>>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use >>>>> The target service''s index is already in use. (/dev/sdc) >>>> Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. >>>> >>>>> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >>>>>> On 2010-11-07, at 12:32, Bob Ball<ball at umich.edu> wrote: >>>>>>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>>>>>> >>>>>>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >>>>>> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >>>>>> >>>>>>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >>>>>> Decimal, though it may also accept hex (I can''t check right now). >>>>>> >>>>>>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >>>>>> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >>>>>> >>>>>>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>>>>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>>>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>>>>>> more efficiently. Two questions: >>>>>>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>>>>>> takes many hours to run. >>>>>>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>>>>>> >>>>>>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>>>>>> "holes" in the list. From the following information, am I correct to >>>>>>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>>>>>> use when we re-create the file system? >>>>>>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>>>>>> >>>>>>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>>>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>>>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>>>>>> /lustre/umt3[OST:24] >>>>>>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>>>>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>>>>>> >>>>>>>> Cheers, Andreas >>>>>>>> >>>> Cheers, Andreas >>>> -- >>>> Andreas Dilger >>>> Lustre Technical Lead >>>> Oracle Corporation Canada Inc. >>>> >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > >
On 2010-11-10, at 14:40, Bob Ball wrote:> Yes, this brought us back up (sorry, took us a while). Clients see the system, and I can read and write files. But...... > > What have we lost by doing this? Can we now let it go and recover as usual? What is the next step here?The abort_recovery option evicted all of the clients, so any of their in-progress operations would have failed. They have all since reconnected and no action is needed.> On 11/10/2010 3:00 PM, Andreas Dilger wrote: >> On 2010-11-10, at 11:01, Bob Ball wrote: >>> Well, we ran 2 days, migrating files off OST, then this morning, the MDT >>> crashed. Could not get all clients reconnected before seeing another >>> kernel panic on the mdt. did an e2fsck of the mdt db and tried again. >>> crashed again, but this time the logged message is: >>> >>> 2010-11-10T12:40:26-05:00 lmd01.local kernel: [12307.325340] Lustre: >>> 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids >>> 2010-11-10T12:40:27-05:00 lmd01.local kernel: [12308.347087] Lustre: >>> 6243:0:(mds_lov.c:330:mds_lov_update_objids()) Unexpected gap in objids >>> >>> I''ve seen this message elsewhere, but can''t seem to find anything on it >>> now, or what to do about it. >> >> This might be a recovery-only problem. Try mounting the MDS with the mount option "-o abort_recovery". >> >>> On 11/8/2010 4:27 PM, Bob Ball wrote: >>>> Yes, you are correct. That was the key here, did not put that file back >>>> in place. Back up and (so far) operating cleanly. >>>> >>>> Thanks, >>>> bob >>>> >>>> On 11/8/2010 3:04 PM, Andreas Dilger wrote: >>>>> On 2010-11-08, at 11:39, Bob Ball wrote: >>>>>> Don''t know if I sent to the whole list. One of those days. >>>>>> >>>>>> remade the raid device, remade the lustre fs on it, but the disks won''t mount. Error is below. How do I overcome this? >>>>>> >>>>>> mounting device /dev/sdc at /mnt/ost12, flags=0 options=device=/dev/sdc >>>>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use retries left: 0 >>>>>> mount.lustre: mount /dev/sdc at /mnt/ost12 failed: Address already in use >>>>>> The target service''s index is already in use. (/dev/sdc) >>>>> Looks like you didn''t copy the old "CONFIGS/mountdata" file over the new one. You can also use "--writeconf" (described in the manual and several times on the list) to have the MGS re-generate the configuration, which should fix this as well. >>>>> >>>>>> On 11/8/2010 5:01 AM, Andreas Dilger wrote: >>>>>>> On 2010-11-07, at 12:32, Bob Ball<ball at umich.edu> wrote: >>>>>>>> Tomorrow, we will redo all 8 OST on the first file server we are redoing. I am very nervous about this, as a lot is riding on us doing this correctly. For example, on a client now, if I umount one of the ost, without first taking some (unknown to me) action on the MDT, then the client will hang on the "df" command. >>>>>>>> >>>>>>>> So, while we are doing the reformat, is there any way to avoid this "hang" situation? >>>>>>> If you issue "lctl --device %{OSC UUID} deactivate" on the MDS and clients then any operations on those OSTs will immediately fail with an IO error. If you are migrating I objects from those OSTs, I would have imagined you already did this on the MDS or new objects would have continued to be allocated there >>>>>>> >>>>>>>> Is the --index=XX argument to mkfs.lustre hex, or decimal? Seems from your comment below that this must be hex? >>>>>>> Decimal, though it may also accept hex (I can''t check right now). >>>>>>> >>>>>>>> Finally, does supplying the --index even matter if we restore the files below that you mention? That seems to be what you are saying. >>>>>>> Well, you still need to set the filesystem label. This could be done with tune2fs, but you may as well specify the right index from the beginning. >>>>>>> >>>>>>>> On 11/6/2010 11:09 AM, Andreas Dilger wrote: >>>>>>>>> On 2010-11-06, at 8:24, Bob Ball<ball at umich.edu> wrote: >>>>>>>>>> I am emptying a set of OST so that I can reformat the underlying RAID-6 >>>>>>>>>> more efficiently. Two questions: >>>>>>>>>> 1. Is there a quick way to tell if the OST is really empty? lfs_find >>>>>>>>>> takes many hours to run. >>>>>>>>> If you mount the OST as type ldiskfs and look in the O/0/d* directories (capital-O, zero) there should be a few hundred zero-length objects owned by root. >>>>>>>>> >>>>>>>>>> 2. When I reformat, I want it to retain the same ID so as to not make >>>>>>>>>> "holes" in the list. From the following information, am I correct to >>>>>>>>>> assume that the id is 24? If not, how do I determine the correct ID to >>>>>>>>>> use when we re-create the file system? >>>>>>>>> If you still have the existing OST, the easiest way to do this is to save the files last_rcvd, O/0/LAST_ID, and CONFIGS/*, and copy them into the reformatted OST. >>>>>>>>> >>>>>>>>>> /dev/sdj 3.5T 3.1T 222G 94% /mnt/ost51 >>>>>>>>>> 10 UP obdfilter umt3-OST0018 umt3-OST0018_UUID 547 >>>>>>>>>> umt3-OST0018_UUID 3.4T 3.0T 221.1G 88% >>>>>>>>>> /lustre/umt3[OST:24] >>>>>>>>>> 20 IN osc umt3-OST0018-osc umt3-mdtlov_UUID 5 >>>>>>>>> The OST index is indeed 24 (18 hex). As for /dev/sdj, it is hard to know from the above info. If you run "e2label /dev/sdj" the filesystem label should match the OST name umt3-OST0018. >>>>>>>>> >>>>>>>>> Cheers, Andreas >>>>>>>>> >>>>> Cheers, Andreas >>>>> -- >>>>> Andreas Dilger >>>>> Lustre Technical Lead >>>>> Oracle Corporation Canada Inc. >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Lustre Technical Lead >> Oracle Corporation Canada Inc. >> >> >>Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.