Hi, I need help getting my filesystem back online. I''ve followed the instructions in section 27.2 Recovering from Corruption in the Lustre File System and in section 27.2.1 Working with Orphaned Objects. lctl dl shows all of my ost as IN instead of UP. If I try to activate them they are immediately disabled with a message like ''oscc recovery failed: -22''. I''ve been working on this all day and believe that I need to reset the LAST_ID for one or more OST. Section 23.3.9 explains this process but stops short of showing an example and I am not following what it is asking me to do. I really need to get this resolved ASAP. Can someone help? Cheers, David
What version? Also can you post logging? David Lee Braun <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org> wrote:>Hi, > >I need help getting my filesystem back online. I''ve followed the >instructions in section 27.2 Recovering from Corruption in the Lustre >File System and in section 27.2.1 Working with Orphaned Objects. lctl >dl shows all of my ost as IN instead of UP. If I try to activate them >they are immediately disabled with a message like ''oscc recovery >failed: -22''. I''ve been working on this all day and believe that I need >to reset the LAST_ID for one or more OST. Section 23.3.9 explains this >process but stops short of showing an example and I am not following >what it is asking me to do. I really need to get this resolved ASAP. >Can someone help? > >Cheers, > >David > >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi Colin, I have lustre 2.3.0 from the git repository. My issues started when a user submitted ~13,000 jobs that each compare ~200 1GB files to ~200 other 1GB files line by line. There were approximately 400 of these jobs running and the metadata server had a load between 300 and 400. Once I realized what was happening I limited the number of jobs that the user could run concurrently. A week went by and the user stated that the most data intensive jobs had completed and ask to have the limits removed. The next morning the lustre filesystem was read only. Here is a log extract. Nov 27 21:55:48 storage-00 kernel: LustreError: 4504:0:(filter.c:3759:filter_handle_precreate()) scratch-OST0000: ignoring bogus orphan destroy request: obdid 413850672 last_id 413998151 Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(osc_create.c:610:osc_create()) scratch-OST0000-osc-MDT0000: oscc recovery failed: -22 Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(lov_obd.c:1063:lov_clear_orphans()) error in orphan recovery on OST idx 0/5: rc = -22 Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(lov_obd.c:1063:lov_clear_orphans()) Skipped 3 previous similar messages Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(mds_lov.c:883:__mds_lov_synchronize()) scratch-OST0000_UUID failed at mds_lov_clear_orphans: -22 Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(mds_lov.c:883:__mds_lov_synchronize()) Skipped 3 previous similar messages Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(mds_lov.c:903:__mds_lov_synchronize()) scratch-OST0000_UUID sync failed -22, deactivating Nov 27 21:55:48 storage-00 kernel: LustreError: 15778:0:(mds_lov.c:903:__mds_lov_synchronize()) Skipped 4 previous similar messages Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous similar message Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous similar message Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 9 previous similar messages Nov 27 21:58:35 storage-00 kernel: LustreError: 15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 Any ideas? Cheers, David On 11/28/2013 01:49 AM, Colin Faber wrote:> What version? Also can you post logging? > > David Lee Braun <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org> wrote: > >> Hi, >> >> I need help getting my filesystem back online. I''ve followed the >> instructions in section 27.2 Recovering from Corruption in the Lustre >> File System and in section 27.2.1 Working with Orphaned Objects. lctl >> dl shows all of my ost as IN instead of UP. If I try to activate them >> they are immediately disabled with a message like ''oscc recovery >> failed: -22''. I''ve been working on this all day and believe that I need >> to reset the LAST_ID for one or more OST. Section 23.3.9 explains this >> process but stops short of showing an example and I am not following >> what it is asking me to do. I really need to get this resolved ASAP. >> Can someone help? >> >> Cheers, >> >> David >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss
On 2013/12/02 12:31 PM, "David Lee Braun" <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org> wrote:>Hi Colin, > >I have lustre 2.3.0 from the git repository.I would say that 2.3.0 from git is not a good version to be running on in the long term. This is a feature release that does not get any bug fixes. I''d really recommend that you update to 2.4.1 (2.4.2 is coming soon also) since this is a supported maintenance release. None of the error messages below indicate why the filesystem went read-only. That will normally happen right before the OST or MDT prints a log like: {some kind of serious error or corruption message} LDiskfs-FS: remounting filesystem read-only Cheers, Andreas>My issues started when a user submitted ~13,000 jobs that each compare >~200 1GB files to ~200 other 1GB files line by line. There were >approximately 400 of these jobs running and the metadata server had a >load between 300 and 400. Once I realized what was happening I limited >the number of jobs that the user could run concurrently. A week went by >and the user stated that the most data intensive jobs had completed and >ask to have the limits removed. The next morning the lustre filesystem >was read only. > >Here is a log extract. > > > >Nov 27 21:55:48 storage-00 kernel: LustreError: >4504:0:(filter.c:3759:filter_handle_precreate()) scratch-OST0000: >ignoring bogus orphan destroy request: obdid 413850672 last_id 413998151 >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(osc_create.c:610:osc_create()) scratch-OST0000-osc-MDT0000: >oscc recovery failed: -22 >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(lov_obd.c:1063:lov_clear_orphans()) error in orphan recovery on >OST idx 0/5: rc = -22 >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(lov_obd.c:1063:lov_clear_orphans()) Skipped 3 previous similar >messages >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(mds_lov.c:883:__mds_lov_synchronize()) scratch-OST0000_UUID >failed at mds_lov_clear_orphans: -22 >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(mds_lov.c:883:__mds_lov_synchronize()) Skipped 3 previous >similar messages >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(mds_lov.c:903:__mds_lov_synchronize()) scratch-OST0000_UUID >sync failed -22, deactivating >Nov 27 21:55:48 storage-00 kernel: LustreError: >15778:0:(mds_lov.c:903:__mds_lov_synchronize()) Skipped 4 previous >similar messages >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from >cancel RPC: canceling anyway >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous >similar message >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) >ldlm_cli_cancel_list: -108 >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous >similar message >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from >cancel RPC: canceling anyway >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 9 previous >similar messages >Nov 27 21:58:35 storage-00 kernel: LustreError: >15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) >ldlm_cli_cancel_list: -108 > >Any ideas? > >Cheers, > >David > > >On 11/28/2013 01:49 AM, Colin Faber wrote: >> What version? Also can you post logging? >> >> David Lee Braun <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org> wrote: >> >>> Hi, >>> >>> I need help getting my filesystem back online. I''ve followed the >>> instructions in section 27.2 Recovering from Corruption in the Lustre >>> File System and in section 27.2.1 Working with Orphaned Objects. lctl >>> dl shows all of my ost as IN instead of UP. If I try to activate them >>> they are immediately disabled with a message like ''oscc recovery >>> failed: -22''. I''ve been working on this all day and believe that I >>>need >>> to reset the LAST_ID for one or more OST. Section 23.3.9 explains this >>> process but stops short of showing an example and I am not following >>> what it is asking me to do. I really need to get this resolved ASAP. >>> Can someone help? >>> >>> Cheers, >>> >>> David >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >_______________________________________________ >Lustre-discuss mailing list >Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org >http://lists.lustre.org/mailman/listinfo/lustre-discuss >Cheers, Andreas -- Andreas Dilger Lustre Software Architect Intel High Performance Data Division