thr3ads.net - Lustre discuss - ignoring bogus orphan destroy request [Nov 2013]

If this information is useful, please help other people find it:
Share via:

David Lee Braun

2013-Nov-28 03:48 UTC

ignoring bogus orphan destroy request

Hi,

I need help getting my filesystem back online.  I''ve followed the
instructions in section 27.2 Recovering from Corruption in the Lustre
File System and in section 27.2.1 Working with Orphaned Objects.  lctl
dl shows all of my ost as IN instead of UP.  If I try to activate them
they are immediately disabled with a message  like ''oscc recovery
failed: -22''.  I''ve been working on this all day and believe
that I need
to reset the LAST_ID for one or more OST.  Section 23.3.9 explains this
process but stops short of showing an example and I am not following
what it is asking me to do.  I really need to get this resolved ASAP. 
Can someone help?

Cheers,

David

Colin Faber

2013-Nov-28 06:49 UTC

head link

Re: ignoring bogus orphan destroy request

What version? Also can you post logging?

David Lee Braun <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org> wrote:
>Hi,
>
>I need help getting my filesystem back online.  I''ve followed the
>instructions in section 27.2 Recovering from Corruption in the Lustre
>File System and in section 27.2.1 Working with Orphaned Objects.  lctl
>dl shows all of my ost as IN instead of UP.  If I try to activate them
>they are immediately disabled with a message  like ''oscc recovery
>failed: -22''.  I''ve been working on this all day and
believe that I need
>to reset the LAST_ID for one or more OST.  Section 23.3.9 explains this
>process but stops short of showing an example and I am not following
>what it is asking me to do.  I really need to get this resolved ASAP. 
>Can someone help?
>
>Cheers,
>
>David
>
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss

David Lee Braun

2013-Dec-02 19:31 UTC

head link

Re: ignoring bogus orphan destroy request

Hi Colin,

I have lustre 2.3.0 from the git repository.

My issues started when a user submitted ~13,000 jobs that each compare
~200 1GB files to ~200 other 1GB files line by line.  There were
approximately 400 of these jobs running and the metadata server had a
load between 300 and 400.  Once I realized what was happening I limited
the number of jobs that the user could run concurrently.  A week went by
and the user stated that the most data intensive jobs had completed and
ask to have the limits removed.  The next morning the lustre filesystem
was read only.

Here is a log extract.



Nov 27 21:55:48 storage-00 kernel: LustreError:
4504:0:(filter.c:3759:filter_handle_precreate()) scratch-OST0000:
ignoring bogus orphan destroy request: obdid 413850672 last_id 413998151
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(osc_create.c:610:osc_create()) scratch-OST0000-osc-MDT0000:
oscc recovery failed: -22
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(lov_obd.c:1063:lov_clear_orphans()) error in orphan recovery on
OST idx 0/5: rc = -22
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(lov_obd.c:1063:lov_clear_orphans()) Skipped 3 previous similar
messages
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(mds_lov.c:883:__mds_lov_synchronize()) scratch-OST0000_UUID
failed at mds_lov_clear_orphans: -22
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(mds_lov.c:883:__mds_lov_synchronize()) Skipped 3 previous
similar messages
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(mds_lov.c:903:__mds_lov_synchronize()) scratch-OST0000_UUID
sync failed -22, deactivating
Nov 27 21:55:48 storage-00 kernel: LustreError:
15778:0:(mds_lov.c:903:__mds_lov_synchronize()) Skipped 4 previous
similar messages
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from
cancel RPC: canceling anyway
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous
similar message
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous
similar message
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from
cancel RPC: canceling anyway
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 9 previous
similar messages
Nov 27 21:58:35 storage-00 kernel: LustreError:
15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list())
ldlm_cli_cancel_list: -108

Any ideas?

Cheers,

David


On 11/28/2013 01:49 AM, Colin Faber wrote:> What version? Also can you post logging?
> 
> David Lee Braun <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org>
wrote:
> 
>> Hi,
>>
>> I need help getting my filesystem back online.  I''ve followed
the
>> instructions in section 27.2 Recovering from Corruption in the Lustre
>> File System and in section 27.2.1 Working with Orphaned Objects.  lctl
>> dl shows all of my ost as IN instead of UP.  If I try to activate them
>> they are immediately disabled with a message  like ''oscc
recovery
>> failed: -22''.  I''ve been working on this all day and
believe that I need
>> to reset the LAST_ID for one or more OST.  Section 23.3.9 explains this
>> process but stops short of showing an example and I am not following
>> what it is asking me to do.  I really need to get this resolved ASAP. 
>> Can someone help?
>>
>> Cheers,
>>
>> David
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Dilger, Andreas

2013-Dec-02 22:25 UTC

head link

Re: ignoring bogus orphan destroy request

On 2013/12/02 12:31 PM, "David Lee Braun"
<dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org> wrote:
>Hi Colin,
>
>I have lustre 2.3.0 from the git repository.
I would say that 2.3.0 from git is not a good version to be running on
in the long term.  This is a feature release that does not get any bug
fixes.  I''d really recommend that you update to 2.4.1 (2.4.2 is coming
soon also) since this is a supported maintenance release.

None of the error messages below indicate why the filesystem went
read-only.
That will normally happen right before the OST or MDT prints a log like:

  {some kind of serious error or corruption message}
  LDiskfs-FS: remounting filesystem read-only

Cheers, Andreas
>My issues started when a user submitted ~13,000 jobs that each compare
>~200 1GB files to ~200 other 1GB files line by line.  There were
>approximately 400 of these jobs running and the metadata server had a
>load between 300 and 400.  Once I realized what was happening I limited
>the number of jobs that the user could run concurrently.  A week went by
>and the user stated that the most data intensive jobs had completed and
>ask to have the limits removed.  The next morning the lustre filesystem
>was read only.
>
>Here is a log extract.
>
>
>
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>4504:0:(filter.c:3759:filter_handle_precreate()) scratch-OST0000:
>ignoring bogus orphan destroy request: obdid 413850672 last_id 413998151
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(osc_create.c:610:osc_create()) scratch-OST0000-osc-MDT0000:
>oscc recovery failed: -22
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(lov_obd.c:1063:lov_clear_orphans()) error in orphan recovery on
>OST idx 0/5: rc = -22
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(lov_obd.c:1063:lov_clear_orphans()) Skipped 3 previous similar
>messages
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(mds_lov.c:883:__mds_lov_synchronize()) scratch-OST0000_UUID
>failed at mds_lov_clear_orphans: -22
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(mds_lov.c:883:__mds_lov_synchronize()) Skipped 3 previous
>similar messages
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(mds_lov.c:903:__mds_lov_synchronize()) scratch-OST0000_UUID
>sync failed -22, deactivating
>Nov 27 21:55:48 storage-00 kernel: LustreError:
>15778:0:(mds_lov.c:903:__mds_lov_synchronize()) Skipped 4 previous
>similar messages
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from
>cancel RPC: canceling anyway
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 1 previous
>similar message
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list())
>ldlm_cli_cancel_list: -108
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list()) Skipped 1 previous
>similar message
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Got rc -108 from
>cancel RPC: canceling anyway
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1166:ldlm_cli_cancel_req()) Skipped 9 previous
>similar messages
>Nov 27 21:58:35 storage-00 kernel: LustreError:
>15998:0:(ldlm_request.c:1792:ldlm_cli_cancel_list())
>ldlm_cli_cancel_list: -108
>
>Any ideas?
>
>Cheers,
>
>David
>
>
>On 11/28/2013 01:49 AM, Colin Faber wrote:
>> What version? Also can you post logging?
>> 
>> David Lee Braun <dbraun-Wuw85uim5zDR7s880joybQ@public.gmane.org>
wrote:
>> 
>>> Hi,
>>>
>>> I need help getting my filesystem back online.  I''ve
followed the
>>> instructions in section 27.2 Recovering from Corruption in the
Lustre
>>> File System and in section 27.2.1 Working with Orphaned Objects. 
lctl
>>> dl shows all of my ost as IN instead of UP.  If I try to activate
them
>>> they are immediately disabled with a message  like ''oscc
recovery
>>> failed: -22''.  I''ve been working on this all day
and believe that I
>>>need
>>> to reset the LAST_ID for one or more OST.  Section 23.3.9 explains
this
>>> process but stops short of showing an example and I am not
following
>>> what it is asking me to do.  I really need to get this resolved
ASAP.
>>> Can someone help?
>>>
>>> Cheers,
>>>
>>> David
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>_______________________________________________
>Lustre-discuss mailing list
>Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org
>http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division

Lustre discuss - Nov 2013 - ignoring bogus orphan destroy request

ignoring bogus orphan destroy request

Re: ignoring bogus orphan destroy request

Re: ignoring bogus orphan destroy request

Re: ignoring bogus orphan destroy request