Jakob Goldbach
2009-Jul-30 06:29 UTC
[Lustre-discuss] OST went back in time to 0 (bug 9646)
Hi, I have a question on bug 9646 - Server went back in time. I had an OSS crash and had to pull the power. After mounting lustre again I see the following on one of my clients: (import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went back in time (transno 12901362807 was previously committed, server now claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 This bug description suggest that there are commits lost in hardware cache - but how can it loose all commits (transno is zero)? (btw, cache is battery backup up) On the client that I saw this I had previosly deactivated the import bacause of the crash. Is this the reason I''m seeing this transno as zero ? (full dmesg below) Thanks, Jakob 2860:0:(import.c:508:import_select_connection()) b-OST0010-osc-ffff81022ce89800: tried all connections, increasing latency to 27s setting import backup-OST0010_UUID INACTIVE by administrator request 8281:0:(import.c:508:import_select_connection()) b-OST0010-osc-ffff81022ce89800: tried all connections, increasing latency to 32s 167-0: This client was evicted by b-OST0010; in progress operations using this service will fail. b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010 using nid 172.16.14.36 at tcp. 11-0: an error occurred while communicating with 172.16.14.36 at tcp. The ost_statfs operation failed with -11 ... 11-0: an error occurred while communicating with 172.16.14.36 at tcp. The obd_ping operation failed with -107 b-OST0010-osc-ffff81022ce89800: Connection to service backup-OST0010 via nid 172.16.14.36 at tcp was lost; in progress operations using this service will wait for recovery to complete. 2859:0:(import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went back in time (transno 12901362807 was previously committed, server now claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 167-0: This client was evicted by backup-OST0010; in progress operations using this service will fail. b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010 using nid 172.16.14.36 at tcp.
Andreas Dilger
2009-Jul-31 20:57 UTC
[Lustre-discuss] OST went back in time to 0 (bug 9646)
On Jul 30, 2009 08:29 +0200, Jakob Goldbach wrote:> I have a question on bug 9646 - Server went back in time. > > I had an OSS crash and had to pull the power. After mounting lustre > again I see the following on one of my clients: > > (import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went back in > time (transno 12901362807 was previously committed, server now claims > 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 > > This bug description suggest that there are commits lost in hardware > cache - but how can it loose all commits (transno is zero)? (btw, cache > is battery backup up) > > On the client that I saw this I had previosly deactivated the import > bacause of the crash. Is this the reason I''m seeing this transno as > zero ? (full dmesg below)The error message is a bit misleading. Bug 9646 is describing the situation where the last_committed transaction number rolled back to some previous non-zero value. That indicates some serious problem in the storage. In this case the client is not getting a complete reply and is looking at a last_committed value that was never properly filled in. That should probably be updated in the manual. This should probably be filed as a bug, and no check should be done if the reply was an error, and no message printed on the console.> 2860:0:(import.c:508:import_select_connection()) > b-OST0010-osc-ffff81022ce89800: tried all connections, increasing > latency to 27s > > setting import backup-OST0010_UUID INACTIVE by administrator request > > 8281:0:(import.c:508:import_select_connection()) > b-OST0010-osc-ffff81022ce89800: tried all connections, increasing > latency to 32s > > 167-0: This client was evicted by b-OST0010; in progress operations > using this service will fail. > > b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010 > using nid 172.16.14.36 at tcp. > > 11-0: an error occurred while communicating with 172.16.14.36 at tcp. The > ost_statfs operation failed with -11 > ... > 11-0: an error occurred while communicating with 172.16.14.36 at tcp. The > obd_ping operation failed with -107 > > b-OST0010-osc-ffff81022ce89800: Connection to service backup-OST0010 via > nid 172.16.14.36 at tcp was lost; in progress operations using this service > will wait for recovery to complete. > > 2859:0:(import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went > back in time (transno 12901362807 was previously committed, server now > claims 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 > > 167-0: This client was evicted by backup-OST0010; in progress operations > using this service will fail. > > b-OST0010-osc-ffff81022ce89800: Connection restored to service b-OST0010 > using nid 172.16.14.36 at tcp.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Mikhail Pershin
2009-Aug-01 13:29 UTC
[Lustre-discuss] OST went back in time to 0 (bug 9646)
On Thu, 30 Jul 2009 10:29:17 +0400, Jakob Goldbach <jakob at goldbach.dk> wrote:> Hi, > > I have a question on bug 9646 - Server went back in time. > > I had an OSS crash and had to pull the power. After mounting lustre > again I see the following on one of my clients: > > (import.c:909:ptlrpc_connect_interpret()) b-OST0010_UUID went back in > time (transno 12901362807 was previously committed, server now claims > 0)! See https://bugzilla.lustre.org/show_bug.cgi?id=9646 > > This bug description suggest that there are commits lost in hardware > cache - but how can it loose all commits (transno is zero)? (btw, cache > is battery backup up) > > On the client that I saw this I had previosly deactivated the import > bacause of the crash. Is this the reason I''m seeing this transno as > zero ? (full dmesg below) >Please see the bug 19579, it is not critical error and harmless but it is confusing to see such error message indeed. This is wrong alert and will be fixed -- Mikhail Pershin Staff Engineer Lustre Group Sun Microsystems, Inc.