Kevin M. Myer
2005-Dec-17 16:05 UTC
[Fedora-directory-users] Replication - consumer failed to replay change
Hello, I have three instances of FDS running - two are in a multimaster config (directory1 and directory2), with all subtrees replicated and one is a dedicated slave (garnet), with half a dozen subtrees replicated. directory1 and directory2 are running FDS 7.1, garnet is running FDS 1.0.1. Most of the writes go to directory1, and although I have not tested writing to every subtree from directory1 -> directory2 and directory2 -> directory1, replication seems to be working fine for the most part. However, I noticed yesterday morning the following entry: [16/Dec/2005:09:06:16 -0500] NSMMReplicationPlugin - agmt="cn=IU13" (directory2:636): Consumer failed to replay change (uniqueid 6a76611a-1dd211b2-8027b642-689d0000, CSN 43a2c9d8000000010000): Operations error. Will retry later. This is repeated every five minutes to the present time. Is there a way to look at the changelog entries to see what modification caused this problem? And if not, whats the best way to go about clearing up the error? Also, is FDS smart enough so that if you have a two-server multimaster replication setup, and you use one master to initialize the other, which has an existing replication setup with the master, that it won''t send the changes back? In other words, if I have directory1 and directory2, and they are setup in multimaster, with replication agreements in place for a subtree, and there''s a problem in the subtree on directory2, can I use directory1 to initialize directory2, or will directory2 then turn around and try to initialize directory1? Kevin -- Kevin M. Myer Senior Systems Administrator Lancaster-Lebanon Intermediate Unit 13 http://www.iu13.org
David Boreham
2005-Dec-17 17:40 UTC
Re: [Fedora-directory-users] Replication - consumer failed to replay change
Kevin M. Myer wrote:> [16/Dec/2005:09:06:16 -0500] NSMMReplicationPlugin - agmt="cn=IU13" > (directory2:636): Consumer failed to replay change (uniqueid > 6a76611a-1dd211b2-8027b642-689d0000, CSN 43a2c9d8000000010000): > Operations error. Will retry later. > > This is repeated every five minutes to the present time. Is there a > way to look at the changelog entries to see what modification caused > this problem? And if not, whats the best way to go about clearing up > the error?Try looking in the access and error logs on the replica server (the server that is receiving this update). That should tell us which operation is failing. Exactly what is going on I''m not sure, I''ve not seen a problem like this before. Perhaps someone else on the list has.> Also, is FDS smart enough so that if you have a two-server multimaster > replication setup, and you use one master to initialize the other, > which has an existing replication setup with the master, that it won''t > send the changes back? In other words, if I have directory1 and > directory2, and they are setup in multimaster, with replication > agreements in place for a subtree, and there''s a problem in the > subtree on directory2, can I use directory1 to initialize directory2, > or will directory2 then turn around and try to initialize directory1?No, it''s smart enough to not do that.
Kevin M. Myer
2005-Dec-17 18:57 UTC
Re: [Fedora-directory-users] Replication - consumer failed to replay change
Quoting David Boreham <david_list@boreham.org>:> Try looking in the access and error logs on the replica server (the > server that is receiving this update). > That should tell us which operation is failing. Exactly what is going > on I''m not sure, I''ve not seen a > problem like this before. Perhaps someone else on the list has.Here''s the action its trying to perform: [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 EXT oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 RESULT err=0 tag=120 nentries=0 etime=0 [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 DEL dn="uid=<username>,ou=people,dc=base" [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 RESULT err=1 tag=107 nentries=0 etime=0 csn=43a2c9d8000000010000 [16/Dec/2005:09:06:18 -0500] conn=900959 op=5 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session" The replication to the slave (garnet) did occur properly for the account that was being deleted. Its also not inhibiting other changes from occuring in the the same replication session. I just made a minor modification to my account and it replicated while the deletion of the account giving errors failed. I restarted the server that was receiving the changes, and now the deletion operation that was failing isn''t occuring at all :/ So I guess I''ll just manually delete the account, since the one master seems to be convinced that the change went through. Kevin -- Kevin M. Myer Senior Systems Administrator Lancaster-Lebanon Intermediate Unit 13 http://www.iu13.org
Richard Megginson
2005-Dec-19 15:31 UTC
Re: [Fedora-directory-users] Replication - consumer failed to replay change
Kevin M. Myer wrote:> Quoting David Boreham <david_list@boreham.org>: > >> Try looking in the access and error logs on the replica server (the >> server that is receiving this update). >> That should tell us which operation is failing. Exactly what is going >> on I''m not sure, I''ve not seen a >> problem like this before. Perhaps someone else on the list has. > > > Here''s the action its trying to perform: > > [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 EXT > oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" > [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 RESULT err=0 tag=120 > nentries=0 etime=0 > [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 DEL > dn="uid=<username>,ou=people,dc=base" > [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 RESULT err=1 tag=107 > nentries=0 etime=0 csn=43a2c9d8000000010000 > [16/Dec/2005:09:06:18 -0500] conn=900959 op=5 EXT > oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session" > > The replication to the slave (garnet) did occur properly for the > account that was being deleted.Is this the access log from one of the masters?> Its also not inhibiting other changes from occuring in the the same > replication session. I just made a minor modification to my account > and it replicated while the deletion of the account giving errors > failed. I restarted the server that was receiving the changes, and > now the deletion operation that was failing isn''t occuring at all :/ > So I guess I''ll just manually delete the account, since the one master > seems to be convinced that the change went through.So after the restart, everything is ok?> > Kevin
Kevin M. Myer
2005-Dec-19 19:54 UTC
Re: [Fedora-directory-users] Replication - consumer failed to replay change
Quoting Richard Megginson <rmeggins@redhat.com>:> Kevin M. Myer wrote: > >> Quoting David Boreham <david_list@boreham.org>: >> >>> Try looking in the access and error logs on the replica server (the >>> server that is receiving this update). >>> That should tell us which operation is failing. Exactly what is >>> going on I''m not sure, I''ve not seen a >>> problem like this before. Perhaps someone else on the list has. >> >> >> Here''s the action its trying to perform: >> >> [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 EXT >> oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start >> Session" >> [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 RESULT err=0 tag=120 >> nentries=0 etime=0 >> [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 DEL >> dn="uid=<username>,ou=people,dc=base" >> [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 RESULT err=1 tag=107 >> nentries=0 etime=0 csn=43a2c9d8000000010000 >> [16/Dec/2005:09:06:18 -0500] conn=900959 op=5 EXT >> oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session" >> >> The replication to the slave (garnet) did occur properly for the >> account that was being deleted. > > Is this the access log from one of the masters?Yes, its from the master that the changes were sent to.>> Its also not inhibiting other changes from occuring in the the same >> replication session. I just made a minor modification to my account >> and it replicated while the deletion of the account giving errors >> failed. I restarted the server that was receiving the changes, and >> now the deletion operation that was failing isn''t occuring at all :/ >> So I guess I''ll just manually delete the account, since the one >> master seems to be convinced that the change went through. > > So after the restart, everything is ok?Unfortunately, no. What has stopped is the attempt to do the replication from the master where the initial change was committed. Further, if I try to manually delete the entry from the master the changes were to be replicated to, I get the same operation error. [17/Dec/2005:14:07:41 -0500] conn=471 fd=210 slot=210 connection from XX.XX.XX.XX to XX.XX.XX.XX [17/Dec/2005:14:07:41 -0500] conn=471 op=0 BIND dn="cn=Directory Manager" method=128 version=3 [17/Dec/2005:14:07:41 -0500] conn=471 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager" [17/Dec/2005:14:07:41 -0500] conn=471 op=1 DEL dn="uid=<username>,ou=People,dc=base" [17/Dec/2005:14:07:41 -0500] conn=471 op=1 RESULT err=1 tag=107 nentries=0 etime=0 csn=43a461fe000000650000 [17/Dec/2005:14:07:41 -0500] conn=471 op=2 UNBIND [17/Dec/2005:14:07:41 -0500] conn=471 op=2 fd=210 closed - U1 Now to the best of my knowledge, this server has not gone down uncleanly, and its only this one entry that is causing problems. So ideas on what to try next, or how I might fix it? Kevin -- Kevin M. Myer Senior Systems Administrator Lancaster-Lebanon Intermediate Unit 13 http://www.iu13.org
Richard Megginson
2005-Dec-19 20:27 UTC
Re: [Fedora-directory-users] Replication - consumer failed to replay change
Kevin M. Myer wrote:> Quoting Richard Megginson <rmeggins@redhat.com>: > >> Kevin M. Myer wrote: >> >>> Quoting David Boreham <david_list@boreham.org>: >>> >>>> Try looking in the access and error logs on the replica server (the >>>> server that is receiving this update). >>>> That should tell us which operation is failing. Exactly what is >>>> going on I''m not sure, I''ve not seen a >>>> problem like this before. Perhaps someone else on the list has. >>> >>> >>> >>> Here''s the action its trying to perform: >>> >>> [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 EXT >>> oid="2.16.840.1.113730.3.5.3" name="Netscape Replication Start Session" >>> [16/Dec/2005:09:06:16 -0500] conn=900959 op=3 RESULT err=0 tag=120 >>> nentries=0 etime=0 >>> [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 DEL >>> dn="uid=<username>,ou=people,dc=base" >>> [16/Dec/2005:09:06:16 -0500] conn=900959 op=4 RESULT err=1 tag=107 >>> nentries=0 etime=0 csn=43a2c9d8000000010000 >>> [16/Dec/2005:09:06:18 -0500] conn=900959 op=5 EXT >>> oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session" >>> >>> The replication to the slave (garnet) did occur properly for the >>> account that was being deleted. >> >> >> Is this the access log from one of the masters? > > > Yes, its from the master that the changes were sent to. > >>> Its also not inhibiting other changes from occuring in the the same >>> replication session. I just made a minor modification to my account >>> and it replicated while the deletion of the account giving errors >>> failed. I restarted the server that was receiving the changes, and >>> now the deletion operation that was failing isn''t occuring at all :/ >>> So I guess I''ll just manually delete the account, since the one >>> master seems to be convinced that the change went through. >> >> >> So after the restart, everything is ok? > > > Unfortunately, no. What has stopped is the attempt to do the > replication from the master where the initial change was committed. > Further, if I try to manually delete the entry from the master the > changes were to be replicated to, I get the same operation error. > > [17/Dec/2005:14:07:41 -0500] conn=471 fd=210 slot=210 connection from > XX.XX.XX.XX to XX.XX.XX.XX > [17/Dec/2005:14:07:41 -0500] conn=471 op=0 BIND dn="cn=Directory > Manager" method=128 version=3 > [17/Dec/2005:14:07:41 -0500] conn=471 op=0 RESULT err=0 tag=97 > nentries=0 etime=0 dn="cn=directory manager" > [17/Dec/2005:14:07:41 -0500] conn=471 op=1 DEL > dn="uid=<username>,ou=People,dc=base" > [17/Dec/2005:14:07:41 -0500] conn=471 op=1 RESULT err=1 tag=107 > nentries=0 etime=0 csn=43a461fe000000650000 > [17/Dec/2005:14:07:41 -0500] conn=471 op=2 UNBIND > [17/Dec/2005:14:07:41 -0500] conn=471 op=2 fd=210 closed - U1 > > Now to the best of my knowledge, this server has not gone down > uncleanly, and its only this one entry that is causing problems. So > ideas on what to try next, or how I might fix it?I think you should just re-initialize it e.g. reinit this master from the other master.> > Kevin