thr3ads.net - Gluster users - [Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave [Jun 2020]

If this information is useful, please help other people find it:
Share via:

David Cunningham

2020-Jun-02 03:14 UTC

[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

Hi Strahil and Sunny,

Thank you for the replies. I checked the gfid on the master and slaves and
they are the same. After moving the file away and back again it doesn't
seem to be having the issue with that file any more.

We are still getting higher CPU usage on one of the master nodes than the
others. It logs this every few seconds:

[2020-06-02 03:10:15.637815] I [master(worker
/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken
       MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0   duration=0.0000
UNL=0
[2020-06-02 03:10:15.638010] I [master(worker
/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
Time Taken        SETA=0  SETX=0  meta_duration=0.0000data_duration=12.7878
   DATA=4  XATT=0
[2020-06-02 03:10:15.638286] I [master(worker
/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed
changelog_end=1591067378        entry_stime=(1591067167, 0)
 changelog_start=1591067364      stime=(1591067377, 0)   duration=12.8068
     num_changelogs=2        mode=live_changelog
[2020-06-02 03:10:20.658601] I [master(worker
/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
 stime=(1591067377, 0)
[2020-06-02 03:10:34.21799] I [master(worker
/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
 duration=0.3826 num_files=8     job=1   return_code=0
[2020-06-02 03:10:46.440535] I [master(worker
/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken
       MKD=0   MKN=0   LIN=0   SYM=0   REN=1   RMD=0CRE=2   duration=0.1314
UNL=1
[2020-06-02 03:10:46.440809] I [master(worker
/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
Time Taken        SETA=0  SETX=0  meta_duration=0.0000data_duration=13.0171
   DATA=14 XATT=0
[2020-06-02 03:10:46.441205] I [master(worker
/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed
changelog_end=1591067420        entry_stime=(1591067419, 0)
 changelog_start=1591067392      stime=(1591067419, 0)   duration=13.0322
     num_changelogs=3        mode=live_changelog
[2020-06-02 03:10:51.460925] I [master(worker
/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
 stime=(1591067419, 0)

[2020-06-02 03:11:04.448913] I [master(worker
/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
duration=0.3466 num_files=3     job=1   return_code=0

Whereas the other master nodes only log this:

[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>: Using
session config file
path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using
session config file
path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf

Can anyone help with what might cause the high CPU usage on one master
node? The process is this one, and is using 70-100% of CPU:

python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1
--resource-remote nvfs30 --resource-remote-id
1e698ccd-aeec-4ec4-96fe-383da8fc3b78

Thank you in advance!




On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg at yahoo.com>
wrote:
> Hey David,
>
> for me a gfid  mismatch means  that the file  was  replaced/recreated  -
> just like  vim in linux does (and it is expected for config file).
>
> Have  you checked the gfid  of  the file on both source and destination,
> do they really match or they are different ?
>
> What happens  when you move away the file  from the slave ,  does it fixes
> the issue ?
>
> Best Regards,
> Strahil Nikolov
>
> ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham <
> dcunningham at voisonics.com> ??????:
> >Hello,
> >
> >We're having an issue with a geo-replication process with unusually
> >high
> >CPU use and giving "Entry not present on master. Fixing gfid
mismatch
> >in
> >slave" errors. Can anyone help on this?
> >
> >We have 3 GlusterFS replica nodes (we'll call the master), which
also
> >push
> >data to a remote server (slave) using geo-replication. This has been
> >running fine for a couple of months, but yesterday one of the master
> >nodes
> >started having unusually high CPU use. It's this process:
> >
> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048
> >root     32048 68.7  0.6 1843140 845756 ?      Rl   02:51 493:51
> >python2
> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker
> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1
> >--resource-remote nvfs30 --resource-remote-id
> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
> >
> >Here's what is being logged in
> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:
> >
> >[2020-05-29 21:57:18.843524] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's
time
> > stime=(1590789408, 0)
> >[2020-05-29 21:57:30.626172] I [master(worker
> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> >_GMaster: Entry not present on master. Fixing gfid mismatch in slave.
> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
u'gfid':
> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117,
u'mode': 33204,
> >u'entry':
> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',
> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,
u'gfid_mismatch': True,
> >u'slave_name': None, u'slave_gfid':
> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c',
u'name_mismatch': False,
> >u'dst':
> >False})
> >[2020-05-29 21:57:30.627893] I [master(worker
> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> >_GMaster: Entry not present on master. Fixing gfid mismatch in slave.
> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
u'gfid':
> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117,
u'mode': 33204,
> >u'entry':
>
>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',
> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,
u'gfid_mismatch': True,
> >u'slave_name': None, u'slave_gfid':
> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a',
u'name_mismatch': False,
> >u'dst':
> >False})
> >[2020-05-29 21:57:30.629532] I [master(worker
> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> >_GMaster: Entry not present on master. Fixing gfid mismatch in slave.
> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
u'gfid':
> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117,
u'mode': 33204,
> >u'entry':
>
>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',
> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,
u'gfid_mismatch': True,
> >u'slave_name': None, u'slave_gfid':
> >u'06717b5a-d842-495d-bd25-aab9cd454490',
u'name_mismatch': False,
> >u'dst':
> >False})
> >[2020-05-29 21:57:30.659123] I [master(worker
> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] _GMaster:
> >Sucessfully fixed entry ops with gfid mismatch     retry_count=1
> >[2020-05-29 21:57:30.659343] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster: Retry
> >original entries. count = 1
> >[2020-05-29 21:57:30.725810] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster:
> >Sucessfully fixed all entry ops with gfid mismatch
> >[2020-05-29 21:57:31.747319] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
> >duration=0.7409 num_files=18    job=1   return_code=0
> >
> >We've verified that the files like polycom_00a859f7xxxx.cfg
referred to
> >in
> >the error do exist on the master nodes and slave.
> >
> >We found this bug fix:
> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865
> >
> >However that fix went in 5.1, and we're running 5.12 on the master
> >nodes
> >and slave. A couple of GlusterFS clients connected to the master nodes
> >are
> >running 5.13.
> >
> >Would anyone have any suggestions? Thank you in advance.
>

-- 
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20200602/8fb756a0/attachment.html>

Strahil Nikolov

2020-Jun-02 09:17 UTC

head link

[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

Hi David,

in which log do you see the entries ?

I think I got an explanation why you see the process  only on one of the master
nodes -  geo-rep session is established from only 1 master node  /I hope someone
corrects me if I'm wrong/ to one slave node. Thus it will be natural to see
the high CPU  usage on only 1 master node in your situation.

Do you see anything else  in the :
var/log/glusterfs/geo-replication/<logs> (master nodes) or in
/var/log/glusterfs/geo-replication-slaves (slaves) that could hint of the exact
issue. I have  a vague feeling that that python script is constantly  looping
over some data causing the CPU hog.

Sadly, I can't find an instruction for increasing the log level of the geo
rep log .


Best  Regards,
Strahil  Nikolov
 

?? 2 ??? 2020 ?. 6:14:46 GMT+03:00, David Cunningham <dcunningham at
voisonics.com> ??????:>Hi Strahil and Sunny,
>
>Thank you for the replies. I checked the gfid on the master and slaves
>and
>they are the same. After moving the file away and back again it doesn't
>seem to be having the issue with that file any more.
>
>We are still getting higher CPU usage on one of the master nodes than
>the
>others. It logs this every few seconds:
>
>[2020-06-02 03:10:15.637815] I [master(worker
>/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
>Taken
>   MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0   duration=0.0000
>UNL=0
>[2020-06-02 03:10:15.638010] I [master(worker
>/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
>Time Taken        SETA=0  SETX=0 
>meta_duration=0.0000data_duration=12.7878
>   DATA=4  XATT=0
>[2020-06-02 03:10:15.638286] I [master(worker
>/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
>Completed
>changelog_end=1591067378        entry_stime=(1591067167, 0)
>changelog_start=1591067364      stime=(1591067377, 0)  
>duration=12.8068
>     num_changelogs=2        mode=live_changelog
>[2020-06-02 03:10:20.658601] I [master(worker
>/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> stime=(1591067377, 0)
>[2020-06-02 03:10:34.21799] I [master(worker
>/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
> duration=0.3826 num_files=8     job=1   return_code=0
>[2020-06-02 03:10:46.440535] I [master(worker
>/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
>Taken
>   MKD=0   MKN=0   LIN=0   SYM=0   REN=1   RMD=0CRE=2   duration=0.1314
>UNL=1
>[2020-06-02 03:10:46.440809] I [master(worker
>/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
>Time Taken        SETA=0  SETX=0 
>meta_duration=0.0000data_duration=13.0171
>   DATA=14 XATT=0
>[2020-06-02 03:10:46.441205] I [master(worker
>/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
>Completed
>changelog_end=1591067420        entry_stime=(1591067419, 0)
>changelog_start=1591067392      stime=(1591067419, 0)  
>duration=13.0322
>     num_changelogs=3        mode=live_changelog
>[2020-06-02 03:10:51.460925] I [master(worker
>/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> stime=(1591067419, 0)
>
>[2020-06-02 03:11:04.448913] I [master(worker
>/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
>duration=0.3466 num_files=3     job=1   return_code=0
>
>Whereas the other master nodes only log this:
>
>[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>:
>Using
>session config file
>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
>[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using
>session config file
>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
>
>Can anyone help with what might cause the high CPU usage on one master
>node? The process is this one, and is using 70-100% of CPU:
>
>python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
>worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
>/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
>b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
>cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1
>--resource-remote nvfs30 --resource-remote-id
>1e698ccd-aeec-4ec4-96fe-383da8fc3b78
>
>Thank you in advance!
>
>
>
>
>On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg at
yahoo.com>
>wrote:
>
>> Hey David,
>>
>> for me a gfid  mismatch means  that the file  was  replaced/recreated
> -
>> just like  vim in linux does (and it is expected for config file).
>>
>> Have  you checked the gfid  of  the file on both source and
>destination,
>> do they really match or they are different ?
>>
>> What happens  when you move away the file  from the slave ,  does it
>fixes
>> the issue ?
>>
>> Best Regards,
>> Strahil Nikolov
>>
>> ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham <
>> dcunningham at voisonics.com> ??????:
>> >Hello,
>> >
>> >We're having an issue with a geo-replication process with
unusually
>> >high
>> >CPU use and giving "Entry not present on master. Fixing gfid
>mismatch
>> >in
>> >slave" errors. Can anyone help on this?
>> >
>> >We have 3 GlusterFS replica nodes (we'll call the master),
which
>also
>> >push
>> >data to a remote server (slave) using geo-replication. This has
been
>> >running fine for a couple of months, but yesterday one of the
master
>> >nodes
>> >started having unusually high CPU use. It's this process:
>> >
>> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048
>> >root     32048 68.7  0.6 1843140 845756 ?      Rl   02:51 493:51
>> >python2
>> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
>worker
>> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
>> >/nodirectwritedata/gluster/gvol0 --local-node cafs30
--local-node-id
>> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
>> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13
>--subvol-num 1
>> >--resource-remote nvfs30 --resource-remote-id
>> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
>> >
>> >Here's what is being logged in
>> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:
>> >
>> >[2020-05-29 21:57:18.843524] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's
time
>> > stime=(1590789408, 0)
>> >[2020-05-29 21:57:30.626172] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
>> >_GMaster: Entry not present on master. Fixing gfid mismatch in
>slave.
>> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
u'gfid':
>> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid':
117, u'mode':
>33204,
>> >u'entry':
>>
>u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',
>> >u'op': u'CREATE'}, 17, {u'slave_isdir':
False, u'gfid_mismatch':
>True,
>> >u'slave_name': None, u'slave_gfid':
>> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c',
u'name_mismatch': False,
>> >u'dst':
>> >False})
>> >[2020-05-29 21:57:30.627893] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
>> >_GMaster: Entry not present on master. Fixing gfid mismatch in
>slave.
>> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
u'gfid':
>> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid':
117, u'mode':
>33204,
>> >u'entry':
>>
>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',
>> >u'op': u'CREATE'}, 17, {u'slave_isdir':
False, u'gfid_mismatch':
>True,
>> >u'slave_name': None, u'slave_gfid':
>> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a',
u'name_mismatch': False,
>> >u'dst':
>> >False})
>> >[2020-05-29 21:57:30.629532] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
>> >_GMaster: Entry not present on master. Fixing gfid mismatch in
>slave.
>> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
u'gfid':
>> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid':
117, u'mode':
>33204,
>> >u'entry':
>>
>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',
>> >u'op': u'CREATE'}, 17, {u'slave_isdir':
False, u'gfid_mismatch':
>True,
>> >u'slave_name': None, u'slave_gfid':
>> >u'06717b5a-d842-495d-bd25-aab9cd454490',
u'name_mismatch': False,
>> >u'dst':
>> >False})
>> >[2020-05-29 21:57:30.659123] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures]
>_GMaster:
>> >Sucessfully fixed entry ops with gfid mismatch     retry_count=1
>> >[2020-05-29 21:57:30.659343] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster:
>Retry
>> >original entries. count = 1
>> >[2020-05-29 21:57:30.725810] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster:
>> >Sucessfully fixed all entry ops with gfid mismatch
>> >[2020-05-29 21:57:31.747319] I [master(worker
>> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time
>Taken
>> >duration=0.7409 num_files=18    job=1   return_code=0
>> >
>> >We've verified that the files like polycom_00a859f7xxxx.cfg
referred
>to
>> >in
>> >the error do exist on the master nodes and slave.
>> >
>> >We found this bug fix:
>> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865
>> >
>> >However that fix went in 5.1, and we're running 5.12 on the
master
>> >nodes
>> >and slave. A couple of GlusterFS clients connected to the master
>nodes
>> >are
>> >running 5.13.
>> >
>> >Would anyone have any suggestions? Thank you in advance.
>>

Gluster users - Jun 2020 - Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave