David Cunningham
2020-May-29 22:10 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hello, We're having an issue with a geo-replication process with unusually high CPU use and giving "Entry not present on master. Fixing gfid mismatch in slave" errors. Can anyone help on this? We have 3 GlusterFS replica nodes (we'll call the master), which also push data to a remote server (slave) using geo-replication. This has been running fine for a couple of months, but yesterday one of the master nodes started having unusually high CPU use. It's this process: root at cafs30:/var/log/glusterfs# ps aux | grep 32048 root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51 python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path /nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 --resource-remote nvfs30 --resource-remote-id 1e698ccd-aeec-4ec4-96fe-383da8fc3b78 Here's what is being logged in /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: [2020-05-29 21:57:18.843524] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time stime=(1590789408, 0) [2020-05-29 21:57:30.626172] I [master(worker /nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode': 33204, u'entry': u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, u'slave_name': None, u'slave_gfid': u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False, u'dst': False}) [2020-05-29 21:57:30.627893] I [master(worker /nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode': 33204, u'entry': u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, u'slave_name': None, u'slave_gfid': u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False, u'dst': False}) [2020-05-29 21:57:30.629532] I [master(worker /nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] _GMaster: Entry not present on master. Fixing gfid mismatch in slave. Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode': 33204, u'entry': u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, u'slave_name': None, u'slave_gfid': u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False, u'dst': False}) [2020-05-29 21:57:30.659123] I [master(worker /nodirectwritedata/gluster/gvol0):942:handle_entry_failures] _GMaster: Sucessfully fixed entry ops with gfid mismatch retry_count=1 [2020-05-29 21:57:30.659343] I [master(worker /nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster: Retry original entries. count = 1 [2020-05-29 21:57:30.725810] I [master(worker /nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster: Sucessfully fixed all entry ops with gfid mismatch [2020-05-29 21:57:31.747319] I [master(worker /nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken duration=0.7409 num_files=18 job=1 return_code=0 We've verified that the files like polycom_00a859f7xxxx.cfg referred to in the error do exist on the master nodes and slave. We found this bug fix: https://bugzilla.redhat.com/show_bug.cgi?id=1642865 However that fix went in 5.1, and we're running 5.12 on the master nodes and slave. A couple of GlusterFS clients connected to the master nodes are running 5.13. Would anyone have any suggestions? Thank you in advance. -- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200530/bbef4baf/attachment.html>
Strahil Nikolov
2020-May-30 08:20 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hey David, for me a gfid mismatch means that the file was replaced/recreated - just like vim in linux does (and it is expected for config file). Have you checked the gfid of the file on both source and destination, do they really match or they are different ? What happens when you move away the file from the slave , does it fixes the issue ? Best Regards, Strahil Nikolov ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham <dcunningham at voisonics.com> ??????:>Hello, > >We're having an issue with a geo-replication process with unusually >high >CPU use and giving "Entry not present on master. Fixing gfid mismatch >in >slave" errors. Can anyone help on this? > >We have 3 GlusterFS replica nodes (we'll call the master), which also >push >data to a remote server (slave) using geo-replication. This has been >running fine for a couple of months, but yesterday one of the master >nodes >started having unusually high CPU use. It's this process: > >root at cafs30:/var/log/glusterfs# ps aux | grep 32048 >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51 >python2 >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 >--resource-remote nvfs30 --resource-remote-id >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > >Here's what is being logged in >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: > >[2020-05-29 21:57:18.843524] I [master(worker >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > stime=(1590789408, 0) >[2020-05-29 21:57:30.626172] I [master(worker >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >_GMaster: Entry not present on master. Fixing gfid mismatch in slave. >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode': 33204, >u'entry': >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, >u'slave_name': None, u'slave_gfid': >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False, >u'dst': >False}) >[2020-05-29 21:57:30.627893] I [master(worker >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >_GMaster: Entry not present on master. Fixing gfid mismatch in slave. >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode': 33204, >u'entry': >u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, >u'slave_name': None, u'slave_gfid': >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False, >u'dst': >False}) >[2020-05-29 21:57:30.629532] I [master(worker >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >_GMaster: Entry not present on master. Fixing gfid mismatch in slave. >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode': 33204, >u'entry': >u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, >u'slave_name': None, u'slave_gfid': >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False, >u'dst': >False}) >[2020-05-29 21:57:30.659123] I [master(worker >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] _GMaster: >Sucessfully fixed entry ops with gfid mismatch retry_count=1 >[2020-05-29 21:57:30.659343] I [master(worker >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster: Retry >original entries. count = 1 >[2020-05-29 21:57:30.725810] I [master(worker >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster: >Sucessfully fixed all entry ops with gfid mismatch >[2020-05-29 21:57:31.747319] I [master(worker >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken >duration=0.7409 num_files=18 job=1 return_code=0 > >We've verified that the files like polycom_00a859f7xxxx.cfg referred to >in >the error do exist on the master nodes and slave. > >We found this bug fix: >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 > >However that fix went in 5.1, and we're running 5.12 on the master >nodes >and slave. A couple of GlusterFS clients connected to the master nodes >are >running 5.13. > >Would anyone have any suggestions? Thank you in advance.
David Cunningham
2020-Jun-02 03:14 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hi Strahil and Sunny, Thank you for the replies. I checked the gfid on the master and slaves and they are the same. After moving the file away and back again it doesn't seem to be having the issue with that file any more. We are still getting higher CPU usage on one of the master nodes than the others. It logs this every few seconds: [2020-06-02 03:10:15.637815] I [master(worker /nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000 UNL=0 [2020-06-02 03:10:15.638010] I [master(worker /nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata Time Taken SETA=0 SETX=0 meta_duration=0.0000data_duration=12.7878 DATA=4 XATT=0 [2020-06-02 03:10:15.638286] I [master(worker /nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed changelog_end=1591067378 entry_stime=(1591067167, 0) changelog_start=1591067364 stime=(1591067377, 0) duration=12.8068 num_changelogs=2 mode=live_changelog [2020-06-02 03:10:20.658601] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time stime=(1591067377, 0) [2020-06-02 03:10:34.21799] I [master(worker /nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken duration=0.3826 num_files=8 job=1 return_code=0 [2020-06-02 03:10:46.440535] I [master(worker /nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 duration=0.1314 UNL=1 [2020-06-02 03:10:46.440809] I [master(worker /nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata Time Taken SETA=0 SETX=0 meta_duration=0.0000data_duration=13.0171 DATA=14 XATT=0 [2020-06-02 03:10:46.441205] I [master(worker /nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed changelog_end=1591067420 entry_stime=(1591067419, 0) changelog_start=1591067392 stime=(1591067419, 0) duration=13.0322 num_changelogs=3 mode=live_changelog [2020-06-02 03:10:51.460925] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time stime=(1591067419, 0) [2020-06-02 03:11:04.448913] I [master(worker /nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken duration=0.3466 num_files=3 job=1 return_code=0 Whereas the other master nodes only log this: [2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>: Using session config file path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf [2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using session config file path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf Can anyone help with what might cause the high CPU usage on one master node? The process is this one, and is using 70-100% of CPU: python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path /nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 --resource-remote nvfs30 --resource-remote-id 1e698ccd-aeec-4ec4-96fe-383da8fc3b78 Thank you in advance! On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> Hey David, > > for me a gfid mismatch means that the file was replaced/recreated - > just like vim in linux does (and it is expected for config file). > > Have you checked the gfid of the file on both source and destination, > do they really match or they are different ? > > What happens when you move away the file from the slave , does it fixes > the issue ? > > Best Regards, > Strahil Nikolov > > ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham < > dcunningham at voisonics.com> ??????: > >Hello, > > > >We're having an issue with a geo-replication process with unusually > >high > >CPU use and giving "Entry not present on master. Fixing gfid mismatch > >in > >slave" errors. Can anyone help on this? > > > >We have 3 GlusterFS replica nodes (we'll call the master), which also > >push > >data to a remote server (slave) using geo-replication. This has been > >running fine for a couple of months, but yesterday one of the master > >nodes > >started having unusually high CPU use. It's this process: > > > >root at cafs30:/var/log/glusterfs# ps aux | grep 32048 > >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51 > >python2 > >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py worker > >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 > >--resource-remote nvfs30 --resource-remote-id > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > > > >Here's what is being logged in > >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: > > > >[2020-05-29 21:57:18.843524] I [master(worker > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > > stime=(1590789408, 0) > >[2020-05-29 21:57:30.626172] I [master(worker > >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >_GMaster: Entry not present on master. Fixing gfid mismatch in slave. > >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': > >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode': 33204, > >u'entry': > >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', > >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, > >u'slave_name': None, u'slave_gfid': > >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False, > >u'dst': > >False}) > >[2020-05-29 21:57:30.627893] I [master(worker > >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >_GMaster: Entry not present on master. Fixing gfid mismatch in slave. > >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': > >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode': 33204, > >u'entry': > >u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', > >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, > >u'slave_name': None, u'slave_gfid': > >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False, > >u'dst': > >False}) > >[2020-05-29 21:57:30.629532] I [master(worker > >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >_GMaster: Entry not present on master. Fixing gfid mismatch in slave. > >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': > >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode': 33204, > >u'entry': > >u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', > >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': True, > >u'slave_name': None, u'slave_gfid': > >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False, > >u'dst': > >False}) > >[2020-05-29 21:57:30.659123] I [master(worker > >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] _GMaster: > >Sucessfully fixed entry ops with gfid mismatch retry_count=1 > >[2020-05-29 21:57:30.659343] I [master(worker > >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster: Retry > >original entries. count = 1 > >[2020-05-29 21:57:30.725810] I [master(worker > >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster: > >Sucessfully fixed all entry ops with gfid mismatch > >[2020-05-29 21:57:31.747319] I [master(worker > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken > >duration=0.7409 num_files=18 job=1 return_code=0 > > > >We've verified that the files like polycom_00a859f7xxxx.cfg referred to > >in > >the error do exist on the master nodes and slave. > > > >We found this bug fix: > >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 > > > >However that fix went in 5.1, and we're running 5.12 on the master > >nodes > >and slave. A couple of GlusterFS clients connected to the master nodes > >are > >running 5.13. > > > >Would anyone have any suggestions? Thank you in advance. >-- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200602/8fb756a0/attachment.html>