Strahil Nikolov
2020-Jun-02 09:17 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hi David, in which log do you see the entries ? I think I got an explanation why you see the process only on one of the master nodes - geo-rep session is established from only 1 master node /I hope someone corrects me if I'm wrong/ to one slave node. Thus it will be natural to see the high CPU usage on only 1 master node in your situation. Do you see anything else in the : var/log/glusterfs/geo-replication/<logs> (master nodes) or in /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of the exact issue. I have a vague feeling that that python script is constantly looping over some data causing the CPU hog. Sadly, I can't find an instruction for increasing the log level of the geo rep log . Best Regards, Strahil Nikolov ?? 2 ??? 2020 ?. 6:14:46 GMT+03:00, David Cunningham <dcunningham at voisonics.com> ??????:>Hi Strahil and Sunny, > >Thank you for the replies. I checked the gfid on the master and slaves >and >they are the same. After moving the file away and back again it doesn't >seem to be having the issue with that file any more. > >We are still getting higher CPU usage on one of the master nodes than >the >others. It logs this every few seconds: > >[2020-06-02 03:10:15.637815] I [master(worker >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time >Taken > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000 >UNL=0 >[2020-06-02 03:10:15.638010] I [master(worker >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata >Time Taken SETA=0 SETX=0 >meta_duration=0.0000data_duration=12.7878 > DATA=4 XATT=0 >[2020-06-02 03:10:15.638286] I [master(worker >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch >Completed >changelog_end=1591067378 entry_stime=(1591067167, 0) >changelog_start=1591067364 stime=(1591067377, 0) >duration=12.8068 > num_changelogs=2 mode=live_changelog >[2020-06-02 03:10:20.658601] I [master(worker >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > stime=(1591067377, 0) >[2020-06-02 03:10:34.21799] I [master(worker >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken > duration=0.3826 num_files=8 job=1 return_code=0 >[2020-06-02 03:10:46.440535] I [master(worker >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time >Taken > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 duration=0.1314 >UNL=1 >[2020-06-02 03:10:46.440809] I [master(worker >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata >Time Taken SETA=0 SETX=0 >meta_duration=0.0000data_duration=13.0171 > DATA=14 XATT=0 >[2020-06-02 03:10:46.441205] I [master(worker >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch >Completed >changelog_end=1591067420 entry_stime=(1591067419, 0) >changelog_start=1591067392 stime=(1591067419, 0) >duration=13.0322 > num_changelogs=3 mode=live_changelog >[2020-06-02 03:10:51.460925] I [master(worker >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > stime=(1591067419, 0) > >[2020-06-02 03:11:04.448913] I [master(worker >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken >duration=0.3466 num_files=3 job=1 return_code=0 > >Whereas the other master nodes only log this: > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>: >Using >session config file >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using >session config file >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >Can anyone help with what might cause the high CPU usage on one master >node? The process is this one, and is using 70-100% of CPU: > >python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 >--resource-remote nvfs30 --resource-remote-id >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > >Thank you in advance! > > > > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg at yahoo.com> >wrote: > >> Hey David, >> >> for me a gfid mismatch means that the file was replaced/recreated > - >> just like vim in linux does (and it is expected for config file). >> >> Have you checked the gfid of the file on both source and >destination, >> do they really match or they are different ? >> >> What happens when you move away the file from the slave , does it >fixes >> the issue ? >> >> Best Regards, >> Strahil Nikolov >> >> ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham < >> dcunningham at voisonics.com> ??????: >> >Hello, >> > >> >We're having an issue with a geo-replication process with unusually >> >high >> >CPU use and giving "Entry not present on master. Fixing gfid >mismatch >> >in >> >slave" errors. Can anyone help on this? >> > >> >We have 3 GlusterFS replica nodes (we'll call the master), which >also >> >push >> >data to a remote server (slave) using geo-replication. This has been >> >running fine for a couple of months, but yesterday one of the master >> >nodes >> >started having unusually high CPU use. It's this process: >> > >> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048 >> >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51 >> >python2 >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py >worker >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 >--subvol-num 1 >> >--resource-remote nvfs30 --resource-remote-id >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 >> > >> >Here's what is being logged in >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: >> > >> >[2020-05-29 21:57:18.843524] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time >> > stime=(1590789408, 0) >> >[2020-05-29 21:57:30.626172] I [master(worker >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> >_GMaster: Entry not present on master. Fixing gfid mismatch in >slave. >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode': >33204, >> >u'entry': >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': >True, >> >u'slave_name': None, u'slave_gfid': >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False, >> >u'dst': >> >False}) >> >[2020-05-29 21:57:30.627893] I [master(worker >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> >_GMaster: Entry not present on master. Fixing gfid mismatch in >slave. >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode': >33204, >> >u'entry': >> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': >True, >> >u'slave_name': None, u'slave_gfid': >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False, >> >u'dst': >> >False}) >> >[2020-05-29 21:57:30.629532] I [master(worker >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> >_GMaster: Entry not present on master. Fixing gfid mismatch in >slave. >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode': >33204, >> >u'entry': >> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': >True, >> >u'slave_name': None, u'slave_gfid': >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False, >> >u'dst': >> >False}) >> >[2020-05-29 21:57:30.659123] I [master(worker >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] >_GMaster: >> >Sucessfully fixed entry ops with gfid mismatch retry_count=1 >> >[2020-05-29 21:57:30.659343] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster: >Retry >> >original entries. count = 1 >> >[2020-05-29 21:57:30.725810] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster: >> >Sucessfully fixed all entry ops with gfid mismatch >> >[2020-05-29 21:57:31.747319] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time >Taken >> >duration=0.7409 num_files=18 job=1 return_code=0 >> > >> >We've verified that the files like polycom_00a859f7xxxx.cfg referred >to >> >in >> >the error do exist on the master nodes and slave. >> > >> >We found this bug fix: >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 >> > >> >However that fix went in 5.1, and we're running 5.12 on the master >> >nodes >> >and slave. A couple of GlusterFS clients connected to the master >nodes >> >are >> >running 5.13. >> > >> >Would anyone have any suggestions? Thank you in advance. >>
Sunny Kumar
2020-Jun-02 09:57 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hi David, You haven't answered my previous question regarding the type of your workload. --- You can use the below command to enable debug log. `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level DEBUG` and after capturing log again switch back to info mode: `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level INFO` Please share the debug log and geo-rep config to debug further: for config: `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config` /sunny On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> > Hi David, > > in which log do you see the entries ? > > I think I got an explanation why you see the process only on one of the master nodes - geo-rep session is established from only 1 master node /I hope someone corrects me if I'm wrong/ to one slave node. Thus it will be natural to see the high CPU usage on only 1 master node in your situation. > > Do you see anything else in the : var/log/glusterfs/geo-replication/<logs> (master nodes) or in /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of the exact issue. I have a vague feeling that that python script is constantly looping over some data causing the CPU hog. > > Sadly, I can't find an instruction for increasing the log level of the geo rep log . > > > Best Regards, > Strahil Nikolov > > > ?? 2 ??? 2020 ?. 6:14:46 GMT+03:00, David Cunningham <dcunningham at voisonics.com> ??????: > >Hi Strahil and Sunny, > > > >Thank you for the replies. I checked the gfid on the master and slaves > >and > >they are the same. After moving the file away and back again it doesn't > >seem to be having the issue with that file any more. > > > >We are still getting higher CPU usage on one of the master nodes than > >the > >others. It logs this every few seconds: > > > >[2020-06-02 03:10:15.637815] I [master(worker > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time > >Taken > > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000 > >UNL=0 > >[2020-06-02 03:10:15.638010] I [master(worker > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata > >Time Taken SETA=0 SETX=0 > >meta_duration=0.0000data_duration=12.7878 > > DATA=4 XATT=0 > >[2020-06-02 03:10:15.638286] I [master(worker > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch > >Completed > >changelog_end=1591067378 entry_stime=(1591067167, 0) > >changelog_start=1591067364 stime=(1591067377, 0) > >duration=12.8068 > > num_changelogs=2 mode=live_changelog > >[2020-06-02 03:10:20.658601] I [master(worker > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > > stime=(1591067377, 0) > >[2020-06-02 03:10:34.21799] I [master(worker > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken > > duration=0.3826 num_files=8 job=1 return_code=0 > >[2020-06-02 03:10:46.440535] I [master(worker > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time > >Taken > > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 duration=0.1314 > >UNL=1 > >[2020-06-02 03:10:46.440809] I [master(worker > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata > >Time Taken SETA=0 SETX=0 > >meta_duration=0.0000data_duration=13.0171 > > DATA=14 XATT=0 > >[2020-06-02 03:10:46.441205] I [master(worker > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch > >Completed > >changelog_end=1591067420 entry_stime=(1591067419, 0) > >changelog_start=1591067392 stime=(1591067419, 0) > >duration=13.0322 > > num_changelogs=3 mode=live_changelog > >[2020-06-02 03:10:51.460925] I [master(worker > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > > stime=(1591067419, 0) > > > >[2020-06-02 03:11:04.448913] I [master(worker > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken > >duration=0.3466 num_files=3 job=1 return_code=0 > > > >Whereas the other master nodes only log this: > > > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>: > >Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > > > >Can anyone help with what might cause the high CPU usage on one master > >node? The process is this one, and is using 70-100% of CPU: > > > >python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1 > >--resource-remote nvfs30 --resource-remote-id > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > > > >Thank you in advance! > > > > > > > > > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg at yahoo.com> > >wrote: > > > >> Hey David, > >> > >> for me a gfid mismatch means that the file was replaced/recreated > > - > >> just like vim in linux does (and it is expected for config file). > >> > >> Have you checked the gfid of the file on both source and > >destination, > >> do they really match or they are different ? > >> > >> What happens when you move away the file from the slave , does it > >fixes > >> the issue ? > >> > >> Best Regards, > >> Strahil Nikolov > >> > >> ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham < > >> dcunningham at voisonics.com> ??????: > >> >Hello, > >> > > >> >We're having an issue with a geo-replication process with unusually > >> >high > >> >CPU use and giving "Entry not present on master. Fixing gfid > >mismatch > >> >in > >> >slave" errors. Can anyone help on this? > >> > > >> >We have 3 GlusterFS replica nodes (we'll call the master), which > >also > >> >push > >> >data to a remote server (slave) using geo-replication. This has been > >> >running fine for a couple of months, but yesterday one of the master > >> >nodes > >> >started having unusually high CPU use. It's this process: > >> > > >> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048 > >> >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 493:51 > >> >python2 > >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py > >worker > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 > >--subvol-num 1 > >> >--resource-remote nvfs30 --resource-remote-id > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > >> > > >> >Here's what is being logged in > >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: > >> > > >> >[2020-05-29 21:57:18.843524] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > >> > stime=(1590789408, 0) > >> >[2020-05-29 21:57:30.626172] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in > >slave. > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode': > >33204, > >> >u'entry': > >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': > >True, > >> >u'slave_name': None, u'slave_gfid': > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False, > >> >u'dst': > >> >False}) > >> >[2020-05-29 21:57:30.627893] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in > >slave. > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode': > >33204, > >> >u'entry': > >> > >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': > >True, > >> >u'slave_name': None, u'slave_gfid': > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False, > >> >u'dst': > >> >False}) > >> >[2020-05-29 21:57:30.629532] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in > >slave. > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, u'gfid': > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode': > >33204, > >> >u'entry': > >> > >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch': > >True, > >> >u'slave_name': None, u'slave_gfid': > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False, > >> >u'dst': > >> >False}) > >> >[2020-05-29 21:57:30.659123] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] > >_GMaster: > >> >Sucessfully fixed entry ops with gfid mismatch retry_count=1 > >> >[2020-05-29 21:57:30.659343] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster: > >Retry > >> >original entries. count = 1 > >> >[2020-05-29 21:57:30.725810] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster: > >> >Sucessfully fixed all entry ops with gfid mismatch > >> >[2020-05-29 21:57:31.747319] I [master(worker > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time > >Taken > >> >duration=0.7409 num_files=18 job=1 return_code=0 > >> > > >> >We've verified that the files like polycom_00a859f7xxxx.cfg referred > >to > >> >in > >> >the error do exist on the master nodes and slave. > >> > > >> >We found this bug fix: > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 > >> > > >> >However that fix went in 5.1, and we're running 5.12 on the master > >> >nodes > >> >and slave. A couple of GlusterFS clients connected to the master > >nodes > >> >are > >> >running 5.13. > >> > > >> >Would anyone have any suggestions? Thank you in advance. > >> > ________ > > > > Community Meeting Calendar: > > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://bluejeans.com/441850968 > > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users