David Cunningham
2020-Jun-10 03:37 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hi Strahil, Thank you for that. Do you know if these "Stale file handle" errors on the geo-replication slave could be related? [2020-06-10 01:02:32.268989] E [MSGID: 109040] [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba: failed to lookup the file on gvol0-dht [Stale file handle] [2020-06-10 01:02:32.269092] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434237: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.329280] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434251: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.387129] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434264: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.448838] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434277: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.507196] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434290: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.566033] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434303: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.625168] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434316: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.772442] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434329: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.832481] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434342: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) [2020-06-10 01:02:32.891835] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 7434403: STAT() /.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) On Tue, 9 Jun 2020 at 16:31, Strahil Nikolov <hunter86_bg at yahoo.com> wrote:> Hey David, > > Can you check the cpu usage in the sar on the rest of the cluster (going > backwards from the day you found the high cpu usage), so we can know if > this behaviour was obseerved on other nodes. > > Maybe that behaviour was "normal" for the push node (which could be > another one) . > > As this script is python, I guess you can put some debug print > statements in it. > > Best Regards, > Strahil Nikolov > > ?? 9 ??? 2020 ?. 5:07:11 GMT+03:00, David Cunningham < > dcunningham at voisonics.com> ??????: > >Hi Sankarshan, > > > >Thanks for that. So what should we look for to figure out what this > >process > >is doing? In > >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log we see > >something like the following logged regularly: > > > > > >[[2020-06-09 02:01:19.670595] D [master(worker > >/nodirectwritedata/gluster/gvol0):1454:changelogs_batch_process] > >_GMaster: > >processing changes > > >batch=['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040', > > >'/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055'] > >[2020-06-09 02:01:19.674927] D [master(worker > >/nodirectwritedata/gluster/gvol0):1289:process] _GMaster: processing > >change > > > > >changelog=/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040 > >[2020-06-09 02:01:19.683098] D [master(worker > >/nodirectwritedata/gluster/gvol0):1170:process_change] _GMaster: > >entries: [] > >[2020-06-09 02:01:19.695125] D [master(worker > >/nodirectwritedata/gluster/gvol0):312:a_syncdata] _GMaster: files > >files=set(['.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17', > >'.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77', > >'.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0', > >'.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435']) > >[2020-06-09 02:01:19.695344] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 > >[2020-06-09 02:01:19.695508] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 > >[2020-06-09 02:01:19.695638] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0 > >[2020-06-09 02:01:19.695759] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 > >[2020-06-09 02:01:19.695883] D [master(worker > >/nodirectwritedata/gluster/gvol0):1289:process] _GMaster: processing > >change > > > > >changelog=/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055 > >[2020-06-09 02:01:19.696170] D [master(worker > >/nodirectwritedata/gluster/gvol0):1170:process_change] _GMaster: > >entries: [] > >[2020-06-09 02:01:19.714097] D [master(worker > >/nodirectwritedata/gluster/gvol0):312:a_syncdata] _GMaster: files > >files=set(['.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17', > >'.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77', > >'.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435']) > >[2020-06-09 02:01:19.714286] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 > >[2020-06-09 02:01:19.714433] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 > >[2020-06-09 02:01:19.714577] D [master(worker > >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate > >for > >syncing file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 > >[2020-06-09 02:01:20.179656] D [resource(worker > >/nodirectwritedata/gluster/gvol0):1419:rsync] SSH: files: > >.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17, > >.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77, > >.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0, > >.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435, > >.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17, > >.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77, > >.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 > >[2020-06-09 02:01:20.738632] I [master(worker > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken > >duration=0.5588 num_files=7 job=2 return_code=0 > >[2020-06-09 02:01:20.739650] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 > >[2020-06-09 02:01:20.740041] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 > >[2020-06-09 02:01:20.740200] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0 > >[2020-06-09 02:01:20.740343] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 > >[2020-06-09 02:01:20.740482] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 > >[2020-06-09 02:01:20.740616] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 > >[2020-06-09 02:01:20.740741] D [master(worker > >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced > > file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 > >[2020-06-09 02:01:20.745385] D [repce(worker > >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >22499:140085349934848:1591668080.75 > > >done('/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040',) > >... > >[2020-06-09 02:01:20.746252] D [repce(worker > >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >22499:140085349934848:1591668080.75 done -> None > >[2020-06-09 02:01:20.746515] D [repce(worker > >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >22499:140085349934848:1591668080.75 > > >done('/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055',) > >... > >[2020-06-09 02:01:20.747045] D [repce(worker > >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >22499:140085349934848:1591668080.75 done -> None > >[2020-06-09 02:01:31.364780] I [gsyncd(config-get):308:main] <top>: > >Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-09 02:01:31.458915] I [gsyncd(status):308:main] <top>: Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-09 02:01:32.156028] I [gsyncd(config-get):308:main] <top>: > >Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-09 02:01:32.250082] I [gsyncd(status):308:main] <top>: Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-09 02:01:33.778866] I [master(worker > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time > >Taken > > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0 CRE=0 > >duration=0.0000 UNL=0 > >[2020-06-09 02:01:33.779000] I [master(worker > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata > >Time Taken SETA=0 SETX=0 meta_duration=0.0000 > > data_duration=14.0839 DATA=11 XATT=0 > >[2020-06-09 02:01:33.779195] I [master(worker > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch > >Completed > >changelog_end=1591668055 entry_stime=(1591668024, 0) > >changelog_start=1591668040 stime=(1591668054, 0) > >duration=14.1043 > > num_changelogs=2 mode=live_changelog > >[2020-06-09 02:01:35.337511] I [gsyncd(config-get):308:main] <top>: > >Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-09 02:01:35.432674] I [gsyncd(status):308:main] <top>: Using > >session config file > >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >[2020-06-09 02:01:38.784449] D [master(worker > >/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl > >#18195 > >done, took 32.533401 seconds > >[2020-06-09 02:01:38.790709] D [repce(worker > >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >22499:140085349934848:1591668098.79 scan() ... > >[2020-06-09 02:01:38.791342] D [repce(worker > >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >22499:140085349934848:1591668098.79 scan -> None > >[2020-06-09 02:01:38.791479] D [repce(worker > >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >22499:140085349934848:1591668098.79 getchanges() ... > >[2020-06-09 02:01:38.791989] D [repce(worker > >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >22499:140085349934848:1591668098.79 getchanges -> > > >['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668069', > > >'/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668083', > > >'/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668097'] > >[2020-06-09 02:01:38.792114] I [master(worker > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time > > stime=(1591668054, 0) > > > >Something very similar is logged around every 32 seconds, and I notice > >the > >log above says the crawl took 32 seconds. > > > > > >On Mon, 8 Jun 2020 at 16:22, sankarshan <sankarshan at kadalu.io> wrote: > > > >> Reading through the thread it occurs to me that it would be a > >stronger > >> approach to understand the workload (a general description of the > >> application) and in terms of the releases of GlusterFS running, > >assess > >> if there are new issues to be addressed or if existing sets of > >patches > >> tend to work. Thanks for setting up the debug level in the log. High > >> CPU usage by a geo-replication process would need to be traced back > >to > >> why it really requires that %-age of CPU if it was not doing so > >> previously. > >> > >> On Mon, 8 Jun 2020 at 05:29, David Cunningham > ><dcunningham at voisonics.com> > >> wrote: > >> > > >> > Hi Strahil, > >> > > >> > The CPU is still quite high, with "top" regularly showing 100% CPU > >usage > >> by that process. However it's not clear whether this is really a > >problem, > >> or if it's just normal geo-replication activity. While CPU usage was > >not > >> previously as high on this server, it's not clear whether GlusterFS > >might > >> have made this the "push" node for geo-replication when it used to be > >some > >> other server. Having said that, no other server shows a sudden drop > >in CPU > >> usage. > >> > > >> > Ideally we could find out what the gsyncd process is doing, and > >> therefore whether it's expected or not. Any ideas on that? We did set > >the > >> log level to DEBUG as Sunny suggested. > >> > > >> > Looking at > >> > > >/var/log/glusterfs/geo-replication-slaves/gvol0_nvfs10_gvol0/mnt-cafs30-nodirectwritedata-gluster-gvol0.log > >> on the geo-replication slave, we have a lot of lines like the > >following > >> logged. > >> > > >> > [2020-06-06 00:32:43.155856] W [fuse-bridge.c:897:fuse_attr_cbk] > >> 0-glusterfs-fuse: 875853: STAT() > >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file handle) > >> > [2020-06-06 00:32:43.219759] W [fuse-bridge.c:897:fuse_attr_cbk] > >> 0-glusterfs-fuse: 875923: STAT() > >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file handle) > >> > [2020-06-06 00:32:43.280357] W [fuse-bridge.c:897:fuse_attr_cbk] > >> 0-glusterfs-fuse: 876001: STAT() > >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file handle) > >> > The message "E [MSGID: 109040] > >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: > >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac: failed to lookup the > >file on > >> gvol0-dht [Stale file handle]" repeated 9 times between [2020-06-06 > >> 00:32:42.689780] and [2020-06-06 00:32:43.280322] > >> > [2020-06-06 09:03:03.660956] E [MSGID: 109040] > >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: > >> /.gfid/decdd552-d58b-4ddf-b27a-0da9a6fbc38b: failed to lookup the > >file on > >> gvol0-dht [Stale file handle] > >> > [2020-06-06 09:03:03.661057] W [fuse-bridge.c:897:fuse_attr_cbk] > >> 0-glusterfs-fuse: 965375: STAT() > >> /.gfid/decdd552-d58b-4ddf-b27a-0da9a6fbc38b => -1 (Stale file handle) > >> > [2020-06-06 09:03:10.258798] E [MSGID: 109040] > >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: > >> /.gfid/1d90e129-d40f-4ca4-bea8-0bb3c4c7985a: failed to lookup the > >file on > >> gvol0-dht [Stale file handle] > >> > [2020-06-06 09:03:10.258880] W [fuse-bridge.c:897:fuse_attr_cbk] > >> 0-glusterfs-fuse: 969455: STAT() > >> /.gfid/1d90e129-d40f-4ca4-bea8-0bb3c4c7985a => -1 (Stale file handle) > >> > [2020-06-06 09:09:41.259362] E [MSGID: 109040] > >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: > >> /.gfid/740efff6-74dd-45be-847f-66919d1179e0: failed to lookup the > >file on > >> gvol0-dht [Stale file handle] > >> > [2020-06-06 09:09:41.259458] W [fuse-bridge.c:897:fuse_attr_cbk] > >> 0-glusterfs-fuse: 1040904: STAT() > >> /.gfid/740efff6-74dd-45be-847f-66919d1179e0 => -1 (Stale file handle) > >> > > >> > Could these errors be part of the problem? > >> > > >> > Thanks again. > >> > > >> > > >> > On Sat, 6 Jun 2020 at 21:21, Strahil Nikolov > ><hunter86_bg at yahoo.com> > >> wrote: > >> >> > >> >> Hey David, > >> >> > >> >> can you check the old logs for gfid mismatch and get a list of > >files > >> that were causing the high cpu . > >> >> Maybe they are related somehow (maybe created by the same > >software > >> , same client version or something else) which could help about > >that. > >> >> > >> >> Also take a look in geo-replication-slave logs. > >> >> > >> >> Does the issue still occurs ? > >> >> > >> >> Best Regards, > >> >> Strahil Nikolov > >> >> > >> >> ?? 6 ??? 2020 ?. 1:21:55 GMT+03:00, David Cunningham < > >> dcunningham at voisonics.com> ??????: > >> >> >Hi Sunny and Strahil, > >> >> > > >> >> >Thanks again for your responses. We don't have a lot of renaming > >> >> >activity - > >> >> >maybe some, but not a lot. We do have files which are open for > >writing > >> >> >for > >> >> >quite a while - they're call recordings being written as the call > >> >> >happens. > >> >> > > >> >> >We've installed GlusterFS using the Ubuntu packages and I'd > >really like > >> >> >to > >> >> >avoid compiling and applying patches. > >> >> > > >> >> >After enabling DEBUG the log at > >> >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log > >> >> >doesn't > >> >> >seem to show anything very unusual: > >> >> > > >> >> >[2020-06-03 02:49:01.992177] I [master(worker > >> >> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry > >Time > >> >> >Taken > >> >> > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 > >duration=0.0000 > >> >> >UNL=0 > >> >> >[2020-06-03 02:49:01.992465] I [master(worker > >> >> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: > >Data/Metadata > >> >> >Time Taken SETA=0 SETX=0 > >> >> >meta_duration=0.0000data_duration=13.0954 > >> >> > DATA=8 XATT=0 > >> >> >[2020-06-03 02:49:01.992863] I [master(worker > >> >> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch > >> >> >Completed > >> >> >changelog_end=1591152508 entry_stime=(1591152352, 0) > >> >> >changelog_start=1591152494 stime=(1591152507, 0) > >> >> >duration=13.1077 > >> >> > num_changelogs=2 mode=live_changelog > >> >> >[2020-06-03 02:49:02.958687] D [repce(worker > >> >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >> >> >19017:139678812452608:1591152542.96 keep_alive({'version': (1, > >0), > >> >> >'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', > >'timeout': > >> >> >1591152662, 'volume_mark': (1583043396, 161632)},) ... > >> >> >[2020-06-03 02:49:02.979139] D [repce(worker > >> >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >> >> >19017:139678812452608:1591152542.96 keep_alive -> 28 > >> >> >[2020-06-03 02:49:06.998127] D [master(worker > >> >> >/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... > >crawl > >> >> >#114 > >> >> >done, took 30.180089 seconds > >> >> >[2020-06-03 02:49:07.10132] D [repce(worker > >> >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >> >> >19017:139679441716992:1591152547.01 scan() ... > >> >> >[2020-06-03 02:49:07.10781] D [repce(worker > >> >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >> >> >19017:139679441716992:1591152547.01 scan -> None > >> >> >[2020-06-03 02:49:07.10935] D [repce(worker > >> >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call > >> >> >19017:139679441716992:1591152547.01 getchanges() ... > >> >> >[2020-06-03 02:49:07.11579] D [repce(worker > >> >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call > >> >> >19017:139679441716992:1591152547.01 getchanges -> > >> >> > >> > > >>['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522'] > >> >> >[2020-06-03 02:49:07.11720] I [master(worker > >> >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's > >time > >> >> >stime=(1591152507, 0) > >> >> > > >> >> >Am I looking in the right place to find out what that gsyncd.py > >process > >> >> >is > >> >> >doing? > >> >> > > >> >> > > >> >> >On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <sunkumar at redhat.com> > >wrote: > >> >> > > >> >> >> Hi David, > >> >> >> > >> >> >> You haven't answered my previous question regarding the type of > >your > >> >> >> workload. > >> >> >> --- > >> >> >> You can use the below command to enable debug log. > >> >> >> > >> >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config > >> >> >log-level > >> >> >> DEBUG` > >> >> >> > >> >> >> and after capturing log again switch back to info mode: > >> >> >> > >> >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config > >> >> >log-level > >> >> >> INFO` > >> >> >> > >> >> >> Please share the debug log and geo-rep config to debug further: > >> >> >> for config: > >> >> >> > >> >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> > >config` > >> >> >> > >> >> >> /sunny > >> >> >> > >> >> >> > >> >> >> On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov > >> >> ><hunter86_bg at yahoo.com> > >> >> >> wrote: > >> >> >> > > >> >> >> > Hi David, > >> >> >> > > >> >> >> > in which log do you see the entries ? > >> >> >> > > >> >> >> > I think I got an explanation why you see the process only on > >one > >> >> >of the > >> >> >> master nodes - geo-rep session is established from only 1 > >master > >> >> >node /I > >> >> >> hope someone corrects me if I'm wrong/ to one slave node. Thus > >it > >> >> >will be > >> >> >> natural to see the high CPU usage on only 1 master node in > >your > >> >> >situation. > >> >> >> > > >> >> >> > Do you see anything else in the : > >> >> >> var/log/glusterfs/geo-replication/<logs> (master nodes) or in > >> >> >> /var/log/glusterfs/geo-replication-slaves (slaves) that could > >hint of > >> >> >the > >> >> >> exact issue. I have a vague feeling that that python script is > >> >> >constantly > >> >> >> looping over some data causing the CPU hog. > >> >> >> > > >> >> >> > Sadly, I can't find an instruction for increasing the log > >level of > >> >> >the > >> >> >> geo rep log . > >> >> >> > > >> >> >> > > >> >> >> > Best Regards, > >> >> >> > Strahil Nikolov > >> >> >> > > >> >> >> > > >> >> >> > ?? 2 ??? 2020 ?. 6:14:46 GMT+03:00, David Cunningham < > >> >> >> dcunningham at voisonics.com> ??????: > >> >> >> > >Hi Strahil and Sunny, > >> >> >> > > > >> >> >> > >Thank you for the replies. I checked the gfid on the master > >and > >> >> >slaves > >> >> >> > >and > >> >> >> > >they are the same. After moving the file away and back again > >it > >> >> >doesn't > >> >> >> > >seem to be having the issue with that file any more. > >> >> >> > > > >> >> >> > >We are still getting higher CPU usage on one of the master > >nodes > >> >> >than > >> >> >> > >the > >> >> >> > >others. It logs this every few seconds: > >> >> >> > > > >> >> >> > >[2020-06-02 03:10:15.637815] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: > >Entry > >> >> >Time > >> >> >> > >Taken > >> >> >> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 > >> >> >duration=0.0000 > >> >> >> > >UNL=0 > >> >> >> > >[2020-06-02 03:10:15.638010] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: > >> >> >Data/Metadata > >> >> >> > >Time Taken SETA=0 SETX=0 > >> >> >> > >meta_duration=0.0000data_duration=12.7878 > >> >> >> > > DATA=4 XATT=0 > >> >> >> > >[2020-06-02 03:10:15.638286] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: > >Batch > >> >> >> > >Completed > >> >> >> > >changelog_end=1591067378 entry_stime=(1591067167, 0) > >> >> >> > >changelog_start=1591067364 stime=(1591067377, 0) > >> >> >> > >duration=12.8068 > >> >> >> > > num_changelogs=2 mode=live_changelog > >> >> >> > >[2020-06-02 03:10:20.658601] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: > >slave's > >> >> >time > >> >> >> > > stime=(1591067377, 0) > >> >> >> > >[2020-06-02 03:10:34.21799] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync > >Time > >> >> >Taken > >> >> >> > > duration=0.3826 num_files=8 job=1 return_code=0 > >> >> >> > >[2020-06-02 03:10:46.440535] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: > >Entry > >> >> >Time > >> >> >> > >Taken > >> >> >> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 > >> >> >duration=0.1314 > >> >> >> > >UNL=1 > >> >> >> > >[2020-06-02 03:10:46.440809] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: > >> >> >Data/Metadata > >> >> >> > >Time Taken SETA=0 SETX=0 > >> >> >> > >meta_duration=0.0000data_duration=13.0171 > >> >> >> > > DATA=14 XATT=0 > >> >> >> > >[2020-06-02 03:10:46.441205] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: > >Batch > >> >> >> > >Completed > >> >> >> > >changelog_end=1591067420 entry_stime=(1591067419, 0) > >> >> >> > >changelog_start=1591067392 stime=(1591067419, 0) > >> >> >> > >duration=13.0322 > >> >> >> > > num_changelogs=3 mode=live_changelog > >> >> >> > >[2020-06-02 03:10:51.460925] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: > >slave's > >> >> >time > >> >> >> > > stime=(1591067419, 0) > >> >> >> > > > >> >> >> > >[2020-06-02 03:11:04.448913] I [master(worker > >> >> >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync > >Time > >> >> >Taken > >> >> >> > >duration=0.3466 num_files=3 job=1 return_code=0 > >> >> >> > > > >> >> >> > >Whereas the other master nodes only log this: > >> >> >> > > > >> >> >> > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] > >> >> ><top>: > >> >> >> > >Using > >> >> >> > >session config file > >> >> >> > > >> >> > >>>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >> >> >> > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] > ><top>: > >> >> >Using > >> >> >> > >session config file > >> >> >> > > >> >> > >>>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf > >> >> >> > > > >> >> >> > >Can anyone help with what might cause the high CPU usage on > >one > >> >> >master > >> >> >> > >node? The process is this one, and is using 70-100% of CPU: > >> >> >> > > > >> >> >> > >python2 > >> >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py > >> >> >> > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path > >> >> >> > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 > >> >> >--local-node-id > >> >> >> > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id > >> >> >> > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 > >> >> >--subvol-num 1 > >> >> >> > >--resource-remote nvfs30 --resource-remote-id > >> >> >> > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > >> >> >> > > > >> >> >> > >Thank you in advance! > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov > >> >> ><hunter86_bg at yahoo.com> > >> >> >> > >wrote: > >> >> >> > > > >> >> >> > >> Hey David, > >> >> >> > >> > >> >> >> > >> for me a gfid mismatch means that the file was > >> >> >replaced/recreated > >> >> >> > > - > >> >> >> > >> just like vim in linux does (and it is expected for > >config > >> >> >file). > >> >> >> > >> > >> >> >> > >> Have you checked the gfid of the file on both source > >and > >> >> >> > >destination, > >> >> >> > >> do they really match or they are different ? > >> >> >> > >> > >> >> >> > >> What happens when you move away the file from the slave > >, > >> >> >does it > >> >> >> > >fixes > >> >> >> > >> the issue ? > >> >> >> > >> > >> >> >> > >> Best Regards, > >> >> >> > >> Strahil Nikolov > >> >> >> > >> > >> >> >> > >> ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham < > >> >> >> > >> dcunningham at voisonics.com> ??????: > >> >> >> > >> >Hello, > >> >> >> > >> > > >> >> >> > >> >We're having an issue with a geo-replication process with > >> >> >unusually > >> >> >> > >> >high > >> >> >> > >> >CPU use and giving "Entry not present on master. Fixing > >gfid > >> >> >> > >mismatch > >> >> >> > >> >in > >> >> >> > >> >slave" errors. Can anyone help on this? > >> >> >> > >> > > >> >> >> > >> >We have 3 GlusterFS replica nodes (we'll call the > >master), > >> >> >which > >> >> >> > >also > >> >> >> > >> >push > >> >> >> > >> >data to a remote server (slave) using geo-replication. > >This has > >> >> >been > >> >> >> > >> >running fine for a couple of months, but yesterday one of > >the > >> >> >master > >> >> >> > >> >nodes > >> >> >> > >> >started having unusually high CPU use. It's this process: > >> >> >> > >> > > >> >> >> > >> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048 > >> >> >> > >> >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 > >> >> >493:51 > >> >> >> > >> >python2 > >> >> >> > >> > >>/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py > >> >> >> > >worker > >> >> >> > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path > >> >> >> > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 > >> >> >--local-node-id > >> >> >> > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id > >> >> >> > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 > >> >> >> > >--subvol-num 1 > >> >> >> > >> >--resource-remote nvfs30 --resource-remote-id > >> >> >> > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 > >> >> >> > >> > > >> >> >> > >> >Here's what is being logged in > >> >> >> > >> > >> >> > >>>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: > >> >> >> > >> > > >> >> >> > >> >[2020-05-29 21:57:18.843524] I [master(worker > >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: > >slave's > >> >> >time > >> >> >> > >> > stime=(1590789408, 0) > >> >> >> > >> >[2020-05-29 21:57:30.626172] I [master(worker > >> >> >> > >> > >> >> > >>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >> >> >> > >> >_GMaster: Entry not present on master. Fixing gfid > >mismatch in > >> >> >> > >slave. > >> >> >> > >> >Deleting the entry retry_count=1 entry=({u'uid': > >108, > >> >> >u'gfid': > >> >> >> > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, > >u'mode': > >> >> >> > >33204, > >> >> >> > >> >u'entry': > >> >> >> > >> > >>u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', > >> >> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, > >> >> >u'gfid_mismatch': > >> >> >> > >True, > >> >> >> > >> >u'slave_name': None, u'slave_gfid': > >> >> >> > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', > >u'name_mismatch': > >> >> >False, > >> >> >> > >> >u'dst': > >> >> >> > >> >False}) > >> >> >> > >> >[2020-05-29 21:57:30.627893] I [master(worker > >> >> >> > >> > >> >> > >>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >> >> >> > >> >_GMaster: Entry not present on master. Fixing gfid > >mismatch in > >> >> >> > >slave. > >> >> >> > >> >Deleting the entry retry_count=1 entry=({u'uid': > >108, > >> >> >u'gfid': > >> >> >> > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, > >u'mode': > >> >> >> > >33204, > >> >> >> > >> >u'entry': > >> >> >> > >> > >> >> >> > > >> >> > >> > >>>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', > >> >> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, > >> >> >u'gfid_mismatch': > >> >> >> > >True, > >> >> >> > >> >u'slave_name': None, u'slave_gfid': > >> >> >> > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', > >u'name_mismatch': > >> >> >False, > >> >> >> > >> >u'dst': > >> >> >> > >> >False}) > >> >> >> > >> >[2020-05-29 21:57:30.629532] I [master(worker > >> >> >> > >> > >> >> > >>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] > >> >> >> > >> >_GMaster: Entry not present on master. Fixing gfid > >mismatch in > >> >> >> > >slave. > >> >> >> > >> >Deleting the entry retry_count=1 entry=({u'uid': > >108, > >> >> >u'gfid': > >> >> >> > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, > >u'mode': > >> >> >> > >33204, > >> >> >> > >> >u'entry': > >> >> >> > >> > >> >> >> > > >> >> >> > >> >> > >> > > >>>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', > >> >> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, > >> >> >u'gfid_mismatch': > >> >> >> > >True, > >> >> >> > >> >u'slave_name': None, u'slave_gfid': > >> >> >> > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', > >u'name_mismatch': > >> >> >False, > >> >> >> > >> >u'dst': > >> >> >> > >> >False}) > >> >> >> > >> >[2020-05-29 21:57:30.659123] I [master(worker > >> >> >> > >> > >>/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] > >> >> >> > >_GMaster: > >> >> >> > >> >Sucessfully fixed entry ops with gfid mismatch > >> >> >retry_count=1 > >> >> >> > >> >[2020-05-29 21:57:30.659343] I [master(worker > >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1194:process_change] > >> >> >_GMaster: > >> >> >> > >Retry > >> >> >> > >> >original entries. count = 1 > >> >> >> > >> >[2020-05-29 21:57:30.725810] I [master(worker > >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1197:process_change] > >> >> >_GMaster: > >> >> >> > >> >Sucessfully fixed all entry ops with gfid mismatch > >> >> >> > >> >[2020-05-29 21:57:31.747319] I [master(worker > >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: > >Sync > >> >> >Time > >> >> >> > >Taken > >> >> >> > >> >duration=0.7409 num_files=18 job=1 return_code=0 > >> >> >> > >> > > >> >> >> > >> >We've verified that the files like > >polycom_00a859f7xxxx.cfg > >> >> >referred > >> >> >> > >to > >> >> >> > >> >in > >> >> >> > >> >the error do exist on the master nodes and slave. > >> >> >> > >> > > >> >> >> > >> >We found this bug fix: > >> >> >> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 > >> >> >> > >> > > >> >> >> > >> >However that fix went in 5.1, and we're running 5.12 on > >the > >> >> >master > >> >> >> > >> >nodes > >> >> >> > >> >and slave. A couple of GlusterFS clients connected to the > >> >> >master > >> >> >> > >nodes > >> >> >> > >> >are > >> >> >> > >> >running 5.13. > >> >> >> > >> > > >> >> >> > >> >Would anyone have any suggestions? Thank you in advance. > >> >> >> > >> > >> >> >> > ________ > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > Community Meeting Calendar: > >> >> >> > > >> >> >> > Schedule - > >> >> >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >> >> >> > Bridge: https://bluejeans.com/441850968 > >> >> >> > > >> >> >> > Gluster-users mailing list > >> >> >> > Gluster-users at gluster.org > >> >> >> > https://lists.gluster.org/mailman/listinfo/gluster-users > >> >> >> > >> >> >> > >> > > >> > > >> > > >> > -- > >> > David Cunningham, Voisonics Limited > >> > http://voisonics.com/ > >> > USA: +1 213 221 1092 > >> > New Zealand: +64 (0)28 2558 3782 > >> > ________ > >> > > >> > > >> > > >> > Community Meeting Calendar: > >> > > >> > Schedule - > >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >> > Bridge: https://bluejeans.com/441850968 > >> > > >> > Gluster-users mailing list > >> > Gluster-users at gluster.org > >> > https://lists.gluster.org/mailman/listinfo/gluster-users > >> > >> > >> > >> -- > >> sankarshan at kadalu.io | TZ: UTC+0530 | +91 99606 03294 > >> kadalu.io : Making it easy to provision storage in k8s! > >> ________ > >> > >> > >> > >> Community Meeting Calendar: > >> > >> Schedule - > >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > >> Bridge: https://bluejeans.com/441850968 > >> > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > >> >-- David Cunningham, Voisonics Limited http://voisonics.com/ USA: +1 213 221 1092 New Zealand: +64 (0)28 2558 3782 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200610/c4a644a5/attachment.html>
Strahil Nikolov
2020-Jun-10 04:06 UTC
[Gluster-users] Geo-replication: Entry not present on master. Fixing gfid mismatch in slave
Hey David, Sadly I just have a feeling that on any brick there is a gfid mismatch, but I could be wrong. As you have the gfid list, please check on all bricks (both master and slave) that the file exists (not the one in .gluster , but the real one) and it has the same gfid. You can find the inode via ls and then run a find (don't forget the ionice) against the brick and that inode number. Once you have the full path to the file , test: - Mount with FUSE - Check file exists ( no '??????' for permissions, size, etc) and can be manipulated (maybe 'touch' can be used ?) - Find (on all replica sets ) the file and check the gfid - Check for heals pending for that gfid Best Regards, Strahil Nikolov ?? 10 ??? 2020 ?. 6:37:35 GMT+03:00, David Cunningham <dcunningham at voisonics.com> ??????:>Hi Strahil, > >Thank you for that. Do you know if these "Stale file handle" errors on >the >geo-replication slave could be related? > >[2020-06-10 01:02:32.268989] E [MSGID: 109040] >[dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba: failed to lookup the file >on >gvol0-dht [Stale file handle] >[2020-06-10 01:02:32.269092] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434237: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.329280] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434251: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.387129] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434264: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.448838] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434277: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.507196] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434290: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.566033] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434303: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.625168] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434316: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.772442] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434329: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.832481] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434342: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) >[2020-06-10 01:02:32.891835] W [fuse-bridge.c:897:fuse_attr_cbk] >0-glusterfs-fuse: 7434403: STAT() >/.gfid/d4265a0c-d881-48d8-8ca1-0920ab5ae9ba => -1 (Stale file handle) > > > >On Tue, 9 Jun 2020 at 16:31, Strahil Nikolov <hunter86_bg at yahoo.com> >wrote: > >> Hey David, >> >> Can you check the cpu usage in the sar on the rest of the cluster >(going >> backwards from the day you found the high cpu usage), so we can know >if >> this behaviour was obseerved on other nodes. >> >> Maybe that behaviour was "normal" for the push node (which could be >> another one) . >> >> As this script is python, I guess you can put some debug print >> statements in it. >> >> Best Regards, >> Strahil Nikolov >> >> ?? 9 ??? 2020 ?. 5:07:11 GMT+03:00, David Cunningham < >> dcunningham at voisonics.com> ??????: >> >Hi Sankarshan, >> > >> >Thanks for that. So what should we look for to figure out what this >> >process >> >is doing? In >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log we >see >> >something like the following logged regularly: >> > >> > >> >[[2020-06-09 02:01:19.670595] D [master(worker >> >/nodirectwritedata/gluster/gvol0):1454:changelogs_batch_process] >> >_GMaster: >> >processing changes >> >> >>batch=['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040', >> >> >>'/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055'] >> >[2020-06-09 02:01:19.674927] D [master(worker >> >/nodirectwritedata/gluster/gvol0):1289:process] _GMaster: processing >> >change >> > >> >> >>changelog=/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040 >> >[2020-06-09 02:01:19.683098] D [master(worker >> >/nodirectwritedata/gluster/gvol0):1170:process_change] _GMaster: >> >entries: [] >> >[2020-06-09 02:01:19.695125] D [master(worker >> >/nodirectwritedata/gluster/gvol0):312:a_syncdata] _GMaster: files >> >files=set(['.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17', >> >'.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77', >> >'.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0', >> >'.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435']) >> >[2020-06-09 02:01:19.695344] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 >> >[2020-06-09 02:01:19.695508] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 >> >[2020-06-09 02:01:19.695638] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0 >> >[2020-06-09 02:01:19.695759] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 >> >[2020-06-09 02:01:19.695883] D [master(worker >> >/nodirectwritedata/gluster/gvol0):1289:process] _GMaster: processing >> >change >> > >> >> >>changelog=/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055 >> >[2020-06-09 02:01:19.696170] D [master(worker >> >/nodirectwritedata/gluster/gvol0):1170:process_change] _GMaster: >> >entries: [] >> >[2020-06-09 02:01:19.714097] D [master(worker >> >/nodirectwritedata/gluster/gvol0):312:a_syncdata] _GMaster: files >> >files=set(['.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17', >> >'.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77', >> >'.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435']) >> >[2020-06-09 02:01:19.714286] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 >> >[2020-06-09 02:01:19.714433] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 >> >[2020-06-09 02:01:19.714577] D [master(worker >> >/nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: >candidate >> >for >> >syncing file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 >> >[2020-06-09 02:01:20.179656] D [resource(worker >> >/nodirectwritedata/gluster/gvol0):1419:rsync] SSH: files: >> >.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17, >> >.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77, >> >.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0, >> >.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435, >> >.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17, >> >.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77, >> >.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 >> >[2020-06-09 02:01:20.738632] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time >Taken >> >duration=0.5588 num_files=7 job=2 return_code=0 >> >[2020-06-09 02:01:20.739650] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 >> >[2020-06-09 02:01:20.740041] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 >> >[2020-06-09 02:01:20.740200] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0 >> >[2020-06-09 02:01:20.740343] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 >> >[2020-06-09 02:01:20.740482] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17 >> >[2020-06-09 02:01:20.740616] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77 >> >[2020-06-09 02:01:20.740741] D [master(worker >> >/nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced >> > file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435 >> >[2020-06-09 02:01:20.745385] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >22499:140085349934848:1591668080.75 >> >> >>done('/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040',) >> >... >> >[2020-06-09 02:01:20.746252] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >> >22499:140085349934848:1591668080.75 done -> None >> >[2020-06-09 02:01:20.746515] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >22499:140085349934848:1591668080.75 >> >> >>done('/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055',) >> >... >> >[2020-06-09 02:01:20.747045] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >> >22499:140085349934848:1591668080.75 done -> None >> >[2020-06-09 02:01:31.364780] I [gsyncd(config-get):308:main] <top>: >> >Using >> >session config file >> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >[2020-06-09 02:01:31.458915] I [gsyncd(status):308:main] <top>: >Using >> >session config file >> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >[2020-06-09 02:01:32.156028] I [gsyncd(config-get):308:main] <top>: >> >Using >> >session config file >> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >[2020-06-09 02:01:32.250082] I [gsyncd(status):308:main] <top>: >Using >> >session config file >> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >[2020-06-09 02:01:33.778866] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time >> >Taken >> > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0 CRE=0 >> >duration=0.0000 UNL=0 >> >[2020-06-09 02:01:33.779000] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: >Data/Metadata >> >Time Taken SETA=0 SETX=0 meta_duration=0.0000 >> > data_duration=14.0839 DATA=11 XATT=0 >> >[2020-06-09 02:01:33.779195] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch >> >Completed >> >changelog_end=1591668055 entry_stime=(1591668024, 0) >> >changelog_start=1591668040 stime=(1591668054, 0) >> >duration=14.1043 >> > num_changelogs=2 mode=live_changelog >> >[2020-06-09 02:01:35.337511] I [gsyncd(config-get):308:main] <top>: >> >Using >> >session config file >> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >[2020-06-09 02:01:35.432674] I [gsyncd(status):308:main] <top>: >Using >> >session config file >> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >[2020-06-09 02:01:38.784449] D [master(worker >> >/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl >> >#18195 >> >done, took 32.533401 seconds >> >[2020-06-09 02:01:38.790709] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >22499:140085349934848:1591668098.79 scan() ... >> >[2020-06-09 02:01:38.791342] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >> >22499:140085349934848:1591668098.79 scan -> None >> >[2020-06-09 02:01:38.791479] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >22499:140085349934848:1591668098.79 getchanges() ... >> >[2020-06-09 02:01:38.791989] D [repce(worker >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >> >22499:140085349934848:1591668098.79 getchanges -> >> >> >>['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668069', >> >> >>'/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668083', >> >> >>'/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668097'] >> >[2020-06-09 02:01:38.792114] I [master(worker >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time >> > stime=(1591668054, 0) >> > >> >Something very similar is logged around every 32 seconds, and I >notice >> >the >> >log above says the crawl took 32 seconds. >> > >> > >> >On Mon, 8 Jun 2020 at 16:22, sankarshan <sankarshan at kadalu.io> >wrote: >> > >> >> Reading through the thread it occurs to me that it would be a >> >stronger >> >> approach to understand the workload (a general description of the >> >> application) and in terms of the releases of GlusterFS running, >> >assess >> >> if there are new issues to be addressed or if existing sets of >> >patches >> >> tend to work. Thanks for setting up the debug level in the log. >High >> >> CPU usage by a geo-replication process would need to be traced >back >> >to >> >> why it really requires that %-age of CPU if it was not doing so >> >> previously. >> >> >> >> On Mon, 8 Jun 2020 at 05:29, David Cunningham >> ><dcunningham at voisonics.com> >> >> wrote: >> >> > >> >> > Hi Strahil, >> >> > >> >> > The CPU is still quite high, with "top" regularly showing 100% >CPU >> >usage >> >> by that process. However it's not clear whether this is really a >> >problem, >> >> or if it's just normal geo-replication activity. While CPU usage >was >> >not >> >> previously as high on this server, it's not clear whether >GlusterFS >> >might >> >> have made this the "push" node for geo-replication when it used to >be >> >some >> >> other server. Having said that, no other server shows a sudden >drop >> >in CPU >> >> usage. >> >> > >> >> > Ideally we could find out what the gsyncd process is doing, and >> >> therefore whether it's expected or not. Any ideas on that? We did >set >> >the >> >> log level to DEBUG as Sunny suggested. >> >> > >> >> > Looking at >> >> >> >> >>/var/log/glusterfs/geo-replication-slaves/gvol0_nvfs10_gvol0/mnt-cafs30-nodirectwritedata-gluster-gvol0.log >> >> on the geo-replication slave, we have a lot of lines like the >> >following >> >> logged. >> >> > >> >> > [2020-06-06 00:32:43.155856] W [fuse-bridge.c:897:fuse_attr_cbk] >> >> 0-glusterfs-fuse: 875853: STAT() >> >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file >handle) >> >> > [2020-06-06 00:32:43.219759] W [fuse-bridge.c:897:fuse_attr_cbk] >> >> 0-glusterfs-fuse: 875923: STAT() >> >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file >handle) >> >> > [2020-06-06 00:32:43.280357] W [fuse-bridge.c:897:fuse_attr_cbk] >> >> 0-glusterfs-fuse: 876001: STAT() >> >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file >handle) >> >> > The message "E [MSGID: 109040] >> >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: >> >> /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac: failed to lookup the >> >file on >> >> gvol0-dht [Stale file handle]" repeated 9 times between >[2020-06-06 >> >> 00:32:42.689780] and [2020-06-06 00:32:43.280322] >> >> > [2020-06-06 09:03:03.660956] E [MSGID: 109040] >> >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: >> >> /.gfid/decdd552-d58b-4ddf-b27a-0da9a6fbc38b: failed to lookup the >> >file on >> >> gvol0-dht [Stale file handle] >> >> > [2020-06-06 09:03:03.661057] W [fuse-bridge.c:897:fuse_attr_cbk] >> >> 0-glusterfs-fuse: 965375: STAT() >> >> /.gfid/decdd552-d58b-4ddf-b27a-0da9a6fbc38b => -1 (Stale file >handle) >> >> > [2020-06-06 09:03:10.258798] E [MSGID: 109040] >> >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: >> >> /.gfid/1d90e129-d40f-4ca4-bea8-0bb3c4c7985a: failed to lookup the >> >file on >> >> gvol0-dht [Stale file handle] >> >> > [2020-06-06 09:03:10.258880] W [fuse-bridge.c:897:fuse_attr_cbk] >> >> 0-glusterfs-fuse: 969455: STAT() >> >> /.gfid/1d90e129-d40f-4ca4-bea8-0bb3c4c7985a => -1 (Stale file >handle) >> >> > [2020-06-06 09:09:41.259362] E [MSGID: 109040] >> >> [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: >> >> /.gfid/740efff6-74dd-45be-847f-66919d1179e0: failed to lookup the >> >file on >> >> gvol0-dht [Stale file handle] >> >> > [2020-06-06 09:09:41.259458] W [fuse-bridge.c:897:fuse_attr_cbk] >> >> 0-glusterfs-fuse: 1040904: STAT() >> >> /.gfid/740efff6-74dd-45be-847f-66919d1179e0 => -1 (Stale file >handle) >> >> > >> >> > Could these errors be part of the problem? >> >> > >> >> > Thanks again. >> >> > >> >> > >> >> > On Sat, 6 Jun 2020 at 21:21, Strahil Nikolov >> ><hunter86_bg at yahoo.com> >> >> wrote: >> >> >> >> >> >> Hey David, >> >> >> >> >> >> can you check the old logs for gfid mismatch and get a list of >> >files >> >> that were causing the high cpu . >> >> >> Maybe they are related somehow (maybe created by the same >> >software >> >> , same client version or something else) which could help about >> >that. >> >> >> >> >> >> Also take a look in geo-replication-slave logs. >> >> >> >> >> >> Does the issue still occurs ? >> >> >> >> >> >> Best Regards, >> >> >> Strahil Nikolov >> >> >> >> >> >> ?? 6 ??? 2020 ?. 1:21:55 GMT+03:00, David Cunningham < >> >> dcunningham at voisonics.com> ??????: >> >> >> >Hi Sunny and Strahil, >> >> >> > >> >> >> >Thanks again for your responses. We don't have a lot of >renaming >> >> >> >activity - >> >> >> >maybe some, but not a lot. We do have files which are open for >> >writing >> >> >> >for >> >> >> >quite a while - they're call recordings being written as the >call >> >> >> >happens. >> >> >> > >> >> >> >We've installed GlusterFS using the Ubuntu packages and I'd >> >really like >> >> >> >to >> >> >> >avoid compiling and applying patches. >> >> >> > >> >> >> >After enabling DEBUG the log at >> >> >> >>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log >> >> >> >doesn't >> >> >> >seem to show anything very unusual: >> >> >> > >> >> >> >[2020-06-03 02:49:01.992177] I [master(worker >> >> >> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: >Entry >> >Time >> >> >> >Taken >> >> >> > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 >> >duration=0.0000 >> >> >> >UNL=0 >> >> >> >[2020-06-03 02:49:01.992465] I [master(worker >> >> >> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: >> >Data/Metadata >> >> >> >Time Taken SETA=0 SETX=0 >> >> >> >meta_duration=0.0000data_duration=13.0954 >> >> >> > DATA=8 XATT=0 >> >> >> >[2020-06-03 02:49:01.992863] I [master(worker >> >> >> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: >Batch >> >> >> >Completed >> >> >> >changelog_end=1591152508 entry_stime=(1591152352, 0) >> >> >> >changelog_start=1591152494 stime=(1591152507, 0) >> >> >> >duration=13.1077 >> >> >> > num_changelogs=2 mode=live_changelog >> >> >> >[2020-06-03 02:49:02.958687] D [repce(worker >> >> >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >> >> >19017:139678812452608:1591152542.96 keep_alive({'version': (1, >> >0), >> >> >> >'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', >> >'timeout': >> >> >> >1591152662, 'volume_mark': (1583043396, 161632)},) ... >> >> >> >[2020-06-03 02:49:02.979139] D [repce(worker >> >> >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: >call >> >> >> >19017:139678812452608:1591152542.96 keep_alive -> 28 >> >> >> >[2020-06-03 02:49:06.998127] D [master(worker >> >> >> >/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... >> >crawl >> >> >> >#114 >> >> >> >done, took 30.180089 seconds >> >> >> >[2020-06-03 02:49:07.10132] D [repce(worker >> >> >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >> >> >19017:139679441716992:1591152547.01 scan() ... >> >> >> >[2020-06-03 02:49:07.10781] D [repce(worker >> >> >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: >call >> >> >> >19017:139679441716992:1591152547.01 scan -> None >> >> >> >[2020-06-03 02:49:07.10935] D [repce(worker >> >> >> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >> >> >> >19017:139679441716992:1591152547.01 getchanges() ... >> >> >> >[2020-06-03 02:49:07.11579] D [repce(worker >> >> >> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: >call >> >> >> >19017:139679441716992:1591152547.01 getchanges -> >> >> >> >> >> >> >> >>>['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522'] >> >> >> >[2020-06-03 02:49:07.11720] I [master(worker >> >> >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: >slave's >> >time >> >> >> >stime=(1591152507, 0) >> >> >> > >> >> >> >Am I looking in the right place to find out what that >gsyncd.py >> >process >> >> >> >is >> >> >> >doing? >> >> >> > >> >> >> > >> >> >> >On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <sunkumar at redhat.com> >> >wrote: >> >> >> > >> >> >> >> Hi David, >> >> >> >> >> >> >> >> You haven't answered my previous question regarding the type >of >> >your >> >> >> >> workload. >> >> >> >> --- >> >> >> >> You can use the below command to enable debug log. >> >> >> >> >> >> >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> >config >> >> >> >log-level >> >> >> >> DEBUG` >> >> >> >> >> >> >> >> and after capturing log again switch back to info mode: >> >> >> >> >> >> >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> >config >> >> >> >log-level >> >> >> >> INFO` >> >> >> >> >> >> >> >> Please share the debug log and geo-rep config to debug >further: >> >> >> >> for config: >> >> >> >> >> >> >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> >> >config` >> >> >> >> >> >> >> >> /sunny >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov >> >> >> ><hunter86_bg at yahoo.com> >> >> >> >> wrote: >> >> >> >> > >> >> >> >> > Hi David, >> >> >> >> > >> >> >> >> > in which log do you see the entries ? >> >> >> >> > >> >> >> >> > I think I got an explanation why you see the process only >on >> >one >> >> >> >of the >> >> >> >> master nodes - geo-rep session is established from only 1 >> >master >> >> >> >node /I >> >> >> >> hope someone corrects me if I'm wrong/ to one slave node. >Thus >> >it >> >> >> >will be >> >> >> >> natural to see the high CPU usage on only 1 master node in >> >your >> >> >> >situation. >> >> >> >> > >> >> >> >> > Do you see anything else in the : >> >> >> >> var/log/glusterfs/geo-replication/<logs> (master nodes) or >in >> >> >> >> /var/log/glusterfs/geo-replication-slaves (slaves) that >could >> >hint of >> >> >> >the >> >> >> >> exact issue. I have a vague feeling that that python script >is >> >> >> >constantly >> >> >> >> looping over some data causing the CPU hog. >> >> >> >> > >> >> >> >> > Sadly, I can't find an instruction for increasing the log >> >level of >> >> >> >the >> >> >> >> geo rep log . >> >> >> >> > >> >> >> >> > >> >> >> >> > Best Regards, >> >> >> >> > Strahil Nikolov >> >> >> >> > >> >> >> >> > >> >> >> >> > ?? 2 ??? 2020 ?. 6:14:46 GMT+03:00, David Cunningham < >> >> >> >> dcunningham at voisonics.com> ??????: >> >> >> >> > >Hi Strahil and Sunny, >> >> >> >> > > >> >> >> >> > >Thank you for the replies. I checked the gfid on the >master >> >and >> >> >> >slaves >> >> >> >> > >and >> >> >> >> > >they are the same. After moving the file away and back >again >> >it >> >> >> >doesn't >> >> >> >> > >seem to be having the issue with that file any more. >> >> >> >> > > >> >> >> >> > >We are still getting higher CPU usage on one of the >master >> >nodes >> >> >> >than >> >> >> >> > >the >> >> >> >> > >others. It logs this every few seconds: >> >> >> >> > > >> >> >> >> > >[2020-06-02 03:10:15.637815] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: >> >Entry >> >> >> >Time >> >> >> >> > >Taken >> >> >> >> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 >> >> >> >duration=0.0000 >> >> >> >> > >UNL=0 >> >> >> >> > >[2020-06-02 03:10:15.638010] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: >> >> >> >Data/Metadata >> >> >> >> > >Time Taken SETA=0 SETX=0 >> >> >> >> > >meta_duration=0.0000data_duration=12.7878 >> >> >> >> > > DATA=4 XATT=0 >> >> >> >> > >[2020-06-02 03:10:15.638286] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: >> >Batch >> >> >> >> > >Completed >> >> >> >> > >changelog_end=1591067378 entry_stime=(1591067167, >0) >> >> >> >> > >changelog_start=1591067364 stime=(1591067377, 0) >> >> >> >> > >duration=12.8068 >> >> >> >> > > num_changelogs=2 mode=live_changelog >> >> >> >> > >[2020-06-02 03:10:20.658601] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: >> >slave's >> >> >> >time >> >> >> >> > > stime=(1591067377, 0) >> >> >> >> > >[2020-06-02 03:10:34.21799] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: >Sync >> >Time >> >> >> >Taken >> >> >> >> > > duration=0.3826 num_files=8 job=1 return_code=0 >> >> >> >> > >[2020-06-02 03:10:46.440535] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: >> >Entry >> >> >> >Time >> >> >> >> > >Taken >> >> >> >> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 >> >> >> >duration=0.1314 >> >> >> >> > >UNL=1 >> >> >> >> > >[2020-06-02 03:10:46.440809] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: >> >> >> >Data/Metadata >> >> >> >> > >Time Taken SETA=0 SETX=0 >> >> >> >> > >meta_duration=0.0000data_duration=13.0171 >> >> >> >> > > DATA=14 XATT=0 >> >> >> >> > >[2020-06-02 03:10:46.441205] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: >> >Batch >> >> >> >> > >Completed >> >> >> >> > >changelog_end=1591067420 entry_stime=(1591067419, >0) >> >> >> >> > >changelog_start=1591067392 stime=(1591067419, 0) >> >> >> >> > >duration=13.0322 >> >> >> >> > > num_changelogs=3 mode=live_changelog >> >> >> >> > >[2020-06-02 03:10:51.460925] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: >> >slave's >> >> >> >time >> >> >> >> > > stime=(1591067419, 0) >> >> >> >> > > >> >> >> >> > >[2020-06-02 03:11:04.448913] I [master(worker >> >> >> >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: >Sync >> >Time >> >> >> >Taken >> >> >> >> > >duration=0.3466 num_files=3 job=1 return_code=0 >> >> >> >> > > >> >> >> >> > >Whereas the other master nodes only log this: >> >> >> >> > > >> >> >> >> > >[2020-06-02 03:11:33.886938] I >[gsyncd(config-get):308:main] >> >> >> ><top>: >> >> >> >> > >Using >> >> >> >> > >session config file >> >> >> >> > >> >> >> >> >>>>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >> >> >> > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] >> ><top>: >> >> >> >Using >> >> >> >> > >session config file >> >> >> >> > >> >> >> >> >>>>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> >> >> >> > > >> >> >> >> > >Can anyone help with what might cause the high CPU usage >on >> >one >> >> >> >master >> >> >> >> > >node? The process is this one, and is using 70-100% of >CPU: >> >> >> >> > > >> >> >> >> > >python2 >> >> >> >>/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py >> >> >> >> > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >> >> >> >> > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 >> >> >> >--local-node-id >> >> >> >> > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >> >> >> >> > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 >> >> >> >--subvol-num 1 >> >> >> >> > >--resource-remote nvfs30 --resource-remote-id >> >> >> >> > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 >> >> >> >> > > >> >> >> >> > >Thank you in advance! >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov >> >> >> ><hunter86_bg at yahoo.com> >> >> >> >> > >wrote: >> >> >> >> > > >> >> >> >> > >> Hey David, >> >> >> >> > >> >> >> >> >> > >> for me a gfid mismatch means that the file was >> >> >> >replaced/recreated >> >> >> >> > > - >> >> >> >> > >> just like vim in linux does (and it is expected for >> >config >> >> >> >file). >> >> >> >> > >> >> >> >> >> > >> Have you checked the gfid of the file on both source >> >and >> >> >> >> > >destination, >> >> >> >> > >> do they really match or they are different ? >> >> >> >> > >> >> >> >> >> > >> What happens when you move away the file from the >slave >> >, >> >> >> >does it >> >> >> >> > >fixes >> >> >> >> > >> the issue ? >> >> >> >> > >> >> >> >> >> > >> Best Regards, >> >> >> >> > >> Strahil Nikolov >> >> >> >> > >> >> >> >> >> > >> ?? 30 ??? 2020 ?. 1:10:56 GMT+03:00, David Cunningham < >> >> >> >> > >> dcunningham at voisonics.com> ??????: >> >> >> >> > >> >Hello, >> >> >> >> > >> > >> >> >> >> > >> >We're having an issue with a geo-replication process >with >> >> >> >unusually >> >> >> >> > >> >high >> >> >> >> > >> >CPU use and giving "Entry not present on master. >Fixing >> >gfid >> >> >> >> > >mismatch >> >> >> >> > >> >in >> >> >> >> > >> >slave" errors. Can anyone help on this? >> >> >> >> > >> > >> >> >> >> > >> >We have 3 GlusterFS replica nodes (we'll call the >> >master), >> >> >> >which >> >> >> >> > >also >> >> >> >> > >> >push >> >> >> >> > >> >data to a remote server (slave) using geo-replication. >> >This has >> >> >> >been >> >> >> >> > >> >running fine for a couple of months, but yesterday one >of >> >the >> >> >> >master >> >> >> >> > >> >nodes >> >> >> >> > >> >started having unusually high CPU use. It's this >process: >> >> >> >> > >> > >> >> >> >> > >> >root at cafs30:/var/log/glusterfs# ps aux | grep 32048 >> >> >> >> > >> >root 32048 68.7 0.6 1843140 845756 ? Rl >02:51 >> >> >> >493:51 >> >> >> >> > >> >python2 >> >> >> >> > >> >> >>/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py >> >> >> >> > >worker >> >> >> >> > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >> >> >> >> > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 >> >> >> >--local-node-id >> >> >> >> > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >> >> >> >> > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd >12,11,9,13 >> >> >> >> > >--subvol-num 1 >> >> >> >> > >> >--resource-remote nvfs30 --resource-remote-id >> >> >> >> > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 >> >> >> >> > >> > >> >> >> >> > >> >Here's what is being logged in >> >> >> >> > >> >> >> >> >> >>>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: >> >> >> >> > >> > >> >> >> >> > >> >[2020-05-29 21:57:18.843524] I [master(worker >> >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] >_GMaster: >> >slave's >> >> >> >time >> >> >> >> > >> > stime=(1590789408, 0) >> >> >> >> > >> >[2020-05-29 21:57:30.626172] I [master(worker >> >> >> >> > >> >> >> >> >> >>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> >> >> >> > >> >_GMaster: Entry not present on master. Fixing gfid >> >mismatch in >> >> >> >> > >slave. >> >> >> >> > >> >Deleting the entry retry_count=1 entry=({u'uid': >> >108, >> >> >> >u'gfid': >> >> >> >> > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, >> >u'mode': >> >> >> >> > >33204, >> >> >> >> > >> >u'entry': >> >> >> >> > >> >> >>u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', >> >> >> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, >> >> >> >u'gfid_mismatch': >> >> >> >> > >True, >> >> >> >> > >> >u'slave_name': None, u'slave_gfid': >> >> >> >> > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', >> >u'name_mismatch': >> >> >> >False, >> >> >> >> > >> >u'dst': >> >> >> >> > >> >False}) >> >> >> >> > >> >[2020-05-29 21:57:30.627893] I [master(worker >> >> >> >> > >> >> >> >> >> >>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> >> >> >> > >> >_GMaster: Entry not present on master. Fixing gfid >> >mismatch in >> >> >> >> > >slave. >> >> >> >> > >> >Deleting the entry retry_count=1 entry=({u'uid': >> >108, >> >> >> >u'gfid': >> >> >> >> > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, >> >u'mode': >> >> >> >> > >33204, >> >> >> >> > >> >u'entry': >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> >> >>>>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', >> >> >> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, >> >> >> >u'gfid_mismatch': >> >> >> >> > >True, >> >> >> >> > >> >u'slave_name': None, u'slave_gfid': >> >> >> >> > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', >> >u'name_mismatch': >> >> >> >False, >> >> >> >> > >> >u'dst': >> >> >> >> > >> >False}) >> >> >> >> > >> >[2020-05-29 21:57:30.629532] I [master(worker >> >> >> >> > >> >> >> >> >> >>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> >> >> >> > >> >_GMaster: Entry not present on master. Fixing gfid >> >mismatch in >> >> >> >> > >slave. >> >> >> >> > >> >Deleting the entry retry_count=1 entry=({u'uid': >> >108, >> >> >> >u'gfid': >> >> >> >> > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, >> >u'mode': >> >> >> >> > >33204, >> >> >> >> > >> >u'entry': >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >>>>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', >> >> >> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, >> >> >> >u'gfid_mismatch': >> >> >> >> > >True, >> >> >> >> > >> >u'slave_name': None, u'slave_gfid': >> >> >> >> > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', >> >u'name_mismatch': >> >> >> >False, >> >> >> >> > >> >u'dst': >> >> >> >> > >> >False}) >> >> >> >> > >> >[2020-05-29 21:57:30.659123] I [master(worker >> >> >> >> > >> >> >>/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] >> >> >> >> > >_GMaster: >> >> >> >> > >> >Sucessfully fixed entry ops with gfid mismatch >> >> >> >retry_count=1 >> >> >> >> > >> >[2020-05-29 21:57:30.659343] I [master(worker >> >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1194:process_change] >> >> >> >_GMaster: >> >> >> >> > >Retry >> >> >> >> > >> >original entries. count = 1 >> >> >> >> > >> >[2020-05-29 21:57:30.725810] I [master(worker >> >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1197:process_change] >> >> >> >_GMaster: >> >> >> >> > >> >Sucessfully fixed all entry ops with gfid mismatch >> >> >> >> > >> >[2020-05-29 21:57:31.747319] I [master(worker >> >> >> >> > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] >Syncer: >> >Sync >> >> >> >Time >> >> >> >> > >Taken >> >> >> >> > >> >duration=0.7409 num_files=18 job=1 return_code=0 >> >> >> >> > >> > >> >> >> >> > >> >We've verified that the files like >> >polycom_00a859f7xxxx.cfg >> >> >> >referred >> >> >> >> > >to >> >> >> >> > >> >in >> >> >> >> > >> >the error do exist on the master nodes and slave. >> >> >> >> > >> > >> >> >> >> > >> >We found this bug fix: >> >> >> >> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 >> >> >> >> > >> > >> >> >> >> > >> >However that fix went in 5.1, and we're running 5.12 >on >> >the >> >> >> >master >> >> >> >> > >> >nodes >> >> >> >> > >> >and slave. A couple of GlusterFS clients connected to >the >> >> >> >master >> >> >> >> > >nodes >> >> >> >> > >> >are >> >> >> >> > >> >running 5.13. >> >> >> >> > >> > >> >> >> >> > >> >Would anyone have any suggestions? Thank you in >advance. >> >> >> >> > >> >> >> >> >> > ________ >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > Community Meeting Calendar: >> >> >> >> > >> >> >> >> > Schedule - >> >> >> >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> >> >> >> > Bridge: https://bluejeans.com/441850968 >> >> >> >> > >> >> >> >> > Gluster-users mailing list >> >> >> >> > Gluster-users at gluster.org >> >> >> >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> >> > -- >> >> > David Cunningham, Voisonics Limited >> >> > http://voisonics.com/ >> >> > USA: +1 213 221 1092 >> >> > New Zealand: +64 (0)28 2558 3782 >> >> > ________ >> >> > >> >> > >> >> > >> >> > Community Meeting Calendar: >> >> > >> >> > Schedule - >> >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> >> > Bridge: https://bluejeans.com/441850968 >> >> > >> >> > Gluster-users mailing list >> >> > Gluster-users at gluster.org >> >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> >> >> >> >> -- >> >> sankarshan at kadalu.io | TZ: UTC+0530 | +91 99606 03294 >> >> kadalu.io : Making it easy to provision storage in k8s! >> >> ________ >> >> >> >> >> >> >> >> Community Meeting Calendar: >> >> >> >> Schedule - >> >> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> >> Bridge: https://bluejeans.com/441850968 >> >> >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >>