thr3ads.net - Gluster users - [Gluster-users] Geo-Replication memory leak on slave node [Jul 2018]

If this information is useful, please help other people find it:
Share via:
Sunny Kumar
2018-Jul-13 11:35 UTC
[Gluster-users] Geo-Replication memory leak on slave node

Hi Mark,

Currently I am looking at this issue (Kotresh is busy with some other
work) so can you please share the latest log with me.

Thanks,
Sunny


On Fri, Jul 13, 2018 at 12:41 PM Mark Betham <mark.betham at
partnerize.com> wrote:>
> Hi Kotresh,
>
> I was wondering if you had found any time t take a look at the issue I am
currently experiencing with geo-replication and memory usage.
>
> If you require any further information then please do not hesitate to ask.
>
> Many thanks,
>
> Mark Betham
>
>
> On Wed, 20 Jun 2018 at 11:27, Mark Betham <mark.betham at
performancehorizon.com> wrote:
>>
>> Hi Kotresh,
>>
>> Many thanks for your prompt response.  No need to apologise, any help
you can provide is greatly appreciated.
>>
>> I look forward to receiving your update next week.
>>
>> Many thanks,
>>
>> Mark Betham
>>
>> On Wed, 20 Jun 2018 at 10:55, Kotresh Hiremath Ravishankar <khiremat
at redhat.com> wrote:
>>>
>>> Hi Mark,
>>>
>>> Sorry, I was busy and could not take a serious look at the logs. I
can update you on Monday.
>>>
>>> Thanks,
>>> Kotresh HR
>>>
>>> On Wed, Jun 20, 2018 at 12:32 PM, Mark Betham <mark.betham at
performancehorizon.com> wrote:
>>>>
>>>> Hi Kotresh,
>>>>
>>>> I was wondering if you had made any progress with regards to
the issue I am currently experiencing with geo-replication.
>>>>
>>>> For info the fault remains and effectively requires a restart
of the geo-replication service on a daily basis to reclaim the used memory on
the slave node.
>>>>
>>>> If you require any further information then please do not
hesitate to ask.
>>>>
>>>> Many thanks,
>>>>
>>>> Mark Betham
>>>>
>>>>
>>>> On Mon, 11 Jun 2018 at 08:24, Mark Betham <mark.betham at
performancehorizon.com> wrote:
>>>>>
>>>>> Hi Kotresh,
>>>>>
>>>>> Many thanks.  I will shortly setup a share on my GDrive and
send the link directly to yourself.
>>>>>
>>>>> For Info;
>>>>> The Geo-Rep slave failed again over the weekend but it did
not recover this time.  It looks to have become unresponsive at around 14:40 UTC
on 9th June.  I have attached an image showing the mem usage and you can see
from this when the system failed.  The system was totally unresponsive and
required a cold power off and then power on in order to recover the server.
>>>>>
>>>>> Many thanks for your help.
>>>>>
>>>>> Mark Betham.
>>>>>
>>>>> On 11 June 2018 at 05:53, Kotresh Hiremath Ravishankar
<khiremat at redhat.com> wrote:
>>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> Google drive works for me.
>>>>>>
>>>>>> Thanks,
>>>>>> Kotresh HR
>>>>>>
>>>>>> On Fri, Jun 8, 2018 at 3:00 PM, Mark Betham
<mark.betham at performancehorizon.com> wrote:
>>>>>>>
>>>>>>> Hi Kotresh,
>>>>>>>
>>>>>>> The memory issue re-occurred again.  This is
indicating it will occur around once a day.
>>>>>>>
>>>>>>> Again no traceback listed in the log, the only
update in the log was as follows;
>>>>>>> [2018-06-08 08:26:43.404261] I
[resource(slave):1020:service_loop] GLUSTER: connection inactive, stopping
timeout=120
>>>>>>> [2018-06-08 08:29:19.357615] I
[syncdutils(slave):271:finalize] <top>: exiting.
>>>>>>> [2018-06-08 08:31:02.432002] I
[resource(slave):1502:connect] GLUSTER: Mounting gluster volume locally...
>>>>>>> [2018-06-08 08:31:03.716967] I
[resource(slave):1515:connect] GLUSTER: Mounted gluster volume duration=1.2729
>>>>>>> [2018-06-08 08:31:03.717411] I
[resource(slave):1012:service_loop] GLUSTER: slave listening
>>>>>>>
>>>>>>> I have attached an image showing the latest memory
usage pattern.
>>>>>>>
>>>>>>> Can you please advise how I can pass the log data
across to you?  As soon as I know this I will get the data uploaded for your
review.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mark Betham
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 7 June 2018 at 08:19, Mark Betham
<mark.betham at performancehorizon.com> wrote:
>>>>>>>>
>>>>>>>> Hi Kotresh,
>>>>>>>>
>>>>>>>> Many thanks for your prompt response.
>>>>>>>>
>>>>>>>> Below are my responses to your questions;
>>>>>>>>
>>>>>>>> 1. Is this trace back consistently hit? I just
wanted to confirm whether it's transient which occurs once in a while and
gets back to normal?
>>>>>>>> It appears not.  As soon as the geo-rep
recovered yesterday from the high memory usage it immediately began rising again
until it consumed all of the available ram.  But this time nothing was committed
to the log file.
>>>>>>>> I would like to add here that this current
instance of geo-rep was only brought online at the start of this week due to the
issues with glibc on CentOS 7.5.  This is the first time I have had geo-rep
running with Gluster ver 3.12.9, both storage clusters at each physical site
were only rebuilt approx. 4 weeks ago, due to the previous version in use going
EOL.  Prior to this I had been running 3.13.2 (3.13.X now EOL) at each of the
sites and it is worth noting that the same behaviour was also seen on this
version of Gluster, unfortunately I do not have any of the log data from then
but I do not recall seeing any instances of the trace back message mentioned.
>>>>>>>>
>>>>>>>> 2. Please upload the complete geo-rep logs from
both master and slave.
>>>>>>>> I have the log files, just checking to make
sure there is no confidential info inside.  The logfiles are too big to send via
email, even when compressed.  Do you have a preferred method to allow me to
share this data with you or would a share from my Google drive be sufficient?
>>>>>>>>
>>>>>>>> 3. Are the gluster versions same across master
and slave?
>>>>>>>> Yes, all gluster versions are the same across
the two sites for all storage nodes.  See below for version info taken from the
current geo-rep master.
>>>>>>>>
>>>>>>>> glusterfs 3.12.9
>>>>>>>> Repository revision:
git://git.gluster.org/glusterfs.git
>>>>>>>> Copyright (c) 2006-2016 Red Hat, Inc.
<https://www.gluster.org/>
>>>>>>>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>>>>>>>> It is licensed to you under your choice of the
GNU Lesser
>>>>>>>> General Public License, version 3 or any later
version (LGPLv3
>>>>>>>> or later), or the GNU General Public License,
version 2 (GPLv2),
>>>>>>>> in all cases as published by the Free Software
Foundation.
>>>>>>>>
>>>>>>>> glusterfs-geo-replication-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-gnfs-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-libs-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-server-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-api-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-events-3.12.9-1.el7.x86_64
>>>>>>>>
centos-release-gluster312-1.0-1.el7.centos.noarch
>>>>>>>> glusterfs-client-xlators-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-cli-3.12.9-1.el7.x86_64
>>>>>>>> python2-gluster-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-rdma-3.12.9-1.el7.x86_64
>>>>>>>> glusterfs-fuse-3.12.9-1.el7.x86_64
>>>>>>>>
>>>>>>>> I have also attached another screenshot showing
the memory usage from the Gluster slave for the last 48 hours.  This shows
memory saturation from yesterday, which correlates with the trace back sent
yesterday, and the subsequent memory saturation which occurred over the last 24
hours.  For info, all times are in UTC.
>>>>>>>>
>>>>>>>> Please advise the preferred method to get the
log data across to you and also if you require any further information.
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>>
>>>>>>>> Mark Betham
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7 June 2018 at 04:42, Kotresh Hiremath
Ravishankar <khiremat at redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> Hi Mark,
>>>>>>>>>
>>>>>>>>> Few questions.
>>>>>>>>>
>>>>>>>>> 1. Is this trace back consistently hit? I
just wanted to confirm whether it's transient which occurs once in a while
and gets back to normal?
>>>>>>>>> 2. Please upload the complete geo-rep logs
from both master and slave.
>>>>>>>>> 3. Are the gluster versions same across
master and slave?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kotresh HR
>>>>>>>>>
>>>>>>>>> On Wed, Jun 6, 2018 at 7:10 PM, Mark Betham
<mark.betham at performancehorizon.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Dear Gluster-Users,
>>>>>>>>>>
>>>>>>>>>> I have geo-replication setup and
configured between 2 Gluster pools located at different sites.  What I am seeing
is an error being reported within the geo-replication slave log as follows;
>>>>>>>>>>
>>>>>>>>>> [2018-06-05 12:05:26.767615] E
[syncdutils(slave):331:log_raise_exception] <top>: FAIL:
>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>   File
"/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 361, in
twrap
>>>>>>>>>>     tf(*aa)
>>>>>>>>>>   File
"/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1009, in
<lambda>
>>>>>>>>>>     t =
syncdutils.Thread(target=lambda: (repce.service_loop(),
>>>>>>>>>>   File
"/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 90, in
service_loop
>>>>>>>>>>     self.q.put(recv(self.inf))
>>>>>>>>>>   File
"/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 61, in recv
>>>>>>>>>>     return pickle.load(inf)
>>>>>>>>>> ImportError: No module named
h_2013-04-26-04:02:49-2013-04-26_11:02:53.gz.15WBuUh
>>>>>>>>>> [2018-06-05 12:05:26.768085] E
[repce(slave):117:worker] <top>: call failed:
>>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>>   File
"/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in
worker
>>>>>>>>>>     res = getattr(self.obj,
rmeth)(*in_data[2:])
>>>>>>>>>> TypeError: getattr(): attribute name
must be string
>>>>>>>>>>
>>>>>>>>>> From this point in time the slave
server begins to consume all of its available RAM until it becomes
non-responsive.  Eventually the gluster service seems to kill off the offending
process and the memory is returned to the system.  Once the memory has been
returned to the remote slave system the geo-replication often recovers and data
transfer resumes.
>>>>>>>>>>
>>>>>>>>>> I have attached the full
geo-replication slave log containing the error shown above.  I have also
attached an image file showing the memory usage of the affected storage server.
>>>>>>>>>>
>>>>>>>>>> We are currently running Gluster
version 3.12.9 on top of CentOS 7.5 x86_64.  The system has been fully patched
and is running the latest software, excluding glibc which had to be downgraded
to get geo-replication working.
>>>>>>>>>>
>>>>>>>>>> The Gluster volume runs on a dedicated
partition using the XFS filesystem which in turn is running on a LVM thin
volume.  The physical storage is presented as a single drive due to the
underlying disks being part of a raid 10 array.
>>>>>>>>>>
>>>>>>>>>> The Master volume which is being
replicated has a total of 2.2 TB of data to be replicated.  The total size of
the volume fluctuates very little as data being removed equals the new data
coming in.  This data is made up of many thousands of files across many
separated directories.  Data file sizes vary from the very small (>1K) to the
large (>1Gb).  The Gluster service itself is running with a single volume in
a replicated configuration across 3 bricks at each of the sites.  The delta
changes being replicated are on average about 100GB per day, where this includes
file creation / deletion / modification.
>>>>>>>>>>
>>>>>>>>>> The config for the geo-replication
session is as follows, taken from the current source server;
>>>>>>>>>>
>>>>>>>>>> special_sync_mode: partial
>>>>>>>>>> gluster_log_file:
/var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.gluster.log
>>>>>>>>>> ssh_command: ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/secret.pem
>>>>>>>>>> change_detector: changelog
>>>>>>>>>> session_owner:
40e9e77a-034c-44a2-896e-59eec47e8a84
>>>>>>>>>> state_file:
/var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.status
>>>>>>>>>> gluster_params: aux-gfid-mount acl
>>>>>>>>>> log_rsync_performance: true
>>>>>>>>>> remote_gsyncd: /nonexistent/gsyncd
>>>>>>>>>> working_dir:
/var/lib/misc/glusterfsd/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1
>>>>>>>>>> state_detail_file:
/var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-detail.status
>>>>>>>>>> gluster_command_dir: /usr/sbin/
>>>>>>>>>> pid_file:
/var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/monitor.pid
>>>>>>>>>> georep_session_working_dir:
/var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/
>>>>>>>>>> ssh_command_tar: ssh
-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i
/var/lib/glusterd/geo-replication/tar_ssh.pem
>>>>>>>>>> master.stime_xattr_name:
trusted.glusterfs.40e9e77a-034c-44a2-896e-59eec47e8a84.ccfaed9b-ff4b-4a55-acfa-03f092cdf460.stime
>>>>>>>>>> changelog_log_file:
/var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1-changes.log
>>>>>>>>>> socketdir: /var/run/gluster
>>>>>>>>>> volume_id:
40e9e77a-034c-44a2-896e-59eec47e8a84
>>>>>>>>>> ignore_deletes: false
>>>>>>>>>> state_socket_unencoded:
/var/lib/glusterd/geo-replication/glustervol0_storage-server.local_glustervol1/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.socket
>>>>>>>>>> log_file:
/var/log/glusterfs/geo-replication/glustervol0/ssh%3A%2F%2Froot%40storage-server.local%3Agluster%3A%2F%2F127.0.0.1%3Aglustervol1.log
>>>>>>>>>>
>>>>>>>>>> If any further information is required
in order to troubleshoot this issue then please let me know.
>>>>>>>>>>
>>>>>>>>>> I would be very grateful for any help
or guidance received.
>>>>>>>>>>
>>>>>>>>>> Many thanks,
>>>>>>>>>>
>>>>>>>>>> Mark Betham.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This email may contain confidential
material; unintended recipients must not disseminate, use, or act upon any
information in it. If you received this email in error, please contact the
sender and permanently delete the email.
>>>>>>>>>> Performance Horizon Group Limited |
Registered in England & Wales 07188234 | Level 8, West One, Forth Banks,
Newcastle upon Tyne, NE1 3PA
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
_______________________________________________
>>>>>>>>>> Gluster-users mailing list
>>>>>>>>>> Gluster-users at gluster.org
>>>>>>>>>>
http://lists.gluster.org/mailman/listinfo/gluster-users
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks and Regards,
>>>>>>>>> Kotresh H R
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> MARK BETHAM
>>>>>>>> Senior System Administrator
>>>>>>>> +44 (0) 191 261 2444
>>>>>>>> performancehorizon.com
>>>>>>>> PerformanceHorizon
>>>>>>>> tweetphg
>>>>>>>> performance-horizon-group
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> MARK BETHAM
>>>>>>> Senior System Administrator
>>>>>>> +44 (0) 191 261 2444
>>>>>>> performancehorizon.com
>>>>>>> PerformanceHorizon
>>>>>>> tweetphg
>>>>>>> performance-horizon-group
>>>>>>>
>>>>>>>
>>>>>>> This email may contain confidential material;
unintended recipients must not disseminate, use, or act upon any information in
it. If you received this email in error, please contact the sender and
permanently delete the email.
>>>>>>> Performance Horizon Group Limited | Registered in
England & Wales 07188234 | Level 8, West One, Forth Banks, Newcastle upon
Tyne, NE1 3PA
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks and Regards,
>>>>>> Kotresh H R
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> MARK BETHAM
>>>>> Senior System Administrator
>>>>> +44 (0) 191 261 2444
>>>>> performancehorizon.com
>>>>> PerformanceHorizon
>>>>> tweetphg
>>>>> performance-horizon-group
>>>>
>>>>
>>>>
>>>> --
>>>> MARK BETHAM
>>>> Senior System Administrator
>>>> +44 (0) 191 261 2444
>>>> performancehorizon.com
>>>> PerformanceHorizon
>>>> tweetphg
>>>> performance-horizon-group
>>>>
>>>>
>>>> This email may contain confidential material; unintended
recipients must not disseminate, use, or act upon any information in it. If you
received this email in error, please contact the sender and permanently delete
the email.
>>>> Performance Horizon Group Limited | Registered in England &
Wales 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks and Regards,
>>> Kotresh H R
>>
>>
>>
>> --
>> MARK BETHAM
>> Senior System Administrator
>> +44 (0) 191 261 2444
>> performancehorizon.com
>> PerformanceHorizon
>> tweetphg
>> performance-horizon-group
>
>
>
> --
> MARK BETHAM
> Senior Systems Administrator
> +44 (0) 191 261 2444
>
>
> This email may contain confidential material; unintended recipients must
not disseminate, use, or act upon any information in it. If you received this
email in error, please contact the sender and permanently delete the email.
> Performance Horizon Group Limited | Registered in England & Wales
07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA
>
>
Gluster users - Jul 2018 - Geo-Replication memory leak on slave node

[Gluster-users] Geo-Replication memory leak on slave node