On 01/25/2018 11:04 PM, Dan Ragle wrote:> *sigh* trying again to correct formatting ... apologize for the > earlier mess. > > Having a memory issue with Gluster 3.12.4 and not sure how to > troubleshoot. I don't *think* this is expected behavior. > > This is on an updated CentOS 7 box. The setup is a simple two node > replicated layout where the two nodes act as both server and client. > > The volume in question: > > Volume Name: GlusterWWW > Type: Replicate > Volume ID: 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 2 = 2 > Transport-type: tcp > Bricks: > Brick1: vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www > Brick2: vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www > Options Reconfigured: > nfs.disable: on > cluster.favorite-child-policy: mtime > transport.address-family: inet > > I had some other performance options in there, (increased cache-size, > md invalidation, etc) but stripped them out in an attempt to isolate > the issue. Still got the problem without them. > > The volume currently contains over 1M files. > > When mounting the volume, I get (among other things) a process as such: > > /usr/sbin/glusterfs --volfile-server=localhost > --volfile-id=/GlusterWWW /var/www > > This process begins with little memory, but then as files are accessed > in the volume the memory increases. I setup a script that simply reads > the files in the volume one at a time (no writes). It's been running > on and off about 12 hours now and the resident memory of the above > process is already at 7.5G and continues to grow slowly. If I stop the > test script the memory stops growing, but does not reduce. Restart the > test script and the memory begins slowly growing again. > > This is obviously a contrived app environment. With my intended > application load it takes about a week or so for the memory to get > high enough to invoke the oom killer.Can you try debugging with the statedump (https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump) of the fuse mount process and see what member is leaking? Take the statedumps in succession, maybe once initially during the I/O and once the memory gets high enough to hit the OOM mark. Share the dumps here. Regards, Ravi> > Is there potentially something misconfigured here? > > I did see a reference to a memory leak in another thread in this list, > but that had to do with the setting of quotas, I don't have any quotas > set on my system. > > Thanks, > > Dan Ragle > daniel at Biblestuph.com > > On 1/25/2018 11:04 AM, Dan Ragle wrote: >> Having a memory issue with Gluster 3.12.4 and not sure how to >> troubleshoot. I don't *think* this is expected behavior. This is on an >> updated CentOS 7 box. The setup is a simple two node replicated layout >> where the two nodes act as both server and client. The volume in >> question: Volume Name: GlusterWWW Type: Replicate Volume ID: >> 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 Status: Started Snapshot Count: 0 >> Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: >> vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www Brick2: >> vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www Options Reconfigured: >> nfs.disable: on cluster.favorite-child-policy: mtime >> transport.address-family: inet I had some other performance options in >> there, (increased cache-size, md invalidation, etc) but stripped them >> out in an attempt to isolate the issue. Still got the problem without >> them. The volume currently contains over 1M files. When mounting the >> volume, I get (among other things) a process as such: >> /usr/sbin/glusterfs --volfile-server=localhost --volfile-id=/GlusterWWW >> /var/www This process begins with little memory, but then as files are >> accessed in the volume the memory increases. I setup a script that >> simply reads the files in the volume one at a time (no writes). It's >> been running on and off about 12 hours now and the resident memory of >> the above process is already at 7.5G and continues to grow slowly. If I >> stop the test script the memory stops growing, but does not reduce. >> Restart the test script and the memory begins slowly growing again. This >> is obviously a contrived app environment. With my intended application >> load it takes about a week or so for the memory to get high enough to >> invoke the oom killer. Is there potentially something misconfigured >> here? Thanks, Dan Ragle daniel at Biblestuph.com >> >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://lists.gluster.org/mailman/listinfo/gluster-users >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users
On 1/25/2018 8:21 PM, Ravishankar N wrote:> > > On 01/25/2018 11:04 PM, Dan Ragle wrote: >> *sigh* trying again to correct formatting ... apologize for the earlier mess. >> >> Having a memory issue with Gluster 3.12.4 and not sure how to troubleshoot. I don't *think* this is expected behavior. >> >> This is on an updated CentOS 7 box. The setup is a simple two node replicated layout where the two nodes act as both server and >> client. >> >> The volume in question: >> >> Volume Name: GlusterWWW >> Type: Replicate >> Volume ID: 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x 2 = 2 >> Transport-type: tcp >> Bricks: >> Brick1: vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www >> Brick2: vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www >> Options Reconfigured: >> nfs.disable: on >> cluster.favorite-child-policy: mtime >> transport.address-family: inet >> >> I had some other performance options in there, (increased cache-size, md invalidation, etc) but stripped them out in an attempt to >> isolate the issue. Still got the problem without them. >> >> The volume currently contains over 1M files. >> >> When mounting the volume, I get (among other things) a process as such: >> >> /usr/sbin/glusterfs --volfile-server=localhost --volfile-id=/GlusterWWW /var/www >> >> This process begins with little memory, but then as files are accessed in the volume the memory increases. I setup a script that >> simply reads the files in the volume one at a time (no writes). It's been running on and off about 12 hours now and the resident >> memory of the above process is already at 7.5G and continues to grow slowly. If I stop the test script the memory stops growing, >> but does not reduce. Restart the test script and the memory begins slowly growing again. >> >> This is obviously a contrived app environment. With my intended application load it takes about a week or so for the memory to get >> high enough to invoke the oom killer. > > Can you try debugging with the statedump (https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump) of > the fuse mount process and see what member is leaking? Take the statedumps in succession, maybe once initially during the I/O and > once the memory gets high enough to hit the OOM mark. > Share the dumps here. > > Regards, > RaviThanks for the reply. I noticed yesterday that an update (3.12.5) had been posted so I went ahead and updated and repeated the test overnight. The memory usage does not appear to be growing as quickly as is was with 3.12.4, but does still appear to be growing. I should also mention that there is another process beyond my test app that is reading the files from the volume. Specifically, there is an rsync that runs from the second node 2-4 times an hour that reads from the GlusterWWW volume mounted on node 1. Since none of the files in that mount are changing it doesn't actually rsync anything, but nonetheless it is running and reading the files in addition to my test script. (It's a part of my intended production setup that I forgot was still running.) The mount process appears to be gaining memory at a rate of about 1GB every 4 hours or so. At that rate it'll take several days before it runs the box out of memory. But I took your suggestion and made some statedumps today anyway, about 2 hours apart, 4 total so far. It looks like there may already be some actionable information. These are the only registers where the num_allocs have grown with each of the four samples: [mount/fuse.fuse - usage-type gf_fuse_mt_gids_t memusage] ---> num_allocs at Fri Jan 26 08:57:31 2018: 784 ---> num_allocs at Fri Jan 26 10:55:50 2018: 831 ---> num_allocs at Fri Jan 26 12:55:15 2018: 877 ---> num_allocs at Fri Jan 26 14:58:27 2018: 908 [mount/fuse.fuse - usage-type gf_common_mt_fd_lk_ctx_t memusage] ---> num_allocs at Fri Jan 26 08:57:31 2018: 5 ---> num_allocs at Fri Jan 26 10:55:50 2018: 10 ---> num_allocs at Fri Jan 26 12:55:15 2018: 15 ---> num_allocs at Fri Jan 26 14:58:27 2018: 17 [cluster/distribute.GlusterWWW-dht - usage-type gf_dht_mt_dht_layout_t memusage] ---> num_allocs at Fri Jan 26 08:57:31 2018: 24243596 ---> num_allocs at Fri Jan 26 10:55:50 2018: 27902622 ---> num_allocs at Fri Jan 26 12:55:15 2018: 30678066 ---> num_allocs at Fri Jan 26 14:58:27 2018: 33801036 Not sure the best way to get you the full dumps. They're pretty big, over 1G for all four. Also, I noticed some filepath information in there that I'd rather not share. What's the recommended next step? Cheers! Dan>> >> Is there potentially something misconfigured here? >> >> I did see a reference to a memory leak in another thread in this list, but that had to do with the setting of quotas, I don't have >> any quotas set on my system. >> >> Thanks, >> >> Dan Ragle >> daniel at Biblestuph.com >> >> On 1/25/2018 11:04 AM, Dan Ragle wrote: >>> Having a memory issue with Gluster 3.12.4 and not sure how to >>> troubleshoot. I don't *think* this is expected behavior. This is on an >>> updated CentOS 7 box. The setup is a simple two node replicated layout >>> where the two nodes act as both server and client. The volume in >>> question: Volume Name: GlusterWWW Type: Replicate Volume ID: >>> 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 Status: Started Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: >>> vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www Brick2: >>> vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www Options Reconfigured: >>> nfs.disable: on cluster.favorite-child-policy: mtime >>> transport.address-family: inet I had some other performance options in >>> there, (increased cache-size, md invalidation, etc) but stripped them >>> out in an attempt to isolate the issue. Still got the problem without >>> them. The volume currently contains over 1M files. When mounting the >>> volume, I get (among other things) a process as such: >>> /usr/sbin/glusterfs --volfile-server=localhost --volfile-id=/GlusterWWW >>> /var/www This process begins with little memory, but then as files are >>> accessed in the volume the memory increases. I setup a script that >>> simply reads the files in the volume one at a time (no writes). It's >>> been running on and off about 12 hours now and the resident memory of >>> the above process is already at 7.5G and continues to grow slowly. If I >>> stop the test script the memory stops growing, but does not reduce. >>> Restart the test script and the memory begins slowly growing again. This >>> is obviously a contrived app environment. With my intended application >>> load it takes about a week or so for the memory to get high enough to >>> invoke the oom killer. Is there potentially something misconfigured >>> here? Thanks, Dan Ragle daniel at Biblestuph.com >>> >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://lists.gluster.org/mailman/listinfo/gluster-users >>> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://lists.gluster.org/mailman/listinfo/gluster-users >
On 01/27/2018 02:29 AM, Dan Ragle wrote:> > On 1/25/2018 8:21 PM, Ravishankar N wrote: >> >> >> On 01/25/2018 11:04 PM, Dan Ragle wrote: >>> *sigh* trying again to correct formatting ... apologize for the >>> earlier mess. >>> >>> Having a memory issue with Gluster 3.12.4 and not sure how to >>> troubleshoot. I don't *think* this is expected behavior. >>> >>> This is on an updated CentOS 7 box. The setup is a simple two node >>> replicated layout where the two nodes act as both server and >>> client. >>> >>> The volume in question: >>> >>> Volume Name: GlusterWWW >>> Type: Replicate >>> Volume ID: 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 2 = 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www >>> Brick2: vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www >>> Options Reconfigured: >>> nfs.disable: on >>> cluster.favorite-child-policy: mtime >>> transport.address-family: inet >>> >>> I had some other performance options in there, (increased >>> cache-size, md invalidation, etc) but stripped them out in an >>> attempt to >>> isolate the issue. Still got the problem without them. >>> >>> The volume currently contains over 1M files. >>> >>> When mounting the volume, I get (among other things) a process as such: >>> >>> /usr/sbin/glusterfs --volfile-server=localhost >>> --volfile-id=/GlusterWWW /var/www >>> >>> This process begins with little memory, but then as files are >>> accessed in the volume the memory increases. I setup a script that >>> simply reads the files in the volume one at a time (no writes). It's >>> been running on and off about 12 hours now and the resident >>> memory of the above process is already at 7.5G and continues to grow >>> slowly. If I stop the test script the memory stops growing, >>> but does not reduce. Restart the test script and the memory begins >>> slowly growing again. >>> >>> This is obviously a contrived app environment. With my intended >>> application load it takes about a week or so for the memory to get >>> high enough to invoke the oom killer. >> >> Can you try debugging with the statedump >> (https://gluster.readthedocs.io/en/latest/Troubleshooting/statedump/#read-a-statedump) >> of >> the fuse mount process and see what member is leaking? Take the >> statedumps in succession, maybe once initially during the I/O and >> once the memory gets high enough to hit the OOM mark. >> Share the dumps here. >> >> Regards, >> Ravi > > Thanks for the reply. I noticed yesterday that an update (3.12.5) had > been posted so I went ahead and updated and repeated the test > overnight. The memory usage does not appear to be growing as quickly > as is was with 3.12.4, but does still appear to be growing. > > I should also mention that there is another process beyond my test app > that is reading the files from the volume. Specifically, there is an > rsync that runs from the second node 2-4 times an hour that reads from > the GlusterWWW volume mounted on node 1. Since none of the files in > that mount are changing it doesn't actually rsync anything, but > nonetheless it is running and reading the files in addition to my test > script. (It's a part of my intended production setup that I forgot was > still running.) > > The mount process appears to be gaining memory at a rate of about 1GB > every 4 hours or so. At that rate it'll take several days before it > runs the box out of memory. But I took your suggestion and made some > statedumps today anyway, about 2 hours apart, 4 total so far. It looks > like there may already be some actionable information. These are the > only registers where the num_allocs have grown with each of the four > samples: > > [mount/fuse.fuse - usage-type gf_fuse_mt_gids_t memusage] > ?---> num_allocs at Fri Jan 26 08:57:31 2018: 784 > ?---> num_allocs at Fri Jan 26 10:55:50 2018: 831 > ?---> num_allocs at Fri Jan 26 12:55:15 2018: 877 > ?---> num_allocs at Fri Jan 26 14:58:27 2018: 908 > > [mount/fuse.fuse - usage-type gf_common_mt_fd_lk_ctx_t memusage] > ?---> num_allocs at Fri Jan 26 08:57:31 2018: 5 > ?---> num_allocs at Fri Jan 26 10:55:50 2018: 10 > ?---> num_allocs at Fri Jan 26 12:55:15 2018: 15 > ?---> num_allocs at Fri Jan 26 14:58:27 2018: 17 > > [cluster/distribute.GlusterWWW-dht - usage-type gf_dht_mt_dht_layout_t > memusage] > ?---> num_allocs at Fri Jan 26 08:57:31 2018: 24243596 > ?---> num_allocs at Fri Jan 26 10:55:50 2018: 27902622 > ?---> num_allocs at Fri Jan 26 12:55:15 2018: 30678066 > ?---> num_allocs at Fri Jan 26 14:58:27 2018: 33801036 > > Not sure the best way to get you the full dumps. They're pretty big, > over 1G for all four. Also, I noticed some filepath information in > there that I'd rather not share. What's the recommended next step?I've CC'd the fuse/ dht devs to see if these data types have potential leaks. Could you raise a bug with the volume info and a (dropbox?) link from which we can download the dumps? You can remove/replace the filepaths from them. Regards. Ravi> > Cheers! > > Dan > >>> >>> Is there potentially something misconfigured here? >>> >>> I did see a reference to a memory leak in another thread in this >>> list, but that had to do with the setting of quotas, I don't have >>> any quotas set on my system. >>> >>> Thanks, >>> >>> Dan Ragle >>> daniel at Biblestuph.com >>> >>> On 1/25/2018 11:04 AM, Dan Ragle wrote: >>>> Having a memory issue with Gluster 3.12.4 and not sure how to >>>> troubleshoot. I don't *think* this is expected behavior. This is on an >>>> updated CentOS 7 box. The setup is a simple two node replicated layout >>>> where the two nodes act as both server and client. The volume in >>>> question: Volume Name: GlusterWWW Type: Replicate Volume ID: >>>> 8e9b0e79-f309-4d9b-a5bb-45d065faaaa3 Status: Started Snapshot Count: 0 >>>> Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: >>>> vs1dlan.mydomain.com:/glusterfs_bricks/brick1/www Brick2: >>>> vs2dlan.mydomain.com:/glusterfs_bricks/brick1/www Options >>>> Reconfigured: >>>> nfs.disable: on cluster.favorite-child-policy: mtime >>>> transport.address-family: inet I had some other performance options in >>>> there, (increased cache-size, md invalidation, etc) but stripped them >>>> out in an attempt to isolate the issue. Still got the problem without >>>> them. The volume currently contains over 1M files. When mounting the >>>> volume, I get (among other things) a process as such: >>>> /usr/sbin/glusterfs --volfile-server=localhost >>>> --volfile-id=/GlusterWWW >>>> /var/www This process begins with little memory, but then as files are >>>> accessed in the volume the memory increases. I setup a script that >>>> simply reads the files in the volume one at a time (no writes). It's >>>> been running on and off about 12 hours now and the resident memory of >>>> the above process is already at 7.5G and continues to grow slowly. >>>> If I >>>> stop the test script the memory stops growing, but does not reduce. >>>> Restart the test script and the memory begins slowly growing again. >>>> This >>>> is obviously a contrived app environment. With my intended application >>>> load it takes about a week or so for the memory to get high enough to >>>> invoke the oom killer. Is there potentially something misconfigured >>>> here? Thanks, Dan Ragle daniel at Biblestuph.com >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> http://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> http://lists.gluster.org/mailman/listinfo/gluster-users >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users