thr3ads.net - Gluster users - [Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Christian Rice

2015-Sep-01 19:15 UTC

[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases

This is still an issue for me, I don?t need anyone to tear the code apart, but
I?d be grateful if someone would even chime in and say ?yeah, we?ve seen that
too."

From: Christian Rice <crice at pandora.com<mailto:crice at
pandora.com>>
Date: Sunday, August 30, 2015 at 11:18 PM
To: "gluster-users at gluster.org<mailto:gluster-users at
gluster.org>" <gluster-users at gluster.org<mailto:gluster-users
at gluster.org>>
Subject: [Gluster-users] Gluster 3.6.3 performance.cache-size not working as
expected in some cases

I am confused about my caching problem.  I?ll try to keep this as
straightforward as possible and include the basic details...

I have a sixteen node distributed volume, one brick per node, XFS isize=512,
Debian 7/Wheezy, 32GB RAM minimally.  Every brick node is also a gluster client,
and also importantly an HTTP server.  We use a back-end 1GbE network for gluster
traffic (eth1).  There are a couple dozen gluster client-only systems accessing
this volume, as well.

We had a really hot spot on one brick due to an oft-requested file, and every
time any httpd process on any gluster client was asked to deliver the file, it
was physically fetching it (we could see this traffic using, say, ?iftop -i
eth1?,) so we thought to increase the volume cache timeout and cache size.  We
set the following values for testing:

performance.cache-size 16GB
performance.cache-refresh-timeout: 30

This test was run from a node that didn?t have the requested file on the local
brick:

while(true); do cat /path/to/file > /dev/null; done

and what had been very high traffic on the gluster backend network, delivering
the data repeatedly to my requesting node, dropped to nothing visible.

I thought good, problem fixed.  Caching works.  My colleague had run a test
early on to show this perf issue, so he ran it again to sign off.

His testing used curl, because all the real front end traffic is HTTP, and all
the gluster nodes are web servers, which are of course using the fuse mount to
access the document root.  Even with our performance tuning, the traffic on the
gluster backend subnet was continuous and undiminished.  I saw no evidence of
cache (again using ?iftop -i eth1?, which showed a steady 75+% of line rate on a
1GbE link.

Does that make sense at all?  We had theorized that we wouldn?t get to use
VFS/kernel page cache on any node except maybe the one which held the data in
the local brick.  That?s what drove us to setting the gluster performance cache.
But it doesn?t seem to come into play with http access.

Volume info:
Volume Name: DOCROOT
Type: Distribute
Volume ID: 3aecd277-4d26-44cd-879d-cffbb1fec6ba
Status: Started
Number of Bricks: 16
Transport-type: tcp
Bricks:
<snipped list of bricks>
Options Reconfigured:
performance.cache-refresh-timeout: 30
performance.cache-size: 16GB

The net result of being overwhelmed by a hot spot is all the gluster client
nodes lose access to the gluster volume?it becomes so busy it hangs.  When the
traffic goes away (failing health checks by load balancers causes requests to be
redirected elsewhere), the volume eventually unfreezes and life goes on.

I wish I could type ALL that into a google query and get a lucid answer :)

Regards,
Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150901/09d65c11/attachment.html>

Raghavendra Bhat

2015-Sep-02 07:15 UTC

head link

[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases

Hi Christian,

I have been working on it since couple of days. I have not been able to 
recreate the issue. I will continue to recreate and get back to you in a 
day or two.

Regards,
Raghavendra Bhat


On 09/02/2015 12:45 AM, Christian Rice wrote:> This is still an issue for me, I don?t need anyone to tear the code 
> apart, but I?d be grateful if someone would even chime in and say 
> ?yeah, we?ve seen that too."
>
> From: Christian Rice <crice at pandora.com <mailto:crice at
pandora.com>>
> Date: Sunday, August 30, 2015 at 11:18 PM
> To: "gluster-users at gluster.org <mailto:gluster-users at
gluster.org>"
> <gluster-users at gluster.org <mailto:gluster-users at
gluster.org>>
> Subject: [Gluster-users] Gluster 3.6.3 performance.cache-size not 
> working as expected in some cases
>
> I am confused about my caching problem.  I?ll try to keep this as 
> straightforward as possible and include the basic details...
>
> I have a sixteen node distributed volume, one brick per node, XFS 
> isize=512, Debian 7/Wheezy, 32GB RAM minimally.  Every brick node is 
> also a gluster client, and also importantly an HTTP server.  We use a 
> back-end 1GbE network for gluster traffic (eth1).  There are a couple 
> dozen gluster client-only systems accessing this volume, as well.
>
> We had a really hot spot on one brick due to an oft-requested file, 
> and every time any httpd process on any gluster client was asked to 
> deliver the file, it was physically fetching it (we could see this 
> traffic using, say, ?iftop -i eth1?,) so we thought to increase the 
> volume cache timeout and cache size.  We set the following values for 
> testing:
>
> performance.cache-size 16GB
> performance.cache-refresh-timeout: 30
>
> This test was run from a node that didn?t have the requested file on 
> the local brick:
>
> while(true); do cat /path/to/file > /dev/null; done
>
> and what had been very high traffic on the gluster backend network, 
> delivering the data repeatedly to my requesting node, dropped to 
> nothing visible.
>
> I thought good, problem fixed.  Caching works.  My colleague had run a 
> test early on to show this perf issue, so he ran it again to sign off.
>
> His testing used curl, because all the real front end traffic is HTTP, 
> and all the gluster nodes are web servers, which are of course using 
> the fuse mount to access the document root.  Even with our performance 
> tuning, the traffic on the gluster backend subnet was continuous and 
> undiminished.  I saw no evidence of cache (again using ?iftop -i 
> eth1?, which showed a steady 75+% of line rate on a 1GbE link.
>
> Does that make sense at all?  We had theorized that we wouldn?t get to 
> use VFS/kernel page cache on any node except maybe the one which held 
> the data in the local brick.  That?s what drove us to setting the 
> gluster performance cache.  But it doesn?t seem to come into play with 
> http access.
>
>
> Volume info:
> Volume Name: DOCROOT
> Type: Distribute
> Volume ID: 3aecd277-4d26-44cd-879d-cffbb1fec6ba
> Status: Started
> Number of Bricks: 16
> Transport-type: tcp
> Bricks:
> <snipped list of bricks>
> Options Reconfigured:
> performance.cache-refresh-timeout: 30
> performance.cache-size: 16GB
>
> The net result of being overwhelmed by a hot spot is all the gluster 
> client nodes lose access to the gluster volume?it becomes so busy it 
> hangs.  When the traffic goes away (failing health checks by load 
> balancers causes requests to be redirected elsewhere), the volume 
> eventually unfreezes and life goes on.
>
> I wish I could type ALL that into a google query and get a lucid answer :)
>
> Regards,
> Christian
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150902/1da52d16/attachment.html>

Mathieu Chateau

2015-Sep-02 08:46 UTC

head link

[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases

Hello,

What I could say from my few knowledge on gluster:

   - If each server have itself as a mount point (name of server in fstab),
   then it should only ask others for metadata but grab file locally from
   itself. Use backupvolfile-server to provide an alternate one in case of
   issue at boot
   - Did you disable atime flag ? Maybe the file get this attribute updated
   on each read that may invalidate cache (just a clue)
   - if a lot small files and good server perf, you can increase number of
   thread through  performance.io-thread-count

You also have these settings that impact what get or not in the cache:

performance.*cache*-max-file-size         0


performance.*cache*-min-file-size         0



just my 2cents

Cordialement,
Mathieu CHATEAU
http://www.lotp.fr

2015-09-01 21:15 GMT+02:00 Christian Rice <crice at pandora.com>:
> This is still an issue for me, I don?t need anyone to tear the code apart,
> but I?d be grateful if someone would even chime in and say ?yeah, we?ve
> seen that too."
>
> From: Christian Rice <crice at pandora.com>
> Date: Sunday, August 30, 2015 at 11:18 PM
> To: "gluster-users at gluster.org" <gluster-users at
gluster.org>
> Subject: [Gluster-users] Gluster 3.6.3 performance.cache-size not working
> as expected in some cases
>
> I am confused about my caching problem.  I?ll try to keep this as
> straightforward as possible and include the basic details...
>
> I have a sixteen node distributed volume, one brick per node, XFS
> isize=512, Debian 7/Wheezy, 32GB RAM minimally.  Every brick node is also a
> gluster client, and also importantly an HTTP server.  We use a back-end
> 1GbE network for gluster traffic (eth1).  There are a couple dozen gluster
> client-only systems accessing this volume, as well.
>
> We had a really hot spot on one brick due to an oft-requested file, and
> every time any httpd process on any gluster client was asked to deliver the
> file, it was physically fetching it (we could see this traffic using, say,
> ?iftop -i eth1?,) so we thought to increase the volume cache timeout and
> cache size.  We set the following values for testing:
>
> performance.cache-size 16GB
> performance.cache-refresh-timeout: 30
>
> This test was run from a node that didn?t have the requested file on the
> local brick:
>
> while(true); do cat /path/to/file > /dev/null; done
>
> and what had been very high traffic on the gluster backend network,
> delivering the data repeatedly to my requesting node, dropped to nothing
> visible.
>
> I thought good, problem fixed.  Caching works.  My colleague had run a
> test early on to show this perf issue, so he ran it again to sign off.
>
> His testing used curl, because all the real front end traffic is HTTP, and
> all the gluster nodes are web servers, which are of course using the fuse
> mount to access the document root.  Even with our performance tuning, the
> traffic on the gluster backend subnet was continuous and undiminished.  I
> saw no evidence of cache (again using ?iftop -i eth1?, which showed a
> steady 75+% of line rate on a 1GbE link.
>
> Does that make sense at all?  We had theorized that we wouldn?t get to use
> VFS/kernel page cache on any node except maybe the one which held the data
> in the local brick.  That?s what drove us to setting the gluster
> performance cache.  But it doesn?t seem to come into play with http access.
>
>
> Volume info:
> Volume Name: DOCROOT
> Type: Distribute
> Volume ID: 3aecd277-4d26-44cd-879d-cffbb1fec6ba
> Status: Started
> Number of Bricks: 16
> Transport-type: tcp
> Bricks:
> <snipped list of bricks>
> Options Reconfigured:
> performance.cache-refresh-timeout: 30
> performance.cache-size: 16GB
>
> The net result of being overwhelmed by a hot spot is all the gluster
> client nodes lose access to the gluster volume?it becomes so busy it
> hangs.  When the traffic goes away (failing health checks by load balancers
> causes requests to be redirected elsewhere), the volume eventually
> unfreezes and life goes on.
>
> I wish I could type ALL that into a google query and get a lucid answer :)
>
> Regards,
> Christian
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://www.gluster.org/pipermail/gluster-users/attachments/20150902/6154c42f/attachment.html>

Gluster users - Sep 2015 - Gluster 3.6.3 performance.cache-size not working as expected in some cases

[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases

[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases

[Gluster-users] Gluster 3.6.3 performance.cache-size not working as expected in some cases