thr3ads.net - Gluster users - [Gluster-users] reading from local replica? [Jun 2015]

If this information is useful, please help other people find it:
Share via:

Brian Ericson

2015-Jun-08 21:55 UTC

[Gluster-users] reading from local replica?

Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index?

I have two regions, "A" and "B" with servers "a"
and "b" in,
respectfully, each region.  I have clients in both regions. Intra-region 
communication is fast, but the pipe between the regions is terrible.  
I'd like to minimize inter-region communication to as close to glusterfs 
write operations only and have reads go to the server in the region the 
client is running in.

I have created a replica volume as:
gluster volume create gv0 replica 2 a:/data/brick1/gv0 
b:/data/brick1/gv0 force

As a baseline, if I use scp to copy from the brick directly, I get -- 
for a 100M file -- times of about 6s if the client scps from the server 
in the same region and anywhere from 3 to 5 minutes if I the client scps 
the server in the other region.

I was under the impression (from something I read but can't now find) 
that glusterfs automatically picks the fastest replica, but that has not 
been my experience; glusterfs seems to generally prefer the server in 
the other region over the "local" one, with times usually in excess of
4
minutes.

I've also tried having clients mount the volume using the "xlator"
options cluster.read-subvolume and cluster.read-subvolume-index, but 
neither seem to have any impact.  Here are sample mount commands to show 
what I'm attempting:

mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-<0 
or 1> a:/gv0 /mnt/glusterfs
mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or 
1> a:/gv0 /mnt/glusterfs

Am I misunderstanding how glusterfs works, particularly when trying to 
"read locally"?  Is it possible to configure glusterfs to use a local 
replica (or the "fastest replica") for reads?

Ted Miller

2015-Jun-09 14:21 UTC

head link

[Gluster-users] reading from local replica?

On 6/8/2015 5:55 PM, Brian Ericson wrote:> Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index?
>
> I have two regions, "A" and "B" with servers
"a" and "b" in, respectfully,
> each region.  I have clients in both regions. Intra-region communication is
> fast, but the pipe between the regions is terrible.  I'd like to
minimize
> inter-region communication to as close to glusterfs write operations only 
> and have reads go to the server in the region the client is running in.
>
> I have created a replica volume as:
> gluster volume create gv0 replica 2 a:/data/brick1/gv0 b:/data/brick1/gv0 
> force
>
> As a baseline, if I use scp to copy from the brick directly, I get -- for a
> 100M file -- times of about 6s if the client scps from the server in the 
> same region and anywhere from 3 to 5 minutes if I the client scps the 
> server in the other region.
>
> I was under the impression (from something I read but can't now find)
that
> glusterfs automatically picks the fastest replica, but that has not been my
> experience; glusterfs seems to generally prefer the server in the other 
> region over the "local" one, with times usually in excess of 4
minutes.
>
> I've also tried having clients mount the volume using the
"xlator" options
> cluster.read-subvolume and cluster.read-subvolume-index, but neither seem 
> to have any impact.  Here are sample mount commands to show what I'm 
> attempting:
>
> mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-<0
or
> 1> a:/gv0 /mnt/glusterfs
> mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or
1>
> a:/gv0 /mnt/glusterfs
>
> Am I misunderstanding how glusterfs works, particularly when trying to 
> "read locally"?  Is it possible to configure glusterfs to use a
local
> replica (or the "fastest replica") for reads? I am not a developer, nor intimately familiar with the insides of glusterfs, 
but here is how I understand that glusterfs-fuse file reads work.
First, all replica bricks are read, to make sure they are consistent.  (If 
not, gluster tries to make them consistent before proceeding).
After consistency is established, then the actual read occurs from the brick 
with the shortest response time.  I don't know when or how the response time
is measured, but it seems to work for most people most of the time.  (If the 
client is on one of the brick hosts, it will almost always read from the 
local brick.)

If the file reads involve a lot of small files, the consistency check may be 
what is killing your response times, rather than the read of the file 
itself.  Over a fast LAN, the consistency checks can take many times the 
actual read time of the file.

Hopefully others will chime in with more information, but if you can supply 
more information about what you are reading, that will help too.  Are you 
reading entire files, or just reading in a lot of "snippets" or what?

Ted Miller
Elkhart, IN, USA

Jeff Darcy

2015-Jun-09 15:37 UTC

head link

[Gluster-users] reading from local replica?

> Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index?
> 
> I have two regions, "A" and "B" with servers
"a" and "b" in,
> respectfully, each region.  I have clients in both regions. Intra-region
> communication is fast, but the pipe between the regions is terrible.
> I'd like to minimize inter-region communication to as close to
glusterfs
> write operations only and have reads go to the server in the region the
> client is running in.
> 
> I have created a replica volume as:
> gluster volume create gv0 replica 2 a:/data/brick1/gv0
> b:/data/brick1/gv0 force
> 
> As a baseline, if I use scp to copy from the brick directly, I get --
> for a 100M file -- times of about 6s if the client scps from the server
> in the same region and anywhere from 3 to 5 minutes if I the client scps
> the server in the other region.
> 
> I was under the impression (from something I read but can't now find)
> that glusterfs automatically picks the fastest replica, but that has not
> been my experience; glusterfs seems to generally prefer the server in
> the other region over the "local" one, with times usually in
excess of 4
> minutes.
The choice of which replica to read from has become rather complicated
over time.  The first parameter that matters is cluster.read-hash-mode,
which selects between dynamic and (two forms of) static selection.  For
the default mode, we try to spread the read load across replicas based
on both the file's ID and the client's.  For read-hash-mode=0 *only*,
we do this.

 * If "choose-local" is set (as it is by default) and there's a
local
   replica, use that.

 * Otherwise, select a replica based on fastest *initial* response.

Note that these are both a bit prone to hot spots, which is why this
method is not the default.  Also, re-evaluating response times is as
likely to lead to "mobile hotspot" behavior as anything else -
clients keep following each other around to previously idle but now
overloaded replicas, moving the congestion around but never resolving
it.  Thus, we only tend to re-evaluate in response to brick up/down
events.  Probably some room for improvement here.

That brings us to read-subvolume and read-subvolume-index.  The
difference between them is that read-subvolume takes a translator
*name* (which you'd have to get from the volfile) and only applies
to one replica set within a volume.  It's really only useful for
testing and debugging.  By contrast, read-subvolume-index applies
to all replica sets in a volume and doesn't require any knowledge
of translator names.  Either one is used *before* read-hash-mode;
if it's set, and if the corresponding replica is up, it will be
chosen.

Yes, it's a bit of a mess.  However, as you've clearly guessed,
this is a pretty critical decision so it's nice to have many
different ways to control it.
> I've also tried having clients mount the volume using the
"xlator"
> options cluster.read-subvolume and cluster.read-subvolume-index, but
> neither seem to have any impact.  Here are sample mount commands to show
> what I'm attempting:
> 
> mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-<0
> or 1> a:/gv0 /mnt/glusterfs
> mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or
> 1> a:/gv0 /mnt/glusterfs
I would guess that the translator options are somehow not being passed
all the way through to the translator that actually makes the decision.
If it is being passed, it definitely should "force the decision" as
described above.  There might be a bug here, or perhaps I'm just
misunderstanding code I haven't read in a while.

Also, please not that synchronous replication (AFR) isn't really
intended or expected to work over long distances.  Anything over 5ms
RTT is risky territory; that's why we have separate geo-replication.

Gluster users - Jun 2015 - reading from local replica?

[Gluster-users] reading from local replica?

[Gluster-users] reading from local replica?

[Gluster-users] reading from local replica?