thr3ads.net - Gluster users - [Gluster-users] disperse volume file to subvolume mapping [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Serkan Çoban

2016-Apr-21 12:23 UTC

[Gluster-users] disperse volume file to subvolume mapping

>Has the rebalance operation finished successfully ? has it skipped any files
?Yes according to gluster v rebalance status it is completed without any errors.
rebalance status report is like:
Node         Rebalanced files   size               Scanned
failures  skipped
1.1.1.185   158                      29GB             1720
0           314
1.1.1.205    93                       46.5GB           761
0           95
1.1.1.225    74                       37GB              779
 0           94


All other hosts has 0 values.

I double check that files with '---------T' attributes are there,
maybe some of them deleted but I still see them in bricks...
I am also concerned why part files not distributed to all 60 nodes?
Rebalance should do that?

On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at
datalab.es> wrote:> Hi Serkan,
>
> On 21/04/16 12:39, Serkan ?oban wrote:
>>
>> I started a gluster v rebalance v0 start command hoping that it will
>> equally redistribute files across 60 nodes but it did not do that...
>> why it did not redistribute files? any thoughts?
>
>
> Has the rebalance operation finished successfully ? has it skipped any
files
> ?
>
> After a successful rebalance all files with attributes '---------T'
should
> have disappeared.
>
>
>>
>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
>> <xhernandez at datalab.es> wrote:
>>>
>>> Hi Serkan,
>>>
>>> On 21/04/16 10:07, Serkan ?oban wrote:
>>>>>
>>>>>
>>>>> I think the problem is in the temporary name that distcp
gives to the
>>>>> file while it's being copied before renaming it to the
real name. Do
>>>>> you
>>>>> know what is the structure of this name ?
>>>>
>>>>
>>>> Distcp temporary file name format is:
>>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0"
and the same
>>>> temporary file name used by one map process. For example I see
in the
>>>> logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
>>>> sequentially and they all use same temporary file name above.
So no
>>>> original file name appears in temporary file name.
>>>
>>>
>>>
>>> This explains the problem. With the default options, DHT sends all
files
>>> to
>>> the subvolume that should store a file named 'distcp.tmp'.
>>>
>>> With this temporary name format, little can be done.
>>>
>>>>
>>>> I will check if we can modify distcp behaviour, or we have to
write
>>>> our mapreduce procedures instead of using distcp.
>>>>
>>>>> 2. define the option 'extra-hash-regex' to an
expression that matches
>>>>> your temporary file names and returns the same name that
will finally
>>>>> have.
>>>>> Depending on the differences between original and temporary
file names,
>>>>> this
>>>>> option could be useless.
>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>> name conversion, so the files will be evenly distributed.
However this
>>>>> will
>>>>> cause a lot of files placed in incorrect subvolumes,
creating a lot of
>>>>> link
>>>>> files until a rebalance is executed.
>>>>
>>>>
>>>>
>>>> How can I set these options?
>>>
>>>
>>>
>>> You can set gluster options using:
>>>
>>> gluster volume set <volname> <option> <value>
>>>
>>> for example:
>>>
>>> gluster volume set v0 rsync-hash-regex none
>>>
>>> Xavi
>>>
>>>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>>> <xhernandez at datalab.es> wrote:
>>>>>
>>>>>
>>>>> Hi Serkan,
>>>>>
>>>>> I think the problem is in the temporary name that distcp
gives to the
>>>>> file
>>>>> while it's being copied before renaming it to the real
name. Do you
>>>>> know
>>>>> what is the structure of this name ?
>>>>>
>>>>> DHT selects the subvolume (in this case the ec set) on
which the file
>>>>> will
>>>>> be stored based on the name of the file. This has a problem
when a file
>>>>> is
>>>>> being renamed, because this could change the subvolume
where the file
>>>>> should
>>>>> be found.
>>>>>
>>>>> DHT has a feature to avoid incorrect file placements when
executing
>>>>> renames
>>>>> for the rsync case. What it does is to check if the file
matches the
>>>>> following regular expression:
>>>>>
>>>>>       ^\.(.+)\.[^.]+$
>>>>>
>>>>> If a match is found, it only considers the part between
parenthesis to
>>>>> calculate the destination subvolume.
>>>>>
>>>>> This is useful for rsync because temporary file names are
constructed
>>>>> in
>>>>> the
>>>>> following way: suppose the original filename is
'test'. The temporary
>>>>> filename while rsync is being executed is made by
prepending a dot and
>>>>> appending '.<random chars>': .test.712hd
>>>>>
>>>>> As you can see, the original name and the part of the name
between
>>>>> parenthesis that matches the regular expression are the
same. This
>>>>> causes
>>>>> that, after renaming the temporary file to its original
filename, both
>>>>> files
>>>>> will be considered to belong to the same subvolume by DHT.
>>>>>
>>>>> In your case it's very probable that distcp uses a
temporary name like
>>>>> '.part.<number>'. In this case the portion of
the name used to select
>>>>> the
>>>>> subvolume is always 'part'. This would explain why
all files go to the
>>>>> same
>>>>> subvolume. Once the file is renamed to another name, DHT
realizes that
>>>>> it
>>>>> should go to another subvolume. At this point it creates a
link file
>>>>> (those
>>>>> files with access rights = '---------T') in the
correct subvolume but
>>>>> it
>>>>> doesn't move it. As you can see, this kind of files are
better
>>>>> balanced.
>>>>>
>>>>> To solve this problem you have three options:
>>>>>
>>>>> 1. change the temporary filename used by distcp to
correctly match the
>>>>> regular expression. I'm not sure if this can be
configured, but if this
>>>>> is
>>>>> possible, this is the best option.
>>>>>
>>>>> 2. define the option 'extra-hash-regex' to an
expression that matches
>>>>> your
>>>>> temporary file names and returns the same name that will
finally have.
>>>>> Depending on the differences between original and temporary
file names,
>>>>> this
>>>>> option could be useless.
>>>>>
>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>> name
>>>>> conversion, so the files will be evenly distributed.
However this will
>>>>> cause
>>>>> a lot of files placed in incorrect subvolumes, creating a
lot of link
>>>>> files
>>>>> until a rebalance is executed.
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>> On 20/04/16 14:13, Serkan ?oban wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here is the steps that I do in detail and relevant
output from bricks:
>>>>>>
>>>>>> I am using below command for volume creation:
>>>>>> gluster volume create v0 disperse 20 redundancy 4 \
>>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>>
>>>>>> then I mount volume on 50 clients:
>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>>>>>
>>>>>> then I make a directory from one of the clients and
chmod it.
>>>>>> mkdir /mnt/gluster/s1 && chmod 777
/mnt/gluster/s1
>>>>>>
>>>>>> then I start distcp on clients, there are 1059X8.8GB
files in one
>>>>>> folder
>>>>>> and
>>>>>> they will be copied to /mnt/gluster/s1 with 100
parallel which means 2
>>>>>> copy jobs per client at same time.
>>>>>> hadoop distcp -m 100
http://nn1:8020/path/to/teragen-10tb
>>>>>> file:///mnt/gluster/s1
>>>>>>
>>>>>> After job finished here is the status of s1 directory
from bricks:
>>>>>> s1 directory is present in all 1560 brick.
>>>>>> s1/teragen-10tb folder is present in all 1560 brick.
>>>>>>
>>>>>> full listing of files in bricks:
>>>>>>
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>>
>>>>>> You can ignore the .crc files in the brick output
above, they are
>>>>>> checksum files...
>>>>>>
>>>>>> As you can see part-m-xxxx files written only some
bricks in nodes
>>>>>> 0205..0224
>>>>>> All bricks have some files but they have zero size.
>>>>>>
>>>>>> I increase file descriptors to 65k so it is not the
issue...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
>>>>>> <xhernandez at datalab.es>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Serkan,
>>>>>>>
>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I assume that gluster is used to
store the intermediate files
>>>>>>>>>>> before
>>>>>>>>>>> the reduce phase
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Nope, gluster is the destination for distcp
command. hadoop distcp
>>>>>>>> -m
>>>>>>>> 50 http://nn1:8020/path/to/folder
file:///mnt/gluster
>>>>>>>> This run maps on datanodes which have
/mnt/gluster mounted on all of
>>>>>>>> them.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I don't know hadoop, so I'm of little help
here. However it seems
>>>>>>> that
>>>>>>> -m
>>>>>>> 50
>>>>>>> means to execute 50 copies in parallel. This means
that even if the
>>>>>>> distribution worked fine, at most 50 (much probably
less) of the 78
>>>>>>> ec
>>>>>>> sets
>>>>>>> would be used in parallel.
>>>>>>>
>>>>>>>>
>>>>>>>>>>> This means that this is caused by
some peculiarity of the
>>>>>>>>>>> mapreduce.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Yes but how a client write 500 files to gluster
mount and those file
>>>>>>>> just written only to subset of subvolumes? I
cannot use gluster as a
>>>>>>>> backup cluster if I cannot write with distcp.
>>>>>>>>
>>>>>>>
>>>>>>> All 500 files were created only on one of the 78 ec
sets and the
>>>>>>> remaining
>>>>>>> 77 got empty ?
>>>>>>>
>>>>>>>>>>> You should look which files are
created in each brick and how
>>>>>>>>>>> many
>>>>>>>>>>> while the process is running.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Files only created on nodes 185..204 or
205..224 or 225..244. Only
>>>>>>>> on
>>>>>>>> 20 nodes in each test.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> How many files there were in each brick ?
>>>>>>>
>>>>>>> Not sure if this can be related, but standard linux
distributions
>>>>>>> have
>>>>>>> a
>>>>>>> default limit of 1024 open file descriptors. Having
a so big volume
>>>>>>> and
>>>>>>> doing a massive copy, maybe this limit is affecting
something ?
>>>>>>>
>>>>>>> Are there any error or warning messages in the
mount or bricks logs ?
>>>>>>>
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier
Hernandez
>>>>>>>> <xhernandez at datalab.es>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Serkan,
>>>>>>>>>
>>>>>>>>> moved to gluster-users since this
doesn't belong to devel list.
>>>>>>>>>
>>>>>>>>> On 19/04/16 11:24, Serkan ?oban wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am copying 10.000 files to gluster
volume using mapreduce on
>>>>>>>>>> clients. Each map process took one file
at a time and copy it to
>>>>>>>>>> gluster volume.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I assume that gluster is used to store the
intermediate files
>>>>>>>>> before
>>>>>>>>> the
>>>>>>>>> reduce phase.
>>>>>>>>>
>>>>>>>>>> My disperse volume consist of 78
subvolumes of 16+4 disk each. So
>>>>>>>>>> If
>>>>>>>>>> I
>>>>>>>>>> copy >78 files parallel I expect
each file goes to different
>>>>>>>>>> subvolume
>>>>>>>>>> right?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you only copy 78 files, most probably
you will get some
>>>>>>>>> subvolume
>>>>>>>>> empty
>>>>>>>>> and some other with more than one or two
files. It's not an exact
>>>>>>>>> distribution, it's a statistially
balanced distribution: over time
>>>>>>>>> and
>>>>>>>>> with
>>>>>>>>> enough files, each brick will contain an
amount of files in the
>>>>>>>>> same
>>>>>>>>> order
>>>>>>>>> of magnitude, but they won't have the
*same* number of files.
>>>>>>>>>
>>>>>>>>>> In my tests during tests with fio I can
see every file goes to
>>>>>>>>>> different subvolume, but when I start
mapreduce process from
>>>>>>>>>> clients
>>>>>>>>>> only 78/3=26 subvolumes used for
writing files.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This means that this is caused by some
peculiarity of the
>>>>>>>>> mapreduce.
>>>>>>>>>
>>>>>>>>>> I see that clearly from network
traffic. Mapreduce on client side
>>>>>>>>>> can
>>>>>>>>>> be run multi thread. I tested with
1-5-10 threads on each client
>>>>>>>>>> but
>>>>>>>>>> every time only 26 subvolumes used.
>>>>>>>>>> How can I debug the issue further?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> You should look which files are created in
each brick and how many
>>>>>>>>> while
>>>>>>>>> the
>>>>>>>>> process is running.
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM,
Xavier Hernandez
>>>>>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>
>>>>>>>>>>> On 19/04/16 09:18, Serkan ?oban
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi, I just reinstalled fresh
3.7.11 and I am seeing the same
>>>>>>>>>>>> behavior.
>>>>>>>>>>>> 50 clients copying part-0-xxxx
named files using mapreduce to
>>>>>>>>>>>> gluster
>>>>>>>>>>>> using one thread per server and
they are using only 20 servers
>>>>>>>>>>>> out
>>>>>>>>>>>> of
>>>>>>>>>>>> 60. On the other hand fio tests
use all the servers. Anything I
>>>>>>>>>>>> can
>>>>>>>>>>>> do
>>>>>>>>>>>> to solve the issue?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Distribution of files to ec sets is
done by dht. In theory if you
>>>>>>>>>>> create
>>>>>>>>>>> many files each ec set will receive
the same amount of files.
>>>>>>>>>>> However
>>>>>>>>>>> when
>>>>>>>>>>> the number of files is small
enough, statistics can fail.
>>>>>>>>>>>
>>>>>>>>>>> Not sure what you are doing
exactly, but a mapreduce procedure
>>>>>>>>>>> generally
>>>>>>>>>>> only creates a single output. In
that case it makes sense that
>>>>>>>>>>> only
>>>>>>>>>>> one
>>>>>>>>>>> ec
>>>>>>>>>>> set is used. If you want to use all
ec sets for a single file,
>>>>>>>>>>> you
>>>>>>>>>>> should
>>>>>>>>>>> enable sharding (I haven't
tested that) or split the result in
>>>>>>>>>>> multiple
>>>>>>>>>>> files.
>>>>>>>>>>>
>>>>>>>>>>> Xavi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Serkan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------- Forwarded message
----------
>>>>>>>>>>>> From: Serkan ?oban
<cobanserkan at gmail.com>
>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39
PM
>>>>>>>>>>>> Subject: disperse volume file
to subvolume mapping
>>>>>>>>>>>> To: Gluster Users
<gluster-users at gluster.org>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi, I have a problem where
clients are using only 1/3 of nodes
>>>>>>>>>>>> in
>>>>>>>>>>>> disperse volume for writing.
>>>>>>>>>>>> I am testing from 50 clients
using 1 to 10 threads with file
>>>>>>>>>>>> names
>>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>>> What I see is clients only use
20 nodes for writing. How is the
>>>>>>>>>>>> file
>>>>>>>>>>>> name to sub volume hashing is
done? Is this related to file
>>>>>>>>>>>> names
>>>>>>>>>>>> are
>>>>>>>>>>>> similar?
>>>>>>>>>>>>
>>>>>>>>>>>> My cluster is 3.7.10 with 60
nodes each has 26 disks. Disperse
>>>>>>>>>>>> volume
>>>>>>>>>>>> is 78 x (16+4). Only 26 out of
78 sub volumes used during
>>>>>>>>>>>> writes..
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Xavier Hernandez

2016-Apr-21 12:34 UTC

head link

[Gluster-users] disperse volume file to subvolume mapping

Can you try a 'gluster volume rebalance v0 start force' ?

On 21/04/16 14:23, Serkan ?oban wrote:>> Has the rebalance operation finished successfully ? has it skipped any
files ?
> Yes according to gluster v rebalance status it is completed without any
errors.
> rebalance status report is like:
> Node         Rebalanced files   size               Scanned
> failures  skipped
> 1.1.1.185   158                      29GB             1720
> 0           314
> 1.1.1.205    93                       46.5GB           761
> 0           95
> 1.1.1.225    74                       37GB              779
>   0           94
>
>
> All other hosts has 0 values.
>
> I double check that files with '---------T' attributes are there,
> maybe some of them deleted but I still see them in bricks...
> I am also concerned why part files not distributed to all 60 nodes?
> Rebalance should do that?
>
> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at
datalab.es> wrote:
>> Hi Serkan,
>>
>> On 21/04/16 12:39, Serkan ?oban wrote:
>>>
>>> I started a gluster v rebalance v0 start command hoping that it
will
>>> equally redistribute files across 60 nodes but it did not do
that...
>>> why it did not redistribute files? any thoughts?
>>
>>
>> Has the rebalance operation finished successfully ? has it skipped any
files
>> ?
>>
>> After a successful rebalance all files with attributes
'---------T' should
>> have disappeared.
>>
>>
>>>
>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
>>> <xhernandez at datalab.es> wrote:
>>>>
>>>> Hi Serkan,
>>>>
>>>> On 21/04/16 10:07, Serkan ?oban wrote:
>>>>>>
>>>>>>
>>>>>> I think the problem is in the temporary name that
distcp gives to the
>>>>>> file while it's being copied before renaming it to
the real name. Do
>>>>>> you
>>>>>> know what is the structure of this name ?
>>>>>
>>>>>
>>>>> Distcp temporary file name format is:
>>>>>
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
>>>>> temporary file name used by one map process. For example I
see in the
>>>>> logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
>>>>> sequentially and they all use same temporary file name
above. So no
>>>>> original file name appears in temporary file name.
>>>>
>>>>
>>>>
>>>> This explains the problem. With the default options, DHT sends
all files
>>>> to
>>>> the subvolume that should store a file named
'distcp.tmp'.
>>>>
>>>> With this temporary name format, little can be done.
>>>>
>>>>>
>>>>> I will check if we can modify distcp behaviour, or we have
to write
>>>>> our mapreduce procedures instead of using distcp.
>>>>>
>>>>>> 2. define the option 'extra-hash-regex' to an
expression that matches
>>>>>> your temporary file names and returns the same name
that will finally
>>>>>> have.
>>>>>> Depending on the differences between original and
temporary file names,
>>>>>> this
>>>>>> option could be useless.
>>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>>> name conversion, so the files will be evenly
distributed. However this
>>>>>> will
>>>>>> cause a lot of files placed in incorrect subvolumes,
creating a lot of
>>>>>> link
>>>>>> files until a rebalance is executed.
>>>>>
>>>>>
>>>>>
>>>>> How can I set these options?
>>>>
>>>>
>>>>
>>>> You can set gluster options using:
>>>>
>>>> gluster volume set <volname> <option> <value>
>>>>
>>>> for example:
>>>>
>>>> gluster volume set v0 rsync-hash-regex none
>>>>
>>>> Xavi
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>>>> <xhernandez at datalab.es> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Serkan,
>>>>>>
>>>>>> I think the problem is in the temporary name that
distcp gives to the
>>>>>> file
>>>>>> while it's being copied before renaming it to the
real name. Do you
>>>>>> know
>>>>>> what is the structure of this name ?
>>>>>>
>>>>>> DHT selects the subvolume (in this case the ec set) on
which the file
>>>>>> will
>>>>>> be stored based on the name of the file. This has a
problem when a file
>>>>>> is
>>>>>> being renamed, because this could change the subvolume
where the file
>>>>>> should
>>>>>> be found.
>>>>>>
>>>>>> DHT has a feature to avoid incorrect file placements
when executing
>>>>>> renames
>>>>>> for the rsync case. What it does is to check if the
file matches the
>>>>>> following regular expression:
>>>>>>
>>>>>>        ^\.(.+)\.[^.]+$
>>>>>>
>>>>>> If a match is found, it only considers the part between
parenthesis to
>>>>>> calculate the destination subvolume.
>>>>>>
>>>>>> This is useful for rsync because temporary file names
are constructed
>>>>>> in
>>>>>> the
>>>>>> following way: suppose the original filename is
'test'. The temporary
>>>>>> filename while rsync is being executed is made by
prepending a dot and
>>>>>> appending '.<random chars>': .test.712hd
>>>>>>
>>>>>> As you can see, the original name and the part of the
name between
>>>>>> parenthesis that matches the regular expression are the
same. This
>>>>>> causes
>>>>>> that, after renaming the temporary file to its original
filename, both
>>>>>> files
>>>>>> will be considered to belong to the same subvolume by
DHT.
>>>>>>
>>>>>> In your case it's very probable that distcp uses a
temporary name like
>>>>>> '.part.<number>'. In this case the
portion of the name used to select
>>>>>> the
>>>>>> subvolume is always 'part'. This would explain
why all files go to the
>>>>>> same
>>>>>> subvolume. Once the file is renamed to another name,
DHT realizes that
>>>>>> it
>>>>>> should go to another subvolume. At this point it
creates a link file
>>>>>> (those
>>>>>> files with access rights = '---------T') in the
correct subvolume but
>>>>>> it
>>>>>> doesn't move it. As you can see, this kind of files
are better
>>>>>> balanced.
>>>>>>
>>>>>> To solve this problem you have three options:
>>>>>>
>>>>>> 1. change the temporary filename used by distcp to
correctly match the
>>>>>> regular expression. I'm not sure if this can be
configured, but if this
>>>>>> is
>>>>>> possible, this is the best option.
>>>>>>
>>>>>> 2. define the option 'extra-hash-regex' to an
expression that matches
>>>>>> your
>>>>>> temporary file names and returns the same name that
will finally have.
>>>>>> Depending on the differences between original and
temporary file names,
>>>>>> this
>>>>>> option could be useless.
>>>>>>
>>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>>> name
>>>>>> conversion, so the files will be evenly distributed.
However this will
>>>>>> cause
>>>>>> a lot of files placed in incorrect subvolumes, creating
a lot of link
>>>>>> files
>>>>>> until a rebalance is executed.
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>
>>>>>> On 20/04/16 14:13, Serkan ?oban wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here is the steps that I do in detail and relevant
output from bricks:
>>>>>>>
>>>>>>> I am using below command for volume creation:
>>>>>>> gluster volume create v0 disperse 20 redundancy 4 \
>>>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>>>
>>>>>>> then I mount volume on 50 clients:
>>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>>>>>>
>>>>>>> then I make a directory from one of the clients and
chmod it.
>>>>>>> mkdir /mnt/gluster/s1 && chmod 777
/mnt/gluster/s1
>>>>>>>
>>>>>>> then I start distcp on clients, there are
1059X8.8GB files in one
>>>>>>> folder
>>>>>>> and
>>>>>>> they will be copied to /mnt/gluster/s1 with 100
parallel which means 2
>>>>>>> copy jobs per client at same time.
>>>>>>> hadoop distcp -m 100
http://nn1:8020/path/to/teragen-10tb
>>>>>>> file:///mnt/gluster/s1
>>>>>>>
>>>>>>> After job finished here is the status of s1
directory from bricks:
>>>>>>> s1 directory is present in all 1560 brick.
>>>>>>> s1/teragen-10tb folder is present in all 1560
brick.
>>>>>>>
>>>>>>> full listing of files in bricks:
>>>>>>>
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>>>
>>>>>>> You can ignore the .crc files in the brick output
above, they are
>>>>>>> checksum files...
>>>>>>>
>>>>>>> As you can see part-m-xxxx files written only some
bricks in nodes
>>>>>>> 0205..0224
>>>>>>> All bricks have some files but they have zero size.
>>>>>>>
>>>>>>> I increase file descriptors to 65k so it is not the
issue...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez
>>>>>>> <xhernandez at datalab.es>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Serkan,
>>>>>>>>
>>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I assume that gluster is used
to store the intermediate files
>>>>>>>>>>>> before
>>>>>>>>>>>> the reduce phase
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nope, gluster is the destination for distcp
command. hadoop distcp
>>>>>>>>> -m
>>>>>>>>> 50 http://nn1:8020/path/to/folder
file:///mnt/gluster
>>>>>>>>> This run maps on datanodes which have
/mnt/gluster mounted on all of
>>>>>>>>> them.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I don't know hadoop, so I'm of little
help here. However it seems
>>>>>>>> that
>>>>>>>> -m
>>>>>>>> 50
>>>>>>>> means to execute 50 copies in parallel. This
means that even if the
>>>>>>>> distribution worked fine, at most 50 (much
probably less) of the 78
>>>>>>>> ec
>>>>>>>> sets
>>>>>>>> would be used in parallel.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> This means that this is caused
by some peculiarity of the
>>>>>>>>>>>> mapreduce.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Yes but how a client write 500 files to
gluster mount and those file
>>>>>>>>> just written only to subset of subvolumes?
I cannot use gluster as a
>>>>>>>>> backup cluster if I cannot write with
distcp.
>>>>>>>>>
>>>>>>>>
>>>>>>>> All 500 files were created only on one of the
78 ec sets and the
>>>>>>>> remaining
>>>>>>>> 77 got empty ?
>>>>>>>>
>>>>>>>>>>>> You should look which files are
created in each brick and how
>>>>>>>>>>>> many
>>>>>>>>>>>> while the process is running.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Files only created on nodes 185..204 or
205..224 or 225..244. Only
>>>>>>>>> on
>>>>>>>>> 20 nodes in each test.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> How many files there were in each brick ?
>>>>>>>>
>>>>>>>> Not sure if this can be related, but standard
linux distributions
>>>>>>>> have
>>>>>>>> a
>>>>>>>> default limit of 1024 open file descriptors.
Having a so big volume
>>>>>>>> and
>>>>>>>> doing a massive copy, maybe this limit is
affecting something ?
>>>>>>>>
>>>>>>>> Are there any error or warning messages in the
mount or bricks logs ?
>>>>>>>>
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier
Hernandez
>>>>>>>>> <xhernandez at datalab.es>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Serkan,
>>>>>>>>>>
>>>>>>>>>> moved to gluster-users since this
doesn't belong to devel list.
>>>>>>>>>>
>>>>>>>>>> On 19/04/16 11:24, Serkan ?oban wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am copying 10.000 files to
gluster volume using mapreduce on
>>>>>>>>>>> clients. Each map process took one
file at a time and copy it to
>>>>>>>>>>> gluster volume.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I assume that gluster is used to store
the intermediate files
>>>>>>>>>> before
>>>>>>>>>> the
>>>>>>>>>> reduce phase.
>>>>>>>>>>
>>>>>>>>>>> My disperse volume consist of 78
subvolumes of 16+4 disk each. So
>>>>>>>>>>> If
>>>>>>>>>>> I
>>>>>>>>>>> copy >78 files parallel I expect
each file goes to different
>>>>>>>>>>> subvolume
>>>>>>>>>>> right?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If you only copy 78 files, most
probably you will get some
>>>>>>>>>> subvolume
>>>>>>>>>> empty
>>>>>>>>>> and some other with more than one or
two files. It's not an exact
>>>>>>>>>> distribution, it's a statistially
balanced distribution: over time
>>>>>>>>>> and
>>>>>>>>>> with
>>>>>>>>>> enough files, each brick will contain
an amount of files in the
>>>>>>>>>> same
>>>>>>>>>> order
>>>>>>>>>> of magnitude, but they won't have
the *same* number of files.
>>>>>>>>>>
>>>>>>>>>>> In my tests during tests with fio I
can see every file goes to
>>>>>>>>>>> different subvolume, but when I
start mapreduce process from
>>>>>>>>>>> clients
>>>>>>>>>>> only 78/3=26 subvolumes used for
writing files.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This means that this is caused by some
peculiarity of the
>>>>>>>>>> mapreduce.
>>>>>>>>>>
>>>>>>>>>>> I see that clearly from network
traffic. Mapreduce on client side
>>>>>>>>>>> can
>>>>>>>>>>> be run multi thread. I tested with
1-5-10 threads on each client
>>>>>>>>>>> but
>>>>>>>>>>> every time only 26 subvolumes used.
>>>>>>>>>>> How can I debug the issue further?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You should look which files are created
in each brick and how many
>>>>>>>>>> while
>>>>>>>>>> the
>>>>>>>>>> process is running.
>>>>>>>>>>
>>>>>>>>>> Xavi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM,
Xavier Hernandez
>>>>>>>>>>> <xhernandez at datalab.es>
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>
>>>>>>>>>>>> On 19/04/16 09:18, Serkan ?oban
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, I just reinstalled
fresh 3.7.11 and I am seeing the same
>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>> 50 clients copying
part-0-xxxx named files using mapreduce to
>>>>>>>>>>>>> gluster
>>>>>>>>>>>>> using one thread per server
and they are using only 20 servers
>>>>>>>>>>>>> out
>>>>>>>>>>>>> of
>>>>>>>>>>>>> 60. On the other hand fio
tests use all the servers. Anything I
>>>>>>>>>>>>> can
>>>>>>>>>>>>> do
>>>>>>>>>>>>> to solve the issue?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Distribution of files to ec
sets is done by dht. In theory if you
>>>>>>>>>>>> create
>>>>>>>>>>>> many files each ec set will
receive the same amount of files.
>>>>>>>>>>>> However
>>>>>>>>>>>> when
>>>>>>>>>>>> the number of files is small
enough, statistics can fail.
>>>>>>>>>>>>
>>>>>>>>>>>> Not sure what you are doing
exactly, but a mapreduce procedure
>>>>>>>>>>>> generally
>>>>>>>>>>>> only creates a single output.
In that case it makes sense that
>>>>>>>>>>>> only
>>>>>>>>>>>> one
>>>>>>>>>>>> ec
>>>>>>>>>>>> set is used. If you want to use
all ec sets for a single file,
>>>>>>>>>>>> you
>>>>>>>>>>>> should
>>>>>>>>>>>> enable sharding (I haven't
tested that) or split the result in
>>>>>>>>>>>> multiple
>>>>>>>>>>>> files.
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Serkan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------- Forwarded
message ----------
>>>>>>>>>>>>> From: Serkan ?oban
<cobanserkan at gmail.com>
>>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at
2:39 PM
>>>>>>>>>>>>> Subject: disperse volume
file to subvolume mapping
>>>>>>>>>>>>> To: Gluster Users
<gluster-users at gluster.org>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, I have a problem where
clients are using only 1/3 of nodes
>>>>>>>>>>>>> in
>>>>>>>>>>>>> disperse volume for
writing.
>>>>>>>>>>>>> I am testing from 50
clients using 1 to 10 threads with file
>>>>>>>>>>>>> names
>>>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>>>> What I see is clients only
use 20 nodes for writing. How is the
>>>>>>>>>>>>> file
>>>>>>>>>>>>> name to sub volume hashing
is done? Is this related to file
>>>>>>>>>>>>> names
>>>>>>>>>>>>> are
>>>>>>>>>>>>> similar?
>>>>>>>>>>>>>
>>>>>>>>>>>>> My cluster is 3.7.10 with
60 nodes each has 26 disks. Disperse
>>>>>>>>>>>>> volume
>>>>>>>>>>>>> is 78 x (16+4). Only 26 out
of 78 sub volumes used during
>>>>>>>>>>>>> writes..
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Gluster users - Apr 2016 - disperse volume file to subvolume mapping

[Gluster-users] disperse volume file to subvolume mapping

[Gluster-users] disperse volume file to subvolume mapping