Xavier Hernandez
2016-Apr-22 06:15 UTC
[Gluster-users] disperse volume file to subvolume mapping
When you execute a rebalance 'force' the skipped column should be 0 for all nodes and all '---------T' files must have disappeared. Otherwise something failed. Is this true in your case ? On 21/04/16 15:19, Serkan ?oban wrote:> Same result. Also checked the rebalance.log file, it has also no > reference to part files... > > On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez <xhernandez at datalab.es> wrote: >> Can you try a 'gluster volume rebalance v0 start force' ? >> >> >> On 21/04/16 14:23, Serkan ?oban wrote: >>>> >>>> Has the rebalance operation finished successfully ? has it skipped any >>>> files ? >>> >>> Yes according to gluster v rebalance status it is completed without any >>> errors. >>> rebalance status report is like: >>> Node Rebalanced files size Scanned >>> failures skipped >>> 1.1.1.185 158 29GB 1720 >>> 0 314 >>> 1.1.1.205 93 46.5GB 761 >>> 0 95 >>> 1.1.1.225 74 37GB 779 >>> 0 94 >>> >>> >>> All other hosts has 0 values. >>> >>> I double check that files with '---------T' attributes are there, >>> maybe some of them deleted but I still see them in bricks... >>> I am also concerned why part files not distributed to all 60 nodes? >>> Rebalance should do that? >>> >>> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at datalab.es> >>> wrote: >>>> >>>> Hi Serkan, >>>> >>>> On 21/04/16 12:39, Serkan ?oban wrote: >>>>> >>>>> >>>>> I started a gluster v rebalance v0 start command hoping that it will >>>>> equally redistribute files across 60 nodes but it did not do that... >>>>> why it did not redistribute files? any thoughts? >>>> >>>> >>>> >>>> Has the rebalance operation finished successfully ? has it skipped any >>>> files >>>> ? >>>> >>>> After a successful rebalance all files with attributes '---------T' >>>> should >>>> have disappeared. >>>> >>>> >>>>> >>>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez >>>>> <xhernandez at datalab.es> wrote: >>>>>> >>>>>> >>>>>> Hi Serkan, >>>>>> >>>>>> On 21/04/16 10:07, Serkan ?oban wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I think the problem is in the temporary name that distcp gives to the >>>>>>>> file while it's being copied before renaming it to the real name. Do >>>>>>>> you >>>>>>>> know what is the structure of this name ? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Distcp temporary file name format is: >>>>>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same >>>>>>> temporary file name used by one map process. For example I see in the >>>>>>> logs that one map copies files part-m-00031,part-m-00047,part-m-00063 >>>>>>> sequentially and they all use same temporary file name above. So no >>>>>>> original file name appears in temporary file name. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> This explains the problem. With the default options, DHT sends all >>>>>> files >>>>>> to >>>>>> the subvolume that should store a file named 'distcp.tmp'. >>>>>> >>>>>> With this temporary name format, little can be done. >>>>>> >>>>>>> >>>>>>> I will check if we can modify distcp behaviour, or we have to write >>>>>>> our mapreduce procedures instead of using distcp. >>>>>>> >>>>>>>> 2. define the option 'extra-hash-regex' to an expression that matches >>>>>>>> your temporary file names and returns the same name that will finally >>>>>>>> have. >>>>>>>> Depending on the differences between original and temporary file >>>>>>>> names, >>>>>>>> this >>>>>>>> option could be useless. >>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the >>>>>>>> name conversion, so the files will be evenly distributed. However >>>>>>>> this >>>>>>>> will >>>>>>>> cause a lot of files placed in incorrect subvolumes, creating a lot >>>>>>>> of >>>>>>>> link >>>>>>>> files until a rebalance is executed. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> How can I set these options? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> You can set gluster options using: >>>>>> >>>>>> gluster volume set <volname> <option> <value> >>>>>> >>>>>> for example: >>>>>> >>>>>> gluster volume set v0 rsync-hash-regex none >>>>>> >>>>>> Xavi >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez >>>>>>> <xhernandez at datalab.es> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Serkan, >>>>>>>> >>>>>>>> I think the problem is in the temporary name that distcp gives to the >>>>>>>> file >>>>>>>> while it's being copied before renaming it to the real name. Do you >>>>>>>> know >>>>>>>> what is the structure of this name ? >>>>>>>> >>>>>>>> DHT selects the subvolume (in this case the ec set) on which the file >>>>>>>> will >>>>>>>> be stored based on the name of the file. This has a problem when a >>>>>>>> file >>>>>>>> is >>>>>>>> being renamed, because this could change the subvolume where the file >>>>>>>> should >>>>>>>> be found. >>>>>>>> >>>>>>>> DHT has a feature to avoid incorrect file placements when executing >>>>>>>> renames >>>>>>>> for the rsync case. What it does is to check if the file matches the >>>>>>>> following regular expression: >>>>>>>> >>>>>>>> ^\.(.+)\.[^.]+$ >>>>>>>> >>>>>>>> If a match is found, it only considers the part between parenthesis >>>>>>>> to >>>>>>>> calculate the destination subvolume. >>>>>>>> >>>>>>>> This is useful for rsync because temporary file names are constructed >>>>>>>> in >>>>>>>> the >>>>>>>> following way: suppose the original filename is 'test'. The temporary >>>>>>>> filename while rsync is being executed is made by prepending a dot >>>>>>>> and >>>>>>>> appending '.<random chars>': .test.712hd >>>>>>>> >>>>>>>> As you can see, the original name and the part of the name between >>>>>>>> parenthesis that matches the regular expression are the same. This >>>>>>>> causes >>>>>>>> that, after renaming the temporary file to its original filename, >>>>>>>> both >>>>>>>> files >>>>>>>> will be considered to belong to the same subvolume by DHT. >>>>>>>> >>>>>>>> In your case it's very probable that distcp uses a temporary name >>>>>>>> like >>>>>>>> '.part.<number>'. In this case the portion of the name used to select >>>>>>>> the >>>>>>>> subvolume is always 'part'. This would explain why all files go to >>>>>>>> the >>>>>>>> same >>>>>>>> subvolume. Once the file is renamed to another name, DHT realizes >>>>>>>> that >>>>>>>> it >>>>>>>> should go to another subvolume. At this point it creates a link file >>>>>>>> (those >>>>>>>> files with access rights = '---------T') in the correct subvolume but >>>>>>>> it >>>>>>>> doesn't move it. As you can see, this kind of files are better >>>>>>>> balanced. >>>>>>>> >>>>>>>> To solve this problem you have three options: >>>>>>>> >>>>>>>> 1. change the temporary filename used by distcp to correctly match >>>>>>>> the >>>>>>>> regular expression. I'm not sure if this can be configured, but if >>>>>>>> this >>>>>>>> is >>>>>>>> possible, this is the best option. >>>>>>>> >>>>>>>> 2. define the option 'extra-hash-regex' to an expression that matches >>>>>>>> your >>>>>>>> temporary file names and returns the same name that will finally >>>>>>>> have. >>>>>>>> Depending on the differences between original and temporary file >>>>>>>> names, >>>>>>>> this >>>>>>>> option could be useless. >>>>>>>> >>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the >>>>>>>> name >>>>>>>> conversion, so the files will be evenly distributed. However this >>>>>>>> will >>>>>>>> cause >>>>>>>> a lot of files placed in incorrect subvolumes, creating a lot of link >>>>>>>> files >>>>>>>> until a rebalance is executed. >>>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>> >>>>>>>> On 20/04/16 14:13, Serkan ?oban wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Here is the steps that I do in detail and relevant output from >>>>>>>>> bricks: >>>>>>>>> >>>>>>>>> I am using below command for volume creation: >>>>>>>>> gluster volume create v0 disperse 20 redundancy 4 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/02 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/02 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/02 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/03 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/03 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/03 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/04 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/04 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/04 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/05 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/05 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/05 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/06 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/06 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/06 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/07 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/07 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/07 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/08 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/08 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/08 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/09 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/09 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/09 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/10 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/10 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/10 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/11 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/11 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/11 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/12 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/12 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/12 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/13 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/13 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/13 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/14 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/14 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/14 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/15 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/15 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/15 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/16 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/16 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/16 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/17 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/17 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/17 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/18 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/18 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/18 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/19 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/19 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/19 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/20 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/20 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/20 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/21 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/21 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/21 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/22 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/22 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/22 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/23 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/23 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/23 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/24 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/24 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/24 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/25 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/25 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/25 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/26 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/26 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/26 \ >>>>>>>>> 1.1.1.{185..204}:/bricks/27 \ >>>>>>>>> 1.1.1.{205..224}:/bricks/27 \ >>>>>>>>> 1.1.1.{225..244}:/bricks/27 force >>>>>>>>> >>>>>>>>> then I mount volume on 50 clients: >>>>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster >>>>>>>>> >>>>>>>>> then I make a directory from one of the clients and chmod it. >>>>>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1 >>>>>>>>> >>>>>>>>> then I start distcp on clients, there are 1059X8.8GB files in one >>>>>>>>> folder >>>>>>>>> and >>>>>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which means >>>>>>>>> 2 >>>>>>>>> copy jobs per client at same time. >>>>>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb >>>>>>>>> file:///mnt/gluster/s1 >>>>>>>>> >>>>>>>>> After job finished here is the status of s1 directory from bricks: >>>>>>>>> s1 directory is present in all 1560 brick. >>>>>>>>> s1/teragen-10tb folder is present in all 1560 brick. >>>>>>>>> >>>>>>>>> full listing of files in bricks: >>>>>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0 >>>>>>>>> >>>>>>>>> You can ignore the .crc files in the brick output above, they are >>>>>>>>> checksum files... >>>>>>>>> >>>>>>>>> As you can see part-m-xxxx files written only some bricks in nodes >>>>>>>>> 0205..0224 >>>>>>>>> All bricks have some files but they have zero size. >>>>>>>>> >>>>>>>>> I increase file descriptors to 65k so it is not the issue... >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez >>>>>>>>> <xhernandez at datalab.es> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Serkan, >>>>>>>>>> >>>>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>>>>>> before >>>>>>>>>>>>>> the reduce phase >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Nope, gluster is the destination for distcp command. hadoop distcp >>>>>>>>>>> -m >>>>>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster >>>>>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on all >>>>>>>>>>> of >>>>>>>>>>> them. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I don't know hadoop, so I'm of little help here. However it seems >>>>>>>>>> that >>>>>>>>>> -m >>>>>>>>>> 50 >>>>>>>>>> means to execute 50 copies in parallel. This means that even if the >>>>>>>>>> distribution worked fine, at most 50 (much probably less) of the 78 >>>>>>>>>> ec >>>>>>>>>> sets >>>>>>>>>> would be used in parallel. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>>>>>> mapreduce. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yes but how a client write 500 files to gluster mount and those >>>>>>>>>>> file >>>>>>>>>>> just written only to subset of subvolumes? I cannot use gluster as >>>>>>>>>>> a >>>>>>>>>>> backup cluster if I cannot write with distcp. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> All 500 files were created only on one of the 78 ec sets and the >>>>>>>>>> remaining >>>>>>>>>> 77 got empty ? >>>>>>>>>> >>>>>>>>>>>>>> You should look which files are created in each brick and how >>>>>>>>>>>>>> many >>>>>>>>>>>>>> while the process is running. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244. Only >>>>>>>>>>> on >>>>>>>>>>> 20 nodes in each test. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> How many files there were in each brick ? >>>>>>>>>> >>>>>>>>>> Not sure if this can be related, but standard linux distributions >>>>>>>>>> have >>>>>>>>>> a >>>>>>>>>> default limit of 1024 open file descriptors. Having a so big volume >>>>>>>>>> and >>>>>>>>>> doing a massive copy, maybe this limit is affecting something ? >>>>>>>>>> >>>>>>>>>> Are there any error or warning messages in the mount or bricks logs >>>>>>>>>> ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Xavi >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez >>>>>>>>>>> <xhernandez at datalab.es> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Serkan, >>>>>>>>>>>> >>>>>>>>>>>> moved to gluster-users since this doesn't belong to devel list. >>>>>>>>>>>> >>>>>>>>>>>> On 19/04/16 11:24, Serkan ?oban wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on >>>>>>>>>>>>> clients. Each map process took one file at a time and copy it to >>>>>>>>>>>>> gluster volume. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>>>> before >>>>>>>>>>>> the >>>>>>>>>>>> reduce phase. >>>>>>>>>>>> >>>>>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. >>>>>>>>>>>>> So >>>>>>>>>>>>> If >>>>>>>>>>>>> I >>>>>>>>>>>>> copy >78 files parallel I expect each file goes to different >>>>>>>>>>>>> subvolume >>>>>>>>>>>>> right? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> If you only copy 78 files, most probably you will get some >>>>>>>>>>>> subvolume >>>>>>>>>>>> empty >>>>>>>>>>>> and some other with more than one or two files. It's not an exact >>>>>>>>>>>> distribution, it's a statistially balanced distribution: over >>>>>>>>>>>> time >>>>>>>>>>>> and >>>>>>>>>>>> with >>>>>>>>>>>> enough files, each brick will contain an amount of files in the >>>>>>>>>>>> same >>>>>>>>>>>> order >>>>>>>>>>>> of magnitude, but they won't have the *same* number of files. >>>>>>>>>>>> >>>>>>>>>>>>> In my tests during tests with fio I can see every file goes to >>>>>>>>>>>>> different subvolume, but when I start mapreduce process from >>>>>>>>>>>>> clients >>>>>>>>>>>>> only 78/3=26 subvolumes used for writing files. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>>>> mapreduce. >>>>>>>>>>>> >>>>>>>>>>>>> I see that clearly from network traffic. Mapreduce on client >>>>>>>>>>>>> side >>>>>>>>>>>>> can >>>>>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each client >>>>>>>>>>>>> but >>>>>>>>>>>>> every time only 26 subvolumes used. >>>>>>>>>>>>> How can I debug the issue further? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> You should look which files are created in each brick and how >>>>>>>>>>>> many >>>>>>>>>>>> while >>>>>>>>>>>> the >>>>>>>>>>>> process is running. >>>>>>>>>>>> >>>>>>>>>>>> Xavi >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez >>>>>>>>>>>>> <xhernandez at datalab.es> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Serkan, >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 19/04/16 09:18, Serkan ?oban wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same >>>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to >>>>>>>>>>>>>>> gluster >>>>>>>>>>>>>>> using one thread per server and they are using only 20 servers >>>>>>>>>>>>>>> out >>>>>>>>>>>>>>> of >>>>>>>>>>>>>>> 60. On the other hand fio tests use all the servers. Anything >>>>>>>>>>>>>>> I >>>>>>>>>>>>>>> can >>>>>>>>>>>>>>> do >>>>>>>>>>>>>>> to solve the issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if >>>>>>>>>>>>>> you >>>>>>>>>>>>>> create >>>>>>>>>>>>>> many files each ec set will receive the same amount of files. >>>>>>>>>>>>>> However >>>>>>>>>>>>>> when >>>>>>>>>>>>>> the number of files is small enough, statistics can fail. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce procedure >>>>>>>>>>>>>> generally >>>>>>>>>>>>>> only creates a single output. In that case it makes sense that >>>>>>>>>>>>>> only >>>>>>>>>>>>>> one >>>>>>>>>>>>>> ec >>>>>>>>>>>>>> set is used. If you want to use all ec sets for a single file, >>>>>>>>>>>>>> you >>>>>>>>>>>>>> should >>>>>>>>>>>>>> enable sharding (I haven't tested that) or split the result in >>>>>>>>>>>>>> multiple >>>>>>>>>>>>>> files. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Xavi >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Serkan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ---------- Forwarded message ---------- >>>>>>>>>>>>>>> From: Serkan ?oban <cobanserkan at gmail.com> >>>>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM >>>>>>>>>>>>>>> Subject: disperse volume file to subvolume mapping >>>>>>>>>>>>>>> To: Gluster Users <gluster-users at gluster.org> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> disperse volume for writing. >>>>>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file >>>>>>>>>>>>>>> names >>>>>>>>>>>>>>> part-0-xxxx. >>>>>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> file >>>>>>>>>>>>>>> name to sub volume hashing is done? Is this related to file >>>>>>>>>>>>>>> names >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>> similar? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse >>>>>>>>>>>>>>> volume >>>>>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during >>>>>>>>>>>>>>> writes.. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >>
Serkan Çoban
2016-Apr-22 06:24 UTC
[Gluster-users] disperse volume file to subvolume mapping
Not only skipped column but all columns are 0 in rebalance status command. It seems rebalance does not to anything. All '---------T' files are there. Anyway we wrote our custom mapreduce tool and it is copying files right now to gluster and it is utilizing all 60 nodes as expected. I will delete distcp folder and continue if you don't need any further log/debug files to examine the issue. Thanks for help, Serkan On Fri, Apr 22, 2016 at 9:15 AM, Xavier Hernandez <xhernandez at datalab.es> wrote:> When you execute a rebalance 'force' the skipped column should be 0 for all > nodes and all '---------T' files must have disappeared. Otherwise something > failed. Is this true in your case ? > > > On 21/04/16 15:19, Serkan ?oban wrote: >> >> Same result. Also checked the rebalance.log file, it has also no >> reference to part files... >> >> On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez <xhernandez at datalab.es> >> wrote: >>> >>> Can you try a 'gluster volume rebalance v0 start force' ? >>> >>> >>> On 21/04/16 14:23, Serkan ?oban wrote: >>>>> >>>>> >>>>> Has the rebalance operation finished successfully ? has it skipped any >>>>> files ? >>>> >>>> >>>> Yes according to gluster v rebalance status it is completed without any >>>> errors. >>>> rebalance status report is like: >>>> Node Rebalanced files size Scanned >>>> failures skipped >>>> 1.1.1.185 158 29GB 1720 >>>> 0 314 >>>> 1.1.1.205 93 46.5GB 761 >>>> 0 95 >>>> 1.1.1.225 74 37GB 779 >>>> 0 94 >>>> >>>> >>>> All other hosts has 0 values. >>>> >>>> I double check that files with '---------T' attributes are there, >>>> maybe some of them deleted but I still see them in bricks... >>>> I am also concerned why part files not distributed to all 60 nodes? >>>> Rebalance should do that? >>>> >>>> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez >>>> <xhernandez at datalab.es> >>>> wrote: >>>>> >>>>> >>>>> Hi Serkan, >>>>> >>>>> On 21/04/16 12:39, Serkan ?oban wrote: >>>>>> >>>>>> >>>>>> >>>>>> I started a gluster v rebalance v0 start command hoping that it will >>>>>> equally redistribute files across 60 nodes but it did not do that... >>>>>> why it did not redistribute files? any thoughts? >>>>> >>>>> >>>>> >>>>> >>>>> Has the rebalance operation finished successfully ? has it skipped any >>>>> files >>>>> ? >>>>> >>>>> After a successful rebalance all files with attributes '---------T' >>>>> should >>>>> have disappeared. >>>>> >>>>> >>>>>> >>>>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez >>>>>> <xhernandez at datalab.es> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Serkan, >>>>>>> >>>>>>> On 21/04/16 10:07, Serkan ?oban wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I think the problem is in the temporary name that distcp gives to >>>>>>>>> the >>>>>>>>> file while it's being copied before renaming it to the real name. >>>>>>>>> Do >>>>>>>>> you >>>>>>>>> know what is the structure of this name ? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Distcp temporary file name format is: >>>>>>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same >>>>>>>> temporary file name used by one map process. For example I see in >>>>>>>> the >>>>>>>> logs that one map copies files >>>>>>>> part-m-00031,part-m-00047,part-m-00063 >>>>>>>> sequentially and they all use same temporary file name above. So no >>>>>>>> original file name appears in temporary file name. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This explains the problem. With the default options, DHT sends all >>>>>>> files >>>>>>> to >>>>>>> the subvolume that should store a file named 'distcp.tmp'. >>>>>>> >>>>>>> With this temporary name format, little can be done. >>>>>>> >>>>>>>> >>>>>>>> I will check if we can modify distcp behaviour, or we have to write >>>>>>>> our mapreduce procedures instead of using distcp. >>>>>>>> >>>>>>>>> 2. define the option 'extra-hash-regex' to an expression that >>>>>>>>> matches >>>>>>>>> your temporary file names and returns the same name that will >>>>>>>>> finally >>>>>>>>> have. >>>>>>>>> Depending on the differences between original and temporary file >>>>>>>>> names, >>>>>>>>> this >>>>>>>>> option could be useless. >>>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent >>>>>>>>> the >>>>>>>>> name conversion, so the files will be evenly distributed. However >>>>>>>>> this >>>>>>>>> will >>>>>>>>> cause a lot of files placed in incorrect subvolumes, creating a lot >>>>>>>>> of >>>>>>>>> link >>>>>>>>> files until a rebalance is executed. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> How can I set these options? >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> You can set gluster options using: >>>>>>> >>>>>>> gluster volume set <volname> <option> <value> >>>>>>> >>>>>>> for example: >>>>>>> >>>>>>> gluster volume set v0 rsync-hash-regex none >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez >>>>>>>> <xhernandez at datalab.es> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Serkan, >>>>>>>>> >>>>>>>>> I think the problem is in the temporary name that distcp gives to >>>>>>>>> the >>>>>>>>> file >>>>>>>>> while it's being copied before renaming it to the real name. Do you >>>>>>>>> know >>>>>>>>> what is the structure of this name ? >>>>>>>>> >>>>>>>>> DHT selects the subvolume (in this case the ec set) on which the >>>>>>>>> file >>>>>>>>> will >>>>>>>>> be stored based on the name of the file. This has a problem when a >>>>>>>>> file >>>>>>>>> is >>>>>>>>> being renamed, because this could change the subvolume where the >>>>>>>>> file >>>>>>>>> should >>>>>>>>> be found. >>>>>>>>> >>>>>>>>> DHT has a feature to avoid incorrect file placements when executing >>>>>>>>> renames >>>>>>>>> for the rsync case. What it does is to check if the file matches >>>>>>>>> the >>>>>>>>> following regular expression: >>>>>>>>> >>>>>>>>> ^\.(.+)\.[^.]+$ >>>>>>>>> >>>>>>>>> If a match is found, it only considers the part between parenthesis >>>>>>>>> to >>>>>>>>> calculate the destination subvolume. >>>>>>>>> >>>>>>>>> This is useful for rsync because temporary file names are >>>>>>>>> constructed >>>>>>>>> in >>>>>>>>> the >>>>>>>>> following way: suppose the original filename is 'test'. The >>>>>>>>> temporary >>>>>>>>> filename while rsync is being executed is made by prepending a dot >>>>>>>>> and >>>>>>>>> appending '.<random chars>': .test.712hd >>>>>>>>> >>>>>>>>> As you can see, the original name and the part of the name between >>>>>>>>> parenthesis that matches the regular expression are the same. This >>>>>>>>> causes >>>>>>>>> that, after renaming the temporary file to its original filename, >>>>>>>>> both >>>>>>>>> files >>>>>>>>> will be considered to belong to the same subvolume by DHT. >>>>>>>>> >>>>>>>>> In your case it's very probable that distcp uses a temporary name >>>>>>>>> like >>>>>>>>> '.part.<number>'. In this case the portion of the name used to >>>>>>>>> select >>>>>>>>> the >>>>>>>>> subvolume is always 'part'. This would explain why all files go to >>>>>>>>> the >>>>>>>>> same >>>>>>>>> subvolume. Once the file is renamed to another name, DHT realizes >>>>>>>>> that >>>>>>>>> it >>>>>>>>> should go to another subvolume. At this point it creates a link >>>>>>>>> file >>>>>>>>> (those >>>>>>>>> files with access rights = '---------T') in the correct subvolume >>>>>>>>> but >>>>>>>>> it >>>>>>>>> doesn't move it. As you can see, this kind of files are better >>>>>>>>> balanced. >>>>>>>>> >>>>>>>>> To solve this problem you have three options: >>>>>>>>> >>>>>>>>> 1. change the temporary filename used by distcp to correctly match >>>>>>>>> the >>>>>>>>> regular expression. I'm not sure if this can be configured, but if >>>>>>>>> this >>>>>>>>> is >>>>>>>>> possible, this is the best option. >>>>>>>>> >>>>>>>>> 2. define the option 'extra-hash-regex' to an expression that >>>>>>>>> matches >>>>>>>>> your >>>>>>>>> temporary file names and returns the same name that will finally >>>>>>>>> have. >>>>>>>>> Depending on the differences between original and temporary file >>>>>>>>> names, >>>>>>>>> this >>>>>>>>> option could be useless. >>>>>>>>> >>>>>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent >>>>>>>>> the >>>>>>>>> name >>>>>>>>> conversion, so the files will be evenly distributed. However this >>>>>>>>> will >>>>>>>>> cause >>>>>>>>> a lot of files placed in incorrect subvolumes, creating a lot of >>>>>>>>> link >>>>>>>>> files >>>>>>>>> until a rebalance is executed. >>>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>> >>>>>>>>> On 20/04/16 14:13, Serkan ?oban wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Here is the steps that I do in detail and relevant output from >>>>>>>>>> bricks: >>>>>>>>>> >>>>>>>>>> I am using below command for volume creation: >>>>>>>>>> gluster volume create v0 disperse 20 redundancy 4 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/02 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/02 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/02 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/03 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/03 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/03 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/04 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/04 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/04 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/05 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/05 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/05 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/06 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/06 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/06 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/07 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/07 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/07 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/08 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/08 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/08 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/09 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/09 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/09 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/10 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/10 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/10 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/11 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/11 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/11 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/12 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/12 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/12 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/13 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/13 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/13 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/14 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/14 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/14 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/15 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/15 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/15 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/16 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/16 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/16 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/17 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/17 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/17 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/18 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/18 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/18 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/19 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/19 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/19 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/20 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/20 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/20 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/21 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/21 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/21 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/22 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/22 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/22 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/23 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/23 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/23 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/24 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/24 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/24 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/25 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/25 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/25 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/26 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/26 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/26 \ >>>>>>>>>> 1.1.1.{185..204}:/bricks/27 \ >>>>>>>>>> 1.1.1.{205..224}:/bricks/27 \ >>>>>>>>>> 1.1.1.{225..244}:/bricks/27 force >>>>>>>>>> >>>>>>>>>> then I mount volume on 50 clients: >>>>>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster >>>>>>>>>> >>>>>>>>>> then I make a directory from one of the clients and chmod it. >>>>>>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1 >>>>>>>>>> >>>>>>>>>> then I start distcp on clients, there are 1059X8.8GB files in one >>>>>>>>>> folder >>>>>>>>>> and >>>>>>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which >>>>>>>>>> means >>>>>>>>>> 2 >>>>>>>>>> copy jobs per client at same time. >>>>>>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb >>>>>>>>>> file:///mnt/gluster/s1 >>>>>>>>>> >>>>>>>>>> After job finished here is the status of s1 directory from bricks: >>>>>>>>>> s1 directory is present in all 1560 brick. >>>>>>>>>> s1/teragen-10tb folder is present in all 1560 brick. >>>>>>>>>> >>>>>>>>>> full listing of files in bricks: >>>>>>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0 >>>>>>>>>> >>>>>>>>>> You can ignore the .crc files in the brick output above, they are >>>>>>>>>> checksum files... >>>>>>>>>> >>>>>>>>>> As you can see part-m-xxxx files written only some bricks in nodes >>>>>>>>>> 0205..0224 >>>>>>>>>> All bricks have some files but they have zero size. >>>>>>>>>> >>>>>>>>>> I increase file descriptors to 65k so it is not the issue... >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez >>>>>>>>>> <xhernandez at datalab.es> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Serkan, >>>>>>>>>>> >>>>>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>>>>>>> before >>>>>>>>>>>>>>> the reduce phase >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Nope, gluster is the destination for distcp command. hadoop >>>>>>>>>>>> distcp >>>>>>>>>>>> -m >>>>>>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster >>>>>>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on >>>>>>>>>>>> all >>>>>>>>>>>> of >>>>>>>>>>>> them. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I don't know hadoop, so I'm of little help here. However it seems >>>>>>>>>>> that >>>>>>>>>>> -m >>>>>>>>>>> 50 >>>>>>>>>>> means to execute 50 copies in parallel. This means that even if >>>>>>>>>>> the >>>>>>>>>>> distribution worked fine, at most 50 (much probably less) of the >>>>>>>>>>> 78 >>>>>>>>>>> ec >>>>>>>>>>> sets >>>>>>>>>>> would be used in parallel. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>>>>>>> mapreduce. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes but how a client write 500 files to gluster mount and those >>>>>>>>>>>> file >>>>>>>>>>>> just written only to subset of subvolumes? I cannot use gluster >>>>>>>>>>>> as >>>>>>>>>>>> a >>>>>>>>>>>> backup cluster if I cannot write with distcp. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> All 500 files were created only on one of the 78 ec sets and the >>>>>>>>>>> remaining >>>>>>>>>>> 77 got empty ? >>>>>>>>>>> >>>>>>>>>>>>>>> You should look which files are created in each brick and how >>>>>>>>>>>>>>> many >>>>>>>>>>>>>>> while the process is running. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244. >>>>>>>>>>>> Only >>>>>>>>>>>> on >>>>>>>>>>>> 20 nodes in each test. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> How many files there were in each brick ? >>>>>>>>>>> >>>>>>>>>>> Not sure if this can be related, but standard linux distributions >>>>>>>>>>> have >>>>>>>>>>> a >>>>>>>>>>> default limit of 1024 open file descriptors. Having a so big >>>>>>>>>>> volume >>>>>>>>>>> and >>>>>>>>>>> doing a massive copy, maybe this limit is affecting something ? >>>>>>>>>>> >>>>>>>>>>> Are there any error or warning messages in the mount or bricks >>>>>>>>>>> logs >>>>>>>>>>> ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Xavi >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez >>>>>>>>>>>> <xhernandez at datalab.es> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Serkan, >>>>>>>>>>>>> >>>>>>>>>>>>> moved to gluster-users since this doesn't belong to devel list. >>>>>>>>>>>>> >>>>>>>>>>>>> On 19/04/16 11:24, Serkan ?oban wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on >>>>>>>>>>>>>> clients. Each map process took one file at a time and copy it >>>>>>>>>>>>>> to >>>>>>>>>>>>>> gluster volume. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>>>>> before >>>>>>>>>>>>> the >>>>>>>>>>>>> reduce phase. >>>>>>>>>>>>> >>>>>>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. >>>>>>>>>>>>>> So >>>>>>>>>>>>>> If >>>>>>>>>>>>>> I >>>>>>>>>>>>>> copy >78 files parallel I expect each file goes to different >>>>>>>>>>>>>> subvolume >>>>>>>>>>>>>> right? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> If you only copy 78 files, most probably you will get some >>>>>>>>>>>>> subvolume >>>>>>>>>>>>> empty >>>>>>>>>>>>> and some other with more than one or two files. It's not an >>>>>>>>>>>>> exact >>>>>>>>>>>>> distribution, it's a statistially balanced distribution: over >>>>>>>>>>>>> time >>>>>>>>>>>>> and >>>>>>>>>>>>> with >>>>>>>>>>>>> enough files, each brick will contain an amount of files in the >>>>>>>>>>>>> same >>>>>>>>>>>>> order >>>>>>>>>>>>> of magnitude, but they won't have the *same* number of files. >>>>>>>>>>>>> >>>>>>>>>>>>>> In my tests during tests with fio I can see every file goes to >>>>>>>>>>>>>> different subvolume, but when I start mapreduce process from >>>>>>>>>>>>>> clients >>>>>>>>>>>>>> only 78/3=26 subvolumes used for writing files. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>>>>> mapreduce. >>>>>>>>>>>>> >>>>>>>>>>>>>> I see that clearly from network traffic. Mapreduce on client >>>>>>>>>>>>>> side >>>>>>>>>>>>>> can >>>>>>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each >>>>>>>>>>>>>> client >>>>>>>>>>>>>> but >>>>>>>>>>>>>> every time only 26 subvolumes used. >>>>>>>>>>>>>> How can I debug the issue further? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> You should look which files are created in each brick and how >>>>>>>>>>>>> many >>>>>>>>>>>>> while >>>>>>>>>>>>> the >>>>>>>>>>>>> process is running. >>>>>>>>>>>>> >>>>>>>>>>>>> Xavi >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez >>>>>>>>>>>>>> <xhernandez at datalab.es> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Serkan, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 19/04/16 09:18, Serkan ?oban wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same >>>>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> gluster >>>>>>>>>>>>>>>> using one thread per server and they are using only 20 >>>>>>>>>>>>>>>> servers >>>>>>>>>>>>>>>> out >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> 60. On the other hand fio tests use all the servers. >>>>>>>>>>>>>>>> Anything >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>> to solve the issue? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> create >>>>>>>>>>>>>>> many files each ec set will receive the same amount of files. >>>>>>>>>>>>>>> However >>>>>>>>>>>>>>> when >>>>>>>>>>>>>>> the number of files is small enough, statistics can fail. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce >>>>>>>>>>>>>>> procedure >>>>>>>>>>>>>>> generally >>>>>>>>>>>>>>> only creates a single output. In that case it makes sense >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> only >>>>>>>>>>>>>>> one >>>>>>>>>>>>>>> ec >>>>>>>>>>>>>>> set is used. If you want to use all ec sets for a single >>>>>>>>>>>>>>> file, >>>>>>>>>>>>>>> you >>>>>>>>>>>>>>> should >>>>>>>>>>>>>>> enable sharding (I haven't tested that) or split the result >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Xavi >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Serkan >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ---------- Forwarded message ---------- >>>>>>>>>>>>>>>> From: Serkan ?oban <cobanserkan at gmail.com> >>>>>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM >>>>>>>>>>>>>>>> Subject: disperse volume file to subvolume mapping >>>>>>>>>>>>>>>> To: Gluster Users <gluster-users at gluster.org> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of >>>>>>>>>>>>>>>> nodes >>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>> disperse volume for writing. >>>>>>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file >>>>>>>>>>>>>>>> names >>>>>>>>>>>>>>>> part-0-xxxx. >>>>>>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>> name to sub volume hashing is done? Is this related to file >>>>>>>>>>>>>>>> names >>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>> similar? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. >>>>>>>>>>>>>>>> Disperse >>>>>>>>>>>>>>>> volume >>>>>>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during >>>>>>>>>>>>>>>> writes.. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >