Serkan Çoban
2016-Apr-21 12:23 UTC
[Gluster-users] disperse volume file to subvolume mapping
>Has the rebalance operation finished successfully ? has it skipped any files ?Yes according to gluster v rebalance status it is completed without any errors. rebalance status report is like: Node Rebalanced files size Scanned failures skipped 1.1.1.185 158 29GB 1720 0 314 1.1.1.205 93 46.5GB 761 0 95 1.1.1.225 74 37GB 779 0 94 All other hosts has 0 values. I double check that files with '---------T' attributes are there, maybe some of them deleted but I still see them in bricks... I am also concerned why part files not distributed to all 60 nodes? Rebalance should do that? On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at datalab.es> wrote:> Hi Serkan, > > On 21/04/16 12:39, Serkan ?oban wrote: >> >> I started a gluster v rebalance v0 start command hoping that it will >> equally redistribute files across 60 nodes but it did not do that... >> why it did not redistribute files? any thoughts? > > > Has the rebalance operation finished successfully ? has it skipped any files > ? > > After a successful rebalance all files with attributes '---------T' should > have disappeared. > > >> >> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez >> <xhernandez at datalab.es> wrote: >>> >>> Hi Serkan, >>> >>> On 21/04/16 10:07, Serkan ?oban wrote: >>>>> >>>>> >>>>> I think the problem is in the temporary name that distcp gives to the >>>>> file while it's being copied before renaming it to the real name. Do >>>>> you >>>>> know what is the structure of this name ? >>>> >>>> >>>> Distcp temporary file name format is: >>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same >>>> temporary file name used by one map process. For example I see in the >>>> logs that one map copies files part-m-00031,part-m-00047,part-m-00063 >>>> sequentially and they all use same temporary file name above. So no >>>> original file name appears in temporary file name. >>> >>> >>> >>> This explains the problem. With the default options, DHT sends all files >>> to >>> the subvolume that should store a file named 'distcp.tmp'. >>> >>> With this temporary name format, little can be done. >>> >>>> >>>> I will check if we can modify distcp behaviour, or we have to write >>>> our mapreduce procedures instead of using distcp. >>>> >>>>> 2. define the option 'extra-hash-regex' to an expression that matches >>>>> your temporary file names and returns the same name that will finally >>>>> have. >>>>> Depending on the differences between original and temporary file names, >>>>> this >>>>> option could be useless. >>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the >>>>> name conversion, so the files will be evenly distributed. However this >>>>> will >>>>> cause a lot of files placed in incorrect subvolumes, creating a lot of >>>>> link >>>>> files until a rebalance is executed. >>>> >>>> >>>> >>>> How can I set these options? >>> >>> >>> >>> You can set gluster options using: >>> >>> gluster volume set <volname> <option> <value> >>> >>> for example: >>> >>> gluster volume set v0 rsync-hash-regex none >>> >>> Xavi >>> >>> >>>> >>>> >>>> >>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez >>>> <xhernandez at datalab.es> wrote: >>>>> >>>>> >>>>> Hi Serkan, >>>>> >>>>> I think the problem is in the temporary name that distcp gives to the >>>>> file >>>>> while it's being copied before renaming it to the real name. Do you >>>>> know >>>>> what is the structure of this name ? >>>>> >>>>> DHT selects the subvolume (in this case the ec set) on which the file >>>>> will >>>>> be stored based on the name of the file. This has a problem when a file >>>>> is >>>>> being renamed, because this could change the subvolume where the file >>>>> should >>>>> be found. >>>>> >>>>> DHT has a feature to avoid incorrect file placements when executing >>>>> renames >>>>> for the rsync case. What it does is to check if the file matches the >>>>> following regular expression: >>>>> >>>>> ^\.(.+)\.[^.]+$ >>>>> >>>>> If a match is found, it only considers the part between parenthesis to >>>>> calculate the destination subvolume. >>>>> >>>>> This is useful for rsync because temporary file names are constructed >>>>> in >>>>> the >>>>> following way: suppose the original filename is 'test'. The temporary >>>>> filename while rsync is being executed is made by prepending a dot and >>>>> appending '.<random chars>': .test.712hd >>>>> >>>>> As you can see, the original name and the part of the name between >>>>> parenthesis that matches the regular expression are the same. This >>>>> causes >>>>> that, after renaming the temporary file to its original filename, both >>>>> files >>>>> will be considered to belong to the same subvolume by DHT. >>>>> >>>>> In your case it's very probable that distcp uses a temporary name like >>>>> '.part.<number>'. In this case the portion of the name used to select >>>>> the >>>>> subvolume is always 'part'. This would explain why all files go to the >>>>> same >>>>> subvolume. Once the file is renamed to another name, DHT realizes that >>>>> it >>>>> should go to another subvolume. At this point it creates a link file >>>>> (those >>>>> files with access rights = '---------T') in the correct subvolume but >>>>> it >>>>> doesn't move it. As you can see, this kind of files are better >>>>> balanced. >>>>> >>>>> To solve this problem you have three options: >>>>> >>>>> 1. change the temporary filename used by distcp to correctly match the >>>>> regular expression. I'm not sure if this can be configured, but if this >>>>> is >>>>> possible, this is the best option. >>>>> >>>>> 2. define the option 'extra-hash-regex' to an expression that matches >>>>> your >>>>> temporary file names and returns the same name that will finally have. >>>>> Depending on the differences between original and temporary file names, >>>>> this >>>>> option could be useless. >>>>> >>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the >>>>> name >>>>> conversion, so the files will be evenly distributed. However this will >>>>> cause >>>>> a lot of files placed in incorrect subvolumes, creating a lot of link >>>>> files >>>>> until a rebalance is executed. >>>>> >>>>> Xavi >>>>> >>>>> >>>>> On 20/04/16 14:13, Serkan ?oban wrote: >>>>>> >>>>>> >>>>>> >>>>>> Here is the steps that I do in detail and relevant output from bricks: >>>>>> >>>>>> I am using below command for volume creation: >>>>>> gluster volume create v0 disperse 20 redundancy 4 \ >>>>>> 1.1.1.{185..204}:/bricks/02 \ >>>>>> 1.1.1.{205..224}:/bricks/02 \ >>>>>> 1.1.1.{225..244}:/bricks/02 \ >>>>>> 1.1.1.{185..204}:/bricks/03 \ >>>>>> 1.1.1.{205..224}:/bricks/03 \ >>>>>> 1.1.1.{225..244}:/bricks/03 \ >>>>>> 1.1.1.{185..204}:/bricks/04 \ >>>>>> 1.1.1.{205..224}:/bricks/04 \ >>>>>> 1.1.1.{225..244}:/bricks/04 \ >>>>>> 1.1.1.{185..204}:/bricks/05 \ >>>>>> 1.1.1.{205..224}:/bricks/05 \ >>>>>> 1.1.1.{225..244}:/bricks/05 \ >>>>>> 1.1.1.{185..204}:/bricks/06 \ >>>>>> 1.1.1.{205..224}:/bricks/06 \ >>>>>> 1.1.1.{225..244}:/bricks/06 \ >>>>>> 1.1.1.{185..204}:/bricks/07 \ >>>>>> 1.1.1.{205..224}:/bricks/07 \ >>>>>> 1.1.1.{225..244}:/bricks/07 \ >>>>>> 1.1.1.{185..204}:/bricks/08 \ >>>>>> 1.1.1.{205..224}:/bricks/08 \ >>>>>> 1.1.1.{225..244}:/bricks/08 \ >>>>>> 1.1.1.{185..204}:/bricks/09 \ >>>>>> 1.1.1.{205..224}:/bricks/09 \ >>>>>> 1.1.1.{225..244}:/bricks/09 \ >>>>>> 1.1.1.{185..204}:/bricks/10 \ >>>>>> 1.1.1.{205..224}:/bricks/10 \ >>>>>> 1.1.1.{225..244}:/bricks/10 \ >>>>>> 1.1.1.{185..204}:/bricks/11 \ >>>>>> 1.1.1.{205..224}:/bricks/11 \ >>>>>> 1.1.1.{225..244}:/bricks/11 \ >>>>>> 1.1.1.{185..204}:/bricks/12 \ >>>>>> 1.1.1.{205..224}:/bricks/12 \ >>>>>> 1.1.1.{225..244}:/bricks/12 \ >>>>>> 1.1.1.{185..204}:/bricks/13 \ >>>>>> 1.1.1.{205..224}:/bricks/13 \ >>>>>> 1.1.1.{225..244}:/bricks/13 \ >>>>>> 1.1.1.{185..204}:/bricks/14 \ >>>>>> 1.1.1.{205..224}:/bricks/14 \ >>>>>> 1.1.1.{225..244}:/bricks/14 \ >>>>>> 1.1.1.{185..204}:/bricks/15 \ >>>>>> 1.1.1.{205..224}:/bricks/15 \ >>>>>> 1.1.1.{225..244}:/bricks/15 \ >>>>>> 1.1.1.{185..204}:/bricks/16 \ >>>>>> 1.1.1.{205..224}:/bricks/16 \ >>>>>> 1.1.1.{225..244}:/bricks/16 \ >>>>>> 1.1.1.{185..204}:/bricks/17 \ >>>>>> 1.1.1.{205..224}:/bricks/17 \ >>>>>> 1.1.1.{225..244}:/bricks/17 \ >>>>>> 1.1.1.{185..204}:/bricks/18 \ >>>>>> 1.1.1.{205..224}:/bricks/18 \ >>>>>> 1.1.1.{225..244}:/bricks/18 \ >>>>>> 1.1.1.{185..204}:/bricks/19 \ >>>>>> 1.1.1.{205..224}:/bricks/19 \ >>>>>> 1.1.1.{225..244}:/bricks/19 \ >>>>>> 1.1.1.{185..204}:/bricks/20 \ >>>>>> 1.1.1.{205..224}:/bricks/20 \ >>>>>> 1.1.1.{225..244}:/bricks/20 \ >>>>>> 1.1.1.{185..204}:/bricks/21 \ >>>>>> 1.1.1.{205..224}:/bricks/21 \ >>>>>> 1.1.1.{225..244}:/bricks/21 \ >>>>>> 1.1.1.{185..204}:/bricks/22 \ >>>>>> 1.1.1.{205..224}:/bricks/22 \ >>>>>> 1.1.1.{225..244}:/bricks/22 \ >>>>>> 1.1.1.{185..204}:/bricks/23 \ >>>>>> 1.1.1.{205..224}:/bricks/23 \ >>>>>> 1.1.1.{225..244}:/bricks/23 \ >>>>>> 1.1.1.{185..204}:/bricks/24 \ >>>>>> 1.1.1.{205..224}:/bricks/24 \ >>>>>> 1.1.1.{225..244}:/bricks/24 \ >>>>>> 1.1.1.{185..204}:/bricks/25 \ >>>>>> 1.1.1.{205..224}:/bricks/25 \ >>>>>> 1.1.1.{225..244}:/bricks/25 \ >>>>>> 1.1.1.{185..204}:/bricks/26 \ >>>>>> 1.1.1.{205..224}:/bricks/26 \ >>>>>> 1.1.1.{225..244}:/bricks/26 \ >>>>>> 1.1.1.{185..204}:/bricks/27 \ >>>>>> 1.1.1.{205..224}:/bricks/27 \ >>>>>> 1.1.1.{225..244}:/bricks/27 force >>>>>> >>>>>> then I mount volume on 50 clients: >>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster >>>>>> >>>>>> then I make a directory from one of the clients and chmod it. >>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1 >>>>>> >>>>>> then I start distcp on clients, there are 1059X8.8GB files in one >>>>>> folder >>>>>> and >>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which means 2 >>>>>> copy jobs per client at same time. >>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb >>>>>> file:///mnt/gluster/s1 >>>>>> >>>>>> After job finished here is the status of s1 directory from bricks: >>>>>> s1 directory is present in all 1560 brick. >>>>>> s1/teragen-10tb folder is present in all 1560 brick. >>>>>> >>>>>> full listing of files in bricks: >>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0 >>>>>> >>>>>> You can ignore the .crc files in the brick output above, they are >>>>>> checksum files... >>>>>> >>>>>> As you can see part-m-xxxx files written only some bricks in nodes >>>>>> 0205..0224 >>>>>> All bricks have some files but they have zero size. >>>>>> >>>>>> I increase file descriptors to 65k so it is not the issue... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez >>>>>> <xhernandez at datalab.es> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Serkan, >>>>>>> >>>>>>> On 19/04/16 15:16, Serkan ?oban wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>>> before >>>>>>>>>>> the reduce phase >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Nope, gluster is the destination for distcp command. hadoop distcp >>>>>>>> -m >>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster >>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on all of >>>>>>>> them. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I don't know hadoop, so I'm of little help here. However it seems >>>>>>> that >>>>>>> -m >>>>>>> 50 >>>>>>> means to execute 50 copies in parallel. This means that even if the >>>>>>> distribution worked fine, at most 50 (much probably less) of the 78 >>>>>>> ec >>>>>>> sets >>>>>>> would be used in parallel. >>>>>>> >>>>>>>> >>>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>>> mapreduce. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Yes but how a client write 500 files to gluster mount and those file >>>>>>>> just written only to subset of subvolumes? I cannot use gluster as a >>>>>>>> backup cluster if I cannot write with distcp. >>>>>>>> >>>>>>> >>>>>>> All 500 files were created only on one of the 78 ec sets and the >>>>>>> remaining >>>>>>> 77 got empty ? >>>>>>> >>>>>>>>>>> You should look which files are created in each brick and how >>>>>>>>>>> many >>>>>>>>>>> while the process is running. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244. Only >>>>>>>> on >>>>>>>> 20 nodes in each test. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> How many files there were in each brick ? >>>>>>> >>>>>>> Not sure if this can be related, but standard linux distributions >>>>>>> have >>>>>>> a >>>>>>> default limit of 1024 open file descriptors. Having a so big volume >>>>>>> and >>>>>>> doing a massive copy, maybe this limit is affecting something ? >>>>>>> >>>>>>> Are there any error or warning messages in the mount or bricks logs ? >>>>>>> >>>>>>> >>>>>>> Xavi >>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez >>>>>>>> <xhernandez at datalab.es> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Serkan, >>>>>>>>> >>>>>>>>> moved to gluster-users since this doesn't belong to devel list. >>>>>>>>> >>>>>>>>> On 19/04/16 11:24, Serkan ?oban wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on >>>>>>>>>> clients. Each map process took one file at a time and copy it to >>>>>>>>>> gluster volume. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>> before >>>>>>>>> the >>>>>>>>> reduce phase. >>>>>>>>> >>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So >>>>>>>>>> If >>>>>>>>>> I >>>>>>>>>> copy >78 files parallel I expect each file goes to different >>>>>>>>>> subvolume >>>>>>>>>> right? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> If you only copy 78 files, most probably you will get some >>>>>>>>> subvolume >>>>>>>>> empty >>>>>>>>> and some other with more than one or two files. It's not an exact >>>>>>>>> distribution, it's a statistially balanced distribution: over time >>>>>>>>> and >>>>>>>>> with >>>>>>>>> enough files, each brick will contain an amount of files in the >>>>>>>>> same >>>>>>>>> order >>>>>>>>> of magnitude, but they won't have the *same* number of files. >>>>>>>>> >>>>>>>>>> In my tests during tests with fio I can see every file goes to >>>>>>>>>> different subvolume, but when I start mapreduce process from >>>>>>>>>> clients >>>>>>>>>> only 78/3=26 subvolumes used for writing files. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>> mapreduce. >>>>>>>>> >>>>>>>>>> I see that clearly from network traffic. Mapreduce on client side >>>>>>>>>> can >>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each client >>>>>>>>>> but >>>>>>>>>> every time only 26 subvolumes used. >>>>>>>>>> How can I debug the issue further? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> You should look which files are created in each brick and how many >>>>>>>>> while >>>>>>>>> the >>>>>>>>> process is running. >>>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez >>>>>>>>>> <xhernandez at datalab.es> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Serkan, >>>>>>>>>>> >>>>>>>>>>> On 19/04/16 09:18, Serkan ?oban wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same >>>>>>>>>>>> behavior. >>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to >>>>>>>>>>>> gluster >>>>>>>>>>>> using one thread per server and they are using only 20 servers >>>>>>>>>>>> out >>>>>>>>>>>> of >>>>>>>>>>>> 60. On the other hand fio tests use all the servers. Anything I >>>>>>>>>>>> can >>>>>>>>>>>> do >>>>>>>>>>>> to solve the issue? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if you >>>>>>>>>>> create >>>>>>>>>>> many files each ec set will receive the same amount of files. >>>>>>>>>>> However >>>>>>>>>>> when >>>>>>>>>>> the number of files is small enough, statistics can fail. >>>>>>>>>>> >>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce procedure >>>>>>>>>>> generally >>>>>>>>>>> only creates a single output. In that case it makes sense that >>>>>>>>>>> only >>>>>>>>>>> one >>>>>>>>>>> ec >>>>>>>>>>> set is used. If you want to use all ec sets for a single file, >>>>>>>>>>> you >>>>>>>>>>> should >>>>>>>>>>> enable sharding (I haven't tested that) or split the result in >>>>>>>>>>> multiple >>>>>>>>>>> files. >>>>>>>>>>> >>>>>>>>>>> Xavi >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Serkan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ---------- Forwarded message ---------- >>>>>>>>>>>> From: Serkan ?oban <cobanserkan at gmail.com> >>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM >>>>>>>>>>>> Subject: disperse volume file to subvolume mapping >>>>>>>>>>>> To: Gluster Users <gluster-users at gluster.org> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes >>>>>>>>>>>> in >>>>>>>>>>>> disperse volume for writing. >>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file >>>>>>>>>>>> names >>>>>>>>>>>> part-0-xxxx. >>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is the >>>>>>>>>>>> file >>>>>>>>>>>> name to sub volume hashing is done? Is this related to file >>>>>>>>>>>> names >>>>>>>>>>>> are >>>>>>>>>>>> similar? >>>>>>>>>>>> >>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse >>>>>>>>>>>> volume >>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during >>>>>>>>>>>> writes.. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >
Xavier Hernandez
2016-Apr-21 12:34 UTC
[Gluster-users] disperse volume file to subvolume mapping
Can you try a 'gluster volume rebalance v0 start force' ? On 21/04/16 14:23, Serkan ?oban wrote:>> Has the rebalance operation finished successfully ? has it skipped any files ? > Yes according to gluster v rebalance status it is completed without any errors. > rebalance status report is like: > Node Rebalanced files size Scanned > failures skipped > 1.1.1.185 158 29GB 1720 > 0 314 > 1.1.1.205 93 46.5GB 761 > 0 95 > 1.1.1.225 74 37GB 779 > 0 94 > > > All other hosts has 0 values. > > I double check that files with '---------T' attributes are there, > maybe some of them deleted but I still see them in bricks... > I am also concerned why part files not distributed to all 60 nodes? > Rebalance should do that? > > On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at datalab.es> wrote: >> Hi Serkan, >> >> On 21/04/16 12:39, Serkan ?oban wrote: >>> >>> I started a gluster v rebalance v0 start command hoping that it will >>> equally redistribute files across 60 nodes but it did not do that... >>> why it did not redistribute files? any thoughts? >> >> >> Has the rebalance operation finished successfully ? has it skipped any files >> ? >> >> After a successful rebalance all files with attributes '---------T' should >> have disappeared. >> >> >>> >>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez >>> <xhernandez at datalab.es> wrote: >>>> >>>> Hi Serkan, >>>> >>>> On 21/04/16 10:07, Serkan ?oban wrote: >>>>>> >>>>>> >>>>>> I think the problem is in the temporary name that distcp gives to the >>>>>> file while it's being copied before renaming it to the real name. Do >>>>>> you >>>>>> know what is the structure of this name ? >>>>> >>>>> >>>>> Distcp temporary file name format is: >>>>> ".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same >>>>> temporary file name used by one map process. For example I see in the >>>>> logs that one map copies files part-m-00031,part-m-00047,part-m-00063 >>>>> sequentially and they all use same temporary file name above. So no >>>>> original file name appears in temporary file name. >>>> >>>> >>>> >>>> This explains the problem. With the default options, DHT sends all files >>>> to >>>> the subvolume that should store a file named 'distcp.tmp'. >>>> >>>> With this temporary name format, little can be done. >>>> >>>>> >>>>> I will check if we can modify distcp behaviour, or we have to write >>>>> our mapreduce procedures instead of using distcp. >>>>> >>>>>> 2. define the option 'extra-hash-regex' to an expression that matches >>>>>> your temporary file names and returns the same name that will finally >>>>>> have. >>>>>> Depending on the differences between original and temporary file names, >>>>>> this >>>>>> option could be useless. >>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the >>>>>> name conversion, so the files will be evenly distributed. However this >>>>>> will >>>>>> cause a lot of files placed in incorrect subvolumes, creating a lot of >>>>>> link >>>>>> files until a rebalance is executed. >>>>> >>>>> >>>>> >>>>> How can I set these options? >>>> >>>> >>>> >>>> You can set gluster options using: >>>> >>>> gluster volume set <volname> <option> <value> >>>> >>>> for example: >>>> >>>> gluster volume set v0 rsync-hash-regex none >>>> >>>> Xavi >>>> >>>> >>>>> >>>>> >>>>> >>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez >>>>> <xhernandez at datalab.es> wrote: >>>>>> >>>>>> >>>>>> Hi Serkan, >>>>>> >>>>>> I think the problem is in the temporary name that distcp gives to the >>>>>> file >>>>>> while it's being copied before renaming it to the real name. Do you >>>>>> know >>>>>> what is the structure of this name ? >>>>>> >>>>>> DHT selects the subvolume (in this case the ec set) on which the file >>>>>> will >>>>>> be stored based on the name of the file. This has a problem when a file >>>>>> is >>>>>> being renamed, because this could change the subvolume where the file >>>>>> should >>>>>> be found. >>>>>> >>>>>> DHT has a feature to avoid incorrect file placements when executing >>>>>> renames >>>>>> for the rsync case. What it does is to check if the file matches the >>>>>> following regular expression: >>>>>> >>>>>> ^\.(.+)\.[^.]+$ >>>>>> >>>>>> If a match is found, it only considers the part between parenthesis to >>>>>> calculate the destination subvolume. >>>>>> >>>>>> This is useful for rsync because temporary file names are constructed >>>>>> in >>>>>> the >>>>>> following way: suppose the original filename is 'test'. The temporary >>>>>> filename while rsync is being executed is made by prepending a dot and >>>>>> appending '.<random chars>': .test.712hd >>>>>> >>>>>> As you can see, the original name and the part of the name between >>>>>> parenthesis that matches the regular expression are the same. This >>>>>> causes >>>>>> that, after renaming the temporary file to its original filename, both >>>>>> files >>>>>> will be considered to belong to the same subvolume by DHT. >>>>>> >>>>>> In your case it's very probable that distcp uses a temporary name like >>>>>> '.part.<number>'. In this case the portion of the name used to select >>>>>> the >>>>>> subvolume is always 'part'. This would explain why all files go to the >>>>>> same >>>>>> subvolume. Once the file is renamed to another name, DHT realizes that >>>>>> it >>>>>> should go to another subvolume. At this point it creates a link file >>>>>> (those >>>>>> files with access rights = '---------T') in the correct subvolume but >>>>>> it >>>>>> doesn't move it. As you can see, this kind of files are better >>>>>> balanced. >>>>>> >>>>>> To solve this problem you have three options: >>>>>> >>>>>> 1. change the temporary filename used by distcp to correctly match the >>>>>> regular expression. I'm not sure if this can be configured, but if this >>>>>> is >>>>>> possible, this is the best option. >>>>>> >>>>>> 2. define the option 'extra-hash-regex' to an expression that matches >>>>>> your >>>>>> temporary file names and returns the same name that will finally have. >>>>>> Depending on the differences between original and temporary file names, >>>>>> this >>>>>> option could be useless. >>>>>> >>>>>> 3. set the option 'rsync-hash-regex' to 'none'. This will prevent the >>>>>> name >>>>>> conversion, so the files will be evenly distributed. However this will >>>>>> cause >>>>>> a lot of files placed in incorrect subvolumes, creating a lot of link >>>>>> files >>>>>> until a rebalance is executed. >>>>>> >>>>>> Xavi >>>>>> >>>>>> >>>>>> On 20/04/16 14:13, Serkan ?oban wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Here is the steps that I do in detail and relevant output from bricks: >>>>>>> >>>>>>> I am using below command for volume creation: >>>>>>> gluster volume create v0 disperse 20 redundancy 4 \ >>>>>>> 1.1.1.{185..204}:/bricks/02 \ >>>>>>> 1.1.1.{205..224}:/bricks/02 \ >>>>>>> 1.1.1.{225..244}:/bricks/02 \ >>>>>>> 1.1.1.{185..204}:/bricks/03 \ >>>>>>> 1.1.1.{205..224}:/bricks/03 \ >>>>>>> 1.1.1.{225..244}:/bricks/03 \ >>>>>>> 1.1.1.{185..204}:/bricks/04 \ >>>>>>> 1.1.1.{205..224}:/bricks/04 \ >>>>>>> 1.1.1.{225..244}:/bricks/04 \ >>>>>>> 1.1.1.{185..204}:/bricks/05 \ >>>>>>> 1.1.1.{205..224}:/bricks/05 \ >>>>>>> 1.1.1.{225..244}:/bricks/05 \ >>>>>>> 1.1.1.{185..204}:/bricks/06 \ >>>>>>> 1.1.1.{205..224}:/bricks/06 \ >>>>>>> 1.1.1.{225..244}:/bricks/06 \ >>>>>>> 1.1.1.{185..204}:/bricks/07 \ >>>>>>> 1.1.1.{205..224}:/bricks/07 \ >>>>>>> 1.1.1.{225..244}:/bricks/07 \ >>>>>>> 1.1.1.{185..204}:/bricks/08 \ >>>>>>> 1.1.1.{205..224}:/bricks/08 \ >>>>>>> 1.1.1.{225..244}:/bricks/08 \ >>>>>>> 1.1.1.{185..204}:/bricks/09 \ >>>>>>> 1.1.1.{205..224}:/bricks/09 \ >>>>>>> 1.1.1.{225..244}:/bricks/09 \ >>>>>>> 1.1.1.{185..204}:/bricks/10 \ >>>>>>> 1.1.1.{205..224}:/bricks/10 \ >>>>>>> 1.1.1.{225..244}:/bricks/10 \ >>>>>>> 1.1.1.{185..204}:/bricks/11 \ >>>>>>> 1.1.1.{205..224}:/bricks/11 \ >>>>>>> 1.1.1.{225..244}:/bricks/11 \ >>>>>>> 1.1.1.{185..204}:/bricks/12 \ >>>>>>> 1.1.1.{205..224}:/bricks/12 \ >>>>>>> 1.1.1.{225..244}:/bricks/12 \ >>>>>>> 1.1.1.{185..204}:/bricks/13 \ >>>>>>> 1.1.1.{205..224}:/bricks/13 \ >>>>>>> 1.1.1.{225..244}:/bricks/13 \ >>>>>>> 1.1.1.{185..204}:/bricks/14 \ >>>>>>> 1.1.1.{205..224}:/bricks/14 \ >>>>>>> 1.1.1.{225..244}:/bricks/14 \ >>>>>>> 1.1.1.{185..204}:/bricks/15 \ >>>>>>> 1.1.1.{205..224}:/bricks/15 \ >>>>>>> 1.1.1.{225..244}:/bricks/15 \ >>>>>>> 1.1.1.{185..204}:/bricks/16 \ >>>>>>> 1.1.1.{205..224}:/bricks/16 \ >>>>>>> 1.1.1.{225..244}:/bricks/16 \ >>>>>>> 1.1.1.{185..204}:/bricks/17 \ >>>>>>> 1.1.1.{205..224}:/bricks/17 \ >>>>>>> 1.1.1.{225..244}:/bricks/17 \ >>>>>>> 1.1.1.{185..204}:/bricks/18 \ >>>>>>> 1.1.1.{205..224}:/bricks/18 \ >>>>>>> 1.1.1.{225..244}:/bricks/18 \ >>>>>>> 1.1.1.{185..204}:/bricks/19 \ >>>>>>> 1.1.1.{205..224}:/bricks/19 \ >>>>>>> 1.1.1.{225..244}:/bricks/19 \ >>>>>>> 1.1.1.{185..204}:/bricks/20 \ >>>>>>> 1.1.1.{205..224}:/bricks/20 \ >>>>>>> 1.1.1.{225..244}:/bricks/20 \ >>>>>>> 1.1.1.{185..204}:/bricks/21 \ >>>>>>> 1.1.1.{205..224}:/bricks/21 \ >>>>>>> 1.1.1.{225..244}:/bricks/21 \ >>>>>>> 1.1.1.{185..204}:/bricks/22 \ >>>>>>> 1.1.1.{205..224}:/bricks/22 \ >>>>>>> 1.1.1.{225..244}:/bricks/22 \ >>>>>>> 1.1.1.{185..204}:/bricks/23 \ >>>>>>> 1.1.1.{205..224}:/bricks/23 \ >>>>>>> 1.1.1.{225..244}:/bricks/23 \ >>>>>>> 1.1.1.{185..204}:/bricks/24 \ >>>>>>> 1.1.1.{205..224}:/bricks/24 \ >>>>>>> 1.1.1.{225..244}:/bricks/24 \ >>>>>>> 1.1.1.{185..204}:/bricks/25 \ >>>>>>> 1.1.1.{205..224}:/bricks/25 \ >>>>>>> 1.1.1.{225..244}:/bricks/25 \ >>>>>>> 1.1.1.{185..204}:/bricks/26 \ >>>>>>> 1.1.1.{205..224}:/bricks/26 \ >>>>>>> 1.1.1.{225..244}:/bricks/26 \ >>>>>>> 1.1.1.{185..204}:/bricks/27 \ >>>>>>> 1.1.1.{205..224}:/bricks/27 \ >>>>>>> 1.1.1.{225..244}:/bricks/27 force >>>>>>> >>>>>>> then I mount volume on 50 clients: >>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster >>>>>>> >>>>>>> then I make a directory from one of the clients and chmod it. >>>>>>> mkdir /mnt/gluster/s1 && chmod 777 /mnt/gluster/s1 >>>>>>> >>>>>>> then I start distcp on clients, there are 1059X8.8GB files in one >>>>>>> folder >>>>>>> and >>>>>>> they will be copied to /mnt/gluster/s1 with 100 parallel which means 2 >>>>>>> copy jobs per client at same time. >>>>>>> hadoop distcp -m 100 http://nn1:8020/path/to/teragen-10tb >>>>>>> file:///mnt/gluster/s1 >>>>>>> >>>>>>> After job finished here is the status of s1 directory from bricks: >>>>>>> s1 directory is present in all 1560 brick. >>>>>>> s1/teragen-10tb folder is present in all 1560 brick. >>>>>>> >>>>>>> full listing of files in bricks: >>>>>>> https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0 >>>>>>> >>>>>>> You can ignore the .crc files in the brick output above, they are >>>>>>> checksum files... >>>>>>> >>>>>>> As you can see part-m-xxxx files written only some bricks in nodes >>>>>>> 0205..0224 >>>>>>> All bricks have some files but they have zero size. >>>>>>> >>>>>>> I increase file descriptors to 65k so it is not the issue... >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier Hernandez >>>>>>> <xhernandez at datalab.es> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Serkan, >>>>>>>> >>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>>>> before >>>>>>>>>>>> the reduce phase >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Nope, gluster is the destination for distcp command. hadoop distcp >>>>>>>>> -m >>>>>>>>> 50 http://nn1:8020/path/to/folder file:///mnt/gluster >>>>>>>>> This run maps on datanodes which have /mnt/gluster mounted on all of >>>>>>>>> them. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I don't know hadoop, so I'm of little help here. However it seems >>>>>>>> that >>>>>>>> -m >>>>>>>> 50 >>>>>>>> means to execute 50 copies in parallel. This means that even if the >>>>>>>> distribution worked fine, at most 50 (much probably less) of the 78 >>>>>>>> ec >>>>>>>> sets >>>>>>>> would be used in parallel. >>>>>>>> >>>>>>>>> >>>>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>>>> mapreduce. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Yes but how a client write 500 files to gluster mount and those file >>>>>>>>> just written only to subset of subvolumes? I cannot use gluster as a >>>>>>>>> backup cluster if I cannot write with distcp. >>>>>>>>> >>>>>>>> >>>>>>>> All 500 files were created only on one of the 78 ec sets and the >>>>>>>> remaining >>>>>>>> 77 got empty ? >>>>>>>> >>>>>>>>>>>> You should look which files are created in each brick and how >>>>>>>>>>>> many >>>>>>>>>>>> while the process is running. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Files only created on nodes 185..204 or 205..224 or 225..244. Only >>>>>>>>> on >>>>>>>>> 20 nodes in each test. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> How many files there were in each brick ? >>>>>>>> >>>>>>>> Not sure if this can be related, but standard linux distributions >>>>>>>> have >>>>>>>> a >>>>>>>> default limit of 1024 open file descriptors. Having a so big volume >>>>>>>> and >>>>>>>> doing a massive copy, maybe this limit is affecting something ? >>>>>>>> >>>>>>>> Are there any error or warning messages in the mount or bricks logs ? >>>>>>>> >>>>>>>> >>>>>>>> Xavi >>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier Hernandez >>>>>>>>> <xhernandez at datalab.es> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Serkan, >>>>>>>>>> >>>>>>>>>> moved to gluster-users since this doesn't belong to devel list. >>>>>>>>>> >>>>>>>>>> On 19/04/16 11:24, Serkan ?oban wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I am copying 10.000 files to gluster volume using mapreduce on >>>>>>>>>>> clients. Each map process took one file at a time and copy it to >>>>>>>>>>> gluster volume. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I assume that gluster is used to store the intermediate files >>>>>>>>>> before >>>>>>>>>> the >>>>>>>>>> reduce phase. >>>>>>>>>> >>>>>>>>>>> My disperse volume consist of 78 subvolumes of 16+4 disk each. So >>>>>>>>>>> If >>>>>>>>>>> I >>>>>>>>>>> copy >78 files parallel I expect each file goes to different >>>>>>>>>>> subvolume >>>>>>>>>>> right? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> If you only copy 78 files, most probably you will get some >>>>>>>>>> subvolume >>>>>>>>>> empty >>>>>>>>>> and some other with more than one or two files. It's not an exact >>>>>>>>>> distribution, it's a statistially balanced distribution: over time >>>>>>>>>> and >>>>>>>>>> with >>>>>>>>>> enough files, each brick will contain an amount of files in the >>>>>>>>>> same >>>>>>>>>> order >>>>>>>>>> of magnitude, but they won't have the *same* number of files. >>>>>>>>>> >>>>>>>>>>> In my tests during tests with fio I can see every file goes to >>>>>>>>>>> different subvolume, but when I start mapreduce process from >>>>>>>>>>> clients >>>>>>>>>>> only 78/3=26 subvolumes used for writing files. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This means that this is caused by some peculiarity of the >>>>>>>>>> mapreduce. >>>>>>>>>> >>>>>>>>>>> I see that clearly from network traffic. Mapreduce on client side >>>>>>>>>>> can >>>>>>>>>>> be run multi thread. I tested with 1-5-10 threads on each client >>>>>>>>>>> but >>>>>>>>>>> every time only 26 subvolumes used. >>>>>>>>>>> How can I debug the issue further? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> You should look which files are created in each brick and how many >>>>>>>>>> while >>>>>>>>>> the >>>>>>>>>> process is running. >>>>>>>>>> >>>>>>>>>> Xavi >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22 AM, Xavier Hernandez >>>>>>>>>>> <xhernandez at datalab.es> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Serkan, >>>>>>>>>>>> >>>>>>>>>>>> On 19/04/16 09:18, Serkan ?oban wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, I just reinstalled fresh 3.7.11 and I am seeing the same >>>>>>>>>>>>> behavior. >>>>>>>>>>>>> 50 clients copying part-0-xxxx named files using mapreduce to >>>>>>>>>>>>> gluster >>>>>>>>>>>>> using one thread per server and they are using only 20 servers >>>>>>>>>>>>> out >>>>>>>>>>>>> of >>>>>>>>>>>>> 60. On the other hand fio tests use all the servers. Anything I >>>>>>>>>>>>> can >>>>>>>>>>>>> do >>>>>>>>>>>>> to solve the issue? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Distribution of files to ec sets is done by dht. In theory if you >>>>>>>>>>>> create >>>>>>>>>>>> many files each ec set will receive the same amount of files. >>>>>>>>>>>> However >>>>>>>>>>>> when >>>>>>>>>>>> the number of files is small enough, statistics can fail. >>>>>>>>>>>> >>>>>>>>>>>> Not sure what you are doing exactly, but a mapreduce procedure >>>>>>>>>>>> generally >>>>>>>>>>>> only creates a single output. In that case it makes sense that >>>>>>>>>>>> only >>>>>>>>>>>> one >>>>>>>>>>>> ec >>>>>>>>>>>> set is used. If you want to use all ec sets for a single file, >>>>>>>>>>>> you >>>>>>>>>>>> should >>>>>>>>>>>> enable sharding (I haven't tested that) or split the result in >>>>>>>>>>>> multiple >>>>>>>>>>>> files. >>>>>>>>>>>> >>>>>>>>>>>> Xavi >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Serkan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ---------- Forwarded message ---------- >>>>>>>>>>>>> From: Serkan ?oban <cobanserkan at gmail.com> >>>>>>>>>>>>> Date: Mon, Apr 18, 2016 at 2:39 PM >>>>>>>>>>>>> Subject: disperse volume file to subvolume mapping >>>>>>>>>>>>> To: Gluster Users <gluster-users at gluster.org> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, I have a problem where clients are using only 1/3 of nodes >>>>>>>>>>>>> in >>>>>>>>>>>>> disperse volume for writing. >>>>>>>>>>>>> I am testing from 50 clients using 1 to 10 threads with file >>>>>>>>>>>>> names >>>>>>>>>>>>> part-0-xxxx. >>>>>>>>>>>>> What I see is clients only use 20 nodes for writing. How is the >>>>>>>>>>>>> file >>>>>>>>>>>>> name to sub volume hashing is done? Is this related to file >>>>>>>>>>>>> names >>>>>>>>>>>>> are >>>>>>>>>>>>> similar? >>>>>>>>>>>>> >>>>>>>>>>>>> My cluster is 3.7.10 with 60 nodes each has 26 disks. Disperse >>>>>>>>>>>>> volume >>>>>>>>>>>>> is 78 x (16+4). Only 26 out of 78 sub volumes used during >>>>>>>>>>>>> writes.. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>> >>