Pranith Kumar Karampuri
2017-May-11 16:06 UTC
[Gluster-users] Slow write times to gluster disk
On Thu, May 11, 2017 at 9:32 PM, Pat Haley <phaley at mit.edu> wrote:> > Hi Pranith, > > The /home partition is mounted as ext4 > /home ext4 defaults,usrquota,grpquota 1 2 > > The brick partitions are mounted ax xfs > /mnt/brick1 xfs defaults 0 0 > /mnt/brick2 xfs defaults 0 0 > > Will this cause a problem with creating a volume under /home? >I don't think the bottleneck is disk. You can do the same tests you did on your new volume to confirm?> > Pat > > > > On 05/11/2017 11:32 AM, Pranith Kumar Karampuri wrote: > > > > On Thu, May 11, 2017 at 8:57 PM, Pat Haley <phaley at mit.edu> wrote: > >> >> Hi Pranith, >> >> Unfortunately, we don't have similar hardware for a small scale test. >> All we have is our production hardware. >> > > You said something about /home partition which has lesser disks, we can > create plain distribute volume inside one of those directories. After we > are done, we can remove the setup. What do you say? > > >> >> Pat >> >> >> >> >> On 05/11/2017 07:05 AM, Pranith Kumar Karampuri wrote: >> >> >> >> On Thu, May 11, 2017 at 2:48 AM, Pat Haley <phaley at mit.edu> wrote: >> >>> >>> Hi Pranith, >>> >>> Since we are mounting the partitions as the bricks, I tried the dd test >>> writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The >>> results without oflag=sync were 1.6 Gb/s (faster than gluster but not as >>> fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer >>> disks). >>> >> >> Okay, then 1.6Gb/s is what we need to target for, considering your volume >> is just distribute. Is there any way you can do tests on similar hardware >> but at a small scale? Just so we can run the workload to learn more about >> the bottlenecks in the system? We can probably try to get the speed to >> 1.2Gb/s on your /home partition you were telling me yesterday. Let me know >> if that is something you are okay to do. >> >> >>> >>> Pat >>> >>> >>> >>> On 05/10/2017 01:27 PM, Pranith Kumar Karampuri wrote: >>> >>> >>> >>> On Wed, May 10, 2017 at 10:15 PM, Pat Haley <phaley at mit.edu> wrote: >>> >>>> >>>> Hi Pranith, >>>> >>>> Not entirely sure (this isn't my area of expertise). I'll run your >>>> answer by some other people who are more familiar with this. >>>> >>>> I am also uncertain about how to interpret the results when we also add >>>> the dd tests writing to the /home area (no gluster, still on the same >>>> machine) >>>> >>>> - dd test without oflag=sync (rough average of multiple tests) >>>> - gluster w/ fuse mount : 570 Mb/s >>>> - gluster w/ nfs mount: 390 Mb/s >>>> - nfs (no gluster): 1.2 Gb/s >>>> - dd test with oflag=sync (rough average of multiple tests) >>>> - gluster w/ fuse mount: 5 Mb/s >>>> - gluster w/ nfs mount: 200 Mb/s >>>> - nfs (no gluster): 20 Mb/s >>>> >>>> Given that the non-gluster area is a RAID-6 of 4 disks while each brick >>>> of the gluster area is a RAID-6 of 32 disks, I would naively expect the >>>> writes to the gluster area to be roughly 8x faster than to the non-gluster. >>>> >>> >>> I think a better test is to try and write to a file using nfs without >>> any gluster to a location that is not inside the brick but someother >>> location that is on same disk(s). If you are mounting the partition as the >>> brick, then we can write to a file inside .glusterfs directory, something >>> like <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>> >>> >>>> I still think we have a speed issue, I can't tell if fuse vs nfs is >>>> part of the problem. >>>> >>> >>> I got interested in the post because I read that fuse speed is lesser >>> than nfs speed which is counter-intuitive to my understanding. So wanted >>> clarifications. Now that I got my clarifications where fuse outperformed >>> nfs without sync, we can resume testing as described above and try to find >>> what it is. Based on your email-id I am guessing you are from Boston and I >>> am from Bangalore so if you are okay with doing this debugging for multiple >>> days because of timezones, I will be happy to help. Please be a bit patient >>> with me, I am under a release crunch but I am very curious with the problem >>> you posted. >>> >>> Was there anything useful in the profiles? >>>> >>> >>> Unfortunately profiles didn't help me much, I think we are collecting >>> the profiles from an active volume, so it has a lot of information that is >>> not pertaining to dd so it is difficult to find the contributions of dd. So >>> I went through your post again and found something I didn't pay much >>> attention to earlier i.e. oflag=sync, so did my own tests on my setup with >>> FUSE so sent that reply. >>> >>> >>>> >>>> Pat >>>> >>>> >>>> >>>> On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote: >>>> >>>> Okay good. At least this validates my doubts. Handling O_SYNC in >>>> gluster NFS and fuse is a bit different. >>>> When application opens a file with O_SYNC on fuse mount then each write >>>> syscall has to be written to disk as part of the syscall where as in case >>>> of NFS, there is no concept of open. NFS performs write though a handle >>>> saying it needs to be a synchronous write, so write() syscall is performed >>>> first then it performs fsync(). so an write on an fd with O_SYNC becomes >>>> write+fsync. I am suspecting that when multiple threads do this >>>> write+fsync() operation on the same file, multiple writes are batched >>>> together to be written do disk so the throughput on the disk is increasing >>>> is my guess. >>>> >>>> Does it answer your doubts? >>>> >>>> On Wed, May 10, 2017 at 9:35 PM, Pat Haley <phaley at mit.edu> wrote: >>>> >>>>> >>>>> Without the oflag=sync and only a single test of each, the FUSE is >>>>> going faster than NFS: >>>>> >>>>> FUSE: >>>>> mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576 >>>>> of=zeros.txt conv=sync >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s >>>>> >>>>> >>>>> NFS >>>>> mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt >>>>> conv=sync >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s >>>>> >>>>> >>>>> >>>>> On 05/10/2017 11:53 AM, Pranith Kumar Karampuri wrote: >>>>> >>>>> Could you let me know the speed without oflag=sync on both the mounts? >>>>> No need to collect profiles. >>>>> >>>>> On Wed, May 10, 2017 at 9:17 PM, Pat Haley <phaley at mit.edu> wrote: >>>>> >>>>>> >>>>>> Here is what I see now: >>>>>> >>>>>> [root at mseas-data2 ~]# gluster volume info >>>>>> >>>>>> Volume Name: data-volume >>>>>> Type: Distribute >>>>>> Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >>>>>> Status: Started >>>>>> Number of Bricks: 2 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: mseas-data2:/mnt/brick1 >>>>>> Brick2: mseas-data2:/mnt/brick2 >>>>>> Options Reconfigured: >>>>>> diagnostics.count-fop-hits: on >>>>>> diagnostics.latency-measurement: on >>>>>> nfs.exports-auth-enable: on >>>>>> diagnostics.brick-sys-log-level: WARNING >>>>>> performance.readdir-ahead: on >>>>>> nfs.disable: on >>>>>> nfs.export-volumes: off >>>>>> >>>>>> >>>>>> >>>>>> On 05/10/2017 11:44 AM, Pranith Kumar Karampuri wrote: >>>>>> >>>>>> Is this the volume info you have? >>>>>> >>>>>> >* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info >>>>>> *>>* Volume Name: data-volume >>>>>> *>* Type: Distribute >>>>>> *>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >>>>>> *>* Status: Started >>>>>> *>* Number of Bricks: 2 >>>>>> *>* Transport-type: tcp >>>>>> *>* Bricks: >>>>>> *>* Brick1: mseas-data2:/mnt/brick1 >>>>>> *>* Brick2: mseas-data2:/mnt/brick2 >>>>>> *>* Options Reconfigured: >>>>>> *>* performance.readdir-ahead: on >>>>>> *>* nfs.disable: on >>>>>> *>* nfs.export-volumes: off >>>>>> >>>>>> * >>>>>> >>>>>> ?I copied this from old thread from 2016. This is distribute volume. >>>>>> Did you change any of the options in between? >>>>>> >>>>>> -- >>>>>> >>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>> Pat Haley Email: phaley at mit.edu >>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >>>>>> 77 Massachusetts Avenue >>>>>> Cambridge, MA 02139-4301 >>>>>> >>>>>> -- >>>>> Pranith >>>>> >>>>> -- >>>>> >>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>> Pat Haley Email: phaley at mit.edu >>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >>>>> 77 Massachusetts Avenue >>>>> Cambridge, MA 02139-4301 >>>>> >>>>> -- >>>> Pranith >>>> >>>> -- >>>> >>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>> Pat Haley Email: phaley at mit.edu >>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >>>> 77 Massachusetts Avenue >>>> Cambridge, MA 02139-4301 >>>> >>>> -- >>> Pranith >>> >>> -- >>> >>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> Pat Haley Email: phaley at mit.edu >>> Center for Ocean Engineering Phone: (617) 253-6824 >>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >>> 77 Massachusetts Avenue >>> Cambridge, MA 02139-4301 >>> >>> -- >> Pranith >> >> -- >> >> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >> Pat Haley Email: phaley at mit.edu >> Center for Ocean Engineering Phone: (617) 253-6824 >> Dept. of Mechanical Engineering Fax: (617) 253-8125 >> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >> 77 Massachusetts Avenue >> Cambridge, MA 02139-4301 >> >> -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170511/909d312e/attachment.html>
Hi Pranith, My question was about setting up a gluster volume on an ext4 partition. I thought we had the bricks mounted as xfs for compatibility with gluster? Pat On 05/11/2017 12:06 PM, Pranith Kumar Karampuri wrote:> > > On Thu, May 11, 2017 at 9:32 PM, Pat Haley <phaley at mit.edu > <mailto:phaley at mit.edu>> wrote: > > > Hi Pranith, > > The /home partition is mounted as ext4 > /home ext4 defaults,usrquota,grpquota 1 2 > > The brick partitions are mounted ax xfs > /mnt/brick1 xfs defaults 0 0 > /mnt/brick2 xfs defaults 0 0 > > Will this cause a problem with creating a volume under /home? > > > I don't think the bottleneck is disk. You can do the same tests you > did on your new volume to confirm? > > > Pat > > > > On 05/11/2017 11:32 AM, Pranith Kumar Karampuri wrote: >> >> >> On Thu, May 11, 2017 at 8:57 PM, Pat Haley <phaley at mit.edu >> <mailto:phaley at mit.edu>> wrote: >> >> >> Hi Pranith, >> >> Unfortunately, we don't have similar hardware for a small >> scale test. All we have is our production hardware. >> >> >> You said something about /home partition which has lesser disks, >> we can create plain distribute volume inside one of those >> directories. After we are done, we can remove the setup. What do >> you say? >> >> >> Pat >> >> >> >> >> On 05/11/2017 07:05 AM, Pranith Kumar Karampuri wrote: >>> >>> >>> On Thu, May 11, 2017 at 2:48 AM, Pat Haley <phaley at mit.edu >>> <mailto:phaley at mit.edu>> wrote: >>> >>> >>> Hi Pranith, >>> >>> Since we are mounting the partitions as the bricks, I >>> tried the dd test writing to >>> <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>> The results without oflag=sync were 1.6 Gb/s (faster >>> than gluster but not as fast as I was expecting given >>> the 1.2 Gb/s to the no-gluster area w/ fewer disks). >>> >>> >>> Okay, then 1.6Gb/s is what we need to target for, >>> considering your volume is just distribute. Is there any way >>> you can do tests on similar hardware but at a small scale? >>> Just so we can run the workload to learn more about the >>> bottlenecks in the system? We can probably try to get the >>> speed to 1.2Gb/s on your /home partition you were telling me >>> yesterday. Let me know if that is something you are okay to do. >>> >>> >>> Pat >>> >>> >>> >>> On 05/10/2017 01:27 PM, Pranith Kumar Karampuri wrote: >>>> >>>> >>>> On Wed, May 10, 2017 at 10:15 PM, Pat Haley >>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>> >>>> >>>> Hi Pranith, >>>> >>>> Not entirely sure (this isn't my area of >>>> expertise). I'll run your answer by some other >>>> people who are more familiar with this. >>>> >>>> I am also uncertain about how to interpret the >>>> results when we also add the dd tests writing to >>>> the /home area (no gluster, still on the same machine) >>>> >>>> * dd test without oflag=sync (rough average of >>>> multiple tests) >>>> o gluster w/ fuse mount : 570 Mb/s >>>> o gluster w/ nfs mount: 390 Mb/s >>>> o nfs (no gluster): 1.2 Gb/s >>>> * dd test with oflag=sync (rough average of >>>> multiple tests) >>>> o gluster w/ fuse mount: 5 Mb/s >>>> o gluster w/ nfs mount: 200 Mb/s >>>> o nfs (no gluster): 20 Mb/s >>>> >>>> Given that the non-gluster area is a RAID-6 of 4 >>>> disks while each brick of the gluster area is a >>>> RAID-6 of 32 disks, I would naively expect the >>>> writes to the gluster area to be roughly 8x faster >>>> than to the non-gluster. >>>> >>>> >>>> I think a better test is to try and write to a file >>>> using nfs without any gluster to a location that is not >>>> inside the brick but someother location that is on same >>>> disk(s). If you are mounting the partition as the >>>> brick, then we can write to a file inside .glusterfs >>>> directory, something like >>>> <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>>> >>>> >>>> I still think we have a speed issue, I can't tell >>>> if fuse vs nfs is part of the problem. >>>> >>>> >>>> I got interested in the post because I read that fuse >>>> speed is lesser than nfs speed which is >>>> counter-intuitive to my understanding. So wanted >>>> clarifications. Now that I got my clarifications where >>>> fuse outperformed nfs without sync, we can resume >>>> testing as described above and try to find what it is. >>>> Based on your email-id I am guessing you are from >>>> Boston and I am from Bangalore so if you are okay with >>>> doing this debugging for multiple days because of >>>> timezones, I will be happy to help. Please be a bit >>>> patient with me, I am under a release crunch but I am >>>> very curious with the problem you posted. >>>> >>>> Was there anything useful in the profiles? >>>> >>>> >>>> Unfortunately profiles didn't help me much, I think we >>>> are collecting the profiles from an active volume, so >>>> it has a lot of information that is not pertaining to >>>> dd so it is difficult to find the contributions of dd. >>>> So I went through your post again and found something I >>>> didn't pay much attention to earlier i.e. oflag=sync, >>>> so did my own tests on my setup with FUSE so sent that >>>> reply. >>>> >>>> >>>> Pat >>>> >>>> >>>> >>>> On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote: >>>>> Okay good. At least this validates my doubts. >>>>> Handling O_SYNC in gluster NFS and fuse is a bit >>>>> different. >>>>> When application opens a file with O_SYNC on fuse >>>>> mount then each write syscall has to be written to >>>>> disk as part of the syscall where as in case of >>>>> NFS, there is no concept of open. NFS performs >>>>> write though a handle saying it needs to be a >>>>> synchronous write, so write() syscall is performed >>>>> first then it performs fsync(). so an write on an >>>>> fd with O_SYNC becomes write+fsync. I am >>>>> suspecting that when multiple threads do this >>>>> write+fsync() operation on the same file, multiple >>>>> writes are batched together to be written do disk >>>>> so the throughput on the disk is increasing is my >>>>> guess. >>>>> >>>>> Does it answer your doubts? >>>>> >>>>> On Wed, May 10, 2017 at 9:35 PM, Pat Haley >>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>>> >>>>> >>>>> Without the oflag=sync and only a single test >>>>> of each, the FUSE is going faster than NFS: >>>>> >>>>> FUSE: >>>>> mseas-data2(dri_nascar)% dd if=/dev/zero >>>>> count=4096 bs=1048576 of=zeros.txt conv=sync >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 7.46961 s, >>>>> 575 MB/s >>>>> >>>>> >>>>> NFS >>>>> mseas-data2(HYCOM)% dd if=/dev/zero count=4096 >>>>> bs=1048576 of=zeros.txt conv=sync >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 11.4264 s, >>>>> 376 MB/s >>>>> >>>>> >>>>> >>>>> On 05/10/2017 11:53 AM, Pranith Kumar >>>>> Karampuri wrote: >>>>>> Could you let me know the speed without >>>>>> oflag=sync on both the mounts? No need to >>>>>> collect profiles. >>>>>> >>>>>> On Wed, May 10, 2017 at 9:17 PM, Pat Haley >>>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>>>> >>>>>> >>>>>> Here is what I see now: >>>>>> >>>>>> [root at mseas-data2 ~]# gluster volume info >>>>>> >>>>>> Volume Name: data-volume >>>>>> Type: Distribute >>>>>> Volume ID: >>>>>> c162161e-2a2d-4dac-b015-f31fd89ceb18 >>>>>> Status: Started >>>>>> Number of Bricks: 2 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: mseas-data2:/mnt/brick1 >>>>>> Brick2: mseas-data2:/mnt/brick2 >>>>>> Options Reconfigured: >>>>>> diagnostics.count-fop-hits: on >>>>>> diagnostics.latency-measurement: on >>>>>> nfs.exports-auth-enable: on >>>>>> diagnostics.brick-sys-log-level: WARNING >>>>>> performance.readdir-ahead: on >>>>>> nfs.disable: on >>>>>> nfs.export-volumes: off >>>>>> >>>>>> >>>>>> >>>>>> On 05/10/2017 11:44 AM, Pranith Kumar >>>>>> Karampuri wrote: >>>>>>> Is this the volume info you have? >>>>>>> >>>>>>> >/[root at mseas-data2 >>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>>> ~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: >>>>>>> c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off / >>>>>>> ?I copied this from old thread from >>>>>>> 2016. This is distribute volume. Did you >>>>>>> change any of the options in between? >>>>>> >>>>>> -- >>>>>> >>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>> 77 Massachusetts Avenue >>>>>> Cambridge, MA 02139-4301 >>>>>> >>>>>> -- >>>>>> Pranith >>>>> -- >>>>> >>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>> 77 Massachusetts Avenue >>>>> Cambridge, MA 02139-4301 >>>>> >>>>> -- >>>>> Pranith >>>> -- >>>> >>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>> 77 Massachusetts Avenue >>>> Cambridge, MA 02139-4301 >>>> >>>> -- >>>> Pranith >>> -- >>> >>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>> Center for Ocean Engineering Phone: (617) 253-6824 >>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>> 77 Massachusetts Avenue >>> Cambridge, MA 02139-4301 >>> >>> -- >>> Pranith >> -- >> >> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >> Center for Ocean Engineering Phone: (617) 253-6824 >> Dept. of Mechanical Engineering Fax: (617) 253-8125 >> MIT, Room 5-213http://web.mit.edu/phaley/www/ >> 77 Massachusetts Avenue >> Cambridge, MA 02139-4301 >> >> -- >> Pranith > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley at mit.edu Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170512/61a062d1/attachment.html>
Hi Pranith, Sorry for the delay. I never saw received your reply (but I did receive Ben Turner's follow-up to your reply). So we tried to create a gluster volume under /home using different variations of gluster volume create test-volume mseas-data2:/home/gbrick_test_1 mseas-data2:/home/gbrick_test_2 transport tcp However we keep getting errors of the form Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path> Any thoughts on what we're doing wrong? Also do you have a list of the test we should be running once we get this volume created? Given the time-zone difference it might help if we can run a small battery of tests and post the results rather than test-post-new test-post... . Thanks Pat On 05/11/2017 12:06 PM, Pranith Kumar Karampuri wrote:> > > On Thu, May 11, 2017 at 9:32 PM, Pat Haley <phaley at mit.edu > <mailto:phaley at mit.edu>> wrote: > > > Hi Pranith, > > The /home partition is mounted as ext4 > /home ext4 defaults,usrquota,grpquota 1 2 > > The brick partitions are mounted ax xfs > /mnt/brick1 xfs defaults 0 0 > /mnt/brick2 xfs defaults 0 0 > > Will this cause a problem with creating a volume under /home? > > > I don't think the bottleneck is disk. You can do the same tests you > did on your new volume to confirm? > > > Pat > > > > On 05/11/2017 11:32 AM, Pranith Kumar Karampuri wrote: >> >> >> On Thu, May 11, 2017 at 8:57 PM, Pat Haley <phaley at mit.edu >> <mailto:phaley at mit.edu>> wrote: >> >> >> Hi Pranith, >> >> Unfortunately, we don't have similar hardware for a small >> scale test. All we have is our production hardware. >> >> >> You said something about /home partition which has lesser disks, >> we can create plain distribute volume inside one of those >> directories. After we are done, we can remove the setup. What do >> you say? >> >> >> Pat >> >> >> >> >> On 05/11/2017 07:05 AM, Pranith Kumar Karampuri wrote: >>> >>> >>> On Thu, May 11, 2017 at 2:48 AM, Pat Haley <phaley at mit.edu >>> <mailto:phaley at mit.edu>> wrote: >>> >>> >>> Hi Pranith, >>> >>> Since we are mounting the partitions as the bricks, I >>> tried the dd test writing to >>> <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>> The results without oflag=sync were 1.6 Gb/s (faster >>> than gluster but not as fast as I was expecting given >>> the 1.2 Gb/s to the no-gluster area w/ fewer disks). >>> >>> >>> Okay, then 1.6Gb/s is what we need to target for, >>> considering your volume is just distribute. Is there any way >>> you can do tests on similar hardware but at a small scale? >>> Just so we can run the workload to learn more about the >>> bottlenecks in the system? We can probably try to get the >>> speed to 1.2Gb/s on your /home partition you were telling me >>> yesterday. Let me know if that is something you are okay to do. >>> >>> >>> Pat >>> >>> >>> >>> On 05/10/2017 01:27 PM, Pranith Kumar Karampuri wrote: >>>> >>>> >>>> On Wed, May 10, 2017 at 10:15 PM, Pat Haley >>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>> >>>> >>>> Hi Pranith, >>>> >>>> Not entirely sure (this isn't my area of >>>> expertise). I'll run your answer by some other >>>> people who are more familiar with this. >>>> >>>> I am also uncertain about how to interpret the >>>> results when we also add the dd tests writing to >>>> the /home area (no gluster, still on the same machine) >>>> >>>> * dd test without oflag=sync (rough average of >>>> multiple tests) >>>> o gluster w/ fuse mount : 570 Mb/s >>>> o gluster w/ nfs mount: 390 Mb/s >>>> o nfs (no gluster): 1.2 Gb/s >>>> * dd test with oflag=sync (rough average of >>>> multiple tests) >>>> o gluster w/ fuse mount: 5 Mb/s >>>> o gluster w/ nfs mount: 200 Mb/s >>>> o nfs (no gluster): 20 Mb/s >>>> >>>> Given that the non-gluster area is a RAID-6 of 4 >>>> disks while each brick of the gluster area is a >>>> RAID-6 of 32 disks, I would naively expect the >>>> writes to the gluster area to be roughly 8x faster >>>> than to the non-gluster. >>>> >>>> >>>> I think a better test is to try and write to a file >>>> using nfs without any gluster to a location that is not >>>> inside the brick but someother location that is on same >>>> disk(s). If you are mounting the partition as the >>>> brick, then we can write to a file inside .glusterfs >>>> directory, something like >>>> <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>>> >>>> >>>> I still think we have a speed issue, I can't tell >>>> if fuse vs nfs is part of the problem. >>>> >>>> >>>> I got interested in the post because I read that fuse >>>> speed is lesser than nfs speed which is >>>> counter-intuitive to my understanding. So wanted >>>> clarifications. Now that I got my clarifications where >>>> fuse outperformed nfs without sync, we can resume >>>> testing as described above and try to find what it is. >>>> Based on your email-id I am guessing you are from >>>> Boston and I am from Bangalore so if you are okay with >>>> doing this debugging for multiple days because of >>>> timezones, I will be happy to help. Please be a bit >>>> patient with me, I am under a release crunch but I am >>>> very curious with the problem you posted. >>>> >>>> Was there anything useful in the profiles? >>>> >>>> >>>> Unfortunately profiles didn't help me much, I think we >>>> are collecting the profiles from an active volume, so >>>> it has a lot of information that is not pertaining to >>>> dd so it is difficult to find the contributions of dd. >>>> So I went through your post again and found something I >>>> didn't pay much attention to earlier i.e. oflag=sync, >>>> so did my own tests on my setup with FUSE so sent that >>>> reply. >>>> >>>> >>>> Pat >>>> >>>> >>>> >>>> On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote: >>>>> Okay good. At least this validates my doubts. >>>>> Handling O_SYNC in gluster NFS and fuse is a bit >>>>> different. >>>>> When application opens a file with O_SYNC on fuse >>>>> mount then each write syscall has to be written to >>>>> disk as part of the syscall where as in case of >>>>> NFS, there is no concept of open. NFS performs >>>>> write though a handle saying it needs to be a >>>>> synchronous write, so write() syscall is performed >>>>> first then it performs fsync(). so an write on an >>>>> fd with O_SYNC becomes write+fsync. I am >>>>> suspecting that when multiple threads do this >>>>> write+fsync() operation on the same file, multiple >>>>> writes are batched together to be written do disk >>>>> so the throughput on the disk is increasing is my >>>>> guess. >>>>> >>>>> Does it answer your doubts? >>>>> >>>>> On Wed, May 10, 2017 at 9:35 PM, Pat Haley >>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>>> >>>>> >>>>> Without the oflag=sync and only a single test >>>>> of each, the FUSE is going faster than NFS: >>>>> >>>>> FUSE: >>>>> mseas-data2(dri_nascar)% dd if=/dev/zero >>>>> count=4096 bs=1048576 of=zeros.txt conv=sync >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 7.46961 s, >>>>> 575 MB/s >>>>> >>>>> >>>>> NFS >>>>> mseas-data2(HYCOM)% dd if=/dev/zero count=4096 >>>>> bs=1048576 of=zeros.txt conv=sync >>>>> 4096+0 records in >>>>> 4096+0 records out >>>>> 4294967296 bytes (4.3 GB) copied, 11.4264 s, >>>>> 376 MB/s >>>>> >>>>> >>>>> >>>>> On 05/10/2017 11:53 AM, Pranith Kumar >>>>> Karampuri wrote: >>>>>> Could you let me know the speed without >>>>>> oflag=sync on both the mounts? No need to >>>>>> collect profiles. >>>>>> >>>>>> On Wed, May 10, 2017 at 9:17 PM, Pat Haley >>>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>>>> >>>>>> >>>>>> Here is what I see now: >>>>>> >>>>>> [root at mseas-data2 ~]# gluster volume info >>>>>> >>>>>> Volume Name: data-volume >>>>>> Type: Distribute >>>>>> Volume ID: >>>>>> c162161e-2a2d-4dac-b015-f31fd89ceb18 >>>>>> Status: Started >>>>>> Number of Bricks: 2 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: mseas-data2:/mnt/brick1 >>>>>> Brick2: mseas-data2:/mnt/brick2 >>>>>> Options Reconfigured: >>>>>> diagnostics.count-fop-hits: on >>>>>> diagnostics.latency-measurement: on >>>>>> nfs.exports-auth-enable: on >>>>>> diagnostics.brick-sys-log-level: WARNING >>>>>> performance.readdir-ahead: on >>>>>> nfs.disable: on >>>>>> nfs.export-volumes: off >>>>>> >>>>>> >>>>>> >>>>>> On 05/10/2017 11:44 AM, Pranith Kumar >>>>>> Karampuri wrote: >>>>>>> Is this the volume info you have? >>>>>>> >>>>>>> >/[root at mseas-data2 >>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>>> ~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: >>>>>>> c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off / >>>>>>> ?I copied this from old thread from >>>>>>> 2016. This is distribute volume. Did you >>>>>>> change any of the options in between? >>>>>> >>>>>> -- >>>>>> >>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>> 77 Massachusetts Avenue >>>>>> Cambridge, MA 02139-4301 >>>>>> >>>>>> -- >>>>>> Pranith >>>>> -- >>>>> >>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>> 77 Massachusetts Avenue >>>>> Cambridge, MA 02139-4301 >>>>> >>>>> -- >>>>> Pranith >>>> -- >>>> >>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>> 77 Massachusetts Avenue >>>> Cambridge, MA 02139-4301 >>>> >>>> -- >>>> Pranith >>> -- >>> >>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>> Center for Ocean Engineering Phone: (617) 253-6824 >>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>> 77 Massachusetts Avenue >>> Cambridge, MA 02139-4301 >>> >>> -- >>> Pranith >> -- >> >> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >> Center for Ocean Engineering Phone: (617) 253-6824 >> Dept. of Mechanical Engineering Fax: (617) 253-8125 >> MIT, Room 5-213http://web.mit.edu/phaley/www/ >> 77 Massachusetts Avenue >> Cambridge, MA 02139-4301 >> >> -- >> Pranith > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley at mit.edu Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170516/f3db1d94/attachment.html>