Pranith Kumar Karampuri
2017-May-10 17:27 UTC
[Gluster-users] Slow write times to gluster disk
On Wed, May 10, 2017 at 10:15 PM, Pat Haley <phaley at mit.edu> wrote:> > Hi Pranith, > > Not entirely sure (this isn't my area of expertise). I'll run your answer > by some other people who are more familiar with this. > > I am also uncertain about how to interpret the results when we also add > the dd tests writing to the /home area (no gluster, still on the same > machine) > > - dd test without oflag=sync (rough average of multiple tests) > - gluster w/ fuse mount : 570 Mb/s > - gluster w/ nfs mount: 390 Mb/s > - nfs (no gluster): 1.2 Gb/s > - dd test with oflag=sync (rough average of multiple tests) > - gluster w/ fuse mount: 5 Mb/s > - gluster w/ nfs mount: 200 Mb/s > - nfs (no gluster): 20 Mb/s > > Given that the non-gluster area is a RAID-6 of 4 disks while each brick of > the gluster area is a RAID-6 of 32 disks, I would naively expect the writes > to the gluster area to be roughly 8x faster than to the non-gluster. >I think a better test is to try and write to a file using nfs without any gluster to a location that is not inside the brick but someother location that is on same disk(s). If you are mounting the partition as the brick, then we can write to a file inside .glusterfs directory, something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>.> I still think we have a speed issue, I can't tell if fuse vs nfs is part > of the problem. >I got interested in the post because I read that fuse speed is lesser than nfs speed which is counter-intuitive to my understanding. So wanted clarifications. Now that I got my clarifications where fuse outperformed nfs without sync, we can resume testing as described above and try to find what it is. Based on your email-id I am guessing you are from Boston and I am from Bangalore so if you are okay with doing this debugging for multiple days because of timezones, I will be happy to help. Please be a bit patient with me, I am under a release crunch but I am very curious with the problem you posted. Was there anything useful in the profiles?>Unfortunately profiles didn't help me much, I think we are collecting the profiles from an active volume, so it has a lot of information that is not pertaining to dd so it is difficult to find the contributions of dd. So I went through your post again and found something I didn't pay much attention to earlier i.e. oflag=sync, so did my own tests on my setup with FUSE so sent that reply.> > Pat > > > > On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote: > > Okay good. At least this validates my doubts. Handling O_SYNC in gluster > NFS and fuse is a bit different. > When application opens a file with O_SYNC on fuse mount then each write > syscall has to be written to disk as part of the syscall where as in case > of NFS, there is no concept of open. NFS performs write though a handle > saying it needs to be a synchronous write, so write() syscall is performed > first then it performs fsync(). so an write on an fd with O_SYNC becomes > write+fsync. I am suspecting that when multiple threads do this > write+fsync() operation on the same file, multiple writes are batched > together to be written do disk so the throughput on the disk is increasing > is my guess. > > Does it answer your doubts? > > On Wed, May 10, 2017 at 9:35 PM, Pat Haley <phaley at mit.edu> wrote: > >> >> Without the oflag=sync and only a single test of each, the FUSE is going >> faster than NFS: >> >> FUSE: >> mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576 >> of=zeros.txt conv=sync >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s >> >> >> NFS >> mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt >> conv=sync >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s >> >> >> >> On 05/10/2017 11:53 AM, Pranith Kumar Karampuri wrote: >> >> Could you let me know the speed without oflag=sync on both the mounts? No >> need to collect profiles. >> >> On Wed, May 10, 2017 at 9:17 PM, Pat Haley <phaley at mit.edu> wrote: >> >>> >>> Here is what I see now: >>> >>> [root at mseas-data2 ~]# gluster volume info >>> >>> Volume Name: data-volume >>> Type: Distribute >>> Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >>> Status: Started >>> Number of Bricks: 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: mseas-data2:/mnt/brick1 >>> Brick2: mseas-data2:/mnt/brick2 >>> Options Reconfigured: >>> diagnostics.count-fop-hits: on >>> diagnostics.latency-measurement: on >>> nfs.exports-auth-enable: on >>> diagnostics.brick-sys-log-level: WARNING >>> performance.readdir-ahead: on >>> nfs.disable: on >>> nfs.export-volumes: off >>> >>> >>> >>> On 05/10/2017 11:44 AM, Pranith Kumar Karampuri wrote: >>> >>> Is this the volume info you have? >>> >>> >* [root at mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info >>> *>>* Volume Name: data-volume >>> *>* Type: Distribute >>> *>* Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >>> *>* Status: Started >>> *>* Number of Bricks: 2 >>> *>* Transport-type: tcp >>> *>* Bricks: >>> *>* Brick1: mseas-data2:/mnt/brick1 >>> *>* Brick2: mseas-data2:/mnt/brick2 >>> *>* Options Reconfigured: >>> *>* performance.readdir-ahead: on >>> *>* nfs.disable: on >>> *>* nfs.export-volumes: off >>> >>> * >>> >>> ?I copied this from old thread from 2016. This is distribute volume. Did >>> you change any of the options in between? >>> >>> -- >>> >>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> Pat Haley Email: phaley at mit.edu >>> Center for Ocean Engineering Phone: (617) 253-6824 >>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >>> 77 Massachusetts Avenue >>> Cambridge, MA 02139-4301 >>> >>> -- >> Pranith >> >> -- >> >> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >> Pat Haley Email: phaley at mit.edu >> Center for Ocean Engineering Phone: (617) 253-6824 >> Dept. of Mechanical Engineering Fax: (617) 253-8125 >> MIT, Room 5-213 http://web.mit.edu/phaley/www/ >> 77 Massachusetts Avenue >> Cambridge, MA 02139-4301 >> >> -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170510/45a5a1d6/attachment.html>
Hi Pranith, Since we are mounting the partitions as the bricks, I tried the dd test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The results without oflag=sync were 1.6 Gb/s (faster than gluster but not as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer disks). Pat On 05/10/2017 01:27 PM, Pranith Kumar Karampuri wrote:> > > On Wed, May 10, 2017 at 10:15 PM, Pat Haley <phaley at mit.edu > <mailto:phaley at mit.edu>> wrote: > > > Hi Pranith, > > Not entirely sure (this isn't my area of expertise). I'll run your > answer by some other people who are more familiar with this. > > I am also uncertain about how to interpret the results when we > also add the dd tests writing to the /home area (no gluster, still > on the same machine) > > * dd test without oflag=sync (rough average of multiple tests) > o gluster w/ fuse mount : 570 Mb/s > o gluster w/ nfs mount: 390 Mb/s > o nfs (no gluster): 1.2 Gb/s > * dd test with oflag=sync (rough average of multiple tests) > o gluster w/ fuse mount: 5 Mb/s > o gluster w/ nfs mount: 200 Mb/s > o nfs (no gluster): 20 Mb/s > > Given that the non-gluster area is a RAID-6 of 4 disks while each > brick of the gluster area is a RAID-6 of 32 disks, I would naively > expect the writes to the gluster area to be roughly 8x faster than > to the non-gluster. > > > I think a better test is to try and write to a file using nfs without > any gluster to a location that is not inside the brick but someother > location that is on same disk(s). If you are mounting the partition as > the brick, then we can write to a file inside .glusterfs directory, > something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>. > > > I still think we have a speed issue, I can't tell if fuse vs nfs > is part of the problem. > > > I got interested in the post because I read that fuse speed is lesser > than nfs speed which is counter-intuitive to my understanding. So > wanted clarifications. Now that I got my clarifications where fuse > outperformed nfs without sync, we can resume testing as described > above and try to find what it is. Based on your email-id I am guessing > you are from Boston and I am from Bangalore so if you are okay with > doing this debugging for multiple days because of timezones, I will be > happy to help. Please be a bit patient with me, I am under a release > crunch but I am very curious with the problem you posted. > > Was there anything useful in the profiles? > > > Unfortunately profiles didn't help me much, I think we are collecting > the profiles from an active volume, so it has a lot of information > that is not pertaining to dd so it is difficult to find the > contributions of dd. So I went through your post again and found > something I didn't pay much attention to earlier i.e. oflag=sync, so > did my own tests on my setup with FUSE so sent that reply. > > > Pat > > > > On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote: >> Okay good. At least this validates my doubts. Handling O_SYNC in >> gluster NFS and fuse is a bit different. >> When application opens a file with O_SYNC on fuse mount then each >> write syscall has to be written to disk as part of the syscall >> where as in case of NFS, there is no concept of open. NFS >> performs write though a handle saying it needs to be a >> synchronous write, so write() syscall is performed first then it >> performs fsync(). so an write on an fd with O_SYNC becomes >> write+fsync. I am suspecting that when multiple threads do this >> write+fsync() operation on the same file, multiple writes are >> batched together to be written do disk so the throughput on the >> disk is increasing is my guess. >> >> Does it answer your doubts? >> >> On Wed, May 10, 2017 at 9:35 PM, Pat Haley <phaley at mit.edu >> <mailto:phaley at mit.edu>> wrote: >> >> >> Without the oflag=sync and only a single test of each, the >> FUSE is going faster than NFS: >> >> FUSE: >> mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 >> bs=1048576 of=zeros.txt conv=sync >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s >> >> >> NFS >> mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 >> of=zeros.txt conv=sync >> 4096+0 records in >> 4096+0 records out >> 4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s >> >> >> >> On 05/10/2017 11:53 AM, Pranith Kumar Karampuri wrote: >>> Could you let me know the speed without oflag=sync on both >>> the mounts? No need to collect profiles. >>> >>> On Wed, May 10, 2017 at 9:17 PM, Pat Haley <phaley at mit.edu >>> <mailto:phaley at mit.edu>> wrote: >>> >>> >>> Here is what I see now: >>> >>> [root at mseas-data2 ~]# gluster volume info >>> >>> Volume Name: data-volume >>> Type: Distribute >>> Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >>> Status: Started >>> Number of Bricks: 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: mseas-data2:/mnt/brick1 >>> Brick2: mseas-data2:/mnt/brick2 >>> Options Reconfigured: >>> diagnostics.count-fop-hits: on >>> diagnostics.latency-measurement: on >>> nfs.exports-auth-enable: on >>> diagnostics.brick-sys-log-level: WARNING >>> performance.readdir-ahead: on >>> nfs.disable: on >>> nfs.export-volumes: off >>> >>> >>> >>> On 05/10/2017 11:44 AM, Pranith Kumar Karampuri wrote: >>>> Is this the volume info you have? >>>> >>>> >/[root at mseas-data2 >>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>> ~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off / >>>> ?I copied this from old thread from 2016. This is >>>> distribute volume. Did you change any of the options in >>>> between? >>> >>> -- >>> >>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >>> Center for Ocean Engineering Phone: (617) 253-6824 >>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>> 77 Massachusetts Avenue >>> Cambridge, MA 02139-4301 >>> >>> -- >>> Pranith >> -- >> >> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >> Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> >> Center for Ocean Engineering Phone: (617) 253-6824 >> Dept. of Mechanical Engineering Fax: (617) 253-8125 >> MIT, Room 5-213http://web.mit.edu/phaley/www/ >> 77 Massachusetts Avenue >> Cambridge, MA 02139-4301 >> >> -- >> Pranith > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email:phaley at mit.edu <mailto:phaley at mit.edu> > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley at mit.edu Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170510/1f7fba06/attachment.html>