Hi Ben,
Sorry this took so long, but we had a real-time forecasting exercise
last week and I could only get to this now.
Backend Hardware/OS:
* Much of the information on our back end system is included at the
top of
http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html
* The specific model of the hard disks is SeaGate ENTERPRISE CAPACITY
V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s.
* Note: there is one physical server that hosts both the NFS and the
GlusterFS areas
Latest tests
I have had time to run the tests for one of the dd tests you requested
to the underlying XFS FS. The median rate was 170 MB/s. The dd results
and iostat record are in
http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/
I'll add tests for the other brick and to the NFS area later.
Thanks
Pat
On 06/12/2017 06:06 PM, Ben Turner wrote:> Ok you are correct, you have a pure distributed volume. IE no replication
overhead. So normally for pure dist I use:
>
> throughput = slowest of disks / NIC * .6-.7
>
> In your case we have:
>
> 1200 * .6 = 720
>
> So you are seeing a little less throughput than I would expect in your
configuration. What I like to do here is:
>
> -First tell me more about your back end storage, will it sustain 1200 MB /
sec? What kind of HW? How many disks? What type and specs are the disks?
What kind of RAID are you using?
>
> -Second can you refresh me on your workload? Are you doing reads / writes
or both? If both what mix? Since we are using DD I assume you are working iwth
large file sequential I/O, is this correct?
>
> -Run some DD tests on the back end XFS FS. I normally have
/xfs-mount/gluster-brick, if you have something similar just mkdir on the XFS
-> /xfs-mount/my-test-dir. Inside the test dir run:
>
> If you are focusing on a write workload run:
>
> # dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync
>
> If you are focusing on a read workload run:
>
> # echo 3 > /proc/sys/vm/drop_caches
> # dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000
>
> ** MAKE SURE TO DROP CACHE IN BETWEEN READS!! **
>
> Run this in a loop similar to how you did in:
>
>
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
>
> Run this on both servers one at a time and if you are running on a SAN then
run again on both at the same time. While this is running gather iostat for me:
>
> # iostat -c -m -x 1 > iostat-$(hostname).txt
>
> Lets see how the back end performs on both servers while capturing iostat,
then see how the same workload / data looks on gluster.
>
> -Last thing, when you run your kernel NFS tests are you using the same
filesystem / storage you are using for the gluster bricks? I want to be sure we
have an apples to apples comparison here.
>
> -b
>
>
>
> ----- Original Message -----
>> From: "Pat Haley" <phaley at mit.edu>
>> To: "Ben Turner" <bturner at redhat.com>
>> Sent: Monday, June 12, 2017 5:18:07 PM
>> Subject: Re: [Gluster-users] Slow write times to gluster disk
>>
>>
>> Hi Ben,
>>
>> Here is the output:
>>
>> [root at mseas-data2 ~]# gluster volume info
>>
>> Volume Name: data-volume
>> Type: Distribute
>> Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18
>> Status: Started
>> Number of Bricks: 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: mseas-data2:/mnt/brick1
>> Brick2: mseas-data2:/mnt/brick2
>> Options Reconfigured:
>> nfs.exports-auth-enable: on
>> diagnostics.brick-sys-log-level: WARNING
>> performance.readdir-ahead: on
>> nfs.disable: on
>> nfs.export-volumes: off
>>
>>
>> On 06/12/2017 05:01 PM, Ben Turner wrote:
>>> What is the output of gluster v info? That will tell us more about
your
>>> config.
>>>
>>> -b
>>>
>>> ----- Original Message -----
>>>> From: "Pat Haley" <phaley at mit.edu>
>>>> To: "Ben Turner" <bturner at redhat.com>
>>>> Sent: Monday, June 12, 2017 4:54:00 PM
>>>> Subject: Re: [Gluster-users] Slow write times to gluster disk
>>>>
>>>>
>>>> Hi Ben,
>>>>
>>>> I guess I'm confused about what you mean by replication.
If I look at
>>>> the underlying bricks I only ever have a single copy of any
file. It
>>>> either resides on one brick or the other (directories exist on
both
>>>> bricks but not files). We are not using gluster for redundancy
(or at
>>>> least that wasn't our intent). Is that what you meant by
replication
>>>> or is it something else?
>>>>
>>>> Thanks
>>>>
>>>> Pat
>>>>
>>>> On 06/12/2017 04:28 PM, Ben Turner wrote:
>>>>> ----- Original Message -----
>>>>>> From: "Pat Haley" <phaley at mit.edu>
>>>>>> To: "Ben Turner" <bturner at
redhat.com>, "Pranith Kumar Karampuri"
>>>>>> <pkarampu at redhat.com>
>>>>>> Cc: "Ravishankar N" <ravishankar at
redhat.com>, gluster-users at gluster.org,
>>>>>> "Steve Postma" <SPostma at
ztechnet.com>
>>>>>> Sent: Monday, June 12, 2017 2:35:41 PM
>>>>>> Subject: Re: [Gluster-users] Slow write times to
gluster disk
>>>>>>
>>>>>>
>>>>>> Hi Guys,
>>>>>>
>>>>>> I was wondering what our next steps should be to solve
the slow write
>>>>>> times.
>>>>>>
>>>>>> Recently I was debugging a large code and writing a lot
of output at
>>>>>> every time step. When I tried writing to our gluster
disks, it was
>>>>>> taking over a day to do a single time step whereas if I
had the same
>>>>>> program (same hardware, network) write to our nfs disk
the time per
>>>>>> time-step was about 45 minutes. What we are shooting
for here would be
>>>>>> to have similar times to either gluster of nfs.
>>>>> I can see in your test:
>>>>>
>>>>>
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
>>>>>
>>>>> You averaged ~600 MB / sec(expected for replica 2 with 10G,
{~1200 MB /
>>>>> sec} / #replicas{2} = 600). Gluster does client side
replication so with
>>>>> replica 2 you will only ever see 1/2 the speed of your
slowest part of
>>>>> the
>>>>> stack(NW, disk, RAM, CPU). This is usually NW or disk and
600 is
>>>>> normally
>>>>> a best case. Now in your output I do see the instances
where you went
>>>>> down to 200 MB / sec. I can only explain this in three
ways:
>>>>>
>>>>> 1. You are not using conv=fdatasync and writes are
actually going to
>>>>> page
>>>>> cache and then being flushed to disk. During the fsync the
memory is not
>>>>> yet available and the disks are busy flushing dirty pages.
>>>>> 2. Your storage RAID group is shared across multiple
LUNS(like in a SAN)
>>>>> and when write times are slow the RAID group is busy
serviceing other
>>>>> LUNs.
>>>>> 3. Gluster bug / config issue / some other unknown
unknown.
>>>>>
>>>>> So I see 2 issues here:
>>>>>
>>>>> 1. NFS does in 45 minutes what gluster can do in 24 hours.
>>>>> 2. Sometimes your throughput drops dramatically.
>>>>>
>>>>> WRT #1 - have a look at my estimates above. My formula for
guestimating
>>>>> gluster perf is: throughput = NIC throughput or
storage(whatever is
>>>>> slower) / # replicas * overhead(figure .7 or .8). Also the
larger the
>>>>> record size the better for glusterfs mounts, I normally
like to be at
>>>>> LEAST 64k up to 1024k:
>>>>>
>>>>> # dd if=/dev/zero of=/gluster-mount/file bs=1024k
count=10000
>>>>> conv=fdatasync
>>>>>
>>>>> WRT #2 - Again, I question your testing and your storage
config. Try
>>>>> using
>>>>> conv=fdatasync for your DDs, use a larger record size, and
make sure that
>>>>> your back end storage is not causing your slowdowns. Also
remember that
>>>>> with replica 2 you will take ~50% hit on writes because the
client uses
>>>>> 50% of its bandwidth to write to one replica and 50% to the
other.
>>>>>
>>>>> -b
>>>>>
>>>>>
>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Pat
>>>>>>
>>>>>>
>>>>>> On 06/02/2017 01:07 AM, Ben Turner wrote:
>>>>>>> Are you sure using conv=sync is what you want? I
normally use
>>>>>>> conv=fdatasync, I'll look up the difference
between the two and see if
>>>>>>> it
>>>>>>> affects your test.
>>>>>>>
>>>>>>>
>>>>>>> -b
>>>>>>>
>>>>>>> ----- Original Message -----
>>>>>>>> From: "Pat Haley" <phaley at
mit.edu>
>>>>>>>> To: "Pranith Kumar Karampuri"
<pkarampu at redhat.com>
>>>>>>>> Cc: "Ravishankar N" <ravishankar
at redhat.com>,
>>>>>>>> gluster-users at gluster.org,
>>>>>>>> "Steve Postma" <SPostma at
ztechnet.com>, "Ben
>>>>>>>> Turner" <bturner at redhat.com>
>>>>>>>> Sent: Tuesday, May 30, 2017 9:40:34 PM
>>>>>>>> Subject: Re: [Gluster-users] Slow write times
to gluster disk
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Pranith,
>>>>>>>>
>>>>>>>> The "dd" command was:
>>>>>>>>
>>>>>>>> dd if=/dev/zero count=4096 bs=1048576
of=zeros.txt conv=sync
>>>>>>>>
>>>>>>>> There were 2 instances where dd reported 22
seconds. The output from
>>>>>>>> the
>>>>>>>> dd tests are in
>>>>>>>>
>>>>>>>>
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
>>>>>>>>
>>>>>>>> Pat
>>>>>>>>
>>>>>>>> On 05/30/2017 09:27 PM, Pranith Kumar Karampuri
wrote:
>>>>>>>>> Pat,
>>>>>>>>> What is the command you used? As
per the following output,
>>>>>>>>> it
>>>>>>>>> seems like at least one write operation
took 16 seconds. Which is
>>>>>>>>> really bad.
>>>>>>>>> 96.39 1165.10 us 89.00
us*16487014.00 us*
>>>>>>>>> 393212
>>>>>>>>> WRITE
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, May 30, 2017 at 10:36 PM, Pat Haley
<phaley at mit.edu
>>>>>>>>> <mailto:phaley at mit.edu>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Pranith,
>>>>>>>>>
>>>>>>>>> I ran the same 'dd' test
both in the gluster test volume and
>>>>>>>>> in
>>>>>>>>> the .glusterfs directory of each
brick. The median results
>>>>>>>>> (12
>>>>>>>>> dd
>>>>>>>>> trials in each test) are similar to
before
>>>>>>>>>
>>>>>>>>> * gluster test volume: 586.5 MB/s
>>>>>>>>> * bricks (in .glusterfs): 1.4
GB/s
>>>>>>>>>
>>>>>>>>> The profile for the gluster
test-volume is in
>>>>>>>>>
>>>>>>>>>
http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
>>>>>>>>>
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> Pat
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/30/2017 12:10 PM, Pranith
Kumar Karampuri wrote:
>>>>>>>>>> Let's start with the same
'dd' test we were testing with to
>>>>>>>>>> see,
>>>>>>>>>> what the numbers are. Please
provide profile numbers for the
>>>>>>>>>> same. From there on we will
start tuning the volume to see
>>>>>>>>>> what
>>>>>>>>>> we can do.
>>>>>>>>>>
>>>>>>>>>> On Tue, May 30, 2017 at 9:16
PM, Pat Haley <phaley at mit.edu
>>>>>>>>>> <mailto:phaley at
mit.edu>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Pranith,
>>>>>>>>>>
>>>>>>>>>> Thanks for the tip. We now
have the gluster volume
>>>>>>>>>> mounted
>>>>>>>>>> under /home. What tests do
you recommend we run?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> Pat
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/17/2017 05:01 AM,
Pranith Kumar Karampuri wrote:
>>>>>>>>>>> On Tue, May 16, 2017 at
9:20 PM, Pat Haley
>>>>>>>>>>> <phaley at mit.edu
>>>>>>>>>>> <mailto:phaley at
mit.edu>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Pranith,
>>>>>>>>>>>
>>>>>>>>>>> Sorry for the
delay. I never saw received your
>>>>>>>>>>> reply
>>>>>>>>>>> (but I did receive
Ben Turner's follow-up to your
>>>>>>>>>>> reply). So we
tried to create a gluster volume
>>>>>>>>>>> under
>>>>>>>>>>> /home using
different variations of
>>>>>>>>>>>
>>>>>>>>>>> gluster volume
create test-volume
>>>>>>>>>>>
mseas-data2:/home/gbrick_test_1
>>>>>>>>>>>
mseas-data2:/home/gbrick_test_2 transport tcp
>>>>>>>>>>>
>>>>>>>>>>> However we keep
getting errors of the form
>>>>>>>>>>>
>>>>>>>>>>> Wrong brick type:
transport, use
>>>>>>>>>>>
<HOSTNAME>:<export-dir-abs-path>
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts on
what we're doing wrong?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You should give
transport tcp at the beginning I think.
>>>>>>>>>>> Anyways, transport tcp
is the default, so no need to
>>>>>>>>>>> specify
>>>>>>>>>>> so remove those two
words from the CLI.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Also do you have a
list of the test we should be
>>>>>>>>>>> running
>>>>>>>>>>> once we get this
volume created? Given the
>>>>>>>>>>> time-zone
>>>>>>>>>>> difference it might
help if we can run a small
>>>>>>>>>>> battery
>>>>>>>>>>> of tests and post
the results rather than
>>>>>>>>>>> test-post-new
>>>>>>>>>>> test-post... .
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is the first time
I am doing performance analysis
>>>>>>>>>>> on
>>>>>>>>>>> users as far as I
remember. In our team there are
>>>>>>>>>>> separate
>>>>>>>>>>> engineers who do these
tests. Ben who replied earlier is
>>>>>>>>>>> one
>>>>>>>>>>> such engineer.
>>>>>>>>>>>
>>>>>>>>>>> Ben,
>>>>>>>>>>> Have any
suggestions?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> Pat
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 05/11/2017 12:06
PM, Pranith Kumar Karampuri
>>>>>>>>>>> wrote:
>>>>>>>>>>>> On Thu, May 11,
2017 at 9:32 PM, Pat Haley
>>>>>>>>>>>> <phaley at
mit.edu <mailto:phaley at mit.edu>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Pranith,
>>>>>>>>>>>>
>>>>>>>>>>>> The /home
partition is mounted as ext4
>>>>>>>>>>>> /home ext4
defaults,usrquota,grpquota 1 2
>>>>>>>>>>>>
>>>>>>>>>>>> The brick
partitions are mounted ax xfs
>>>>>>>>>>>> /mnt/brick1
xfs defaults 0 0
>>>>>>>>>>>> /mnt/brick2
xfs defaults 0 0
>>>>>>>>>>>>
>>>>>>>>>>>> Will this
cause a problem with creating a
>>>>>>>>>>>> volume
>>>>>>>>>>>> under
/home?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I don't
think the bottleneck is disk. You can do
>>>>>>>>>>>> the
>>>>>>>>>>>> same tests you
did on your new volume to confirm?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Pat
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On
05/11/2017 11:32 AM, Pranith Kumar Karampuri
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> On Thu,
May 11, 2017 at 8:57 PM, Pat Haley
>>>>>>>>>>>>>
<phaley at mit.edu <mailto:phaley at mit.edu>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi
Pranith,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
Unfortunately, we don't have similar
>>>>>>>>>>>>>
hardware
>>>>>>>>>>>>> for
a small scale test. All we have is
>>>>>>>>>>>>> our
>>>>>>>>>>>>>
production hardware.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> You
said something about /home partition which
>>>>>>>>>>>>> has
>>>>>>>>>>>>> lesser
disks, we can create plain distribute
>>>>>>>>>>>>> volume
inside one of those directories. After
>>>>>>>>>>>>> we
>>>>>>>>>>>>> are
done, we can remove the setup. What do you
>>>>>>>>>>>>> say?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Pat
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On
05/11/2017 07:05 AM, Pranith Kumar
>>>>>>>>>>>>>
Karampuri wrote:
>>>>>>>>>>>>>>
On Thu, May 11, 2017 at 2:48 AM, Pat
>>>>>>>>>>>>>>
Haley
>>>>>>>>>>>>>>
<phaley at mit.edu <mailto:phaley at mit.edu>>
>>>>>>>>>>>>>>
wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
Hi Pranith,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
Since we are mounting the partitions
>>>>>>>>>>>>>>
as
>>>>>>>>>>>>>>
the bricks, I tried the dd test
>>>>>>>>>>>>>>
writing
>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
>>>>>>>>>>>>>>
The results without oflag=sync were
>>>>>>>>>>>>>>
1.6
>>>>>>>>>>>>>>
Gb/s (faster than gluster but not as
>>>>>>>>>>>>>>
fast
>>>>>>>>>>>>>>
as I was expecting given the 1.2 Gb/s
>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>
the no-gluster area w/ fewer disks).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
Okay, then 1.6Gb/s is what we need to
>>>>>>>>>>>>>>
target
>>>>>>>>>>>>>>
for, considering your volume is just
>>>>>>>>>>>>>>
distribute. Is there any way you can do
>>>>>>>>>>>>>>
tests
>>>>>>>>>>>>>>
on similar hardware but at a small scale?
>>>>>>>>>>>>>>
Just so we can run the workload to learn
>>>>>>>>>>>>>>
more
>>>>>>>>>>>>>>
about the bottlenecks in the system? We
>>>>>>>>>>>>>>
can
>>>>>>>>>>>>>>
probably try to get the speed to 1.2Gb/s
>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>
your /home partition you were telling me
>>>>>>>>>>>>>>
yesterday. Let me know if that is
>>>>>>>>>>>>>>
something
>>>>>>>>>>>>>>
you are okay to do.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
Pat
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
On 05/10/2017 01:27 PM, Pranith Kumar
>>>>>>>>>>>>>>
Karampuri wrote:
>>>>>>>>>>>>>>>
On Wed, May 10, 2017 at 10:15 PM,
>>>>>>>>>>>>>>>
Pat
>>>>>>>>>>>>>>>
Haley <phaley at mit.edu
>>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Hi Pranith,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Not entirely sure (this isn't my
>>>>>>>>>>>>>>>
area of expertise). I'll run
>>>>>>>>>>>>>>>
your
>>>>>>>>>>>>>>>
answer by some other people who
>>>>>>>>>>>>>>>
are
>>>>>>>>>>>>>>>
more familiar with this.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
I am also uncertain about how to
>>>>>>>>>>>>>>>
interpret the results when we
>>>>>>>>>>>>>>>
also
>>>>>>>>>>>>>>>
add the dd tests writing to the
>>>>>>>>>>>>>>>
/home area (no gluster, still on
>>>>>>>>>>>>>>>
the
>>>>>>>>>>>>>>>
same machine)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
* dd test without oflag=sync
>>>>>>>>>>>>>>>
(rough average of multiple
>>>>>>>>>>>>>>>
tests)
>>>>>>>>>>>>>>>
o gluster w/ fuse mount :
>>>>>>>>>>>>>>>
570
>>>>>>>>>>>>>>>
Mb/s
>>>>>>>>>>>>>>>
o gluster w/ nfs mount:
>>>>>>>>>>>>>>>
390
>>>>>>>>>>>>>>>
Mb/s
>>>>>>>>>>>>>>>
o nfs (no gluster): 1.2
>>>>>>>>>>>>>>>
Gb/s
>>>>>>>>>>>>>>>
* dd test with oflag=sync
>>>>>>>>>>>>>>>
(rough
>>>>>>>>>>>>>>>
average of multiple tests)
>>>>>>>>>>>>>>>
o gluster w/ fuse mount:
>>>>>>>>>>>>>>>
5
>>>>>>>>>>>>>>>
Mb/s
>>>>>>>>>>>>>>>
o gluster w/ nfs mount:
>>>>>>>>>>>>>>>
200
>>>>>>>>>>>>>>>
Mb/s
>>>>>>>>>>>>>>>
o nfs (no gluster): 20
>>>>>>>>>>>>>>>
Mb/s
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Given that the non-gluster area
>>>>>>>>>>>>>>>
is
>>>>>>>>>>>>>>>
a
>>>>>>>>>>>>>>>
RAID-6 of 4 disks while each
>>>>>>>>>>>>>>>
brick
>>>>>>>>>>>>>>>
of the gluster area is a RAID-6
>>>>>>>>>>>>>>>
of
>>>>>>>>>>>>>>>
32 disks, I would naively expect
>>>>>>>>>>>>>>>
the
>>>>>>>>>>>>>>>
writes to the gluster area to be
>>>>>>>>>>>>>>>
roughly 8x faster than to the
>>>>>>>>>>>>>>>
non-gluster.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
I think a better test is to try and
>>>>>>>>>>>>>>>
write to a file using nfs without
>>>>>>>>>>>>>>>
any
>>>>>>>>>>>>>>>
gluster to a location that is not
>>>>>>>>>>>>>>>
inside
>>>>>>>>>>>>>>>
the brick but someother location
>>>>>>>>>>>>>>>
that
>>>>>>>>>>>>>>>
is
>>>>>>>>>>>>>>>
on same disk(s). If you are mounting
>>>>>>>>>>>>>>>
the
>>>>>>>>>>>>>>>
partition as the brick, then we can
>>>>>>>>>>>>>>>
write to a file inside .glusterfs
>>>>>>>>>>>>>>>
directory, something like
>>>>>>>>>>>>>>>
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
I still think we have a speed
>>>>>>>>>>>>>>>
issue,
>>>>>>>>>>>>>>>
I can't tell if fuse vs nfs is
>>>>>>>>>>>>>>>
part
>>>>>>>>>>>>>>>
of the problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
I got interested in the post because
>>>>>>>>>>>>>>>
I
>>>>>>>>>>>>>>>
read that fuse speed is lesser than
>>>>>>>>>>>>>>>
nfs
>>>>>>>>>>>>>>>
speed which is counter-intuitive to
>>>>>>>>>>>>>>>
my
>>>>>>>>>>>>>>>
understanding. So wanted
>>>>>>>>>>>>>>>
clarifications.
>>>>>>>>>>>>>>>
Now that I got my clarifications
>>>>>>>>>>>>>>>
where
>>>>>>>>>>>>>>>
fuse outperformed nfs without sync,
>>>>>>>>>>>>>>>
we
>>>>>>>>>>>>>>>
can resume testing as described
>>>>>>>>>>>>>>>
above
>>>>>>>>>>>>>>>
and try to find what it is. Based on
>>>>>>>>>>>>>>>
your email-id I am guessing you are
>>>>>>>>>>>>>>>
from
>>>>>>>>>>>>>>>
Boston and I am from Bangalore so if
>>>>>>>>>>>>>>>
you
>>>>>>>>>>>>>>>
are okay with doing this debugging
>>>>>>>>>>>>>>>
for
>>>>>>>>>>>>>>>
multiple days because of timezones,
>>>>>>>>>>>>>>>
I
>>>>>>>>>>>>>>>
will be happy to help. Please be a
>>>>>>>>>>>>>>>
bit
>>>>>>>>>>>>>>>
patient with me, I am under a
>>>>>>>>>>>>>>>
release
>>>>>>>>>>>>>>>
crunch but I am very curious with
>>>>>>>>>>>>>>>
the
>>>>>>>>>>>>>>>
problem you posted.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Was there anything useful in the
>>>>>>>>>>>>>>>
profiles?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Unfortunately profiles didn't help
>>>>>>>>>>>>>>>
me
>>>>>>>>>>>>>>>
much, I think we are collecting the
>>>>>>>>>>>>>>>
profiles from an active volume, so
>>>>>>>>>>>>>>>
it
>>>>>>>>>>>>>>>
has a lot of information that is not
>>>>>>>>>>>>>>>
pertaining to dd so it is difficult
>>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>>
find the contributions of dd. So I
>>>>>>>>>>>>>>>
went
>>>>>>>>>>>>>>>
through your post again and found
>>>>>>>>>>>>>>>
something I didn't pay much
>>>>>>>>>>>>>>>
attention
>>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>>
earlier i.e. oflag=sync, so did my
>>>>>>>>>>>>>>>
own
>>>>>>>>>>>>>>>
tests on my setup with FUSE so sent
>>>>>>>>>>>>>>>
that
>>>>>>>>>>>>>>>
reply.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
Pat
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
On 05/10/2017 12:15 PM, Pranith
>>>>>>>>>>>>>>>
Kumar Karampuri wrote:
>>>>>>>>>>>>>>>>
Okay good. At least this
>>>>>>>>>>>>>>>>
validates
>>>>>>>>>>>>>>>>
my doubts. Handling O_SYNC in
>>>>>>>>>>>>>>>>
gluster NFS and fuse is a bit
>>>>>>>>>>>>>>>>
different.
>>>>>>>>>>>>>>>>
When application opens a file
>>>>>>>>>>>>>>>>
with
>>>>>>>>>>>>>>>>
O_SYNC on fuse mount then each
>>>>>>>>>>>>>>>>
write syscall has to be written
>>>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>>>
disk as part of the syscall
>>>>>>>>>>>>>>>>
where
>>>>>>>>>>>>>>>>
as in case of NFS, there is no
>>>>>>>>>>>>>>>>
concept of open. NFS performs
>>>>>>>>>>>>>>>>
write
>>>>>>>>>>>>>>>>
though a handle saying it needs
>>>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>>>
be a synchronous write, so
>>>>>>>>>>>>>>>>
write()
>>>>>>>>>>>>>>>>
syscall is performed first then
>>>>>>>>>>>>>>>>
it
>>>>>>>>>>>>>>>>
performs fsync(). so an write
>>>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>>>
an
>>>>>>>>>>>>>>>>
fd with O_SYNC becomes
>>>>>>>>>>>>>>>>
write+fsync.
>>>>>>>>>>>>>>>>
I am suspecting that when
>>>>>>>>>>>>>>>>
multiple
>>>>>>>>>>>>>>>>
threads do this write+fsync()
>>>>>>>>>>>>>>>>
operation on the same file,
>>>>>>>>>>>>>>>>
multiple writes are batched
>>>>>>>>>>>>>>>>
together to be written do disk
>>>>>>>>>>>>>>>>
so
>>>>>>>>>>>>>>>>
the throughput on the disk is
>>>>>>>>>>>>>>>>
increasing is my guess.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
Does it answer your doubts?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
On Wed, May 10, 2017 at 9:35
>>>>>>>>>>>>>>>>
PM,
>>>>>>>>>>>>>>>>
Pat Haley <phaley at mit.edu
>>>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
Without the oflag=sync and
>>>>>>>>>>>>>>>>
only
>>>>>>>>>>>>>>>>
a single test of each, the
>>>>>>>>>>>>>>>>
FUSE
>>>>>>>>>>>>>>>>
is going faster than NFS:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
FUSE:
>>>>>>>>>>>>>>>>
mseas-data2(dri_nascar)% dd
>>>>>>>>>>>>>>>>
if=/dev/zero count=4096
>>>>>>>>>>>>>>>>
bs=1048576 of=zeros.txt
>>>>>>>>>>>>>>>>
conv=sync
>>>>>>>>>>>>>>>>
4096+0 records in
>>>>>>>>>>>>>>>>
4096+0 records out
>>>>>>>>>>>>>>>>
4294967296 bytes (4.3 GB)
>>>>>>>>>>>>>>>>
copied, 7.46961 s, 575 MB/s
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
NFS
>>>>>>>>>>>>>>>>
mseas-data2(HYCOM)% dd
>>>>>>>>>>>>>>>>
if=/dev/zero count=4096
>>>>>>>>>>>>>>>>
bs=1048576 of=zeros.txt
>>>>>>>>>>>>>>>>
conv=sync
>>>>>>>>>>>>>>>>
4096+0 records in
>>>>>>>>>>>>>>>>
4096+0 records out
>>>>>>>>>>>>>>>>
4294967296 bytes (4.3 GB)
>>>>>>>>>>>>>>>>
copied, 11.4264 s, 376 MB/s
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
On 05/10/2017 11:53 AM,
>>>>>>>>>>>>>>>>
Pranith
>>>>>>>>>>>>>>>>
Kumar Karampuri wrote:
>>>>>>>>>>>>>>>>>
Could you let me know the
>>>>>>>>>>>>>>>>>
speed without oflag=sync
>>>>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>>>>
both the mounts? No need
>>>>>>>>>>>>>>>>>
to
>>>>>>>>>>>>>>>>>
collect profiles.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
On Wed, May 10, 2017 at
>>>>>>>>>>>>>>>>>
9:17
>>>>>>>>>>>>>>>>>
PM, Pat Haley
>>>>>>>>>>>>>>>>>
<phaley at mit.edu
>>>>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>>
>>>>>>>>>>>>>>>>>
wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
Here is what I see
>>>>>>>>>>>>>>>>>
now:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
[root at mseas-data2 ~]#
>>>>>>>>>>>>>>>>>
gluster volume info
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
Volume Name:
>>>>>>>>>>>>>>>>>
data-volume
>>>>>>>>>>>>>>>>>
Type: Distribute
>>>>>>>>>>>>>>>>>
Volume ID:
>>>>>>>>>>>>>>>>>
c162161e-2a2d-4dac-b015-f31fd89ceb18
>>>>>>>>>>>>>>>>>
Status: Started
>>>>>>>>>>>>>>>>>
Number of Bricks: 2
>>>>>>>>>>>>>>>>>
Transport-type: tcp
>>>>>>>>>>>>>>>>>
Bricks:
>>>>>>>>>>>>>>>>>
Brick1:
>>>>>>>>>>>>>>>>>
mseas-data2:/mnt/brick1
>>>>>>>>>>>>>>>>>
Brick2:
>>>>>>>>>>>>>>>>>
mseas-data2:/mnt/brick2
>>>>>>>>>>>>>>>>>
Options Reconfigured:
>>>>>>>>>>>>>>>>>
diagnostics.count-fop-hits:
>>>>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>>>>
diagnostics.latency-measurement:
>>>>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>>>>
nfs.exports-auth-enable:
>>>>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>>>>
diagnostics.brick-sys-log-level:
>>>>>>>>>>>>>>>>>
WARNING
>>>>>>>>>>>>>>>>>
performance.readdir-ahead:
>>>>>>>>>>>>>>>>>
on
>>>>>>>>>>>>>>>>>
nfs.disable: on
>>>>>>>>>>>>>>>>>
nfs.export-volumes:
>>>>>>>>>>>>>>>>>
off
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
On 05/10/2017 11:44
>>>>>>>>>>>>>>>>>
AM,
>>>>>>>>>>>>>>>>>
Pranith Kumar
>>>>>>>>>>>>>>>>>
Karampuri
>>>>>>>>>>>>>>>>>
wrote:
>>>>>>>>>>>>>>>>>>
Is this the volume
>>>>>>>>>>>>>>>>>>
info
>>>>>>>>>>>>>>>>>>
you have?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>/[root at
>>>>>>>>>>>>>>>>>>
>mseas-data2
>>>>>>>>>>>>>>>>>>
<http://www.gluster.org/mailman/listinfo/gluster-users>
>>>>>>>>>>>>>>>>>>
~]# gluster volume
>>>>>>>>>>>>>>>>>>
info
>>>>>>>>>>>>>>>>>>
/>//>/Volume Name:
>>>>>>>>>>>>>>>>>>
data-volume />/Type:
>>>>>>>>>>>>>>>>>>
Distribute />/Volume
>>>>>>>>>>>>>>>>>>
ID:
>>>>>>>>>>>>>>>>>>
c162161e-2a2d-4dac-b015-f31fd89ceb18
>>>>>>>>>>>>>>>>>>
/>/Status: Started
>>>>>>>>>>>>>>>>>>
/>/Number
>>>>>>>>>>>>>>>>>>
of Bricks: 2
>>>>>>>>>>>>>>>>>>
/>/Transport-type:
>>>>>>>>>>>>>>>>>>
tcp
>>>>>>>>>>>>>>>>>>
/>/Bricks: />/Brick1:
>>>>>>>>>>>>>>>>>>
mseas-data2:/mnt/brick1
>>>>>>>>>>>>>>>>>>
/>/Brick2:
>>>>>>>>>>>>>>>>>>
mseas-data2:/mnt/brick2
>>>>>>>>>>>>>>>>>>
/>/Options
>>>>>>>>>>>>>>>>>>
Reconfigured:
>>>>>>>>>>>>>>>>>>
/>/performance.readdir-ahead:
>>>>>>>>>>>>>>>>>>
on />/nfs.disable: on
>>>>>>>>>>>>>>>>>>
/>/nfs.export-volumes:
>>>>>>>>>>>>>>>>>>
off
>>>>>>>>>>>>>>>>>>
/
>>>>>>>>>>>>>>>>>>
?I copied this from
>>>>>>>>>>>>>>>>>>
old
>>>>>>>>>>>>>>>>>>
thread from 2016.
>>>>>>>>>>>>>>>>>>
This
>>>>>>>>>>>>>>>>>>
is
>>>>>>>>>>>>>>>>>>
distribute volume.
>>>>>>>>>>>>>>>>>>
Did
>>>>>>>>>>>>>>>>>>
you change any of the
>>>>>>>>>>>>>>>>>>
options in between?
>>>>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>>>>>>
Pat Haley
>>>>>>>>>>>>>>>>>
Email:phaley at mit.edu
>>>>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>
>>>>>>>>>>>>>>>>>
Center for Ocean
>>>>>>>>>>>>>>>>>
Engineering
>>>>>>>>>>>>>>>>>
Phone: (617) 253-6824
>>>>>>>>>>>>>>>>>
Dept. of Mechanical
>>>>>>>>>>>>>>>>>
Engineering
>>>>>>>>>>>>>>>>>
Fax: (617) 253-8125
>>>>>>>>>>>>>>>>>
MIT, Room
>>>>>>>>>>>>>>>>>
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>>>>>>>>
77 Massachusetts
>>>>>>>>>>>>>>>>>
Avenue
>>>>>>>>>>>>>>>>>
Cambridge, MA
>>>>>>>>>>>>>>>>>
02139-4301
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>>>>
Pranith
>>>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>>>>>
Pat Haley
>>>>>>>>>>>>>>>>
Email:phaley at mit.edu
>>>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>
>>>>>>>>>>>>>>>>
Center for Ocean
>>>>>>>>>>>>>>>>
Engineering
>>>>>>>>>>>>>>>>
Phone: (617) 253-6824
>>>>>>>>>>>>>>>>
Dept. of Mechanical
>>>>>>>>>>>>>>>>
Engineering
>>>>>>>>>>>>>>>>
Fax: (617) 253-8125
>>>>>>>>>>>>>>>>
MIT, Room
>>>>>>>>>>>>>>>>
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>>>>>>>
77 Massachusetts Avenue
>>>>>>>>>>>>>>>>
Cambridge, MA 02139-4301
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>>>
Pranith
>>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>>>>
Pat Haley
>>>>>>>>>>>>>>>
Email:phaley at mit.edu
>>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>
>>>>>>>>>>>>>>>
Center for Ocean Engineering
>>>>>>>>>>>>>>>
Phone:
>>>>>>>>>>>>>>>
(617) 253-6824
>>>>>>>>>>>>>>>
Dept. of Mechanical Engineering
>>>>>>>>>>>>>>>
Fax:
>>>>>>>>>>>>>>>
(617) 253-8125
>>>>>>>>>>>>>>>
MIT, Room
>>>>>>>>>>>>>>>
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>>>>>>
77 Massachusetts Avenue
>>>>>>>>>>>>>>>
Cambridge, MA 02139-4301
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>>
Pranith
>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>>>
Pat Haley
>>>>>>>>>>>>>>
Email:phaley at mit.edu
>>>>>>>>>>>>>>
<mailto:phaley at mit.edu>
>>>>>>>>>>>>>>
Center for Ocean Engineering
>>>>>>>>>>>>>>
Phone:
>>>>>>>>>>>>>>
(617) 253-6824
>>>>>>>>>>>>>>
Dept. of Mechanical Engineering
>>>>>>>>>>>>>>
Fax:
>>>>>>>>>>>>>>
(617) 253-8125
>>>>>>>>>>>>>>
MIT, Room
>>>>>>>>>>>>>>
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>>>>>
77 Massachusetts Avenue
>>>>>>>>>>>>>>
Cambridge, MA 02139-4301
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
--
>>>>>>>>>>>>>>
Pranith
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>>> Pat
Haley
>>>>>>>>>>>>>
Email:phaley at mit.edu
>>>>>>>>>>>>>
<mailto:phaley at mit.edu>
>>>>>>>>>>>>>
Center for Ocean Engineering Phone:
>>>>>>>>>>>>>
(617)
>>>>>>>>>>>>>
253-6824
>>>>>>>>>>>>>
Dept. of Mechanical Engineering Fax:
>>>>>>>>>>>>>
(617)
>>>>>>>>>>>>>
253-8125
>>>>>>>>>>>>>
MIT, Room
>>>>>>>>>>>>>
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>>>> 77
Massachusetts Avenue
>>>>>>>>>>>>>
Cambridge, MA 02139-4301
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Pranith
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>>> Pat Haley
>>>>>>>>>>>>
Email:phaley at mit.edu
>>>>>>>>>>>>
<mailto:phaley at mit.edu>
>>>>>>>>>>>> Center for
Ocean Engineering Phone:
>>>>>>>>>>>> (617)
>>>>>>>>>>>> 253-6824
>>>>>>>>>>>> Dept. of
Mechanical Engineering Fax:
>>>>>>>>>>>> (617)
>>>>>>>>>>>> 253-8125
>>>>>>>>>>>> MIT, Room
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>>> 77
Massachusetts Avenue
>>>>>>>>>>>> Cambridge,
MA 02139-4301
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Pranith
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>>> Pat Haley
>>>>>>>>>>> Email:phaley at
mit.edu
>>>>>>>>>>> <mailto:phaley
at mit.edu>
>>>>>>>>>>> Center for Ocean
Engineering Phone: (617)
>>>>>>>>>>> 253-6824
>>>>>>>>>>> Dept. of Mechanical
Engineering Fax: (617)
>>>>>>>>>>> 253-8125
>>>>>>>>>>> MIT, Room
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>>> 77 Massachusetts
Avenue
>>>>>>>>>>> Cambridge, MA
02139-4301
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Pranith
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>>> Pat Haley
Email:phaley at mit.edu
>>>>>>>>>> <mailto:phaley at
mit.edu>
>>>>>>>>>> Center for Ocean
Engineering Phone: (617) 253-6824
>>>>>>>>>> Dept. of Mechanical
Engineering Fax: (617) 253-8125
>>>>>>>>>> MIT, Room
5-213http://web.mit.edu/phaley/www/
>>>>>>>>>> 77 Massachusetts Avenue
>>>>>>>>>> Cambridge, MA 02139-4301
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Pranith
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>>> Pat Haley
Email:phaley at mit.edu
>>>>>>>>> <mailto:phaley at mit.edu>
>>>>>>>>> Center for Ocean Engineering
Phone: (617) 253-6824
>>>>>>>>> Dept. of Mechanical Engineering
Fax: (617) 253-8125
>>>>>>>>> MIT, Room
5-213http://web.mit.edu/phaley/www/
>>>>>>>>> 77 Massachusetts Avenue
>>>>>>>>> Cambridge, MA 02139-4301
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Pranith
>>>>>>>> --
>>>>>>>>
>>>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>>>> Pat Haley Email:
phaley at mit.edu
>>>>>>>> Center for Ocean Engineering Phone:
(617) 253-6824
>>>>>>>> Dept. of Mechanical Engineering Fax:
(617) 253-8125
>>>>>>>> MIT, Room 5-213
http://web.mit.edu/phaley/www/
>>>>>>>> 77 Massachusetts Avenue
>>>>>>>> Cambridge, MA 02139-4301
>>>>>>>>
>>>>>>>>
>>>>>> --
>>>>>>
>>>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>>>> Pat Haley Email: phaley at
mit.edu
>>>>>> Center for Ocean Engineering Phone: (617)
253-6824
>>>>>> Dept. of Mechanical Engineering Fax: (617)
253-8125
>>>>>> MIT, Room 5-213
http://web.mit.edu/phaley/www/
>>>>>> 77 Massachusetts Avenue
>>>>>> Cambridge, MA 02139-4301
>>>>>>
>>>>>>
>>>> --
>>>>
>>>>
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>>>> Pat Haley Email: phaley at mit.edu
>>>> Center for Ocean Engineering Phone: (617) 253-6824
>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>>>> MIT, Room 5-213
http://web.mit.edu/phaley/www/
>>>> 77 Massachusetts Avenue
>>>> Cambridge, MA 02139-4301
>>>>
>>>>
>> --
>>
>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
>> Pat Haley Email: phaley at mit.edu
>> Center for Ocean Engineering Phone: (617) 253-6824
>> Dept. of Mechanical Engineering Fax: (617) 253-8125
>> MIT, Room 5-213 http://web.mit.edu/phaley/www/
>> 77 Massachusetts Avenue
>> Cambridge, MA 02139-4301
>>
>>
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley at mit.edu
Center for Ocean Engineering Phone: (617) 253-6824
Dept. of Mechanical Engineering Fax: (617) 253-8125
MIT, Room 5-213 http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA 02139-4301
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.gluster.org/pipermail/gluster-users/attachments/20170620/6ddc35ef/attachment.html>
Hi, Today we experimented with some of the FUSE options that we found in the list. Changing these options had no effect: gluster volume set test-volume performance.cache-max-file-size 2MB gluster volume set test-volume performance.cache-refresh-timeout 4 gluster volume set test-volume performance.cache-size 256MB gluster volume set test-volume performance.write-behind-window-size 4MB gluster volume set test-volume performance.write-behind-window-size 8MB Changing the following option from its default value made the speed slower gluster volume set test-volume performance.write-behind off (on by default) Changing the following options initially appeared to give a 10% increase in speed, but this vanished in subsequent tests (we think the apparent increase may have been to a lighter workload on the computer from other users) gluster volume set test-volume performance.stat-prefetch on gluster volume set test-volume client.event-threads 4 gluster volume set test-volume server.event-threads 4 Can anything be gleaned from these observations? Are there other things we can try? Thanks Pat On 06/20/2017 12:06 PM, Pat Haley wrote:> > Hi Ben, > > Sorry this took so long, but we had a real-time forecasting exercise > last week and I could only get to this now. > > Backend Hardware/OS: > > * Much of the information on our back end system is included at the > top of > http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html > * The specific model of the hard disks is SeaGate ENTERPRISE > CAPACITY V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s. > * Note: there is one physical server that hosts both the NFS and the > GlusterFS areas > > Latest tests > > I have had time to run the tests for one of the dd tests you requested > to the underlying XFS FS. The median rate was 170 MB/s. The dd > results and iostat record are in > > http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/ > > I'll add tests for the other brick and to the NFS area later. > > Thanks > > Pat > > > On 06/12/2017 06:06 PM, Ben Turner wrote: >> Ok you are correct, you have a pure distributed volume. IE no replication overhead. So normally for pure dist I use: >> >> throughput = slowest of disks / NIC * .6-.7 >> >> In your case we have: >> >> 1200 * .6 = 720 >> >> So you are seeing a little less throughput than I would expect in your configuration. What I like to do here is: >> >> -First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using? >> >> -Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct? >> >> -Run some DD tests on the back end XFS FS. I normally have /xfs-mount/gluster-brick, if you have something similar just mkdir on the XFS -> /xfs-mount/my-test-dir. Inside the test dir run: >> >> If you are focusing on a write workload run: >> >> # dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync >> >> If you are focusing on a read workload run: >> >> # echo 3 > /proc/sys/vm/drop_caches >> # dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000 >> >> ** MAKE SURE TO DROP CACHE IN BETWEEN READS!! ** >> >> Run this in a loop similar to how you did in: >> >> http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt >> >> Run this on both servers one at a time and if you are running on a SAN then run again on both at the same time. While this is running gather iostat for me: >> >> # iostat -c -m -x 1 > iostat-$(hostname).txt >> >> Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster. >> >> -Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here. >> >> -b >> >> >> >> ----- Original Message ----- >>> From: "Pat Haley"<phaley at mit.edu> >>> To: "Ben Turner"<bturner at redhat.com> >>> Sent: Monday, June 12, 2017 5:18:07 PM >>> Subject: Re: [Gluster-users] Slow write times to gluster disk >>> >>> >>> Hi Ben, >>> >>> Here is the output: >>> >>> [root at mseas-data2 ~]# gluster volume info >>> >>> Volume Name: data-volume >>> Type: Distribute >>> Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 >>> Status: Started >>> Number of Bricks: 2 >>> Transport-type: tcp >>> Bricks: >>> Brick1: mseas-data2:/mnt/brick1 >>> Brick2: mseas-data2:/mnt/brick2 >>> Options Reconfigured: >>> nfs.exports-auth-enable: on >>> diagnostics.brick-sys-log-level: WARNING >>> performance.readdir-ahead: on >>> nfs.disable: on >>> nfs.export-volumes: off >>> >>> >>> On 06/12/2017 05:01 PM, Ben Turner wrote: >>>> What is the output of gluster v info? That will tell us more about your >>>> config. >>>> >>>> -b >>>> >>>> ----- Original Message ----- >>>>> From: "Pat Haley"<phaley at mit.edu> >>>>> To: "Ben Turner"<bturner at redhat.com> >>>>> Sent: Monday, June 12, 2017 4:54:00 PM >>>>> Subject: Re: [Gluster-users] Slow write times to gluster disk >>>>> >>>>> >>>>> Hi Ben, >>>>> >>>>> I guess I'm confused about what you mean by replication. If I look at >>>>> the underlying bricks I only ever have a single copy of any file. It >>>>> either resides on one brick or the other (directories exist on both >>>>> bricks but not files). We are not using gluster for redundancy (or at >>>>> least that wasn't our intent). Is that what you meant by replication >>>>> or is it something else? >>>>> >>>>> Thanks >>>>> >>>>> Pat >>>>> >>>>> On 06/12/2017 04:28 PM, Ben Turner wrote: >>>>>> ----- Original Message ----- >>>>>>> From: "Pat Haley"<phaley at mit.edu> >>>>>>> To: "Ben Turner"<bturner at redhat.com>, "Pranith Kumar Karampuri" >>>>>>> <pkarampu at redhat.com> >>>>>>> Cc: "Ravishankar N"<ravishankar at redhat.com>,gluster-users at gluster.org, >>>>>>> "Steve Postma"<SPostma at ztechnet.com> >>>>>>> Sent: Monday, June 12, 2017 2:35:41 PM >>>>>>> Subject: Re: [Gluster-users] Slow write times to gluster disk >>>>>>> >>>>>>> >>>>>>> Hi Guys, >>>>>>> >>>>>>> I was wondering what our next steps should be to solve the slow write >>>>>>> times. >>>>>>> >>>>>>> Recently I was debugging a large code and writing a lot of output at >>>>>>> every time step. When I tried writing to our gluster disks, it was >>>>>>> taking over a day to do a single time step whereas if I had the same >>>>>>> program (same hardware, network) write to our nfs disk the time per >>>>>>> time-step was about 45 minutes. What we are shooting for here would be >>>>>>> to have similar times to either gluster of nfs. >>>>>> I can see in your test: >>>>>> >>>>>> http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt >>>>>> >>>>>> You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB / >>>>>> sec} / #replicas{2} = 600). Gluster does client side replication so with >>>>>> replica 2 you will only ever see 1/2 the speed of your slowest part of >>>>>> the >>>>>> stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is >>>>>> normally >>>>>> a best case. Now in your output I do see the instances where you went >>>>>> down to 200 MB / sec. I can only explain this in three ways: >>>>>> >>>>>> 1. You are not using conv=fdatasync and writes are actually going to >>>>>> page >>>>>> cache and then being flushed to disk. During the fsync the memory is not >>>>>> yet available and the disks are busy flushing dirty pages. >>>>>> 2. Your storage RAID group is shared across multiple LUNS(like in a SAN) >>>>>> and when write times are slow the RAID group is busy serviceing other >>>>>> LUNs. >>>>>> 3. Gluster bug / config issue / some other unknown unknown. >>>>>> >>>>>> So I see 2 issues here: >>>>>> >>>>>> 1. NFS does in 45 minutes what gluster can do in 24 hours. >>>>>> 2. Sometimes your throughput drops dramatically. >>>>>> >>>>>> WRT #1 - have a look at my estimates above. My formula for guestimating >>>>>> gluster perf is: throughput = NIC throughput or storage(whatever is >>>>>> slower) / # replicas * overhead(figure .7 or .8). Also the larger the >>>>>> record size the better for glusterfs mounts, I normally like to be at >>>>>> LEAST 64k up to 1024k: >>>>>> >>>>>> # dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 >>>>>> conv=fdatasync >>>>>> >>>>>> WRT #2 - Again, I question your testing and your storage config. Try >>>>>> using >>>>>> conv=fdatasync for your DDs, use a larger record size, and make sure that >>>>>> your back end storage is not causing your slowdowns. Also remember that >>>>>> with replica 2 you will take ~50% hit on writes because the client uses >>>>>> 50% of its bandwidth to write to one replica and 50% to the other. >>>>>> >>>>>> -b >>>>>> >>>>>> >>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> Pat >>>>>>> >>>>>>> >>>>>>> On 06/02/2017 01:07 AM, Ben Turner wrote: >>>>>>>> Are you sure using conv=sync is what you want? I normally use >>>>>>>> conv=fdatasync, I'll look up the difference between the two and see if >>>>>>>> it >>>>>>>> affects your test. >>>>>>>> >>>>>>>> >>>>>>>> -b >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>>> From: "Pat Haley"<phaley at mit.edu> >>>>>>>>> To: "Pranith Kumar Karampuri"<pkarampu at redhat.com> >>>>>>>>> Cc: "Ravishankar N"<ravishankar at redhat.com>, >>>>>>>>> gluster-users at gluster.org, >>>>>>>>> "Steve Postma"<SPostma at ztechnet.com>, "Ben >>>>>>>>> Turner"<bturner at redhat.com> >>>>>>>>> Sent: Tuesday, May 30, 2017 9:40:34 PM >>>>>>>>> Subject: Re: [Gluster-users] Slow write times to gluster disk >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Pranith, >>>>>>>>> >>>>>>>>> The "dd" command was: >>>>>>>>> >>>>>>>>> dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync >>>>>>>>> >>>>>>>>> There were 2 instances where dd reported 22 seconds. The output from >>>>>>>>> the >>>>>>>>> dd tests are in >>>>>>>>> >>>>>>>>> http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt >>>>>>>>> >>>>>>>>> Pat >>>>>>>>> >>>>>>>>> On 05/30/2017 09:27 PM, Pranith Kumar Karampuri wrote: >>>>>>>>>> Pat, >>>>>>>>>> What is the command you used? As per the following output, >>>>>>>>>> it >>>>>>>>>> seems like at least one write operation took 16 seconds. Which is >>>>>>>>>> really bad. >>>>>>>>>> 96.39 1165.10 us 89.00 us*16487014.00 us* >>>>>>>>>> 393212 >>>>>>>>>> WRITE >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 30, 2017 at 10:36 PM, Pat Haley <phaley at mit.edu >>>>>>>>>> <mailto:phaley at mit.edu>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Pranith, >>>>>>>>>> >>>>>>>>>> I ran the same 'dd' test both in the gluster test volume and >>>>>>>>>> in >>>>>>>>>> the .glusterfs directory of each brick. The median results >>>>>>>>>> (12 >>>>>>>>>> dd >>>>>>>>>> trials in each test) are similar to before >>>>>>>>>> >>>>>>>>>> * gluster test volume: 586.5 MB/s >>>>>>>>>> * bricks (in .glusterfs): 1.4 GB/s >>>>>>>>>> >>>>>>>>>> The profile for the gluster test-volume is in >>>>>>>>>> >>>>>>>>>> http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt >>>>>>>>>> <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> Pat >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 05/30/2017 12:10 PM, Pranith Kumar Karampuri wrote: >>>>>>>>>>> Let's start with the same 'dd' test we were testing with to >>>>>>>>>>> see, >>>>>>>>>>> what the numbers are. Please provide profile numbers for the >>>>>>>>>>> same. From there on we will start tuning the volume to see >>>>>>>>>>> what >>>>>>>>>>> we can do. >>>>>>>>>>> >>>>>>>>>>> On Tue, May 30, 2017 at 9:16 PM, Pat Haley <phaley at mit.edu >>>>>>>>>>> <mailto:phaley at mit.edu>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Pranith, >>>>>>>>>>> >>>>>>>>>>> Thanks for the tip. We now have the gluster volume >>>>>>>>>>> mounted >>>>>>>>>>> under /home. What tests do you recommend we run? >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> >>>>>>>>>>> Pat >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 05/17/2017 05:01 AM, Pranith Kumar Karampuri wrote: >>>>>>>>>>>> On Tue, May 16, 2017 at 9:20 PM, Pat Haley >>>>>>>>>>>> <phaley at mit.edu >>>>>>>>>>>> <mailto:phaley at mit.edu>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Pranith, >>>>>>>>>>>> >>>>>>>>>>>> Sorry for the delay. I never saw received your >>>>>>>>>>>> reply >>>>>>>>>>>> (but I did receive Ben Turner's follow-up to your >>>>>>>>>>>> reply). So we tried to create a gluster volume >>>>>>>>>>>> under >>>>>>>>>>>> /home using different variations of >>>>>>>>>>>> >>>>>>>>>>>> gluster volume create test-volume >>>>>>>>>>>> mseas-data2:/home/gbrick_test_1 >>>>>>>>>>>> mseas-data2:/home/gbrick_test_2 transport tcp >>>>>>>>>>>> >>>>>>>>>>>> However we keep getting errors of the form >>>>>>>>>>>> >>>>>>>>>>>> Wrong brick type: transport, use >>>>>>>>>>>> <HOSTNAME>:<export-dir-abs-path> >>>>>>>>>>>> >>>>>>>>>>>> Any thoughts on what we're doing wrong? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> You should give transport tcp at the beginning I think. >>>>>>>>>>>> Anyways, transport tcp is the default, so no need to >>>>>>>>>>>> specify >>>>>>>>>>>> so remove those two words from the CLI. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Also do you have a list of the test we should be >>>>>>>>>>>> running >>>>>>>>>>>> once we get this volume created? Given the >>>>>>>>>>>> time-zone >>>>>>>>>>>> difference it might help if we can run a small >>>>>>>>>>>> battery >>>>>>>>>>>> of tests and post the results rather than >>>>>>>>>>>> test-post-new >>>>>>>>>>>> test-post... . >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This is the first time I am doing performance analysis >>>>>>>>>>>> on >>>>>>>>>>>> users as far as I remember. In our team there are >>>>>>>>>>>> separate >>>>>>>>>>>> engineers who do these tests. Ben who replied earlier is >>>>>>>>>>>> one >>>>>>>>>>>> such engineer. >>>>>>>>>>>> >>>>>>>>>>>> Ben, >>>>>>>>>>>> Have any suggestions? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> >>>>>>>>>>>> Pat >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 05/11/2017 12:06 PM, Pranith Kumar Karampuri >>>>>>>>>>>> wrote: >>>>>>>>>>>>> On Thu, May 11, 2017 at 9:32 PM, Pat Haley >>>>>>>>>>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Pranith, >>>>>>>>>>>>> >>>>>>>>>>>>> The /home partition is mounted as ext4 >>>>>>>>>>>>> /home ext4 defaults,usrquota,grpquota 1 2 >>>>>>>>>>>>> >>>>>>>>>>>>> The brick partitions are mounted ax xfs >>>>>>>>>>>>> /mnt/brick1 xfs defaults 0 0 >>>>>>>>>>>>> /mnt/brick2 xfs defaults 0 0 >>>>>>>>>>>>> >>>>>>>>>>>>> Will this cause a problem with creating a >>>>>>>>>>>>> volume >>>>>>>>>>>>> under /home? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I don't think the bottleneck is disk. You can do >>>>>>>>>>>>> the >>>>>>>>>>>>> same tests you did on your new volume to confirm? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Pat >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 05/11/2017 11:32 AM, Pranith Kumar Karampuri >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> On Thu, May 11, 2017 at 8:57 PM, Pat Haley >>>>>>>>>>>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Pranith, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Unfortunately, we don't have similar >>>>>>>>>>>>>> hardware >>>>>>>>>>>>>> for a small scale test. All we have is >>>>>>>>>>>>>> our >>>>>>>>>>>>>> production hardware. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> You said something about /home partition which >>>>>>>>>>>>>> has >>>>>>>>>>>>>> lesser disks, we can create plain distribute >>>>>>>>>>>>>> volume inside one of those directories. After >>>>>>>>>>>>>> we >>>>>>>>>>>>>> are done, we can remove the setup. What do you >>>>>>>>>>>>>> say? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Pat >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 05/11/2017 07:05 AM, Pranith Kumar >>>>>>>>>>>>>> Karampuri wrote: >>>>>>>>>>>>>>> On Thu, May 11, 2017 at 2:48 AM, Pat >>>>>>>>>>>>>>> Haley >>>>>>>>>>>>>>> <phaley at mit.edu <mailto:phaley at mit.edu>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Pranith, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Since we are mounting the partitions >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>> the bricks, I tried the dd test >>>>>>>>>>>>>>> writing >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>>>>>>>>>>>>>> The results without oflag=sync were >>>>>>>>>>>>>>> 1.6 >>>>>>>>>>>>>>> Gb/s (faster than gluster but not as >>>>>>>>>>>>>>> fast >>>>>>>>>>>>>>> as I was expecting given the 1.2 Gb/s >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> the no-gluster area w/ fewer disks). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Okay, then 1.6Gb/s is what we need to >>>>>>>>>>>>>>> target >>>>>>>>>>>>>>> for, considering your volume is just >>>>>>>>>>>>>>> distribute. Is there any way you can do >>>>>>>>>>>>>>> tests >>>>>>>>>>>>>>> on similar hardware but at a small scale? >>>>>>>>>>>>>>> Just so we can run the workload to learn >>>>>>>>>>>>>>> more >>>>>>>>>>>>>>> about the bottlenecks in the system? We >>>>>>>>>>>>>>> can >>>>>>>>>>>>>>> probably try to get the speed to 1.2Gb/s >>>>>>>>>>>>>>> on >>>>>>>>>>>>>>> your /home partition you were telling me >>>>>>>>>>>>>>> yesterday. Let me know if that is >>>>>>>>>>>>>>> something >>>>>>>>>>>>>>> you are okay to do. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Pat >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 05/10/2017 01:27 PM, Pranith Kumar >>>>>>>>>>>>>>> Karampuri wrote: >>>>>>>>>>>>>>>> On Wed, May 10, 2017 at 10:15 PM, >>>>>>>>>>>>>>>> Pat >>>>>>>>>>>>>>>> Haley <phaley at mit.edu >>>>>>>>>>>>>>>> <mailto:phaley at mit.edu>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Pranith, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Not entirely sure (this isn't my >>>>>>>>>>>>>>>> area of expertise). I'll run >>>>>>>>>>>>>>>> your >>>>>>>>>>>>>>>> answer by some other people who >>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>> more familiar with this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am also uncertain about how to >>>>>>>>>>>>>>>> interpret the results when we >>>>>>>>>>>>>>>> also >>>>>>>>>>>>>>>> add the dd tests writing to the >>>>>>>>>>>>>>>> /home area (no gluster, still on >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> same machine) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> * dd test without oflag=sync >>>>>>>>>>>>>>>> (rough average of multiple >>>>>>>>>>>>>>>> tests) >>>>>>>>>>>>>>>> o gluster w/ fuse mount : >>>>>>>>>>>>>>>> 570 >>>>>>>>>>>>>>>> Mb/s >>>>>>>>>>>>>>>> o gluster w/ nfs mount: >>>>>>>>>>>>>>>> 390 >>>>>>>>>>>>>>>> Mb/s >>>>>>>>>>>>>>>> o nfs (no gluster): 1.2 >>>>>>>>>>>>>>>> Gb/s >>>>>>>>>>>>>>>> * dd test with oflag=sync >>>>>>>>>>>>>>>> (rough >>>>>>>>>>>>>>>> average of multiple tests) >>>>>>>>>>>>>>>> o gluster w/ fuse mount: >>>>>>>>>>>>>>>> 5 >>>>>>>>>>>>>>>> Mb/s >>>>>>>>>>>>>>>> o gluster w/ nfs mount: >>>>>>>>>>>>>>>> 200 >>>>>>>>>>>>>>>> Mb/s >>>>>>>>>>>>>>>> o nfs (no gluster): 20 >>>>>>>>>>>>>>>> Mb/s >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Given that the non-gluster area >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> RAID-6 of 4 disks while each >>>>>>>>>>>>>>>> brick >>>>>>>>>>>>>>>> of the gluster area is a RAID-6 >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> 32 disks, I would naively expect >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> writes to the gluster area to be >>>>>>>>>>>>>>>> roughly 8x faster than to the >>>>>>>>>>>>>>>> non-gluster. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think a better test is to try and >>>>>>>>>>>>>>>> write to a file using nfs without >>>>>>>>>>>>>>>> any >>>>>>>>>>>>>>>> gluster to a location that is not >>>>>>>>>>>>>>>> inside >>>>>>>>>>>>>>>> the brick but someother location >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>> on same disk(s). If you are mounting >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> partition as the brick, then we can >>>>>>>>>>>>>>>> write to a file inside .glusterfs >>>>>>>>>>>>>>>> directory, something like >>>>>>>>>>>>>>>> <brick-path>/.glusterfs/<file-to-be-removed-after-test>. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I still think we have a speed >>>>>>>>>>>>>>>> issue, >>>>>>>>>>>>>>>> I can't tell if fuse vs nfs is >>>>>>>>>>>>>>>> part >>>>>>>>>>>>>>>> of the problem. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I got interested in the post because >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> read that fuse speed is lesser than >>>>>>>>>>>>>>>> nfs >>>>>>>>>>>>>>>> speed which is counter-intuitive to >>>>>>>>>>>>>>>> my >>>>>>>>>>>>>>>> understanding. So wanted >>>>>>>>>>>>>>>> clarifications. >>>>>>>>>>>>>>>> Now that I got my clarifications >>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>> fuse outperformed nfs without sync, >>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>> can resume testing as described >>>>>>>>>>>>>>>> above >>>>>>>>>>>>>>>> and try to find what it is. Based on >>>>>>>>>>>>>>>> your email-id I am guessing you are >>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>> Boston and I am from Bangalore so if >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> are okay with doing this debugging >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> multiple days because of timezones, >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> will be happy to help. Please be a >>>>>>>>>>>>>>>> bit >>>>>>>>>>>>>>>> patient with me, I am under a >>>>>>>>>>>>>>>> release >>>>>>>>>>>>>>>> crunch but I am very curious with >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> problem you posted. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Was there anything useful in the >>>>>>>>>>>>>>>> profiles? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Unfortunately profiles didn't help >>>>>>>>>>>>>>>> me >>>>>>>>>>>>>>>> much, I think we are collecting the >>>>>>>>>>>>>>>> profiles from an active volume, so >>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>> has a lot of information that is not >>>>>>>>>>>>>>>> pertaining to dd so it is difficult >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> find the contributions of dd. So I >>>>>>>>>>>>>>>> went >>>>>>>>>>>>>>>> through your post again and found >>>>>>>>>>>>>>>> something I didn't pay much >>>>>>>>>>>>>>>> attention >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> earlier i.e. oflag=sync, so did my >>>>>>>>>>>>>>>> own >>>>>>>>>>>>>>>> tests on my setup with FUSE so sent >>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> reply. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Pat >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 05/10/2017 12:15 PM, Pranith >>>>>>>>>>>>>>>> Kumar Karampuri wrote: >>>>>>>>>>>>>>>>> Okay good. At least this >>>>>>>>>>>>>>>>> validates >>>>>>>>>>>>>>>>> my doubts. Handling O_SYNC in >>>>>>>>>>>>>>>>> gluster NFS and fuse is a bit >>>>>>>>>>>>>>>>> different. >>>>>>>>>>>>>>>>> When application opens a file >>>>>>>>>>>>>>>>> with >>>>>>>>>>>>>>>>> O_SYNC on fuse mount then each >>>>>>>>>>>>>>>>> write syscall has to be written >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> disk as part of the syscall >>>>>>>>>>>>>>>>> where >>>>>>>>>>>>>>>>> as in case of NFS, there is no >>>>>>>>>>>>>>>>> concept of open. NFS performs >>>>>>>>>>>>>>>>> write >>>>>>>>>>>>>>>>> though a handle saying it needs >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> be a synchronous write, so >>>>>>>>>>>>>>>>> write() >>>>>>>>>>>>>>>>> syscall is performed first then >>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>> performs fsync(). so an write >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>> fd with O_SYNC becomes >>>>>>>>>>>>>>>>> write+fsync. >>>>>>>>>>>>>>>>> I am suspecting that when >>>>>>>>>>>>>>>>> multiple >>>>>>>>>>>>>>>>> threads do this write+fsync() >>>>>>>>>>>>>>>>> operation on the same file, >>>>>>>>>>>>>>>>> multiple writes are batched >>>>>>>>>>>>>>>>> together to be written do disk >>>>>>>>>>>>>>>>> so >>>>>>>>>>>>>>>>> the throughput on the disk is >>>>>>>>>>>>>>>>> increasing is my guess. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Does it answer your doubts? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, May 10, 2017 at 9:35 >>>>>>>>>>>>>>>>> PM, >>>>>>>>>>>>>>>>> Pat Haley <phaley at mit.edu >>>>>>>>>>>>>>>>> <mailto:phaley at mit.edu>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Without the oflag=sync and >>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>> a single test of each, the >>>>>>>>>>>>>>>>> FUSE >>>>>>>>>>>>>>>>> is going faster than NFS: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> FUSE: >>>>>>>>>>>>>>>>> mseas-data2(dri_nascar)% dd >>>>>>>>>>>>>>>>> if=/dev/zero count=4096 >>>>>>>>>>>>>>>>> bs=1048576 of=zeros.txt >>>>>>>>>>>>>>>>> conv=sync >>>>>>>>>>>>>>>>> 4096+0 records in >>>>>>>>>>>>>>>>> 4096+0 records out >>>>>>>>>>>>>>>>> 4294967296 bytes (4.3 GB) >>>>>>>>>>>>>>>>> copied, 7.46961 s, 575 MB/s >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> NFS >>>>>>>>>>>>>>>>> mseas-data2(HYCOM)% dd >>>>>>>>>>>>>>>>> if=/dev/zero count=4096 >>>>>>>>>>>>>>>>> bs=1048576 of=zeros.txt >>>>>>>>>>>>>>>>> conv=sync >>>>>>>>>>>>>>>>> 4096+0 records in >>>>>>>>>>>>>>>>> 4096+0 records out >>>>>>>>>>>>>>>>> 4294967296 bytes (4.3 GB) >>>>>>>>>>>>>>>>> copied, 11.4264 s, 376 MB/s >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 05/10/2017 11:53 AM, >>>>>>>>>>>>>>>>> Pranith >>>>>>>>>>>>>>>>> Kumar Karampuri wrote: >>>>>>>>>>>>>>>>>> Could you let me know the >>>>>>>>>>>>>>>>>> speed without oflag=sync >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> both the mounts? No need >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> collect profiles. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, May 10, 2017 at >>>>>>>>>>>>>>>>>> 9:17 >>>>>>>>>>>>>>>>>> PM, Pat Haley >>>>>>>>>>>>>>>>>> <phaley at mit.edu >>>>>>>>>>>>>>>>>> <mailto:phaley at mit.edu>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here is what I see >>>>>>>>>>>>>>>>>> now: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [root at mseas-data2 ~]# >>>>>>>>>>>>>>>>>> gluster volume info >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Volume Name: >>>>>>>>>>>>>>>>>> data-volume >>>>>>>>>>>>>>>>>> Type: Distribute >>>>>>>>>>>>>>>>>> Volume ID: >>>>>>>>>>>>>>>>>> c162161e-2a2d-4dac-b015-f31fd89ceb18 >>>>>>>>>>>>>>>>>> Status: Started >>>>>>>>>>>>>>>>>> Number of Bricks: 2 >>>>>>>>>>>>>>>>>> Transport-type: tcp >>>>>>>>>>>>>>>>>> Bricks: >>>>>>>>>>>>>>>>>> Brick1: >>>>>>>>>>>>>>>>>> mseas-data2:/mnt/brick1 >>>>>>>>>>>>>>>>>> Brick2: >>>>>>>>>>>>>>>>>> mseas-data2:/mnt/brick2 >>>>>>>>>>>>>>>>>> Options Reconfigured: >>>>>>>>>>>>>>>>>> diagnostics.count-fop-hits: >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> diagnostics.latency-measurement: >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> nfs.exports-auth-enable: >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> diagnostics.brick-sys-log-level: >>>>>>>>>>>>>>>>>> WARNING >>>>>>>>>>>>>>>>>> performance.readdir-ahead: >>>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>> nfs.disable: on >>>>>>>>>>>>>>>>>> nfs.export-volumes: >>>>>>>>>>>>>>>>>> off >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 05/10/2017 11:44 >>>>>>>>>>>>>>>>>> AM, >>>>>>>>>>>>>>>>>> Pranith Kumar >>>>>>>>>>>>>>>>>> Karampuri >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> Is this the volume >>>>>>>>>>>>>>>>>>> info >>>>>>>>>>>>>>>>>>> you have? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >/[root at >>>>>>>>>>>>>>>>>>> >mseas-data2 >>>>>>>>>>>>>>>>>>> <http://www.gluster.org/mailman/listinfo/gluster-users> >>>>>>>>>>>>>>>>>>> ~]# gluster volume >>>>>>>>>>>>>>>>>>> info >>>>>>>>>>>>>>>>>>> />//>/Volume Name: >>>>>>>>>>>>>>>>>>> data-volume />/Type: >>>>>>>>>>>>>>>>>>> Distribute />/Volume >>>>>>>>>>>>>>>>>>> ID: >>>>>>>>>>>>>>>>>>> c162161e-2a2d-4dac-b015-f31fd89ceb18 >>>>>>>>>>>>>>>>>>> />/Status: Started >>>>>>>>>>>>>>>>>>> />/Number >>>>>>>>>>>>>>>>>>> of Bricks: 2 >>>>>>>>>>>>>>>>>>> />/Transport-type: >>>>>>>>>>>>>>>>>>> tcp >>>>>>>>>>>>>>>>>>> />/Bricks: />/Brick1: >>>>>>>>>>>>>>>>>>> mseas-data2:/mnt/brick1 >>>>>>>>>>>>>>>>>>> />/Brick2: >>>>>>>>>>>>>>>>>>> mseas-data2:/mnt/brick2 >>>>>>>>>>>>>>>>>>> />/Options >>>>>>>>>>>>>>>>>>> Reconfigured: >>>>>>>>>>>>>>>>>>> />/performance.readdir-ahead: >>>>>>>>>>>>>>>>>>> on />/nfs.disable: on >>>>>>>>>>>>>>>>>>> />/nfs.export-volumes: >>>>>>>>>>>>>>>>>>> off >>>>>>>>>>>>>>>>>>> / >>>>>>>>>>>>>>>>>>> ?I copied this from >>>>>>>>>>>>>>>>>>> old >>>>>>>>>>>>>>>>>>> thread from 2016. >>>>>>>>>>>>>>>>>>> This >>>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>> distribute volume. >>>>>>>>>>>>>>>>>>> Did >>>>>>>>>>>>>>>>>>> you change any of the >>>>>>>>>>>>>>>>>>> options in between? >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>>>>>>>> Pat Haley >>>>>>>>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>>>>>>>> Center for Ocean >>>>>>>>>>>>>>>>>> Engineering >>>>>>>>>>>>>>>>>> Phone: (617) 253-6824 >>>>>>>>>>>>>>>>>> Dept. of Mechanical >>>>>>>>>>>>>>>>>> Engineering >>>>>>>>>>>>>>>>>> Fax: (617) 253-8125 >>>>>>>>>>>>>>>>>> MIT, Room >>>>>>>>>>>>>>>>>> 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>>>>>>>> 77 Massachusetts >>>>>>>>>>>>>>>>>> Avenue >>>>>>>>>>>>>>>>>> Cambridge, MA >>>>>>>>>>>>>>>>>> 02139-4301 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Pranith >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>>>>>>> Pat Haley >>>>>>>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>>>>>>> Center for Ocean >>>>>>>>>>>>>>>>> Engineering >>>>>>>>>>>>>>>>> Phone: (617) 253-6824 >>>>>>>>>>>>>>>>> Dept. of Mechanical >>>>>>>>>>>>>>>>> Engineering >>>>>>>>>>>>>>>>> Fax: (617) 253-8125 >>>>>>>>>>>>>>>>> MIT, Room >>>>>>>>>>>>>>>>> 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Pranith >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>>>>>> Pat Haley >>>>>>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>>>>>> Center for Ocean Engineering >>>>>>>>>>>>>>>> Phone: >>>>>>>>>>>>>>>> (617) 253-6824 >>>>>>>>>>>>>>>> Dept. of Mechanical Engineering >>>>>>>>>>>>>>>> Fax: >>>>>>>>>>>>>>>> (617) 253-8125 >>>>>>>>>>>>>>>> MIT, Room >>>>>>>>>>>>>>>> 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Pranith >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>>>>> Pat Haley >>>>>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>>>>> Center for Ocean Engineering >>>>>>>>>>>>>>> Phone: >>>>>>>>>>>>>>> (617) 253-6824 >>>>>>>>>>>>>>> Dept. of Mechanical Engineering >>>>>>>>>>>>>>> Fax: >>>>>>>>>>>>>>> (617) 253-8125 >>>>>>>>>>>>>>> MIT, Room >>>>>>>>>>>>>>> 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Pranith >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>>>> Pat Haley >>>>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>>>> Center for Ocean Engineering Phone: >>>>>>>>>>>>>> (617) >>>>>>>>>>>>>> 253-6824 >>>>>>>>>>>>>> Dept. of Mechanical Engineering Fax: >>>>>>>>>>>>>> (617) >>>>>>>>>>>>>> 253-8125 >>>>>>>>>>>>>> MIT, Room >>>>>>>>>>>>>> 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Pranith >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>>> Pat Haley >>>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>>> Center for Ocean Engineering Phone: >>>>>>>>>>>>> (617) >>>>>>>>>>>>> 253-6824 >>>>>>>>>>>>> Dept. of Mechanical Engineering Fax: >>>>>>>>>>>>> (617) >>>>>>>>>>>>> 253-8125 >>>>>>>>>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Pranith >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>>> Pat Haley >>>>>>>>>>>> Email:phaley at mit.edu >>>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>>> Center for Ocean Engineering Phone: (617) >>>>>>>>>>>> 253-6824 >>>>>>>>>>>> Dept. of Mechanical Engineering Fax: (617) >>>>>>>>>>>> 253-8125 >>>>>>>>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Pranith >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>>> Pat HaleyEmail:phaley at mit.edu >>>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>>>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Pranith >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>>> Pat HaleyEmail:phaley at mit.edu >>>>>>>>>> <mailto:phaley at mit.edu> >>>>>>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>>>>>> 77 Massachusetts Avenue >>>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Pranith >>>>>>>>> -- >>>>>>>>> >>>>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>>>> Pat Haley Email:phaley at mit.edu >>>>>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>>>>> 77 Massachusetts Avenue >>>>>>>>> Cambridge, MA 02139-4301 >>>>>>>>> >>>>>>>>> >>>>>>> -- >>>>>>> >>>>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>>>> Pat Haley Email:phaley at mit.edu >>>>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>>>> 77 Massachusetts Avenue >>>>>>> Cambridge, MA 02139-4301 >>>>>>> >>>>>>> >>>>> -- >>>>> >>>>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>>>> Pat Haley Email:phaley at mit.edu >>>>> Center for Ocean Engineering Phone: (617) 253-6824 >>>>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>>>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>>>> 77 Massachusetts Avenue >>>>> Cambridge, MA 02139-4301 >>>>> >>>>> >>> -- >>> >>> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- >>> Pat Haley Email:phaley at mit.edu >>> Center for Ocean Engineering Phone: (617) 253-6824 >>> Dept. of Mechanical Engineering Fax: (617) 253-8125 >>> MIT, Room 5-213http://web.mit.edu/phaley/www/ >>> 77 Massachusetts Avenue >>> Cambridge, MA 02139-4301 >>> >>> > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email:phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-users-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley at mit.edu Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170622/7bc75870/attachment.html>
Pranith Kumar Karampuri
2017-Jun-23 03:40 UTC
[Gluster-users] Slow write times to gluster disk
On Fri, Jun 23, 2017 at 2:23 AM, Pat Haley <phaley at mit.edu> wrote:> > Hi, > > Today we experimented with some of the FUSE options that we found in the > list. > > Changing these options had no effect: > > gluster volume set test-volume performance.cache-max-file-size 2MB > gluster volume set test-volume performance.cache-refresh-timeout 4 > gluster volume set test-volume performance.cache-size 256MB > gluster volume set test-volume performance.write-behind-window-size 4MB > gluster volume set test-volume performance.write-behind-window-size 8MB > >This is a good coincidence, I am meeting with write-behind maintainer(+Raghavendra G) today for the same doubt. I think we will have something by EOD IST. I will update you.> Changing the following option from its default value made the speed slower > > gluster volume set test-volume performance.write-behind off (on by default) > > Changing the following options initially appeared to give a 10% increase > in speed, but this vanished in subsequent tests (we think the apparent > increase may have been to a lighter workload on the computer from other > users) > > gluster volume set test-volume performance.stat-prefetch on > gluster volume set test-volume client.event-threads 4 > gluster volume set test-volume server.event-threads 4 > > Can anything be gleaned from these observations? Are there other things > we can try? > > Thanks > > Pat > > > > On 06/20/2017 12:06 PM, Pat Haley wrote: > > > Hi Ben, > > Sorry this took so long, but we had a real-time forecasting exercise last > week and I could only get to this now. > > Backend Hardware/OS: > > - Much of the information on our back end system is included at the > top of http://lists.gluster.org/pipermail/gluster-users/2017- > April/030529.html > - The specific model of the hard disks is SeaGate ENTERPRISE CAPACITY > V.4 6TB (ST6000NM0024). The rated speed is 6Gb/s. > - Note: there is one physical server that hosts both the NFS and the > GlusterFS areas > > Latest tests > > I have had time to run the tests for one of the dd tests you requested to > the underlying XFS FS. The median rate was 170 MB/s. The dd results and > iostat record are in > > http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/ > > I'll add tests for the other brick and to the NFS area later. > > Thanks > > Pat > > > On 06/12/2017 06:06 PM, Ben Turner wrote: > > Ok you are correct, you have a pure distributed volume. IE no replication overhead. So normally for pure dist I use: > > throughput = slowest of disks / NIC * .6-.7 > > In your case we have: > > 1200 * .6 = 720 > > So you are seeing a little less throughput than I would expect in your configuration. What I like to do here is: > > -First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using? > > -Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct? > > -Run some DD tests on the back end XFS FS. I normally have /xfs-mount/gluster-brick, if you have something similar just mkdir on the XFS -> /xfs-mount/my-test-dir. Inside the test dir run: > > If you are focusing on a write workload run: > > # dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync > > If you are focusing on a read workload run: > > # echo 3 > /proc/sys/vm/drop_caches > # dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000 > > ** MAKE SURE TO DROP CACHE IN BETWEEN READS!! ** > > Run this in a loop similar to how you did in: > http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt > > Run this on both servers one at a time and if you are running on a SAN then run again on both at the same time. While this is running gather iostat for me: > > # iostat -c -m -x 1 > iostat-$(hostname).txt > > Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster. > > -Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here. > > -b > > > > ----- Original Message ----- > > From: "Pat Haley" <phaley at mit.edu> <phaley at mit.edu> > To: "Ben Turner" <bturner at redhat.com> <bturner at redhat.com> > Sent: Monday, June 12, 2017 5:18:07 PM > Subject: Re: [Gluster-users] Slow write times to gluster disk > > > Hi Ben, > > Here is the output: > > [root at mseas-data2 ~]# gluster volume info > > Volume Name: data-volume > Type: Distribute > Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: mseas-data2:/mnt/brick1 > Brick2: mseas-data2:/mnt/brick2 > Options Reconfigured: > nfs.exports-auth-enable: on > diagnostics.brick-sys-log-level: WARNING > performance.readdir-ahead: on > nfs.disable: on > nfs.export-volumes: off > > > On 06/12/2017 05:01 PM, Ben Turner wrote: > > What is the output of gluster v info? That will tell us more about your > config. > > -b > > ----- Original Message ----- > > From: "Pat Haley" <phaley at mit.edu> <phaley at mit.edu> > To: "Ben Turner" <bturner at redhat.com> <bturner at redhat.com> > Sent: Monday, June 12, 2017 4:54:00 PM > Subject: Re: [Gluster-users] Slow write times to gluster disk > > > Hi Ben, > > I guess I'm confused about what you mean by replication. If I look at > the underlying bricks I only ever have a single copy of any file. It > either resides on one brick or the other (directories exist on both > bricks but not files). We are not using gluster for redundancy (or at > least that wasn't our intent). Is that what you meant by replication > or is it something else? > > Thanks > > Pat > > On 06/12/2017 04:28 PM, Ben Turner wrote: > > ----- Original Message ----- > > From: "Pat Haley" <phaley at mit.edu> <phaley at mit.edu> > To: "Ben Turner" <bturner at redhat.com> <bturner at redhat.com>, "Pranith Kumar Karampuri"<pkarampu at redhat.com> <pkarampu at redhat.com> > Cc: "Ravishankar N" <ravishankar at redhat.com> <ravishankar at redhat.com>, gluster-users at gluster.org, > "Steve Postma" <SPostma at ztechnet.com> <SPostma at ztechnet.com> > Sent: Monday, June 12, 2017 2:35:41 PM > Subject: Re: [Gluster-users] Slow write times to gluster disk > > > Hi Guys, > > I was wondering what our next steps should be to solve the slow write > times. > > Recently I was debugging a large code and writing a lot of output at > every time step. When I tried writing to our gluster disks, it was > taking over a day to do a single time step whereas if I had the same > program (same hardware, network) write to our nfs disk the time per > time-step was about 45 minutes. What we are shooting for here would be > to have similar times to either gluster of nfs. > > I can see in your test: > http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt > > You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB / > sec} / #replicas{2} = 600). Gluster does client side replication so with > replica 2 you will only ever see 1/2 the speed of your slowest part of > the > stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is > normally > a best case. Now in your output I do see the instances where you went > down to 200 MB / sec. I can only explain this in three ways: > > 1. You are not using conv=fdatasync and writes are actually going to > page > cache and then being flushed to disk. During the fsync the memory is not > yet available and the disks are busy flushing dirty pages. > 2. Your storage RAID group is shared across multiple LUNS(like in a SAN) > and when write times are slow the RAID group is busy serviceing other > LUNs. > 3. Gluster bug / config issue / some other unknown unknown. > > So I see 2 issues here: > > 1. NFS does in 45 minutes what gluster can do in 24 hours. > 2. Sometimes your throughput drops dramatically. > > WRT #1 - have a look at my estimates above. My formula for guestimating > gluster perf is: throughput = NIC throughput or storage(whatever is > slower) / # replicas * overhead(figure .7 or .8). Also the larger the > record size the better for glusterfs mounts, I normally like to be at > LEAST 64k up to 1024k: > > # dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 > conv=fdatasync > > WRT #2 - Again, I question your testing and your storage config. Try > using > conv=fdatasync for your DDs, use a larger record size, and make sure that > your back end storage is not causing your slowdowns. Also remember that > with replica 2 you will take ~50% hit on writes because the client uses > 50% of its bandwidth to write to one replica and 50% to the other. > > -b > > > > > Thanks > > Pat > > > On 06/02/2017 01:07 AM, Ben Turner wrote: > > Are you sure using conv=sync is what you want? I normally use > conv=fdatasync, I'll look up the difference between the two and see if > it > affects your test. > > > -b > > ----- Original Message ----- > > From: "Pat Haley" <phaley at mit.edu> <phaley at mit.edu> > To: "Pranith Kumar Karampuri" <pkarampu at redhat.com> <pkarampu at redhat.com> > Cc: "Ravishankar N" <ravishankar at redhat.com> <ravishankar at redhat.com>,gluster-users at gluster.org, > "Steve Postma" <SPostma at ztechnet.com> <SPostma at ztechnet.com>, "Ben > Turner" <bturner at redhat.com> <bturner at redhat.com> > Sent: Tuesday, May 30, 2017 9:40:34 PM > Subject: Re: [Gluster-users] Slow write times to gluster disk > > > Hi Pranith, > > The "dd" command was: > > dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync > > There were 2 instances where dd reported 22 seconds. The output from > the > dd tests are in > http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt > > Pat > > On 05/30/2017 09:27 PM, Pranith Kumar Karampuri wrote: > > Pat, > What is the command you used? As per the following output, > it > seems like at least one write operation took 16 seconds. Which is > really bad. > 96.39 1165.10 us 89.00 us*16487014.00 us* > 393212 > WRITE > > > On Tue, May 30, 2017 at 10:36 PM, Pat Haley <phaley at mit.edu<mailto:phaley at mit.edu> <phaley at mit.edu>> wrote: > > > Hi Pranith, > > I ran the same 'dd' test both in the gluster test volume and > in > the .glusterfs directory of each brick. The median results > (12 > dd > trials in each test) are similar to before > > * gluster test volume: 586.5 MB/s > * bricks (in .glusterfs): 1.4 GB/s > > The profile for the gluster test-volume is in > > http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt > <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> > > Thanks > > Pat > > > > > On 05/30/2017 12:10 PM, Pranith Kumar Karampuri wrote: > > Let's start with the same 'dd' test we were testing with to > see, > what the numbers are. Please provide profile numbers for the > same. From there on we will start tuning the volume to see > what > we can do. > > On Tue, May 30, 2017 at 9:16 PM, Pat Haley <phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu>> wrote: > > > Hi Pranith, > > Thanks for the tip. We now have the gluster volume > mounted > under /home. What tests do you recommend we run? > > Thanks > > Pat > > > > On 05/17/2017 05:01 AM, Pranith Kumar Karampuri wrote: > > On Tue, May 16, 2017 at 9:20 PM, Pat Haley > <phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu>> wrote: > > > Hi Pranith, > > Sorry for the delay. I never saw received your > reply > (but I did receive Ben Turner's follow-up to your > reply). So we tried to create a gluster volume > under > /home using different variations of > > gluster volume create test-volume > mseas-data2:/home/gbrick_test_1 > mseas-data2:/home/gbrick_test_2 transport tcp > > However we keep getting errors of the form > > Wrong brick type: transport, use > <HOSTNAME>:<export-dir-abs-path> > > Any thoughts on what we're doing wrong? > > > You should give transport tcp at the beginning I think. > Anyways, transport tcp is the default, so no need to > specify > so remove those two words from the CLI. > > > Also do you have a list of the test we should be > running > once we get this volume created? Given the > time-zone > difference it might help if we can run a small > battery > of tests and post the results rather than > test-post-new > test-post... . > > > This is the first time I am doing performance analysis > on > users as far as I remember. In our team there are > separate > engineers who do these tests. Ben who replied earlier is > one > such engineer. > > Ben, > Have any suggestions? > > > Thanks > > Pat > > > > On 05/11/2017 12:06 PM, Pranith Kumar Karampuri > wrote: > > On Thu, May 11, 2017 at 9:32 PM, Pat Haley > <phaley at mit.edu <mailto:phaley at mit.edu> <phaley at mit.edu>> wrote: > > > Hi Pranith, > > The /home partition is mounted as ext4 > /home ext4 defaults,usrquota,grpquota 1 2 > > The brick partitions are mounted ax xfs > /mnt/brick1 xfs defaults 0 0 > /mnt/brick2 xfs defaults 0 0 > > Will this cause a problem with creating a > volume > under /home? > > > I don't think the bottleneck is disk. You can do > the > same tests you did on your new volume to confirm? > > > Pat > > > > On 05/11/2017 11:32 AM, Pranith Kumar Karampuri > wrote: > > On Thu, May 11, 2017 at 8:57 PM, Pat Haley > <phaley at mit.edu <mailto:phaley at mit.edu> <phaley at mit.edu>> > wrote: > > > Hi Pranith, > > Unfortunately, we don't have similar > hardware > for a small scale test. All we have is > our > production hardware. > > > You said something about /home partition which > has > lesser disks, we can create plain distribute > volume inside one of those directories. After > we > are done, we can remove the setup. What do you > say? > > > Pat > > > > > On 05/11/2017 07:05 AM, Pranith Kumar > Karampuri wrote: > > On Thu, May 11, 2017 at 2:48 AM, Pat > Haley > <phaley at mit.edu <mailto:phaley at mit.edu> <phaley at mit.edu>> > wrote: > > > Hi Pranith, > > Since we are mounting the partitions > as > the bricks, I tried the dd test > writing > to > <brick-path>/.glusterfs/<file-to-be-removed-after-test>. > The results without oflag=sync were > 1.6 > Gb/s (faster than gluster but not as > fast > as I was expecting given the 1.2 Gb/s > to > the no-gluster area w/ fewer disks). > > > Okay, then 1.6Gb/s is what we need to > target > for, considering your volume is just > distribute. Is there any way you can do > tests > on similar hardware but at a small scale? > Just so we can run the workload to learn > more > about the bottlenecks in the system? We > can > probably try to get the speed to 1.2Gb/s > on > your /home partition you were telling me > yesterday. Let me know if that is > something > you are okay to do. > > > Pat > > > > On 05/10/2017 01:27 PM, Pranith Kumar > Karampuri wrote: > > On Wed, May 10, 2017 at 10:15 PM, > Pat > Haley <phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu>> wrote: > > > Hi Pranith, > > Not entirely sure (this isn't my > area of expertise). I'll run > your > answer by some other people who > are > more familiar with this. > > I am also uncertain about how to > interpret the results when we > also > add the dd tests writing to the > /home area (no gluster, still on > the > same machine) > > * dd test without oflag=sync > (rough average of multiple > tests) > o gluster w/ fuse mount : > 570 > Mb/s > o gluster w/ nfs mount: > 390 > Mb/s > o nfs (no gluster): 1.2 > Gb/s > * dd test with oflag=sync > (rough > average of multiple tests) > o gluster w/ fuse mount: > 5 > Mb/s > o gluster w/ nfs mount: > 200 > Mb/s > o nfs (no gluster): 20 > Mb/s > > Given that the non-gluster area > is > a > RAID-6 of 4 disks while each > brick > of the gluster area is a RAID-6 > of > 32 disks, I would naively expect > the > writes to the gluster area to be > roughly 8x faster than to the > non-gluster. > > > I think a better test is to try and > write to a file using nfs without > any > gluster to a location that is not > inside > the brick but someother location > that > is > on same disk(s). If you are mounting > the > partition as the brick, then we can > write to a file inside .glusterfs > directory, something like > <brick-path>/.glusterfs/<file-to-be-removed-after-test>. > > > > I still think we have a speed > issue, > I can't tell if fuse vs nfs is > part > of the problem. > > > I got interested in the post because > I > read that fuse speed is lesser than > nfs > speed which is counter-intuitive to > my > understanding. So wanted > clarifications. > Now that I got my clarifications > where > fuse outperformed nfs without sync, > we > can resume testing as described > above > and try to find what it is. Based on > your email-id I am guessing you are > from > Boston and I am from Bangalore so if > you > are okay with doing this debugging > for > multiple days because of timezones, > I > will be happy to help. Please be a > bit > patient with me, I am under a > release > crunch but I am very curious with > the > problem you posted. > > Was there anything useful in the > profiles? > > > Unfortunately profiles didn't help > me > much, I think we are collecting the > profiles from an active volume, so > it > has a lot of information that is not > pertaining to dd so it is difficult > to > find the contributions of dd. So I > went > through your post again and found > something I didn't pay much > attention > to > earlier i.e. oflag=sync, so did my > own > tests on my setup with FUSE so sent > that > reply. > > > Pat > > > > On 05/10/2017 12:15 PM, Pranith > Kumar Karampuri wrote: > > Okay good. At least this > validates > my doubts. Handling O_SYNC in > gluster NFS and fuse is a bit > different. > When application opens a file > with > O_SYNC on fuse mount then each > write syscall has to be written > to > disk as part of the syscall > where > as in case of NFS, there is no > concept of open. NFS performs > write > though a handle saying it needs > to > be a synchronous write, so > write() > syscall is performed first then > it > performs fsync(). so an write > on > an > fd with O_SYNC becomes > write+fsync. > I am suspecting that when > multiple > threads do this write+fsync() > operation on the same file, > multiple writes are batched > together to be written do disk > so > the throughput on the disk is > increasing is my guess. > > Does it answer your doubts? > > On Wed, May 10, 2017 at 9:35 > PM, > Pat Haley <phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu>> wrote: > > > Without the oflag=sync and > only > a single test of each, the > FUSE > is going faster than NFS: > > FUSE: > mseas-data2(dri_nascar)% dd > if=/dev/zero count=4096 > bs=1048576 of=zeros.txt > conv=sync > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB) > copied, 7.46961 s, 575 MB/s > > > NFS > mseas-data2(HYCOM)% dd > if=/dev/zero count=4096 > bs=1048576 of=zeros.txt > conv=sync > 4096+0 records in > 4096+0 records out > 4294967296 bytes (4.3 GB) > copied, 11.4264 s, 376 MB/s > > > > On 05/10/2017 11:53 AM, > Pranith > Kumar Karampuri wrote: > > Could you let me know the > speed without oflag=sync > on > both the mounts? No need > to > collect profiles. > > On Wed, May 10, 2017 at > 9:17 > PM, Pat Haley > <phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu>> > wrote: > > > Here is what I see > now: > > [root at mseas-data2 ~]# > gluster volume info > > Volume Name: > data-volume > Type: Distribute > Volume ID: > c162161e-2a2d-4dac-b015-f31fd89ceb18 > Status: Started > Number of Bricks: 2 > Transport-type: tcp > Bricks: > Brick1: > mseas-data2:/mnt/brick1 > Brick2: > mseas-data2:/mnt/brick2 > Options Reconfigured: > diagnostics.count-fop-hits: > on > diagnostics.latency-measurement: > on > nfs.exports-auth-enable: > on > diagnostics.brick-sys-log-level: > WARNING > performance.readdir-ahead: > on > nfs.disable: on > nfs.export-volumes: > off > > > > On 05/10/2017 11:44 > AM, > Pranith Kumar > Karampuri > wrote: > > Is this the volume > info > you have? > > >/[root at > >mseas-data2 > <http://www.gluster.org/mailman/listinfo/gluster-users> <http://www.gluster.org/mailman/listinfo/gluster-users> > ~]# gluster volume > info > />//>/Volume Name: > data-volume />/Type: > Distribute />/Volume > ID: > c162161e-2a2d-4dac-b015-f31fd89ceb18 > />/Status: Started > />/Number > of Bricks: 2 > />/Transport-type: > tcp > />/Bricks: />/Brick1: > mseas-data2:/mnt/brick1 > />/Brick2: > mseas-data2:/mnt/brick2 > />/Options > Reconfigured: > />/performance.readdir-ahead: > on />/nfs.disable: on > />/nfs.export-volumes: > off > / > ?I copied this from > old > thread from 2016. > This > is > distribute volume. > Did > you change any of the > options in between? > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean > Engineering > Phone: (617) 253-6824 > Dept. of Mechanical > Engineering > Fax: (617) 253-8125 > MIT, Room > 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts > Avenue > Cambridge, MA > 02139-4301 > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean > Engineering > Phone: (617) 253-6824 > Dept. of Mechanical > Engineering > Fax: (617) 253-8125 > MIT, Room > 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering > Phone: > (617) 253-6824 > Dept. of Mechanical Engineering > Fax: > (617) 253-8125 > MIT, Room > 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering > Phone: > (617) 253-6824 > Dept. of Mechanical Engineering > Fax: > (617) 253-8125 > MIT, Room > 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering Phone: > (617) > 253-6824 > Dept. of Mechanical Engineering Fax: > (617) > 253-8125 > MIT, Room > 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering Phone: > (617) > 253-6824 > Dept. of Mechanical Engineering Fax: > (617) > 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley > Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering Phone: (617) > 253-6824 > Dept. of Mechanical Engineering Fax: (617) > 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email:phaley at mit.edu > <mailto:phaley at mit.edu> <phaley at mit.edu> > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > > -- > Pranith > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > > > > _______________________________________________ > Gluster-users mailing listGluster-users at gluster.orghttp://lists.gluster.org/mailman/listinfo/gluster-users > > > -- > > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- > Pat Haley Email: phaley at mit.edu > Center for Ocean Engineering Phone: (617) 253-6824 > Dept. of Mechanical Engineering Fax: (617) 253-8125 > MIT, Room 5-213 http://web.mit.edu/phaley/www/ > 77 Massachusetts Avenue > Cambridge, MA 02139-4301 > >-- Pranith -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20170623/6d1055ce/attachment.html>