thr3ads.net - Ocfs2 devel - [Ocfs2-devel] OCFS2 has low performance when writing the same file from the different nodes concurrently [Dec 2019]

If this information is useful, please help other people find it:
Share via:

Gang He

2019-Dec-18 10:26 UTC

[Ocfs2-devel] OCFS2 has low performance when writing the same file from the different nodes concurrently

Hi Guys,

OCFS2 has low performance when writing the same file from the different nodes
concurrently,
it is very easy to reproduce this problem via dd command.
For example, there are 5 ocfs2 nodes with a shared disk(ram disk) from one
physical machine,
then run dd command to write the same file in the shared disk(ocfs2 partition)
from these 5
ocfs2 nodes concurrently.
the result is as below,
sle12sp4-nd1:/ # /dd_f.sh
+ ssh sle12sp4-nd5 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd4 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd3 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd2 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd1 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc bs=8k
count=10000'
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 72.0841 s, 1.1 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 72.4865 s, 1.1 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 72.4921 s, 1.1 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 72.5006 s, 1.1 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 72.5029 s, 1.1 MB/s
/dd_f.sh done ... 73 sec.

But, if you run this dd command from only one ocfs2 node or run this dd command
from 5 gfs2 nodes concurrently,
the execution time is very short.
sle12sp4-nd1:/ # /dd_f.sh 1
+ ssh sle12sp4-nd5 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd4 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd3 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd2 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc bs=8k
count=10000'
+ ssh sle12sp4-nd1 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc bs=8k
count=10000'
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 0.318876 s, 257 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 0.738694 s, 111 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 0.795672 s, 103 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 0.850756 s, 96.3 MB/s
10000+0 records in
10000+0 records out
81920000 bytes (82 MB, 78 MiB) copied, 0.90767 s, 90.3 MB/s
/dd_f.sh done ... 2 sec.

So far, I feel this problem should be considered as design issue for OCFS2.
Hopefully, we can optimize write performance in this case via learning from
other file systems.
For more detailed data, you can refer to my gist at
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_ganghe_bbc4f3d94e3596715dd08e39dfacd33f&d=DwIFAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=dNLMGrZuCzc_taNLW5rBOOxiMrYH0jE9kYATcWtz9lw&s=4u6_yHYiRMe1dJeTaOXIJ9kwAIuOqLRdOVWScvqlqFM&e=
.

Thanks
Gang

Changwei Ge

2019-Dec-18 12:34 UTC

head link

[Ocfs2-devel] OCFS2 has low performance when writing the same file from the different nodes concurrently

Hi Gang,

Do we really have customer use case that an unique file is writing from 
different nodes concurrently?

If an cluster application on top of ocfs2 can collaborate its each 
member, it might make sense. Can you share some particular use cases?

Speaking of the poor performance as you reported, I admit that it's true.

As ocfs2 doesn't have its own journal module but re-use ext4/jbd2 which 
is not designed for cluster file system, it doesn't have a global 
journal transaction number assigned to each node. So each time we have 
to update metadata(system files, regular files(even a/t/cime), etc), 
the blocked node has to drain its journal region to checkpoint all the 
records to final destination of file system, then release the cluster 
lock. Thus to keep the file system consistency, of course, ocfs2 now 
does some trick like passing inode meta via dlm lock reply directly to 
save a round of disk read. But the checkpoint is still time consuming. 
If we can have global transaction mechanism, it can alleviate the 
performance drop.

Another point I can think of is that even different nodes are wring 
different regions of a single file, ocfs2 still drops the whole address 
mapping of the inode. I don't this it's necessary. This point can be 
optimized. Perhaps, you can tweak your test workload to direct-io to see 
if we still have such an obvious drop?

Another thing I recall is that right now, for bigger cluster size like 
1M, with 'sparse-file' enabled, ocfs2 has a serious write penalty which 
occupies underlying bandwidth when workload is something like 4k random 
write. Better we can do something to improve it too. :-)

Thanks,
Changwei

On 12/18/19 6:26 PM, Gang He wrote:> Hi Guys,
> 
> OCFS2 has low performance when writing the same file from the different
nodes concurrently,
> it is very easy to reproduce this problem via dd command.
> For example, there are 5 ocfs2 nodes with a shared disk(ram disk) from one
physical machine,
> then run dd command to write the same file in the shared disk(ocfs2
partition) from these 5
> ocfs2 nodes concurrently.
> the result is as below,
> sle12sp4-nd1:/ # /dd_f.sh
> + ssh sle12sp4-nd5 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd4 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd3 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd2 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd1 'dd if=/dev/zero of=/mnt/ocfs2/dd.txt conv=notrunc
bs=8k count=10000'
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 72.0841 s, 1.1 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 72.4865 s, 1.1 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 72.4921 s, 1.1 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 72.5006 s, 1.1 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 72.5029 s, 1.1 MB/s
> /dd_f.sh done ... 73 sec.
> 
> But, if you run this dd command from only one ocfs2 node or run this dd
command from 5 gfs2 nodes concurrently,
> the execution time is very short.
> sle12sp4-nd1:/ # /dd_f.sh 1
> + ssh sle12sp4-nd5 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd4 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd3 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd2 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc
bs=8k count=10000'
> + ssh sle12sp4-nd1 'dd if=/dev/zero of=/mnt/gfs2/dd.txt conv=notrunc
bs=8k count=10000'
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 0.318876 s, 257 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 0.738694 s, 111 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 0.795672 s, 103 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 0.850756 s, 96.3 MB/s
> 10000+0 records in
> 10000+0 records out
> 81920000 bytes (82 MB, 78 MiB) copied, 0.90767 s, 90.3 MB/s
> /dd_f.sh done ... 2 sec.
> 
> So far, I feel this problem should be considered as design issue for OCFS2.
> Hopefully, we can optimize write performance in this case via learning from
other file systems.
> For more detailed data, you can refer to my gist at
https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_ganghe_bbc4f3d94e3596715dd08e39dfacd33f&d=DwIFAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=dNLMGrZuCzc_taNLW5rBOOxiMrYH0jE9kYATcWtz9lw&s=4u6_yHYiRMe1dJeTaOXIJ9kwAIuOqLRdOVWScvqlqFM&e=
.
> 
> Thanks
> Gang
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

Ocfs2 devel - Dec 2019 - OCFS2 has low performance when writing the same file from the different nodes concurrently

[Ocfs2-devel] OCFS2 has low performance when writing the same file from the different nodes concurrently

[Ocfs2-devel] OCFS2 has low performance when writing the same file from the different nodes concurrently