Hi, I'm using OCFS2 from 2.6.26 with some patches I made that allow for the creation of a volume greater than 16TB: http://oss.oracle.com/pipermail/ocfs2-devel/2008-July/002568.html http://oss.oracle.com/pipermail/ocfs2-tools-devel/2008-July/000857.html The ocfs2-tools-devel post has info regarding the block/cluster size (from the mkfs command) used which will pertain to the following question: in general, what sort of performance numbers are people seeing for something like "time dd if=/dev/zero of=testFile bs=4k count=500000"? I'm getting anywhere from 120MB/s to 165MB/s . The same command on XFS using the same hardware/LVM setup gives me 300MB/s and with GFS2 gives 100MB/s. Currently there's only one node in the cluster but if other nodes are added with similar 4GB FC HBA hardware will these also achieve ~120-165MB/s write speeds as long as the RAID hardware isn't being "maxed" out? Here are some bonnie++ benchmarks: http://structbio.vanderbilt.edu/~pattans/bonnie-porpoise.html Also if any devs could look at the patches to see if I missed anything that might cause OCFS2 to blow up if it reaches for a block offset greater than 2^32 - 1, would greatly appreciate it (please post in reply to the posts on the -devel lists). As far as the write testing is going, it's only at 1.1T of 18T written, i.e. it'll take a day or two and then I'll have to try some fseek and read calls for large offsets. Thanks, Sabuj Pattanayek
Sabuj Pattanayek wrote:> Hi, > > I'm using OCFS2 from 2.6.26 with some patches I made that allow for > the creation of a volume greater than 16TB: > > http://oss.oracle.com/pipermail/ocfs2-devel/2008-July/002568.html > http://oss.oracle.com/pipermail/ocfs2-tools-devel/2008-July/000857.html > > The ocfs2-tools-devel post has info regarding the block/cluster size > (from the mkfs command) used which will pertain to the following > question: in general, what sort of performance numbers are people > seeing for something like "time dd if=/dev/zero of=testFile bs=4k > count=500000"? I'm getting anywhere from 120MB/s to 165MB/s . The same > command on XFS using the same hardware/LVM setup gives me 300MB/s and > with GFS2 gives 100MB/s. Currently there's only one node in the > cluster but if other nodes are added with similar 4GB FC HBA hardware > will these also achieve ~120-165MB/s write speeds as long as the RAID > hardware isn't being "maxed" out? >Try it out. If not, then we have a bottleneck somewhere. One obvious bottleneck is the global bitmap. The fs works around this by using a node local bitmap cache called localalloc. By default it is 8MB. So if you are using a 4K/4K (block/cluster), then you will hit the global bitmap (and thus cluster lock) every 2048 extents. If that is a bottleneck, you can mount with a larger localalloc. To mount with 16MB localalloc, do: mount -olocalalloc=16 XFS has delayed allocation that allows it to write data in fewer extents allowing it to provide better i/o thruput in buffered access.> Here are some bonnie++ benchmarks: > > http://structbio.vanderbilt.edu/~pattans/bonnie-porpoise.html > > Also if any devs could look at the patches to see if I missed anything > that might cause OCFS2 to blow up if it reaches for a block offset > greater than 2^32 - 1, would greatly appreciate it (please post in > reply to the posts on the -devel lists). As far as the write testing > is going, it's only at 1.1T of 18T written, i.e. it'll take a day or > two and then I'll have to try some fseek and read calls for large > offsets. >So JBD2 will allow one to go beyond 4 billion blocks. But to make ocfs2 access beyond 16T, you will for the time being need to use clustersize > 4K. To make ocfs2 with 4K clustersize access beyond 16T will need more changes. See task titled... Support more than 32-bits worth of clusters. http://oss.oracle.com/osswiki/OCFS2/LargeTasksList A quick way to fill up space could be using unwritten extents. It will just allocate space and not bother writing to it. Check out reserve_space/reserve_space.c in the ocfs2-test project. As far as the kernel patches go, we would like backward compatibility. As in, not get rid of jbd just yet. Maybe an incompat flag. But this has not been decided. Let us know how it goes. Sunil
On Thu, Jul 17, 2008 at 03:54:19PM -0500, Sabuj Pattanayek wrote:> The ocfs2-tools-devel post has info regarding the block/cluster size > (from the mkfs command) used which will pertain to the following > question: in general, what sort of performance numbers are people > seeing for something like "time dd if=/dev/zero of=testFile bs=4k > count=500000"? I'm getting anywhere from 120MB/s to 165MB/s . The same > command on XFS using the same hardware/LVM setup gives me 300MB/s and > with GFS2 gives 100MB/s.Try mounting Ocfs2 in writeback journaling mode: mount -t ocfs2 -odata=writeback /dev/XXX /mountpoint That should increase your perfomance for streaming writes. By the way, are you only timing the dd, or are you doing dd;sync and timing the entire operation? The latter is a better measurement of how long it takes to get data written to disk. --Mark -- Mark Fasheh
> Try mounting Ocfs2 in writeback journaling mode: > > mount -t ocfs2 -odata=writeback /dev/XXX /mountpointYup:> /dev/mapper/vg-ocfs2_0 on /export/ocfs2_0 type ocfs2 > (rw,_netdev,localalloc=16,data=writeback,heartbeat=local)> By the way, are you only timing the dd, or are you doing dd;sync and timing > the entire operation? The latter is a better measurement of how long it > takes to get data written to disk. > --MarkI have a pl script that does start timer, dd, sync, end timer using gettimeofday() but this seems to give slightly slower results than doing it like this: pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=4k count=500000 ; sync; } 2048000000 bytes (2.0 GB) copied, 15.047 s, 136 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=8k count=250000 ; sync; } 2048000000 bytes (2.0 GB) copied, 12.5018 s, 164 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=16k count=125000 ; sync; } 2048000000 bytes (2.0 GB) copied, 13.7218 s, 149 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=32k count=62500 ; sync; } 2048000000 bytes (2.0 GB) copied, 13.647 s, 150 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=64k count=31250 ; sync; } 2048000000 bytes (2.0 GB) copied, 11.9441 s, 171 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=64k count=31250 ; sync; } 2048000000 bytes (2.0 GB) copied, 12.083 s, 169 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=128k count=15625 ; sync; } 2048000000 bytes (2.0 GB) copied, 11.9659 s, 171 MB/s pattans at orca ~/san1/tmp $ time { dd if=/dev/zero of=testFile.porpoise bs=256k count=7812 ; sync; } 2047868928 bytes (2.0 GB) copied, 14.3243 s, 143 MB/s Anyway to change the default block write size similar to using wsize and rsize nfs mount options? Still working on the stripe, need to get some other RAID hardware stabilized.