This may not be a ZFS issue, so please bear with me! I have 4 internal drives that I have striped/mirrored with ZFS and have an application server which is reading/writing to hundreds of thousands of files on it, thousands of files @ a time. If 1 client uses the app server, the transaction (reading/writing to ~80 files) takes about 200 ms. If I have about 80 clients attempting it @ once, it can sometimes take a minute or more. I''m pretty sure its a file I/O bottleneck so I want to make sure ZFS is tuned properly for this kind of usage. The only thing I could think of, so far, is to turn off ZFS compression. Is there anything else I can do? Here is my "zpool iostat" output: # zpool iostat 5 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- pool1 5.69G 266G 23 76 1.44M 2.24M pool1 5.69G 266G 96 259 5.70M 7.25M pool1 5.69G 266G 98 267 5.73M 7.32M pool1 5.69G 266G 92 253 5.76M 7.31M pool1 5.69G 266G 90 254 5.67M 7.43M and here is regular iostat: # iostat -xnz 5 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 0.2 0.0 0.1 0.0 0.0 0.0 0.3 0 0 c0t0d0 0.0 0.2 0.0 0.1 0.0 0.0 0.0 0.3 0 0 c0t1d0 20.4 145.0 1315.8 3714.5 0.0 2.8 0.0 16.8 0 21 c0t2d0 21.4 143.2 1380.2 3711.3 0.0 4.1 0.0 25.1 0 27 c0t3d0 23.4 138.4 1509.3 3693.0 0.0 1.6 0.0 9.8 0 17 c0t4d0 20.8 137.8 1341.6 3693.0 0.0 2.3 0.0 14.7 0 21 c0t5d0 This message posted from opensolaris.org
Some more information about the system. NOTE: Cpu utilization never goes above 10%. Sun Fire v40z 4 x 2.4 GHz proc 8 GB memory 3 x 146 GB Seagate Drives (10k RPM) 1 x 146 GB Fujitsu Drive (10k RPM) This message posted from opensolaris.org
William Fretts-Saxton <william.fretts.saxton <at> sun.com> writes:> > Some more information about the system. NOTE: Cpu utilization never > goes above 10%. > > Sun Fire v40z > 4 x 2.4 GHz proc > 8 GB memory > 3 x 146 GB Seagate Drives (10k RPM) > 1 x 146 GB Fujitsu Drive (10k RPM)And what version of Solaris or what build of OpenSolaris are you using ? Do you know if your application uses synchronous I/O transactions ? Have you tried disabling ZFS file-level prefetching (just as an experiment) ? See: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching -marc
Hi Marc, # cat /etc/release Solaris 10 8/07 s10x_u4wos_12b X86 I don''t know if my application uses synchronous I/O transactions...I''m using Sun''s Glassfish v2u1. I''ve deleted the ZFS partition and have setup an SVM stripe/mirror just to see if "ZFS" is getting in the way. I"ll try out the prefetching idea when I''m done with the SVM testing. Thanks. This message posted from opensolaris.org
I disabled file prefetch and there was no effect. Here are some performance numbers. Note that, when the application server used a ZFS file system to save its data, the transaction took TWICE as long. For some reason, though, iostat is showing 5x as much disk writing (to the physical disks) on the ZFS partition. Can anyone see a problem here? ----- Average application server client response time (1st run/2nd run): SVM - 12/18 seconds ZFS - 35/38 seconds SVM Performance --------------- # iostat -xnz 5 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 195.1 414.3 1465.9 1657.3 0.0 1.7 0.0 2.7 0 98 md/d100 97.5 414.3 730.2 1657.3 0.0 1.0 0.0 1.9 0 74 md/d101 97.7 414.1 735.8 1656.5 0.0 0.8 0.0 1.5 0 59 md/d102 54.4 203.6 370.7 814.2 0.0 0.5 0.0 2.1 0 42 c0t2d0 52.8 210.6 359.5 842.2 0.0 0.5 0.0 1.9 0 40 c0t3d0 54.0 203.6 374.7 814.2 0.0 0.3 0.0 1.2 0 26 c0t4d0 52.2 210.6 361.1 842.2 0.0 0.5 0.0 1.8 0 38 c0t5d0 ZFS Performance --------------- # iostat -xnz 5 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 23.2 148.8 1496.7 3806.8 0.0 2.5 0.0 14.7 0 21 c0t2d0 22.8 148.8 1470.9 3806.8 0.0 2.4 0.0 13.9 0 22 c0t3d0 24.2 149.0 1561.1 3805.0 0.0 1.5 0.0 8.6 0 18 c0t4d0 23.4 149.4 1509.6 3805.0 0.0 2.5 0.0 14.7 0 25 c0t5d0 # zpool iostat 5 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- pool1 5.69G 266G 12 243 775K 7.20M pool1 5.69G 266G 88 232 5.53M 7.12M pool1 5.69G 266G 78 216 4.87M 6.81M This message posted from opensolaris.org
On Feb 6, 2008 6:36 PM, William Fretts-Saxton <william.fretts.saxton at sun.com> wrote:> Here are some performance numbers. Note that, when the > application server used a ZFS file system to save its data, the > transaction took TWICE as long. For some reason, though, iostat is > showing 5x as much disk writing (to the physical disks) on the ZFS > partition. Can anyone see a problem here?What is the disk layout of the zpool in question? Striped? Mirrored? Raidz? I would suggest either a simple stripe or striping+mirroring as the best-performing layout.
It is a striped/mirror: # zpool status NAME STATE READ WRITE CKSUM pool1 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 This message posted from opensolaris.org
Solaris 10u4 eh? Sounds a lot like fsync issues we want into, trying to run Cyrus mail-server spools in ZFS. This was highlighted for us by the filebench software varmail test. OpenSolaris nv78 however worked very well. This message posted from opensolaris.org
William Fretts-Saxton <william.fretts.saxton <at> sun.com> writes:> > I disabled file prefetch and there was no effect. > > Here are some performance numbers. Note that, when the application server > used a ZFS file system to save its data, the transaction took TWICE as long. > For some reason, though, iostat is showing 5x as much disk > writing (to the physical disks) on the ZFS partition. Can anyone see a > problem here?Possible explanation: the Glassfish applications are using synchronous writes, causing the ZIL (ZFS Intent Log) to be intensively used, which leads to a lot of extra I/O. Try to disable it: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 Since disabling it is not recommended, if you find out it is the cause of your perf problems, you should instead try to use a SLOG (separate intent log, see above link). Unfortunately your OS version (Solaris 10 8/07) doesn''t support SLOGs, they have only been added to OpenSolaris build snv_68: http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on -marc
Marc Bevand wrote:> William Fretts-Saxton <william.fretts.saxton <at> sun.com> writes: > >> I disabled file prefetch and there was no effect. >> >> Here are some performance numbers. Note that, when the application server >> used a ZFS file system to save its data, the transaction took TWICE as long. >> For some reason, though, iostat is showing 5x as much disk >> writing (to the physical disks) on the ZFS partition. Can anyone see a >> problem here? >> > > Possible explanation: the Glassfish applications are using synchronous > writes, causing the ZIL (ZFS Intent Log) to be intensively used, which > leads to a lot of extra I/O.The ZIL doesn''t do a lot of extra IO. It usually just does one write per synchronous request and will batch up multiple writes into the same log block if possible. However, it does need to wait for the writes to be on stable storage before returning to the application, which is what the application has requested. It does this by waiting for the write to complete and then flushing the disk write cache. If the write cache is battery backed for all zpool devices then the global zfs_nocacheflush can be set to give dramatically better performance.> Try to disable it: > > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29 > > Since disabling it is not recommended, if you find out it is the cause of your > perf problems, you should instead try to use a SLOG (separate intent log, see > above link). Unfortunately your OS version (Solaris 10 8/07) doesn''t support > SLOGs, they have only been added to OpenSolaris build snv_68: > > http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on > > -marc > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
william.fretts.saxton at sun.com said:> Here are some performance numbers. Note that, when the application server > used a ZFS file system to save its data, the transaction took TWICE as long. > For some reason, though, iostat is showing 5x as much disk writing (to the > physical disks) on the ZFS partition. Can anyone see a problem here?I''m not familiar with the application in use here, but your iostat numbers remind me of something I saw during "small overwrite" tests on ZFS. Even though the test was doing only writing, because it was writing over only a small part of existing blocks, ZFS had to read (the unchanged part of) each old block in before writing out the changed block to a new location (COW). This is a case where you want to set the ZFS recordsize to match your application''s typical write size, in order to avoid the read overhead inherent in partial-block updates. UFS by default has a smaller max blocksize than ZFS'' default 128k, so in addition to the ZIL/fsync issue UFS will also suffer less overhead from such partial-block updates. Again, this may not be what''s going on, but it''s worth checking if you haven''t already done so. Regards, Marion
Neil Perrin <Neil.Perrin <at> Sun.COM> writes:> > The ZIL doesn''t do a lot of extra IO. It usually just does one write per > synchronous request and will batch up multiple writes into the same log > block if possible.Ok. I was wrong then. Well, William, I think Marion Hakanson has the most plausible explanation. As he suggests, experiment with "zfs set recordsize=XXX" to force the filesystem to use small records. See the zfs(1) manpage. -marc
Unfortunately, I don''t know the record size of the writes. Is it as simple as looking @ the size of a file, before and after a client request, and noting the difference in size? This is binary data, so I don''t know if that makes a difference, but the average write size is a lot smaller than the file size. Should the recordsize be in place BEFORE data is written to the file system, or can it be changed after the fact? I might try a bunch of different settings for trial and error. The I/O is actually done by RRD4J, which is a round-robin database library. It is a Java version of ''rrdtool'' which saves data into a binary format, but also "cleans up" the data according to its age, saving less of the older data as time goes on. This message posted from opensolaris.org
I just installed nv82 so we''ll see how that goes. I''m going to try the recordsize idea above as well. A note about UFS: I was told by our local Admin guru that ZFS turns on write-caching for disks, which is something that a UFS file system should not have turned on, so that if I convert the ZFS f/s to a UFS one, I could be giving the UFS performance an unrealistic "boost" to performance because it would still have the caching on. This message posted from opensolaris.org
One thing I just observed is that the initial file size is 65796 bytes. When it gets an update, the file size remains @ 65796. Is there a minimum file size? This message posted from opensolaris.org
William, It should be fairly easy to find the record size using DTrace. Take an aggregation of the the writes happening (aggregate on size for all the write(2) system calls). This would give fair idea of the IO size pattern. Does RRD4J have a record size mentioned ? Usually if it is a database-application they have a record-size option when the DB is created (based on my limited knowledge about DBs). Thanks and regards, Sanjeev. PS : Here is a simple script which just aggregates on the write size and executable name : -- snip -- #!/usr/sbin/dtrace -s syscall::write:entry { wsize = (size_t) arg2; @write[wsize, execname] = count(); } -- snip -- William Fretts-Saxton wrote:> Unfortunately, I don''t know the record size of the writes. Is it as simple as looking @ the size of a file, before and after a client request, and noting the difference in size? This is binary data, so I don''t know if that makes a difference, but the average write size is a lot smaller than the file size. > > Should the recordsize be in place BEFORE data is written to the file system, or can it be changed after the fact? I might try a bunch of different settings for trial and error. > > The I/O is actually done by RRD4J, which is a round-robin database library. It is a Java version of ''rrdtool'' which saves data into a binary format, but also "cleans up" the data according to its age, saving less of the older data as time goes on. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
RRD4J isn''t a DB, per se, so it doesn''t really have a "record" size. In fact, I don''t even know if, when data is written to the binary, whether it is contiguous or not so the amount written may not directly correlate to a proper record-size. I did run your command and found the size patterns you were talking about: 462 java 409 3320 java 409 6819 java 409 5 java 1227 1 java 1692 16 java 3243 "409" is the number of clients I tested, so I assume it means the largest write it makes is "6819". Is that bits or bytes? Does that mean I should try setting my recordsize equal to the lowest multiple of 512 GREATER than 6819? (14 x 512 = 7168) This message posted from opensolaris.org
Slight correction. ''recsize'' must be a power of 2 so it would be 8192. This message posted from opensolaris.org
To avoid making multiple posts, I''ll just write everything here: -Moving to nv_82 did not seem to do anything, so I doesn''t look like fsync was the issue. -Disabling ZIL didn''t do anything either -Still playing with ''recsize'' values but it doesn''t seem to be doing much...I don''t think I have a good understand of what exactly is being written...I think the whole file might be overwritten each time because it''s in binary format. -Setting zfs_nocacheflush, though got me drastically increased throughput--client requests took, on average, less than 2 seconds each! So, in order to use this, I should have a storage array, w/battery backup, instead of using the internal drives, correct? I have the option of using a 6120 or 6140 array on this system so I might just try that out. This message posted from opensolaris.org
> -Still playing with ''recsize'' values but it doesn''t seem to be doing > much...I don''t think I have a good understand of what exactly is being > written...I think the whole file might be overwritten each time > because it''s in binary format.The other thing to keep in mind is that the tunables like compression and recsize only affect newly written blocks. If you have a bunch of data that was already laid down on disk and then you change the tunable, this will only cause new blocks to have the new size. If you experiment with this, make sure all of your data has the same blocksize by copying it over to the new pool once you''ve changed the properties.> -Setting zfs_nocacheflush, though got me drastically increased > throughput--client requests took, on average, less than 2 seconds > each! > > So, in order to use this, I should have a storage array, w/battery > backup, instead of using the internal drives, correct?zfs_nocacheflush should only be used on arrays with a battery backed cache. If you use this option on a disk, and you lose power, there''s no guarantee that your write successfully made it out of the cache. A performance problem when flushing the cache of an individual disk implies that there''s something wrong with the disk or its firmware. You can disable the write cache of an individual disk using format(1M). When you do this, ZFS won''t lose any data, whereas enabling zfs_nocacheflush can lead to problems. I''m attaching a DTrace script that will show the cache-flush times per-vdev. Remove the zfs_nocacheflush tuneable and re-run your test while using this DTrace script. If one particular disk takes longer than the rest to flush, this should show us. In that case, we can disable the write cache on that particular disk. Otherwise, we''ll need to disable the write cache on all of the disks. The script is attached as zfs_flushtime.d Use format(1M) with the -e option to adjust the write_cache settings for SCSI disks. -j -------------- next part -------------- #!/usr/sbin/dtrace -Cs /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright 2008 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ #define DKIOC (0x04 << 8) #define DKIOCFLUSHWRITECACHE (DKIOC|34) fbt:zfs:vdev_disk_io_start:entry /(args[0]->io_cmd == DKIOCFLUSHWRITECACHE) && (self->traced == 0)/ { self->traced = args[0]; self->start = timestamp; } fbt:zfs:vdev_disk_ioctl_done:entry /args[0] == self->traced/ { @a[stringof(self->traced->io_vd->vdev_path)] quantize(timestamp - self->start); self->start = 0; self->traced = 0; }
> -Setting zfs_nocacheflush, though got me drastically > increased throughput--client requests took, on > average, less than 2 seconds each! > > So, in order to use this, I should have a storage > array, w/battery backup, instead of using the > internal drives, correct? I have the option of using > a 6120 or 6140 array on this system so I might just > try that out.We use 3510 and 2540 arrays for Cyrus mail-stores which hold about 10K accounts each. Recommend going with dual-controllers though for safety. Our setups are really simple. Put 2 array units on the SAN, make a pair or RAID-5 LUNs. Then RAID-10 these LUNs together in ZFS. This message posted from opensolaris.org
William Fretts-Saxton wrote:> Unfortunately, I don''t know the record size of the writes. Is it as simple as looking @ the size of a file, before and after a client request, and noting the difference in size? This is binary data, so I don''t know if that makes a difference, but the average write size is a lot smaller than the file size. > > Should the recordsize be in place BEFORE data is written to the file system, or can it be changed after the fact? I might try a bunch of different settings for trial and error. > > The I/O is actually done by RRD4J, which is a round-robin database library. It is a Java version of ''rrdtool'' which saves data into a binary format, but also "cleans up" the data according to its age, saving less of the older data as time goes on. >You should tune that in application level, see https://rrd4j.dev.java.net/ down in "performance issue" section. Try the "NIO" backend and use smaller (2048?) record size... -- This space was intended to be left blank.
> The other thing to keep in mind is that the tunables > like compression > and recsize only affect newly written blocks. If you > have a bunch of > data that was already laid down on disk and then you > change the tunable, > this will only cause new blocks to have the new size. > If you experiment > ith this, make sure all of your data has the same > blocksize by copying > it over to the new pool once you''ve changed the > properties.Is deleting the old files/directories in the ZFS file system sufficient or do I need to destroy/recreate the pool and/or file system itself? I''ve been doing the former. I will use your dtrace script today and get back to you. Thanks for that. This message posted from opensolaris.org
We are going to get a 6120 for this temporarily. If all goes well, we are going to move to a 6140 SAN solution. This message posted from opensolaris.org
Hi Daniel. I take it you are an RRD4J user? I didn''t see anything in the "performance issues" area that would help. Please let me know if I''m missing something: - The default of RRD4J is to use NIO backend, so that is already in place. - Pooling won''t help because there is almost never a time when an RRD file will be accessed simultaneously. - I''m using trial and error when it comes to the recsize right now, so I''ll post back with my results. Right now, it looks like a higher recsize is better (16k better performance than 8k, etc) which is strange, but I''m not done yet. This message posted from opensolaris.org
William Fretts-Saxton wrote:> Unfortunately, I don''t know the record size of the writes. Is it as > simple as looking @ the size of a file, before and after a client > request, and noting the difference in size?and> The I/O is actually done by RRD4J, [...] a Java version of ''rrdtool''If it behaves like rrdtool, it will limit the size of the file, by consolidating older data. After every n samples, older data will be replaced by an aggregate, freeing space for new samples. To me that implies random I/O. You really need a tool like dtrace (or old fashioned truss) to see the sample rate and size. Cheers, Henk
Hello William, Thursday, February 7, 2008, 7:46:51 PM, you wrote: WFS> -Setting zfs_nocacheflush, though got me drastically increased WFS> throughput--client requests took, on average, less than 2 seconds each! That''s interesting - a bug in scsi driver for v40z? -- Best regards, Robert mailto:milek at task.gda.pl http://milek.blogspot.com
On Feb 5, 2008 9:52 PM, William Fretts-Saxton <william.fretts.saxton at sun.com> wrote:> This may not be a ZFS issue, so please bear with me! > > I have 4 internal drives that I have striped/mirrored with ZFS and have an > application server which is reading/writing to hundreds of thousands of > files on it, thousands of files @ a time. > > If 1 client uses the app server, the transaction (reading/writing to ~80 > files) takes about 200 ms. If I have about 80 clients attempting it @ once, > it can sometimes take a minute or more. I''m pretty sure its a file I/O > bottleneck so I want to make sure ZFS is tuned properly for this kind of > usage. > > The only thing I could think of, so far, is to turn off ZFS compression. > Is there anything else I can do? Here is my "zpool iostat" output: > >Hi William To improve performance, consider turning off atime, assuming you don''t need it... # zfs set atime=off POOL/filesystem _J -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080210/8d45b0e5/attachment.html>
It does. The file size is limited to the original creation size, which is 65k for files with 1 data sample. Unfortunately, I have zero experience with dtrace and only a little with truss. I''m relying on the dtrace scripts from people on this thread to get by for now! This message posted from opensolaris.org
I ran this dtrace script and got no output. Any ideas? This message posted from opensolaris.org
> Is deleting the old files/directories in the ZFS file system > sufficient or do I need to destroy/recreate the pool and/or file > system itself? I''ve been doing the former.The former should be sufficient, it''s not necessary to destroy the pool. -j
After working with Sanjeev, and putting in a bunch of timing statement throughout the code, it turns out that file writes ARE NOT the bottleneck, as would be assumed. It is actually reading the file into a byte buffer that is the culprit. Specifically, this java command: byteBuffer = file.getChannel().map(mapMode, 0, length); I''m going to try to apply the some of the same things I tried here with troubleshooting the writes to the reads now. If anyone has any different advice, please let me know. Thanks for all the help so far. This message posted from opensolaris.org