I''m running OpenIndiana 151a5 on two identical Supermicro boxes (hardware details at the bottom of the message). Pretty clean installation. No extra services running. Main purpose is to be storage servers and nothing else. No dedup. Today I started migrating file systems from some old Open Solaris servers to these Supermicro boxes and noticed the transfer to one of them was going 10x slower then to the other one (like 10GB/hour). I was using zfs send/receive over ssh. This server is doing pretty much nothing. Exporting 3 file systems via NFS and only one of them is somewhat in use. Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write operations on the drives and 661 on the ZIL. Compare to the other server (who is in way heavier use then this one) these numbers are extremely high. Any idea how to debug any further? Supermicro SC847 MB X9DRH-iF Xeon E5-2620 2GHz 6-Core LSI SAS9211-8i HBA 32GB RAM Storage pool drives: 3TB Hitachi SAS 7.2K ZIL mirror and L2ARC: 80GB Intel SSD SATA 3Gb/s Thank you, -- Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130116/c8d533d9/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: zpool-IOStat.png Type: image/png Size: 72714 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130116/c8d533d9/attachment-0001.png>
On Wed, 16 Jan 2013, Peter Wood wrote:> > Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write operations on the drives and 661 on the > ZIL. Compare to the other server (who is in way heavier use then this one) these numbers are extremely high. > > Any idea how to debug any further?Do some filesystems contain many snapshots? Do some filesystems use small zfs block sizes. Have the servers been used the same? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 01/16/2013 10:25 PM, Peter Wood wrote:> > Today I started migrating file systems from some old Open Solaris > servers to these Supermicro boxes and noticed the transfer to one of > them was going 10x slower then to the other one (like 10GB/hour).What does "dladm show-link" show? I''m guessing one of your links is at 100mbps or at half duplex.
I have a script that rotates hourly, daily and monthly snapshots. Each filesystem has about 40 snapshots (zfsList.png - output of ''zfs list | grep -v home/'' - the home directories datasets are snipped from the output. 4 users in total.) I noticed that the hourly snapshots on the heaviest filesystem in use are about 1.2GB in size where on the other system the regular NFS exported filesystem has about 60MB snapshots (gallerySnapshots.png - output of command ''zfs list -t snapshot -r pool01/utils/gallery'') I know that the gallery FS is in heavier use then normal but I was told it will be mostly reading and based on the iostat seems that there is heavy writing too. I guess I''ll schedule some downtime and disable gallery export and see if that will effect the number of write operations and performance in general. Unless there is some other way to test what/where these write operations are applied. The ''zpool iostat -v'' output is uncomfortably static. The values of read/write operations and bandwidth are the same for hours and even days. I''d expect at least some variations between morning and night. The load on the servers is different for sure. Any input? Thanks, -- Peter On Wed, Jan 16, 2013 at 7:49 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Wed, 16 Jan 2013, Peter Wood wrote: > >> >> Running zpool iostat -v (attachment zpool-IOStat.png) shows 1,22K write >> operations on the drives and 661 on the >> ZIL. Compare to the other server (who is in way heavier use then this >> one) these numbers are extremely high. >> >> Any idea how to debug any further? >> > > Do some filesystems contain many snapshots? Do some filesystems use small > zfs block sizes. Have the servers been used the same? > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/** > users/bfriesen/ <http://www.simplesystems.org/users/bfriesen/> > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/544d25db/attachment.html>
On Thu, 17 Jan 2013, Peter Wood wrote:> > Unless there is some other way to test what/where these write operations are applied.You can install Brendan Gregg''s DTraceToolkit and use it to find out who and what is doing all the writing. 1.2GB in an hour is quite a lot of writing. If this is going continuously, then it may be causing more fragmentation in conjunction with your snapshots. See "http://www.brendangregg.com/dtrace.html". Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Thanks Bob. Will do. I forgot the attachments. Here they are with hope to bring more clarity to the problem. On Thu, Jan 17, 2013 at 3:42 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Thu, 17 Jan 2013, Peter Wood wrote: > >> >> Unless there is some other way to test what/where these write operations >> are applied. >> > > You can install Brendan Gregg''s DTraceToolkit and use it to find out who > and what is doing all the writing. 1.2GB in an hour is quite a lot of > writing. If this is going continuously, then it may be causing more > fragmentation in conjunction with your snapshots. > > See "http://www.brendangregg.com/**dtrace.html<http://www.brendangregg.com/dtrace.html> > ". > > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/** > users/bfriesen/ <http://www.simplesystems.org/users/bfriesen/> > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/ca2873f9/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: zfsList.png Type: image/png Size: 49894 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/ca2873f9/attachment-0002.png> -------------- next part -------------- A non-text attachment was scrubbed... Name: gallerySnapshots.png Type: image/png Size: 63559 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/ca2873f9/attachment-0003.png>
On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood <peterwood.sd at gmail.com> wrote:> > The ''zpool iostat -v'' output is uncomfortably static. The values of > read/write operations and bandwidth are the same for hours and even days. > I''d expect at least some variations between morning and night. The load on > the servers is different for sure. Any input? > >Without a repetition time parameter, zpool iostat will print exactly once and exit, and the output is an average from kernel boot to "now", just like iostat, this is why it seems so static. If you want to know the activity over 5 second intervals, use something like "zpool iostat -v 5" (repeat every 5 seconds) and wait for the second and later blocks. The second and later blocks are average from previous output until "now". I generally use 5 second intervals to match the 5 second commit interval on my pools. Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/4fd8fdad/attachment.html>
On 2013-01-18 00:42, Bob Friesenhahn wrote:> You can install Brendan Gregg''s DTraceToolkit and use it to find out who > and what is doing all the writing. 1.2GB in an hour is quite a lot of > writing. If this is going continuously, then it may be causing more > fragmentation in conjunction with your snapshots.As a moderately wild guess, since you''re speaking of galleries, are these problematic filesystems often-read? By default ZFS updates the last access-time of files it reads, as do many other filesystems, and this causes avalanches of metadata updates - sync writes (likely) as well as fragmentation. This may also be a poorly traceable but considerable "used" space in frequent snapshots. You can verify (and unset) this behaviour with the ZFS FS dataset property "atime", i.e.: # zfs get atime pond/export/home NAME PROPERTY VALUE SOURCE pond/export/home atime off inherited from pond On another hand, verify where your software keeps the temporary files (i.e. during uploads as may be with galleries). Again, if this is a frequently snapshotted dataset (though 1 hour is not really that frequent) then needless temp files can be held by those older snapshots. Moving such temporary works to a different dataset with a different snapshot schedule and/or to a different pool (to keep related fragmentation constrained) may prove useful. HTH, //Jim Klimov
Right on Tim. Thanks. I didn''t know that. I''m sure it''s documented somewhere and I should have read it so double thanks for explaining it. -- Peter Blajev IT Manager, TAAZ Inc. Office: 858-597-0512 x125 On Thu, Jan 17, 2013 at 4:18 PM, Timothy Coalson <tsc5yc at mst.edu> wrote:> On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood <peterwood.sd at gmail.com>wrote: > >> >> The ''zpool iostat -v'' output is uncomfortably static. The values of >> read/write operations and bandwidth are the same for hours and even days. >> I''d expect at least some variations between morning and night. The load on >> the servers is different for sure. Any input? >> >> > Without a repetition time parameter, zpool iostat will print exactly once > and exit, and the output is an average from kernel boot to "now", just like > iostat, this is why it seems so static. If you want to know the activity > over 5 second intervals, use something like "zpool iostat -v 5" (repeat > every 5 seconds) and wait for the second and later blocks. The second and > later blocks are average from previous output until "now". I generally use > 5 second intervals to match the 5 second commit interval on my pools. > > Tim > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/1f165130/attachment.html>
Right on Tim. Thanks. I didn''t know that. I''m sure it''s documented somewhere and I should have read it so double thanks for explaining it. On Thu, Jan 17, 2013 at 4:18 PM, Timothy Coalson <tsc5yc at mst.edu> wrote:> On Thu, Jan 17, 2013 at 5:33 PM, Peter Wood <peterwood.sd at gmail.com>wrote: > >> >> The ''zpool iostat -v'' output is uncomfortably static. The values of >> read/write operations and bandwidth are the same for hours and even days. >> I''d expect at least some variations between morning and night. The load on >> the servers is different for sure. Any input? >> >> > Without a repetition time parameter, zpool iostat will print exactly once > and exit, and the output is an average from kernel boot to "now", just like > iostat, this is why it seems so static. If you want to know the activity > over 5 second intervals, use something like "zpool iostat -v 5" (repeat > every 5 seconds) and wait for the second and later blocks. The second and > later blocks are average from previous output until "now". I generally use > 5 second intervals to match the 5 second commit interval on my pools. > > Tim > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/ea02fbb4/attachment.html>
Great points Jim. I have requested more information how the gallery share is being used and any temporary data will be moved out of there. About atime, it is set to "on" right now and I''ve considered to turn it off but I wasn''t sure if this will effect incremental zfs send/receive. ''zfs send -i snapshot0 snapshot1'' doesn''t rely on the atime, right? On Thu, Jan 17, 2013 at 4:34 PM, Jim Klimov <jimklimov at cos.ru> wrote:> On 2013-01-18 00:42, Bob Friesenhahn wrote: > >> You can install Brendan Gregg''s DTraceToolkit and use it to find out who >> and what is doing all the writing. 1.2GB in an hour is quite a lot of >> writing. If this is going continuously, then it may be causing more >> fragmentation in conjunction with your snapshots. >> > > As a moderately wild guess, since you''re speaking of galleries, > are these problematic filesystems often-read? By default ZFS > updates the last access-time of files it reads, as do many other > filesystems, and this causes avalanches of metadata updates - > sync writes (likely) as well as fragmentation. This may also > be a poorly traceable but considerable "used" space in frequent > snapshots. You can verify (and unset) this behaviour with the > ZFS FS dataset property "atime", i.e.: > > # zfs get atime pond/export/home > NAME PROPERTY VALUE SOURCE > pond/export/home atime off inherited from pond > > On another hand, verify where your software keeps the temporary > files (i.e. during uploads as may be with galleries). Again, if > this is a frequently snapshotted dataset (though 1 hour is not > really that frequent) then needless temp files can be held by > those older snapshots. Moving such temporary works to a different > dataset with a different snapshot schedule and/or to a different > pool (to keep related fragmentation constrained) may prove useful. > > HTH, > //Jim Klimov > > > ______________________________**_________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/**mailman/listinfo/zfs-discuss<http://mail.opensolaris.org/mailman/listinfo/zfs-discuss> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130117/d9f20ac4/attachment.html>
On Thu, 17 Jan 2013, Peter Wood wrote:> Great points Jim. I have requested more information how the gallery share is being used and any temporary data will > be moved out of there. > About atime, it is set to "on" right now and I''ve considered to turn it off but I wasn''t sure if this will effect > incremental zfs send/receive. > > ''zfs send -i snapshot0 snapshot1'' doesn''t rely on the atime, right?Zfs send does not care about atime. The access time is useless other than as a way to see how long it has been since a file was accessed. For local access (not true for NFS), Zfs is lazy about updating atime on disk and so it may not be updated on disk until the next transaction group is written (e.g. up to 5 seconds) and so it does not represent much actual load. Without this behavior, the system could become unusable. For NFS you should disable atime on the NFS client mounts. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, 17 Jan 2013, Bob Friesenhahn wrote:> > For NFS you should disable atime on the NFS client mounts.This advice was wrong. It needs to be done on the server side. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Thu, Jan 17, 2013 at 4:48 PM, Peter Blajev <pblajev at taaz.com> wrote:> Right on Tim. Thanks. I didn''t know that. I''m sure it''s documented > somewhere and I should have read it so double thanks for explaining it. >When in doubt, always check the man page first: man zpool It''s listed in the section on the "iostat" sub-command: zpool iostat [-T d|u] [-v] [pool] ... [interval [count]] Displays I/O statistics for the given pools. When given an interval, the statistics are printed every interval seconds until Ctrl-C is pressed. If no pools are specified, statistics for every pool in the system is shown. If count is specified, the command exits after count reports are printed. :D -- Freddie Cash fjwcash at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130118/f0c868c2/attachment.html>
On Fri, Jan 18, 2013 at 4:55 PM, Freddie Cash <fjwcash at gmail.com> wrote:> On Thu, Jan 17, 2013 at 4:48 PM, Peter Blajev <pblajev at taaz.com> wrote: > >> Right on Tim. Thanks. I didn''t know that. I''m sure it''s documented >> somewhere and I should have read it so double thanks for explaining it. >> > > When in doubt, always check the man page first: > man zpool > > It''s listed in the section on the "iostat" sub-command: > zpool iostat [-T d|u] [-v] [pool] ... [interval [count]] > > Displays I/O statistics for the given pools. When given an > interval, > the statistics are printed every interval seconds until Ctrl-C is > pressed. If no pools are specified, statistics for every pool in > the > system is shown. If count is specified, the command exits after > count > reports are printed. >To my eye, that doesn''t actually explain what the output is, only how to get it to repeat. It seems to assume that one is familiar with iostat, and expects this to work the same way. So, can''t really fault someone for being confused by the output in this case (perhaps the manpage could use some clarification). Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20130118/ce0c62ae/attachment.html>