Is the speed of a ''zfs send'' dependant on file size / number of files ? We have a system with some large datasets (3.3 TB and about 35 million files) and conventional backups take a long time (using Netbackup 6.5 a FULL takes between two and three days, differential incrementals, even with very few files changing, take between 15 and 20 hours). We already use snapshots for day to day restores, but we need the ''real'' backups for DR. I have been testing zfs send throughput and have not been getting promising results. Note that this is NOT OpenSolaris, but Solaris 10U6 (10/08) with the IDR for the snapshot interrupts resilver bug. Server: V480, 4 CPU, 16 GB RAM (test server, production is an M4000) Storage: two SE-3511, each with one 512 GB LUN presented Simple mirror layout: pkraus at nyc-sted1:/IDR-test/ppk> zpool status pool: IDR-test state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Wed Jul 1 16:54:58 2009 config: NAME STATE READ WRITE CKSUM IDR-test ONLINE 0 0 0 mirror ONLINE 0 0 0 c6t600C0FF0000000000927852FB91AD308d0 ONLINE 0 0 0 c6t600C0FF0000000000922614781B19008d0 ONLINE 0 0 0 errors: No known data errors pkraus at nyc-sted1:/IDR-test/ppk> pkraus at nyc-sted1:/IDR-test/ppk> zfs list NAME USED AVAIL REFER MOUNTPOINT IDR-test 101G 399G 24.3M /IDR-test IDR-test at 1250597527 96.8M - 101M - IDR-test at 1250604834 20.1M - 24.3M - IDR-test at 1250605236 16K - 24.3M - IDR-test at 1250605400 20K - 24.3M - IDR-test at 1250606582 20K - 24.3M - IDR-test at 1250612553 20K - 24.3M - IDR-test at 1250616026 20K - 24.3M - IDR-test/dataset 101G 399G 100G /IDR-test/dataset IDR-test/dataset at 1250597527 313K - 87.1G - IDR-test/dataset at 1250604834 266K - 87.1G - IDR-test/dataset at 1250605236 187M - 88.2G - IDR-test/dataset at 1250605400 192M - 89.3G - IDR-test/dataset at 1250606582 246K - 95.4G - IDR-test/dataset at 1250612553 233K - 95.4G - IDR-test/dataset at 1250616026 230K - 100G - pkraus at nyc-sted1:/IDR-test/ppk> There are about 3.3 million files / directories in the ''dataset'', files range in size from 1 KB to 100 KB. pkraus at nyc-sted1:/IDR-test/ppk> time sudo zfs send IDR-test/dataset at 1250616026 >/dev/null real 91m19.024s user 0m0.022s sys 11m51.422s pkraus at nyc-sted1:/IDR-test/ppk> Which translates to a little over 18 MB/sec. and 600 files/sec. That would mean almost 16 hours per TB. Better, but not much better than NBU. I do not think the SE-3511 is limiting us, as I have seen much higher throughput on them when resilvering one or more mirrors. Any thoughts as to why I am not getting better throughput ? Thanks. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer, "The Pajama Game" @ Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
Posted from the wrong address the first time, sorry. Is the speed of a ''zfs send'' dependant on file size / number of files ? ? ? ? ?We have a system with some large datasets (3.3 TB and about 35 million files) and conventional backups take a long time (using Netbackup 6.5 a FULL takes between two and three days, differential incrementals, even with very few files changing, take between 15 and 20 hours). We already use snapshots for day to day restores, but we need the ''real'' backups for DR. ? ? ? ?I have been testing zfs send throughput and have not been getting promising results. Note that this is NOT OpenSolaris, but Solaris 10U6 (10/08) with the IDR for the snapshot interrupts resilver bug. Server: V480, 4 CPU, 16 GB RAM (test server, production is an M4000) Storage: two SE-3511, each with one 512 GB LUN presented Simple mirror layout: pkraus at nyc-sted1:/IDR-test/ppk> zpool status ?pool: IDR-test ?state: ONLINE ?scrub: resilver completed after 0h0m with 0 errors on Wed Jul ?1 16:54:58 2009 config: ? ? ? ?NAME ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? STATE ? ? READ WRITE CKSUM ? ? ? ?IDR-test ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 ? ? ? ? ?mirror ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ONLINE ? ? ? 0 ? ? 0 ? ? 0 ? ? ? ? ? ?c6t600C0FF0000000000927852FB91AD308d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 ? ? ? ? ? ?c6t600C0FF0000000000922614781B19008d0 ?ONLINE ? ? ? 0 ? ? 0 ? ? 0 errors: No known data errors pkraus at nyc-sted1:/IDR-test/ppk> pkraus at nyc-sted1:/IDR-test/ppk> zfs list NAME ? ? ? ? ? ? ? ? ? ? ? ? ?USED ?AVAIL ?REFER ?MOUNTPOINT IDR-test ? ? ? ? ? ? ? ? ? ? ?101G ? 399G ?24.3M ?/IDR-test IDR-test at 1250597527 ? ? ? ? ?96.8M ? ? ?- ? 101M ?- IDR-test at 1250604834 ? ? ? ? ?20.1M ? ? ?- ?24.3M ?- IDR-test at 1250605236 ? ? ? ? ? ?16K ? ? ?- ?24.3M ?- IDR-test at 1250605400 ? ? ? ? ? ?20K ? ? ?- ?24.3M ?- IDR-test at 1250606582 ? ? ? ? ? ?20K ? ? ?- ?24.3M ?- IDR-test at 1250612553 ? ? ? ? ? ?20K ? ? ?- ?24.3M ?- IDR-test at 1250616026 ? ? ? ? ? ?20K ? ? ?- ?24.3M ?- IDR-test/dataset ? ? ? ? ? ? ?101G ? 399G ? 100G ?/IDR-test/dataset IDR-test/dataset at 1250597527 ? 313K ? ? ?- ?87.1G ?- IDR-test/dataset at 1250604834 ? 266K ? ? ?- ?87.1G ?- IDR-test/dataset at 1250605236 ? 187M ? ? ?- ?88.2G ?- IDR-test/dataset at 1250605400 ? 192M ? ? ?- ?89.3G ?- IDR-test/dataset at 1250606582 ? 246K ? ? ?- ?95.4G ?- IDR-test/dataset at 1250612553 ? 233K ? ? ?- ?95.4G ?- IDR-test/dataset at 1250616026 ? 230K ? ? ?- ? 100G ?- pkraus at nyc-sted1:/IDR-test/ppk> There are about 3.3 million files / directories in the ''dataset'', files range in size from 1 KB to 100 KB. pkraus at nyc-sted1:/IDR-test/ppk> time sudo zfs send IDR-test/dataset at 1250616026 >/dev/null real ? ?91m19.024s user ? ?0m0.022s sys ? ? 11m51.422s pkraus at nyc-sted1:/IDR-test/ppk> Which translates to a little over 18 MB/sec. and 600 files/sec. That would mean almost 16 hours per TB. Better, but not much better than NBU. I do not think the SE-3511 is limiting us, as I have seen much higher throughput on them when resilvering one or more mirrors. Any thoughts as to why I am not getting better throughput ? Thanks. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer, "The Pajama Game" @ Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
>Is the speed of a ''zfs send'' dependant on file size / number of files ?I am going to say no, I have *far* inferior iron that I am running a backup rig on, and doing a send/recv over ssh through gige and last night''s replication gave the following: "received 40.2GB stream in 3498 seconds (11.8MB/sec)" I have seen it as high as your figures but usually between this and your number. I assumed it was a result of the ssh overhead (arcfour yielded the best results).>There are about 3.3 million files / directories in the ''dataset'', >files range in size from 1 KB to 100 KB.The number of files I am replicating would be ~100! jlc
On Tue, Aug 18, 2009 at 04:22:19PM -0400, Paul Kraus wrote:> ? ? ? ?We have a system with some large datasets (3.3 TB and about 35 > million files) and conventional backups take a long time (using > Netbackup 6.5 a FULL takes between two and three days, differential > incrementals, even with very few files changing, take between 15 and > 20 hours). We already use snapshots for day to day restores, but we > need the ''real'' backups for DR.zfs send will be very fast for "differential incrementals ... with very few files changing" since zfs send is a block-level diff based on the differences between the selected snapshots. Where a traditional backup tool would have to traverse the entire filesystem (modulo pruning based on ctime/mtime), zfs send simply traverses a list of changed blocks that''s kept up by ZFS as you make changes in the first place. For a *full* backup zfs send and traditional backup tools will have similar results as both will be I/O bound and both will have more or less the same number of I/Os to do. Caveat: zfs send formats are not guraranteed to be backwards compatible, therefore zfs send is not suitable for long-term backups. Nico --
On Tue, Aug 18, 2009 at 22:22, Paul Kraus<pk1048 at gmail.com> wrote:> Posted from the wrong address the first time, sorry. > > Is the speed of a ''zfs send'' dependant on file size / number of files ? > > ? ? ? ?We have a system with some large datasets (3.3 TB and about 35 > million files) and conventional backups take a long time (using > Netbackup 6.5 a FULL takes between two and three days, differential > incrementals, even with very few files changing, take between 15 and > 20 hours). We already use snapshots for day to day restores, but we > need the ''real'' backups for DR.Conventional backups can be faster that that! I have not used netbackup but you should be able to configure netbackup to run several backup streams in parallel. You may have to point netbackup to subdirs instead of the file system root.
On Tue, Aug 18, 2009 at 7:54 PM, Mattias Pantzare<pantzer at ludd.ltu.se> wrote:> On Tue, Aug 18, 2009 at 22:22, Paul Kraus<pk1048 at gmail.com> wrote: >> Posted from the wrong address the first time, sorry. >> >> Is the speed of a ''zfs send'' dependant on file size / number of files ? >> >> ? ? ? ?We have a system with some large datasets (3.3 TB and about 35 >> million files) and conventional backups take a long time (using >> Netbackup 6.5 a FULL takes between two and three days, differential >> incrementals, even with very few files changing, take between 15 and >> 20 hours). We already use snapshots for day to day restores, but we >> need the ''real'' backups for DR. > > Conventional backups can be faster that that! I have not used > netbackup but you should be able to configure netbackup to run several > backup streams in parallel. You may have to point netbackup to subdirs > instead of the file system root.This was discussed in another thread as well. http://opensolaris.org/jive/thread.jspa?threadID=109751&tstart=0 In particular... http://opensolaris.org/jive/thread.jspa?threadID=109751&tstart=0#405121 http://opensolaris.org/jive/thread.jspa?threadID=109751&tstart=0#404589 http://opensolaris.org/jive/thread.jspa?threadID=109751&tstart=0#405835 http://opensolaris.org/jive/thread.jspa?threadID=109751&tstart=0#405308 -- Mike Gerdts http://mgerdts.blogspot.com/
Thank you for all your replies, I''m collecting my responses in one message below: On Tue, Aug 18, 2009 at 7:43 PM, Nicolas Williams<Nicolas.Williams at sun.com> wrote:> On Tue, Aug 18, 2009 at 04:22:19PM -0400, Paul Kraus wrote: >> ? ? ? ?We have a system with some large datasets (3.3 TB and about 35 >> million files) and conventional backups take a long time (using >> Netbackup 6.5 a FULL takes between two and three days, differential >> incrementals, even with very few files changing, take between 15 and >> 20 hours). We already use snapshots for day to day restores, but we >> need the ''real'' backups for DR. > > zfs send will be very fast for "differential incrementals ... with very > few files changing" since zfs send is a block-level diff based on the > differences between the selected snapshots. ?Where a traditional backup > tool would have to traverse the entire filesystem (modulo pruning based > on ctime/mtime), zfs send simply traverses a list of changed blocks > that''s kept up by ZFS as you make changes in the first place.Our testing indicates that for incremental zfs send the speed is very good, and seems to be bandwidth limited and not limited by file count. For example, while testing incremental sends I got the following results: ~450,000 files sent, ~8.3 GB sent @ 690 files/sec. and 13 MB/sec. ~900,000 files sent, ~13 GB sent @ 890 files/sec. and 13 MB/sec. ~450,000 files sent, ~ 4.6 GB sent @ 1,800 files/sec. and 19 MB/sec. Full zfs sends produced: ~2.5 million files, ~87 GB @ 500 files/sec. and 18 MB/sec. ~3.4 million files, ~ 100 GB @ 600 files/sec. and 19 MB/sec.> For a *full* backup zfs send and traditional backup tools will have > similar results as both will be I/O bound and both will have more or > less the same number of I/Os to do.The zfs send FULLS are in close agreement with what we are seeing with a FULL NBU backup.> Caveat: zfs send formats are not guraranteed to be backwards > compatible, therefore zfs send is not suitable for long-term backups.Yup, we only need them for 5 weeks, and when we upgrade the server (and ZFS version) we would need to do a new set of fulls. On Tue, Aug 18, 2009 at 8:54 PM, Mattias Pantzare <pantzer at ludd.ltu.se> wrote:> Conventional backups can be faster that that! I have not used > netbackup but you should be able to configure netbackup to run several > backup streams in parallel. You may have to point netbackup to subdirs > instead of the file system root.We have over 180 filesystems on the production server right now, we are really trying to avoid any manual customization of the backup policy. In a previous incarnation this data lived on a Mac OS X server in one FS (only about 4 TB total at that point), full backups took so long that we manually configured three NBU policies with many individual directories ... it was a nighmare as new data (and directories) were added. On Tue, Aug 18, 2009 at 10:33 PM, Mike Gerdts <mgerdts at gmail.com> wrote:> This was discussed in another thread as well. > > http://opensolaris.org/jive/thread.jspa?threadID=109751&tstart=0Thanks for that pointer. I had missed that thread in my search, I just hadn''t hit the right keywords. This thread got me thinking about our data layout. Currently the data broken up by both department and project. Each department gets a zpool and each project within the department a dataset/zfs. Departments range in size from one mirrored pair of LUNs (512 GB) to 11 mirrored pairs of LUNs (5.5 TB). Projects range from a few KB to 3.3 TB (and 33 million files). The data is all relatively small, images of documents, but there are many, many of them. Is there any throughput penalty for the dataset being part of a bigger zpool ? In other words, am I more likely to get better FULL throughput if I move the data to a dedicated zpool instead of a child dataset ? We *can* change our model to assign each project a separate zpool, but that would be wasteful of space. Perhaps move a given project to it''s own zpool when it grows to a certain size (>1 TB maybe). But, if there would not be any performance advantage, it''s not worth the effort. I had assumed that a full zfs send would just stream the underlying zfs structure and not really deal with individual files, but if the dataset is part of a shared zpool then I guess it has to look at the files'' metadata to determine if a given file is part of that dataset. P.S. We are planning to move the backend stoage to JBOD (probably J4400), but that is not where we are today, and we can''t count on that happening soon. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Designer, "The Pajama Game" @ Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, Lunacon 2010 (http://www.lunacon.org/) -> Technical Advisor, RPI Players
On Aug 18, 2009, at 1:16 PM, Paul Kraus wrote:> Is the speed of a ''zfs send'' dependant on file size / number of > files ?Not directly. It is dependent on the amount of changes per unit time.> We have a system with some large datasets (3.3 TB and about 35 > million files) and conventional backups take a long time (using > Netbackup 6.5 a FULL takes between two and three days, differential > incrementals, even with very few files changing, take between 15 and > 20 hours). We already use snapshots for day to day restores, but we > need the ''real'' backups for DR.This is quite common.> I have been testing zfs send throughput and have not been > getting promising results. Note that this is NOT OpenSolaris, but > Solaris 10U6 (10/08) with the IDR for the snapshot interrupts resilver > bug.You will need to do this in parallel. We have had some discussions about a possible white paper on this topic, but as yet there is no funding. So, it will remain in the world of professional services for the time being. -- richard
Paul Kraus wrote:> There are about 3.3 million files / directories in the ''dataset'', > files range in size from 1 KB to 100 KB. > > pkraus at nyc-sted1:/IDR-test/ppk> time sudo zfs send > IDR-test/dataset at 1250616026 >/dev/null > > real 91m19.024s > user 0m0.022s > sys 11m51.422s > pkraus at nyc-sted1:/IDR-test/ppk> > > Which translates to a little over 18 MB/sec. and 600 files/sec. That > would mean almost 16 hours per TB. Better, but not much better than > NBU. > > I do not think the SE-3511 is limiting us, as I have seen much higher > throughput on them when resilvering one or more mirrors. > > Any thoughts as to why I am not getting better throughput ? >With Solaris 10U7 I see about 35MB/sec between Thumpers using a direct socket connection rather than ssh for full sends and 7-12MB/sec for incrementals, depending on the data set. -- Ian.
>With Solaris 10U7 I see about 35MB/sec between Thumpers using a direct >socket connection rather than ssh for full sends and 7-12MB/sec for >incrementals, depending on the data set.Ian, What''s the syntax you use for this procedure?
Joseph L. Casale wrote:>> With Solaris 10U7 I see about 35MB/sec between Thumpers using a direct >> socket connection rather than ssh for full sends and 7-12MB/sec for >> incrementals, depending on the data set. >> > > Ian, > What''s the syntax you use for this procedure? >I have my own application that uses large circular buffers and a socket connection between hosts. The buffers keep data flowing during ZFS writes and the direct connection cuts out ssh. -- Ian.
>I have my own application that uses large circular buffers and a socket >connection between hosts. The buffers keep data flowing during ZFS >writes and the direct connection cuts out ssh.Application, as in not script (something you can share)? :) jlc
Joseph L. Casale wrote:>> I have my own application that uses large circular buffers and a socket >> connection between hosts. The buffers keep data flowing during ZFS >> writes and the direct connection cuts out ssh. >> > > Application, as in not script (something you can share)? >Not yet! -- Ian.