ZFS send is very slow. dmu_sendbackup function is traversing dataset in one thread and in traverse callback function ( backup_cb ) we are waiting for data in arc_read called with ARC_WAIT flag. I want to parallize zfs send to make it faster. dmu_sendbackup could allocate buffer, that will be used for buffering output. Few threads can traverse dataset, few threads would be used for async read operations. I think it could speed up zfs send operation 10x. What do you think about it ? This message posted from opensolaris.org
Hello ?ukasz, Monday, July 23, 2007, 1:19:16 PM, you wrote: ?> ZFS send is very slow. ?> dmu_sendbackup function is traversing dataset in one thread and in ?> traverse callback function ( backup_cb ) we are waiting for data in ?> arc_read called with ARC_WAIT flag. ?> I want to parallize zfs send to make it faster. ?> dmu_sendbackup could allocate buffer, that will be used for buffering output. ?> Few threads can traverse dataset, few threads would be used for async read operations. ?> I think it could speed up zfs send operation 10x. ?> What do you think about it ? I guess you should check with Matthew Ahrens as IIRC he''s working on ''zfs send -r'' and possibly some other improvements to zfs send. The question is what code changes Matthew has done so far (it hasn''t been integrated AFAIK) and possibly work from there. Or perhaps Matthew is already working on it also... Now, if zfs resides on lots of disks then I guess it should speed up zfs send considerably, at least in some cases (lot of small files, written/deleted/created randomly). Then it would be great if you could implement something and share with us some results to see if there''s actually some performance gain. Also I guess you''ll have to write all transactions to the other end (zfs recv) in the same order they were created on disk,or not? ps. Lukasz - nice to see you here more and more :) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> Hello ?ukasz, > > Monday, July 23, 2007, 1:19:16 PM, you wrote: > > ?> ZFS send is very slow. > ?> dmu_sendbackup function is traversing dataset in one thread and in > ?> traverse callback function ( backup_cb ) we are waiting for data in > ?> arc_read called with ARC_WAIT flag.That''s correct.> ?> I want to parallize zfs send to make it faster. > ?> dmu_sendbackup could allocate buffer, that will be used for buffering output. > ?> Few threads can traverse dataset, few threads would be used for async read operations. > > ?> I think it could speed up zfs send operation 10x. > > ?> What do you think about it ?You''re right that we need to issue more i/os in parallel -- see 6333409 "traversal code should be able to issue multiple reads in parallel" However, it may be much more straightforward to just issue prefetches appropriately, rather than attempt to coordinate multiple threads. That said, feel free to experiment.> I guess you should check with Matthew Ahrens as IIRC he''s working on > ''zfs send -r'' and possibly some other improvements to zfs send. The > question is what code changes Matthew has done so far (it hasn''t been > integrated AFAIK) and possibly work from there. Or perhaps Matthew is > already working on it also...Unfortunately I am not working on this bug as part of my "zfs send -r" changes. But I plan to work on it (unless you get to it first!) later this year as part of the pool space reduction changes.> Also I guess you''ll have to write all transactions to the other end > (zfs recv) in the same order they were created on disk,or not?Nope, that''s (one of) the beauty of zfs send. --matt
> > ?> I want to parallize zfs send to make it faster. > > ?> dmu_sendbackup could allocate buffer, that will > be used for buffering output. > > ?> Few threads can traverse dataset, few threads > would be used for async read operations. > > > > ?> I think it could speed up zfs send operation > 10x. > > > > ?> What do you think about it ? > > You''re right that we need to issue more i/os in > parallel -- see 6333409 > "traversal code should be able to issue multiple > reads in parallel"When do you think it will be available ?> However, it may be much more straightforward to just > issue prefetches > appropriately, rather than attempt to coordinate > multiple threads. That > said, feel free to experiment.How can I prefetch data ? Traverse dataset in second thread ? Correct me if I''m wrong. Adding simple buffering could speed up sending operation. Now for each packet we are calling [b]vn_rdwr[/b] function. What do you think about smaller dmu_replay_record_t struct. Remove char drr_toname[MAXNAMELEN]; from drr_begin struct and for DRR_BEGIN command add read/write MAXNAMELEN bytes. This message posted from opensolaris.org
?ukasz wrote:>> You''re right that we need to issue more i/os in >> parallel -- see 6333409 >> "traversal code should be able to issue multiple >> reads in parallel" > > When do you think it will be available ?Perhaps by the end of the calendar year, but perhaps longer. Maybe sooner if you work on it :-)>> However, it may be much more straightforward to just >> issue prefetches >> appropriately, rather than attempt to coordinate >> multiple threads. That >> said, feel free to experiment. > > How can I prefetch data ? Traverse dataset in second thread ?No; see dmu_prefetch().> Correct me if I''m wrong. > Adding simple buffering could speed up sending operation. Now for each packet > we are calling [b]vn_rdwr[/b] function.Perhaps; try timing with "zfs send ... > /dev/null". However much faster that is than sending it to your preferred location is the maximum amount of performance to be gained.> What do you think about smaller dmu_replay_record_t struct. > Remove > char drr_toname[MAXNAMELEN]; > from drr_begin struct and for DRR_BEGIN command add read/write MAXNAMELEN bytes.Yeah, that would be nice. But it would sure be nice to be able to still read the old-style records too. --matt
?ukasz K wrote:> Hello Matthew, > > I have problems with pool fragmentation. > http://www.opensolaris.org/jive/thread.jspa?threadID=34810 > > Now I want to speed up zfs send, because our pool space maps are > huge - after sending space maps will be smaller ( from 1GB -> 50MB ). > > As I understand I there will not be anything like defragmentation,We will be implementing defragmentation with the device removal feature, perhaps by the end of the calendar year.> so I need to live with this. But some changes could help: > 1. Auto tune recordsize, when pool is out of 128kB blocks, then should > use smaller ones. > 2. We should be more careful with unloading space maps. I have enough > RAM to keep metaslabs in memory. > > Can you help me with changing this algorithms ?See where metaslab_sync_done() calls space_map_unload(). --matt