I had a spare piece of hardware sitting around, so I thought I''d test btrfs performance with the Cyrus IMAPd server by setting up an extra replica target on the spare machine. Some background on Cyrus replication: when copying a folder the replication system first "reserves" all messages it''s going to need. It tries to maintain "single instance store" as it''s called in Cyrus terminology - hard links between identical messages on disk. This is done in the latest version of Cyrus by storing the sha1 of each file in an index, and scanning the currently active mailboxes on the replica to see if they already have a copy of the file. If so, a hard link is made in the data/sync./$pid/ directory back to the original file in the mailbox directory. Cyrus stores one file per email, which pushes filesystems pretty hard. We used reiser3 until recently, and are part way through converting to ext4. If the file is not already available on the replica, a new copy is uploaded directly into the sync./$pid directory. Either way, when the mailbox is then created or updated, the files get hardlinked from the sync./$pid directory to their final location. They get kept around for a little while, until the sync_server decides it''s time for a reset because it''s using too much memory keeping all the tracking data. Then it unlinks all the files in sync./$pid and starts searching for necessary files again. Most of the time, this means single instance store works - the source and destination mailboxes always get heated up by adding both of them to the sync log, so the duplication will be found. ----------------- Anyway, that''s the background - a daemon that creates a pile of files in one directory, symlinks them out all over the file system, then unlinks all the original files later. We''re finding that as the filesystem grows (currently about 30% full on a 300Gb filesystem) the unlink performance becomes horrible. Watching iostat, there''s a lot of reading going on as well. It really looks like the unlinks are performing pretty badly in this one case. Ideally there would be a nice filesystem API Cyrus could call that said "delete all the files in this directory"! Failing that, is there anything we can do to improve this use case? Real-time production use isn''t QUITE so bad as an initial sync, but lmtp delivery uses the same method - spool to staging file, parse it there, then symlink to all the delivery targets before unlinking the original. Thanks, Bron. -- Bron Gondwana brong@fastmail.fm -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Just posting this again more neatly formatted and just the ''meat'': a) program creates piles of small temporary files, hard links them out to different directories, unlinks the originals. b) filesystem size: ~ 300Gb (backed by hardware RAID5) c) as the filesystem grows (currently about 30% full) the unlink performance becomes horrible. Watching iostat, there''s a lot of reading going on as well. Is this expected? Is there anything we can do about it? (short of rewrite Cyrus replication) Thanks, Bron. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Nov-16 13:38 UTC
Re: Poor performance unlinking hard-linked files (repost)
Excerpts from Bron Gondwana''s message of 2010-11-16 07:54:45 -0500:> Just posting this again more neatly formatted and just the > ''meat'': > > a) program creates piles of small temporary files, hard > links them out to different directories, unlinks the > originals. > > b) filesystem size: ~ 300Gb (backed by hardware RAID5) > > c) as the filesystem grows (currently about 30% full) > the unlink performance becomes horrible. Watching > iostat, there''s a lot of reading going on as well. > > Is this expected? Is there anything we can do about it? > (short of rewrite Cyrus replication)Hi, It sounds like the unlink speed is limited by the reading, and the reads are coming from one of two places. We''re either reading to cache cold block groups or we''re reading to find the directory entries. Could you sysrq-w while the performance is bad? That would narrow it down. Josef has the reads for caching block groups fixed, but we''ll have to look hard at the reads for the rest of unlink. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bron Gondwana
2010-Nov-17 04:11 UTC
Re: Poor performance unlinking hard-linked files (repost)
On Tue, Nov 16, 2010 at 08:38:13AM -0500, Chris Mason wrote:> Excerpts from Bron Gondwana''s message of 2010-11-16 07:54:45 -0500: > > Just posting this again more neatly formatted and just the > > ''meat'': > > > > a) program creates piles of small temporary files, hard > > links them out to different directories, unlinks the > > originals. > > > > b) filesystem size: ~ 300Gb (backed by hardware RAID5) > > > > c) as the filesystem grows (currently about 30% full) > > the unlink performance becomes horrible. Watching > > iostat, there''s a lot of reading going on as well. > > > > Is this expected? Is there anything we can do about it? > > (short of rewrite Cyrus replication) > > Hi, > > It sounds like the unlink speed is limited by the reading, and the reads > are coming from one of two places. We''re either reading to cache cold > block groups or we''re reading to find the directory entries.All the unlinks for a single process will be happening in the same directory (though the hard linked copies will be all over)> Could you sysrq-w while the performance is bad? That would narrow it > down.Here''s one: http://pastebin.com/Tg7agv42> Josef has the reads for caching block groups fixed, but we''ll have to > look hard at the reads for the rest of unlink.I suspect you may want a couple more before you have enough data. I could set up a job to run one every 10 minutes for a couple of hours or something. There will be at least two, possibly three threads of "sync_server" running on this particular server instance. It has two btrfs partitions - a 15Gb partition on RAID1 and a 300Gb partition on RAID5. All the unlinks will be happening to the RAID5 one. Bron ( our usual fully loaded server might have up to 40 of these pairs over 12 separate RAID sets, so anything that doesn''t scale out to lots of filesystems would make us pretty sad too ) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bron Gondwana
2010-Nov-17 09:56 UTC
Re: Poor performance unlinking hard-linked files (repost)
On Wed, Nov 17, 2010 at 03:11:48PM +1100, Bron Gondwana wrote:> > Could you sysrq-w while the performance is bad? That would narrow it > > down. > > Here''s one: > > http://pastebin.com/Tg7agv42And here''s another one, inline this time. The iostat for 10 seconds just before said: (iostat -x 10 10) avg-cpu: %user %nice %system %iowait %steal %idle 32.43 0.00 31.63 21.84 0.00 14.09 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.70 1.30 0.20 25.60 7.20 21.87 0.15 348.27 33.07 4.96 sda1 0.00 0.70 1.30 0.20 25.60 7.20 21.87 0.15 348.27 33.07 4.96 sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 0.00 6.00 0.10 75.20 0.80 860.80 11.44 0.01 0.15 0.06 0.48 sdb1 0.00 5.20 0.10 3.80 0.80 72.00 18.67 0.00 0.41 0.31 0.12 sdb2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb8 0.00 0.80 0.00 71.40 0.00 788.80 11.05 0.01 0.13 0.05 0.36 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd 0.00 2.40 121.80 252.40 43012.00 10223.20 142.26 2.61 6.76 1.24 46.56 sdd1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdd4 0.00 2.40 121.80 252.40 43012.00 10223.20 142.26 2.61 6.76 1.24 46.56 (sdb8 and sdd4 are the meta and data partitions respectively - sdd4 is where all the interesting stuff is happening) By the way - we''re running with the deadline scheduler, I''m pretty sure. Let me know if that''s silly... [533206.344314] SysRq : Show Blocked State [533206.344376] task PC stack pid father [533206.344500] sync_server D 0000000107f0e028 0 17027 10416 0x00020000 [533206.344564] ffff88016c6898a8 0000000000200046 ffff88016c688010 ffff88022a153d00 [533206.344671] ffff88016c689fd8 ffff88016c689fd8 0000000000013300 0000000000013300 [533206.344779] 0000000000013300 0000000000013300 0000000000013300 0000000000013300 [533206.344886] Call Trace: [533206.344948] [<ffffffff817c17dd>] io_schedule+0x4d/0x70 [533206.345005] [<ffffffff81093e4d>] sync_page+0x3d/0x70 [533206.345061] [<ffffffff817c1cfa>] __wait_on_bit+0x5a/0x90 [533206.345116] [<ffffffff81093e10>] ? sync_page+0x0/0x70 [533206.345170] [<ffffffff810940af>] wait_on_page_bit+0x6f/0x80 [533206.345227] [<ffffffff8105dbd0>] ? wake_bit_function+0x0/0x40 [533206.345287] [<ffffffff81278878>] ? submit_one_bio+0x88/0xa0 [533206.345341] [<ffffffff8127cd2d>] read_extent_buffer_pages+0x4ed/0x530 [533206.345401] [<ffffffff81254a30>] ? btree_get_extent+0x0/0x1a0 [533206.345456] [<ffffffff8125490e>] btree_read_extent_buffer_pages+0x5e/0xc0 [533206.345512] [<ffffffff81255406>] read_tree_block+0x56/0x80 [533206.345569] [<ffffffff8123a235>] read_block_for_search+0x105/0x3d0 [533206.345626] [<ffffffff81289869>] ? btrfs_tree_unlock+0x59/0x60 [533206.345680] [<ffffffff81239ec5>] ? unlock_up+0x145/0x160 [533206.345735] [<ffffffff81242602>] btrfs_search_slot+0x412/0x880 [533206.345792] [<ffffffff8124351a>] btrfs_insert_empty_items+0x6a/0xd0 [533206.345850] [<ffffffff810cc462>] ? kmem_cache_alloc+0x92/0xf0 [533206.345905] [<ffffffff81254039>] btrfs_insert_inode_ref+0x79/0x190 [533206.345962] [<ffffffff812627e1>] btrfs_add_link+0x121/0x1a0 [533206.346017] [<ffffffff817c1f39>] ? mutex_unlock+0x9/0x10 [533206.346071] [<ffffffff8126289e>] btrfs_add_nondir+0x3e/0x70 [533206.346126] [<ffffffff81262fe2>] btrfs_link+0xe2/0x180 [533206.346182] [<ffffffff810dead1>] vfs_link+0x101/0x160 [533206.346237] [<ffffffff810e1f51>] sys_linkat+0x131/0x150 [533206.346293] [<ffffffff810e1f89>] sys_link+0x19/0x20 [533206.346349] [<ffffffff8102cc83>] ia32_sysret+0x0/0x5 [533206.346408] sync_server D 0000000107f0e03c 0 5431 10416 0x00020000 [533206.346470] ffff8800ca13bc58 0000000000200046 ffff8800ca13a010 ffff8801ea888000 [533206.346577] ffff8800ca13bfd8 ffff8800ca13bfd8 0000000000013300 0000000000013300 [533206.347724] 0000000000013300 0000000000013300 0000000000013300 0000000000013300 [533206.347830] Call Trace: [533206.347883] [<ffffffff817c17dd>] io_schedule+0x4d/0x70 [533206.347937] [<ffffffff810fdbb5>] sync_buffer+0x45/0x50 [533206.347992] [<ffffffff817c1cfa>] __wait_on_bit+0x5a/0x90 [533206.348004] [<ffffffff810fdb70>] ? sync_buffer+0x0/0x50 [533206.348004] [<ffffffff810fdb70>] ? sync_buffer+0x0/0x50 [533206.348004] [<ffffffff817c1da4>] out_of_line_wait_on_bit+0x74/0x90 [533206.348004] [<ffffffff8105dbd0>] ? wake_bit_function+0x0/0x40 [533206.348004] [<ffffffff810fdae6>] __wait_on_buffer+0x26/0x30 [533206.348004] [<ffffffff81256e78>] write_dev_supers+0x238/0x310 [533206.348004] [<ffffffff81257152>] write_all_supers+0x202/0x280 [533206.348004] [<ffffffff812571de>] write_ctree_super+0xe/0x10 [533206.348004] [<ffffffff8128f687>] btrfs_sync_log+0x3a7/0x5c0 [533206.348004] [<ffffffff81267e27>] btrfs_sync_file+0x187/0x1b0 [533206.348004] [<ffffffff810fa6e1>] vfs_fsync_range+0x81/0xa0 [533206.348004] [<ffffffff810fa767>] vfs_fsync+0x17/0x20 [533206.348004] [<ffffffff810fa7a5>] do_fsync+0x35/0x60 [533206.348004] [<ffffffff810fa7fb>] sys_fsync+0xb/0x10 [533206.348004] [<ffffffff8102cc83>] ia32_sysret+0x0/0x5 [533206.348004] Sched Debug Version: v0.09, 2.6.36-dev64 #1 [533206.348004] now at 533206348.885250 msecs [533206.348004] .jiffies : 4428193883 [533206.348004] .sysctl_sched_latency : 12.000000 [533206.348004] .sysctl_sched_min_granularity : 1.500000 [533206.348004] .sysctl_sched_wakeup_granularity : 2.000000 [533206.348004] .sysctl_sched_child_runs_first : 0 [533206.348004] .sysctl_sched_features : 15471 [533206.348004] .sysctl_sched_tunable_scaling : 1 (logaritmic) [533206.348004] [533206.348004] cpu#0, 3000.402 MHz [533206.348004] .nr_running : 0 [533206.348004] .load : 0 [533206.348004] .nr_switches : 22546403 [533206.348004] .nr_load_updates : 133301585 [533206.348004] .nr_uninterruptible : 2 [533206.348004] .next_balance : 4428.193884 [533206.348004] .curr->pid : 0 [533206.348004] .clock : 533206348.006654 [533206.348004] .cpu_load[0] : 0 [533206.348004] .cpu_load[1] : 0 [533206.348004] .cpu_load[2] : 32 [533206.348004] .cpu_load[3] : 147 [533206.348004] .cpu_load[4] : 225 [533206.348004] .yld_count : 0 [533206.348004] .sched_switch : 0 [533206.348004] .sched_count : 25635109 [533206.348004] .sched_goidle : 8442206 [533206.348004] .avg_idle : 891600 [533206.348004] .ttwu_count : 11929488 [533206.348004] .ttwu_local : 7108567 [533206.348004] .bkl_count : 2862 [533206.348004] [533206.348004] cfs_rq[0]: [533206.348004] .exec_clock : 4785380.650156 [533206.348004] .MIN_vruntime : 0.000001 [533206.348004] .min_vruntime : 4266012.055723 [533206.348004] .max_vruntime : 0.000001 [533206.348004] .spread : 0.000000 [533206.348004] .spread0 : 0.000000 [533206.348004] .nr_running : 0 [533206.348004] .load : 0 [533206.348004] .nr_spread_over : 525 [533206.348004] [533206.348004] rt_rq[0]: [533206.348004] .rt_nr_running : 0 [533206.348004] .rt_throttled : 0 [533206.348004] .rt_time : 0.000000 [533206.348004] .rt_runtime : 950.000000 [533206.348004] [533206.348004] runnable tasks: [533206.348004] task PID tree-key switches prio exec-runtime sum-exec sum-sleep [533206.348004] ---------------------------------------------------------------------------------------------------------- [533206.348004] [533206.348004] cpu#1, 3000.402 MHz [533206.348004] .nr_running : 1 [533206.348004] .load : 1024 [533206.348004] .nr_switches : 20052917 [533206.348004] .nr_load_updates : 133301525 [533206.348004] .nr_uninterruptible : 0 [533206.348004] .next_balance : 4428.193883 [533206.348004] .curr->pid : 6175 [533206.348004] .clock : 533206344.023423 [533206.348004] .cpu_load[0] : 0 [533206.348004] .cpu_load[1] : 0 [533206.348004] .cpu_load[2] : 15 [533206.348004] .cpu_load[3] : 133 [533206.348004] .cpu_load[4] : 330 [533206.348004] .yld_count : 0 [533206.348004] .sched_switch : 0 [533206.348004] .sched_count : 24068035 [533206.348004] .sched_goidle : 6629197 [533206.348004] .avg_idle : 881626 [533206.348004] .ttwu_count : 10794852 [533206.348004] .ttwu_local : 8391194 [533206.348004] .bkl_count : 2823 [533206.348004] [533206.348004] cfs_rq[1]: [533206.348004] .exec_clock : 4041404.226026 [533206.348004] .MIN_vruntime : 0.000001 [533206.348004] .min_vruntime : 4070860.187615 [533206.348004] .max_vruntime : 0.000001 [533206.348004] .spread : 0.000000 [533206.348004] .spread0 : -195151.868108 [533206.348004] .nr_running : 1 [533206.348004] .load : 1024 [533206.348004] .nr_spread_over : 615 [533206.348004] [533206.348004] rt_rq[1]: [533206.348004] .rt_nr_running : 0 [533206.348004] .rt_throttled : 0 [533206.348004] .rt_time : 0.000000 [533206.348004] .rt_runtime : 950.000000 [533206.348004] [533206.348004] runnable tasks: [533206.348004] task PID tree-key switches prio exec-runtime sum-exec sum-sleep [533206.348004] ---------------------------------------------------------------------------------------------------------- [533206.348004] R bash 6175 4070854.187615 73 120 4070854.187615 32.744445 49348.100455 [533206.348004] -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Nov-18 15:30 UTC
Re: Poor performance unlinking hard-linked files (repost)
Excerpts from Bron Gondwana''s message of 2010-11-16 23:11:48 -0500:> On Tue, Nov 16, 2010 at 08:38:13AM -0500, Chris Mason wrote: > > Excerpts from Bron Gondwana''s message of 2010-11-16 07:54:45 -0500: > > > Just posting this again more neatly formatted and just the > > > ''meat'': > > > > > > a) program creates piles of small temporary files, hard > > > links them out to different directories, unlinks the > > > originals. > > > > > > b) filesystem size: ~ 300Gb (backed by hardware RAID5) > > > > > > c) as the filesystem grows (currently about 30% full) > > > the unlink performance becomes horrible. Watching > > > iostat, there''s a lot of reading going on as well. > > > > > > Is this expected? Is there anything we can do about it? > > > (short of rewrite Cyrus replication) > > > > Hi, > > > > It sounds like the unlink speed is limited by the reading, and the reads > > are coming from one of two places. We''re either reading to cache cold > > block groups or we''re reading to find the directory entries. > > All the unlinks for a single process will be happening in the same > directory (though the hard linked copies will be all over) > > > Could you sysrq-w while the performance is bad? That would narrow it > > down. > > Here''s one: > > http://pastebin.com/Tg7agv42Ok, we''re mixing unlinks and fsyncs. If it fsyncing directories too? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bron Gondwana
2010-Nov-18 21:46 UTC
Re: Poor performance unlinking hard-linked files (repost)
On Thu, Nov 18, 2010 at 10:30:47AM -0500, Chris Mason wrote:> Excerpts from Bron Gondwana''s message of 2010-11-16 23:11:48 -0500: > > > > a) program creates piles of small temporary files, hard > > > > links them out to different directories, unlinks the > > > > originals. > > > > > > > > b) filesystem size: ~ 300Gb (backed by hardware RAID5) > > > > > > > > c) as the filesystem grows (currently about 30% full) > > > > the unlink performance becomes horrible. Watching > > > > iostat, there''s a lot of reading going on as well. > > > > > > It sounds like the unlink speed is limited by the reading, and the reads > > > are coming from one of two places. We''re either reading to cache cold > > > block groups or we''re reading to find the directory entries. > > > > All the unlinks for a single process will be happening in the same > > directory (though the hard linked copies will be all over) > > > > > Could you sysrq-w while the performance is bad? That would narrow it > > > down. > > > > Here''s one: > > > > http://pastebin.com/Tg7agv42 > > Ok, we''re mixing unlinks and fsyncs. If it fsyncing directories too?Nup. I''m pretty sure it doesn''t, just files. Yes - there will certainly be fsyncs going on as well - Cyrus is very careful to fsync everything it cares about at the file level, but all it does with directories is mkdir them if they don''t exist. This just a single "sync_server" process on an experimental server. A real server under full load is going to have multiple processes doing fsyncs and unlinks. A significant portion of unlinks are of files that have another link on the filesystem. Every mailbox "move" is implemented as a copy (hardlink) plus expunge (delayed unlink). The "delay" works by marking the message to be deleted in the cyrus.index metadata file, and then deleting later (tunable: 7 to 14 days in our case depending when the next weekend is) Bron. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Nov-19 14:10 UTC
Re: Poor performance unlinking hard-linked files (repost)
Excerpts from Bron Gondwana''s message of 2010-11-18 16:46:31 -0500:> On Thu, Nov 18, 2010 at 10:30:47AM -0500, Chris Mason wrote: > > Excerpts from Bron Gondwana''s message of 2010-11-16 23:11:48 -0500: > > > > > a) program creates piles of small temporary files, hard > > > > > links them out to different directories, unlinks the > > > > > originals. > > > > > > > > > > b) filesystem size: ~ 300Gb (backed by hardware RAID5) > > > > > > > > > > c) as the filesystem grows (currently about 30% full) > > > > > the unlink performance becomes horrible. Watching > > > > > iostat, there''s a lot of reading going on as well. > > > > > > > > It sounds like the unlink speed is limited by the reading, and the reads > > > > are coming from one of two places. We''re either reading to cache cold > > > > block groups or we''re reading to find the directory entries. > > > > > > All the unlinks for a single process will be happening in the same > > > directory (though the hard linked copies will be all over) > > > > > > > Could you sysrq-w while the performance is bad? That would narrow it > > > > down. > > > > > > Here''s one: > > > > > > http://pastebin.com/Tg7agv42 > > > > Ok, we''re mixing unlinks and fsyncs. If it fsyncing directories too? > > Nup. I''m pretty sure it doesn''t, just files. Yes - there will certainly > be fsyncs going on as well - Cyrus is very careful to fsync everything it > cares about at the file level, but all it does with directories is mkdir > them if they don''t exist.Could you double check this one please? fsyncing the directory is a ton more expensive, I just want to make sure it isn''t part of the workload. Otherwise it looks like we''re seeking to read in the inode and unlink it. One possibility is that we''re not giving the elevator enough clues about the IO being synchronous. Are you using cfq or deadline? I bet we can improve the latencies using READ_SYNC. -chris> > This just a single "sync_server" process on an experimental server. A > real server under full load is going to have multiple processes doing > fsyncs and unlinks. > > A significant portion of unlinks are of files that have another link on > the filesystem. Every mailbox "move" is implemented as a copy (hardlink) > plus expunge (delayed unlink). The "delay" works by marking the message > to be deleted in the cyrus.index metadata file, and then deleting later > (tunable: 7 to 14 days in our case depending when the next weekend is) > > Bron.-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bron Gondwana
2010-Nov-19 21:58 UTC
Re: Poor performance unlinking hard-linked files (repost)
On Fri, Nov 19, 2010 at 09:10:08AM -0500, Chris Mason wrote:> Excerpts from Bron Gondwana''s message of 2010-11-18 16:46:31 -0500: > > On Thu, Nov 18, 2010 at 10:30:47AM -0500, Chris Mason wrote: > > > > http://pastebin.com/Tg7agv42 > > > > > > Ok, we''re mixing unlinks and fsyncs. If it fsyncing directories too? > > > > Nup. I''m pretty sure it doesn''t, just files. Yes - there will certainly > > be fsyncs going on as well - Cyrus is very careful to fsync everything it > > cares about at the file level, but all it does with directories is mkdir > > them if they don''t exist. > > Could you double check this one please? fsyncing the directory is a ton > more expensive, I just want to make sure it isn''t part of the workload. > > Otherwise it looks like we''re seeking to read in the inode and unlink > it. One possibility is that we''re not giving the elevator enough clues > about the IO being synchronous. > > Are you using cfq or deadline? I bet we can improve the latencies using > READ_SYNC.I''m using deadline. Here''s a redacted strace of a single message upload: (those gettimeofday calls are actually caused by "trickle" being used to bandwidth limit these things from nuking our internal network if they call go crazy at one) All I''m seeing is the fsyncs on the files. And some unnecessary mkdir calls that I can probably remove, and an unneccary truncate on the quota file. Bron. gettimeofday({1290202884, 848919}, NULL) = 0 gettimeofday({1290202884, 849006}, NULL) = 0 mkdir("/mnt", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/sync.", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/sync./11284", 0755) = -1 EEXIST (File exists) open("/mnt/sata96b1d4/slots96b1p4/store23/data/sync./11284/9be294a24866fc162e5a2d48925d57642ff20a71", O_RDWR|O_CREAT|O_TRUNC, 0666) = 11 fstat64(11, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf4cb2000 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 4096 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 851014}, NULL) = 0 gettimeofday({1290202884, 851101}, NULL) = 0 write(11, "MIME-Version: 1.0\r\nContent-Transf"..., 4096) = 4096 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 4096 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 851952}, NULL) = 0 gettimeofday({1290202884, 852038}, NULL) = 0 write(11, "<CENSORED>"..., 4096) = 4096 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 4096 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 852644}, NULL) = 0 gettimeofday({1290202884, 852729}, NULL) = 0 write(11, "family: Arial; font-size: medium="..., 4096) = 4096 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 4096 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 853303}, NULL) = 0 gettimeofday({1290202884, 853389}, NULL) = 0 write(11, "<CENSORED>"..., 4096) = 4096 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 4096 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 853960}, NULL) = 0 gettimeofday({1290202884, 854045}, NULL) = 0 write(11, "<CENSORED>"..., 4096) = 4096 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 4096 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 854617}, NULL) = 0 gettimeofday({1290202884, 854703}, NULL) = 0 write(11, "<CENSORED>"..., 4096) = 4096 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 910 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 855431}, NULL) = 0 gettimeofday({1290202884, 855552}, NULL) = 0 write(11, "<CENSORED>"..., 4096) = 4096 write(11, "<CENSORED>"..., 668) = 668 fsync(11) = 0 close(11) = 0 munmap(0xf4cb2000, 4096) = 0 write(1, "<CENSORED>"..., 32) = 32 time(NULL) = 1290202884 read(0, "<CENSORED>"..., 4096) = 731 fcntl64(0, F_GETFL) = 0x2 (flags O_RDWR) gettimeofday({1290202884, 858721}, NULL) = 0 gettimeofday({1290202884, 858809}, NULL) = 0 open("/mnt/sata96b1m4/slots96b1p4/store23/conf/lock/domain/a/airpost.net/p/user/<CENSORED>/Drafts.lock", O_RDWR|O_CREAT|O_TRUNC, 0666) = 11 fcntl64(11, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}) = 0 fcntl64(6, F_SETLKW, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}) = 0 fstat64(6, {st_mode=S_IFREG|0600, st_size=809668, ...}) = 0 stat64("/mnt/sata96b1m4/slots96b1p4/store23/conf/mailboxes.db", {st_mode=S_IFREG|0600, st_size=809668, ...}) = 0 fcntl64(6, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 open("/mnt/sata96b1m4/slots96b1p4/store23/meta/domain/a/airpost.net/p/user/<CENSORED>/Drafts/cyrus.index", O_RDWR) = 13 fstat64(13, {st_mode=S_IFREG|0600, st_size=9536, ...}) = 0 mmap2(NULL, 24576, PROT_READ, MAP_SHARED, 13, 0) = 0xf4cad000 fcntl64(13, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 stat64("/mnt/sata96b1m4/slots96b1p4/store23/meta/domain/a/airpost.net/p/user/<CENSORED>/Drafts/cyrus.header", {st_mode=S_IFREG|0600, st_size=241, ...}) = 0 open("/mnt/sata96b1m4/slots96b1p4/store23/meta/domain/a/airpost.net/p/user/<CENSORED>/Drafts/cyrus.header", O_RDONLY) = 14 fstat64(14, {st_mode=S_IFREG|0600, st_size=241, ...}) = 0 mmap2(NULL, 241, PROT_READ, MAP_SHARED, 14, 0) = 0xf4cac000 munmap(0xf4cac000, 241) = 0 lseek(13, 9440, SEEK_SET) = 9440 write(13, "<REWRITE INDEX RECORD (unrelated)>"..., 96) = 96 time(NULL) = 1290202884 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 send(5, "<181>Nov 19 16:41:24 slots96b1p4/"..., 238, MSG_NOSIGNAL) = 238 mkdir("/mnt", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/sync.", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/sync./11284", 0755) = -1 EEXIST (File exists) open("/mnt/sata96b1d4/slots96b1p4/store23/data/sync./11284/9be294a24866fc162e5a2d48925d57642ff20a71", O_RDONLY) = 15 fstat64(15, {st_mode=S_IFREG|0600, st_size=29340, ...}) = 0 mmap2(NULL, 29340, PROT_READ, MAP_SHARED, 15, 0) = 0xf4ca5000 munmap(0xf4ca5000, 29340) = 0 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 close(15) = 0 mkdir("/mnt", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net/p", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net/p/user", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net/p/user/<CENSORED>", 0755) = -1 EEXIST (File exists) mkdir("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net/p/user/<CENSORED>/Drafts", 0755) = -1 EEXIST (File exists) link("/mnt/sata96b1d4/slots96b1p4/store23/data/sync./11284/9be294a24866fc162e5a2d48925d57642ff20a71", "/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net/p/user/<CENSORED>/Drafts/4907.") = 0 utime("/mnt/sata96b1d4/slots96b1p4/store23/data/domain/a/airpost.net/p/user/<CENSORED>/Drafts/4907.", [2010/11/19-16:41:24, 2010/11/19-16:41:24]) = 0 open("/mnt/sata96b1m4/slots96b1p4/store23/meta/domain/a/airpost.net/p/user/<CENSORED>/Drafts/cyrus.cache", O_RDWR) = 15 fstat64(15, {st_mode=S_IFREG|0600, st_size=105488, ...}) = 0 mmap2(NULL, 114688, PROT_READ, MAP_SHARED, 15, 0) = 0xf4c91000 lseek(15, 0, SEEK_END) = 105488 write(15, "<CACHE ENTRY>"..., 1200) = 1200 lseek(13, 9536, SEEK_SET) = 9536 write(13, "<INDEX RECORD FOR THIS UPLOAD>"..., 96) = 96 time(NULL) = 1290202884 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=3519, ...}) = 0 send(5, "<181>Nov 19 16:41:24 slots96b1p4/"..., 237, MSG_NOSIGNAL) = 237 time(NULL) = 1290202884 fsync(15) = 0 open("/mnt/sata96b1m4/slots96b1p4/store23/conf/domain/a/airpost.net/quota/p/user.<CENSORED>", O_RDWR) = 16 fcntl64(16, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 fstat64(16, {st_mode=S_IFREG|0600, st_size=18, ...}) = 0 stat64("/mnt/sata96b1m4/slots96b1p4/store23/conf/domain/a/airpost.net/quota/p/user.<CENSORED>", {st_mode=S_IFREG|0600, st_size=18, ...}) = 0 fstat64(16, {st_mode=S_IFREG|0600, st_size=18, ...}) = 0 mmap2(NULL, 18, PROT_READ, MAP_SHARED, 16, 0) = 0xf4c90000 munmap(0xf4c90000, 18) = 0 unlink("/mnt/sata96b1m4/slots96b1p4/store23/conf/domain/a/airpost.net/quota/p/user.<CENSORED>.NEW") = -1 ENOENT (No such file or directory) open("/mnt/sata96b1m4/slots96b1p4/store23/conf/domain/a/airpost.net/quota/p/user.<CENSORED>.NEW", O_RDWR|O_CREAT|O_TRUNC, 0666) = 17 fcntl64(17, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 lseek(17, 0, SEEK_SET) = 0 write(17, "<CENSORED>"..., 18) = 18 ftruncate(17, 18) = 0 fsync(17) = 0 fstat64(17, {st_mode=S_IFREG|0600, st_size=18, ...}) = 0 rename("/mnt/sata96b1m4/slots96b1p4/store23/conf/domain/a/airpost.net/quota/p/user.<CENSORED>.NEW", "/mnt/sata96b1m4/slots96b1p4/store23/conf/domain/a/airpost.net/quota/p/user.<CENSORED>") = 0 fcntl64(17, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 close(17) = 0 fcntl64(16, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 close(16) = 0 lseek(13, 0, SEEK_SET) = 0 write(13, "<UPDATED INDEX HEADER>"..., 128) = 128 fsync(13) = 0 fcntl64(8, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0 fstat64(8, {st_mode=S_IFREG|0600, st_size=144, ...}) = 0 stat64("/mnt/sata96b1m4/slots96b1p4/store23/conf/statuscache.db", {st_mode=S_IFREG|0600, st_size=144, ...}) = 0 fcntl64(8, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 fcntl64(13, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=0}) = 0 munmap(0xf4cad000, 24576) = 0 munmap(0xf4c91000, 114688) = 0 close(13) = 0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bron Gondwana
2010-Nov-30 09:35 UTC
Re: Poor performance unlinking hard-linked files (repost)
On Sat, Nov 20, 2010 at 08:58:10AM +1100, Bron Gondwana wrote:> On Fri, Nov 19, 2010 at 09:10:08AM -0500, Chris Mason wrote: > > Excerpts from Bron Gondwana''s message of 2010-11-18 16:46:31 -0500: > > > On Thu, Nov 18, 2010 at 10:30:47AM -0500, Chris Mason wrote: > > > > Ok, we''re mixing unlinks and fsyncs. If it fsyncing directories too? > > > > > > Nup. I''m pretty sure it doesn''t, just files. Yes - there will certainly > > > be fsyncs going on as well - Cyrus is very careful to fsync everything it > > > cares about at the file level, but all it does with directories is mkdir > > > them if they don''t exist. > > > > Could you double check this one please? fsyncing the directory is a ton > > more expensive, I just want to make sure it isn''t part of the workload. > > > > Otherwise it looks like we''re seeking to read in the inode and unlink > > it. One possibility is that we''re not giving the elevator enough clues > > about the IO being synchronous. > > > > Are you using cfq or deadline? I bet we can improve the latencies using > > READ_SYNC. > > I''m using deadline. > > All I''m seeing is the fsyncs on the files. And some unnecessary mkdir > calls that I can probably remove, and an unneccary truncate on the > quota file.Do you have any suggestsions for what I could try? You mentioned READ_SYNC above. We now have one working partition on this machine, but it took longer to set up than most, and I''m not sure how it will cope with 7 more of them (which is my next project - compare to the historical performance of this box first with reiserfs and then with ext4!) Bron. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2010-Nov-30 12:49 UTC
Re: Poor performance unlinking hard-linked files (repost)
Excerpts from Bron Gondwana''s message of 2010-11-30 04:35:10 -0500:> On Sat, Nov 20, 2010 at 08:58:10AM +1100, Bron Gondwana wrote: > > On Fri, Nov 19, 2010 at 09:10:08AM -0500, Chris Mason wrote: > > > Excerpts from Bron Gondwana''s message of 2010-11-18 16:46:31 -0500: > > > > On Thu, Nov 18, 2010 at 10:30:47AM -0500, Chris Mason wrote: > > > > > Ok, we''re mixing unlinks and fsyncs. If it fsyncing directories too? > > > > > > > > Nup. I''m pretty sure it doesn''t, just files. Yes - there will certainly > > > > be fsyncs going on as well - Cyrus is very careful to fsync everything it > > > > cares about at the file level, but all it does with directories is mkdir > > > > them if they don''t exist. > > > > > > Could you double check this one please? fsyncing the directory is a ton > > > more expensive, I just want to make sure it isn''t part of the workload. > > > > > > Otherwise it looks like we''re seeking to read in the inode and unlink > > > it. One possibility is that we''re not giving the elevator enough clues > > > about the IO being synchronous. > > > > > > Are you using cfq or deadline? I bet we can improve the latencies using > > > READ_SYNC. > > > > I''m using deadline. > > > > All I''m seeing is the fsyncs on the files. And some unnecessary mkdir > > calls that I can probably remove, and an unneccary truncate on the > > quota file. > > Do you have any suggestsions for what I could try? You mentioned READ_SYNC > above. We now have one working partition on this machine, but it took longer > to set up than most, and I''m not sure how it will cope with 7 more of them > (which is my next project - compare to the historical performance of this > box first with reiserfs and then with ext4!)Let me work up a patch that does READ_SYNC calls for the metadata reads, and I''ll try to model this here a little. We should be able to improve things. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bron Gondwana
2010-Nov-30 23:24 UTC
Re: Poor performance unlinking hard-linked files (repost)
On Tue, Nov 30, 2010 at 07:49:26AM -0500, Chris Mason wrote:> Excerpts from Bron Gondwana''s message of 2010-11-30 04:35:10 -0500: > > Do you have any suggestsions for what I could try? You mentioned READ_SYNC > > above. We now have one working partition on this machine, but it took longer > > to set up than most, and I''m not sure how it will cope with 7 more of them > > (which is my next project - compare to the historical performance of this > > box first with reiserfs and then with ext4!) > > Let me work up a patch that does READ_SYNC calls for the metadata reads, > and I''ll try to model this here a little. We should be able to improve > things.Is there any reason why the read is going back down to the disk? The machine has 8Gb of RAM, and should easily have been able to cache all the metadata under the workload it had. I look forward to trying the patch :) Thanks, Bron. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html