Hi During high disk loads, like backups combinded with lot of writers, rsync at high speed locally or btrfs defrag I always get these messages, and everything grinds to a halt on the btrfs filesystem: [ 3123.062229] INFO: task rtorrent:8431 blocked for more than 120 seconds. [ 3123.062251] Not tainted 3.12.4 #1 [ 3123.062263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3123.062284] rtorrent D ffff88043fc12e80 0 8431 8429 0x00000000 [ 3123.062287] ffff8804289d07b0 0000000000000082 ffffffff81610440 ffffffff810408ff [ 3123.062290] 0000000000012e80 ffff88035f433fd8 ffff88035f433fd8 ffff8804289d07b0 [ 3123.062293] 0000000000000246 ffff88034bda8068 ffff8800ba5a39e8 ffff88035f433740 [ 3123.062295] Call Trace: [ 3123.062302] [<ffffffff810408ff>] ? detach_if_pending+0x18/0x6c [ 3123.062331] [<ffffffffa0193545>] ? wait_current_trans.isra.30+0xbc/0x117 [btrfs] [ 3123.062334] [<ffffffff810515a1>] ? wake_up_atomic_t+0x22/0x22 [ 3123.062346] [<ffffffffa0194ef4>] ? start_transaction+0x1d1/0x46b [btrfs] [ 3123.062359] [<ffffffffa0199537>] ? btrfs_dirty_inode+0x25/0xa6 [btrfs] [ 3123.062362] [<ffffffff8111afe2>] ? file_update_time+0x95/0xb5 [ 3123.062374] [<ffffffffa01a08a0>] ? btrfs_page_mkwrite+0x68/0x2bc [btrfs] [ 3123.062377] [<ffffffff810c3e06>] ? filemap_fault+0x1fa/0x36e [ 3123.062379] [<ffffffff810dec6f>] ? __do_fault+0x15b/0x360 [ 3123.062382] [<ffffffff810e0ffe>] ? handle_mm_fault+0x22c/0x8aa [ 3123.062385] [<ffffffff812c6445>] ? dev_hard_start_xmit+0x271/0x3ec [ 3123.062388] [<ffffffff81380c2a>] ? __do_page_fault+0x38d/0x3d7 [ 3123.062393] [<ffffffffa04eeb2e>] ? br_dev_queue_push_xmit+0x9d/0xa1 [bridge] [ 3123.062397] [<ffffffffa04ed4b3>] ? br_dev_xmit+0x1c3/0x1e0 [bridge] [ 3123.062400] [<ffffffff81060eaa>] ? update_group_power+0xb7/0x1b9 [ 3123.062403] [<ffffffff811c3456>] ? cpumask_next_and+0x2a/0x3a [ 3123.062405] [<ffffffff8106114f>] ? update_sd_lb_stats+0x1a3/0x35a [ 3123.062407] [<ffffffff8137e172>] ? page_fault+0x22/0x30 [ 3123.062410] [<ffffffff811ccc80>] ? copy_user_generic_string+0x30/0x40 [ 3123.062413] [<ffffffff811d101b>] ? memcpy_toiovec+0x2f/0x5c [ 3123.062417] [<ffffffff812bcc5a>] ? skb_copy_datagram_iovec+0x76/0x20d [ 3123.062419] [<ffffffff8137dc08>] ? _raw_spin_lock_bh+0xe/0x1c [ 3123.062422] [<ffffffff81059ad3>] ? should_resched+0x5/0x23 [ 3123.062426] [<ffffffff812fa113>] ? tcp_recvmsg+0x72e/0xaa3 [ 3123.062428] [<ffffffff810615dc>] ? load_balance+0x12c/0x6b5 [ 3123.062431] [<ffffffff813170b1>] ? inet_recvmsg+0x5a/0x6e [ 3123.062434] [<ffffffff810015dc>] ? __switch_to+0x1b1/0x3c4 [ 3123.062437] [<ffffffff812b32d9>] ? sock_recvmsg+0x54/0x71 [ 3123.062440] [<ffffffff81139d43>] ? ep_item_poll+0x16/0x1b [ 3123.062442] [<ffffffff81139e6f>] ? ep_pm_stay_awake+0xf/0xf [ 3123.062445] [<ffffffff8111c81a>] ? fget_light+0x6b/0x7c [ 3123.062447] [<ffffffff812b33c0>] ? SYSC_recvfrom+0xca/0x12e [ 3123.062449] [<ffffffff8105c309>] ? try_to_wake_up+0x190/0x190 [ 3123.062452] [<ffffffff81109189>] ? fput+0xf/0x9d [ 3123.062454] [<ffffffff8113b4b8>] ? SyS_epoll_wait+0x9c/0xc7 [ 3123.062457] [<ffffffff81382d62>] ? system_call_fastpath+0x16/0x1b [ 3123.062462] INFO: task kworker/u16:0:21158 blocked for more than 120 seconds. [ 3123.062491] Not tainted 3.12.4 #1 [ 3123.062513] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 3123.062557] kworker/u16:0 D ffff88043fcd2e80 0 21158 2 0x00000000 [ 3123.062561] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-1) [ 3123.062562] ffff88026a163830 0000000000000046 ffff88042f0c67b0 0000001000011210 [ 3123.062565] 0000000000012e80 ffff88027067bfd8 ffff88027067bfd8 ffff88026a163830 [ 3123.062567] 0000000000000246 ffff88034bda8068 ffff8800ba5a39e8 ffff88027067b750 [ 3123.062569] Call Trace: [ 3123.062581] [<ffffffffa0193545>] ? wait_current_trans.isra.30+0xbc/0x117 [btrfs] [ 3123.062584] [<ffffffff810515a1>] ? wake_up_atomic_t+0x22/0x22 [ 3123.062596] [<ffffffffa0194ef4>] ? start_transaction+0x1d1/0x46b [btrfs] [ 3123.062608] [<ffffffffa019a114>] ? run_delalloc_nocow+0x9c/0x752 [btrfs] [ 3123.062621] [<ffffffffa019a82e>] ? run_delalloc_range+0x64/0x333 [btrfs] [ 3123.062635] [<ffffffffa01a936c>] ? free_extent_state+0x12/0x21 [btrfs] [ 3123.062648] [<ffffffffa01ac32f>] ? __extent_writepage+0x1e5/0x62a [btrfs] [ 3123.062659] [<ffffffffa018f5c8>] ? btree_submit_bio_hook+0x7e/0xd7 [btrfs] [ 3123.062662] [<ffffffff810c20d1>] ? find_get_pages_tag+0x66/0x121 [ 3123.062675] [<ffffffffa01ac8be>] ? extent_write_cache_pages.isra.23.constprop.47+0x14a/0x255 [btrfs] [ 3123.062688] [<ffffffffa01acc5c>] ? extent_writepages+0x49/0x60 [btrfs] [ 3123.062700] [<ffffffffa0198017>] ? btrfs_submit_direct+0x412/0x412 [btrfs] [ 3123.062703] [<ffffffff8112660b>] ? __writeback_single_inode+0x6d/0x1e8 [ 3123.062705] [<ffffffff8112746a>] ? writeback_sb_inodes+0x1f0/0x322 [ 3123.062707] [<ffffffff81127605>] ? __writeback_inodes_wb+0x69/0xab [ 3123.062709] [<ffffffff8112777d>] ? wb_writeback+0x136/0x292 [ 3123.062712] [<ffffffff810fbffb>] ? __cache_free.isra.46+0x178/0x187 [ 3123.062714] [<ffffffff81127a6d>] ? bdi_writeback_workfn+0xc1/0x2fe [ 3123.062716] [<ffffffff8105a469>] ? resched_task+0x35/0x5d [ 3123.062718] [<ffffffff8105a83d>] ? ttwu_do_wakeup+0xf/0xc1 [ 3123.062721] [<ffffffff8105c2f7>] ? try_to_wake_up+0x17e/0x190 [ 3123.062723] [<ffffffff8104bca7>] ? process_one_work+0x191/0x294 [ 3123.062725] [<ffffffff8104c159>] ? worker_thread+0x121/0x1e7 [ 3123.062726] [<ffffffff8104c038>] ? rescuer_thread+0x269/0x269 [ 3123.062729] [<ffffffff81050c01>] ? kthread+0x81/0x89 [ 3123.062731] [<ffffffff81050b80>] ? __kthread_parkme+0x5d/0x5d [ 3123.062733] [<ffffffff81382cbc>] ? ret_from_fork+0x7c/0xb0 [ 3123.062736] [<ffffffff81050b80>] ? __kthread_parkme+0x5d/0x5d These are repeated for several processes trying to do something. I have had no data losses, only availability issues during high load. The surest way to trigger these messages is for me to start a copy from my other local array while doing something like heavy torrenting at the same time. Smartd have not reported any disk issues and iostat -d only indicates a lot of disk activity at the limits of the drives with no drive seeminlig behaving any different than others (until the error hit, where the activity goes to zero) Mount options is default kernel 3.12.4 with compress=lzo. I have 16 GB ECC RAM and a Quad core Xeon CPU. I am running this on a 8 disk WD SE 4TB btrfs RAID10 system with a single snapshot. I have no expectations of btrfs delivering stellar performance during heavy IOPs load on ordinary 7200rpm drives, but I do expect it to just be slow until the load is removed, not more or less completely stall the entire server. The filesystem have used about 26TB of the available 29TB (real available data), and some of the files on it are heavily fragmented (around 100 000 extents at about 25GB) Regards, Hans-Kristian Bakke -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 14, 2013, at 1:30 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:> > During high disk loads, like backups combinded with lot of writers, > rsync at high speed locally or btrfs defrag I always get these > messages, and everything grinds to a halt on the btrfs filesystem: > > [ 3123.062229] INFO: task rtorrent:8431 blocked for more than 120 seconds > [ 3123.062251] Not tainted 3.12.4 #1On blocks, if this is an unknown problem, often devs will want to see dmesg after you''ve issued dmesg -n7 and sysrq+w. More on sysrq triggering is here: https://www.kernel.org/doc/Documentation/sysrq.txt> The filesystem have used about 26TB of the available 29TB (real > available data), and some of the files on it are heavily fragmented > (around 100 000 extents at about 25GB)Please include results from btrfs fi show, and btrfs fi df <mp>. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Looking into triggering the error again and dmesg and sysrq, but here are the other two: # btrfs fi show Label: none uuid: 9302fc8f-15c6-46e9-9217-951d7423927c Total devices 8 FS bytes used 13.00TB devid 4 size 3.64TB used 3.48TB path /dev/sdt devid 3 size 3.64TB used 3.48TB path /dev/sds devid 8 size 3.64TB used 3.48TB path /dev/sdr devid 6 size 3.64TB used 3.48TB path /dev/sdp devid 7 size 3.64TB used 3.48TB path /dev/sdq devid 5 size 3.64TB used 3.48TB path /dev/sdo devid 1 size 3.64TB used 3.48TB path /dev/sdl devid 2 size 3.64TB used 3.48TB path /dev/sdm Btrfs v0.20-rc1 # btrfs fi df /storage/storage-vol0/ Data, RAID10: total=13.89TB, used=12.99TB System, RAID10: total=64.00MB, used=1.19MB System: total=4.00MB, used=0.00 Metadata, RAID10: total=21.00GB, used=17.59GB Regards, Hans-Kristian Bakke On 14 December 2013 22:35, Chris Murphy <lists@colorremedies.com> wrote:> > On Dec 14, 2013, at 1:30 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote: >> >> During high disk loads, like backups combinded with lot of writers, >> rsync at high speed locally or btrfs defrag I always get these >> messages, and everything grinds to a halt on the btrfs filesystem: >> >> [ 3123.062229] INFO: task rtorrent:8431 blocked for more than 120 seconds >> [ 3123.062251] Not tainted 3.12.4 #1 > > On blocks, if this is an unknown problem, often devs will want to see dmesg after you''ve issued dmesg -n7 and sysrq+w. More on sysrq triggering is here: > https://www.kernel.org/doc/Documentation/sysrq.txt > >> The filesystem have used about 26TB of the available 29TB (real >> available data), and some of the files on it are heavily fragmented >> (around 100 000 extents at about 25GB) > > Please include results from btrfs fi show, and btrfs fi df <mp>. > > > Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:> Looking into triggering the error again and dmesg and sysrq, but here > are the other two: > > # btrfs fi show > Label: none uuid: 9302fc8f-15c6-46e9-9217-951d7423927c > Total devices 8 FS bytes used 13.00TB > devid 4 size 3.64TB used 3.48TB path /dev/sdt > devid 3 size 3.64TB used 3.48TB path /dev/sds > devid 8 size 3.64TB used 3.48TB path /dev/sdr > devid 6 size 3.64TB used 3.48TB path /dev/sdp > devid 7 size 3.64TB used 3.48TB path /dev/sdq > devid 5 size 3.64TB used 3.48TB path /dev/sdo > devid 1 size 3.64TB used 3.48TB path /dev/sdl > devid 2 size 3.64TB used 3.48TB path /dev/sdm > > Btrfs v0.20-rc1 > > > # btrfs fi df /storage/storage-vol0/ > Data, RAID10: total=13.89TB, used=12.99TB > System, RAID10: total=64.00MB, used=1.19MB > System: total=4.00MB, used=0.00 > Metadata, RAID10: total=21.00GB, used=17.59GBBy my count this is ~ 95.6% full. My past experience with other file systems, including btree file systems, is they get unpredictably fussy when they''re this full. I start migration planning once 80% full is reached, and make it a policy to avoid going over 90% full. I don''t know what behavior Btrfs developers anticipate for this scenario. On the one hand it seems reasonable to expect it to only be slow, rather than block the whole server for 2 minutes. But on the other hand, it''s reasonable to expect server storage won''t get this full. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
When I look at the entire FS with df-like tools it is reported as 89.4% used (26638.65 of 29808.2 GB). But this is shared amongst both data and metadata I guess? I do know that ~90%+ seems full, but it is still around 3TB in my case! Are the "percentage rules" of old times still valid with modern disk sizes? It seems extremely inconvenient that a filesystem like btrfs is starting to misbehave at "only" 3TB available space for RAID10 mirroring and metadata, which is probably a little bit over 1TB actual filestorage counting everything in. I would normally expect that there is no difference in 1TB free space on a FS that is 2TB in total, and 1TB free space on a filesystem that is 30TB in total, other than my sense of urge and that you would probably expect data growth to be more rapid on the 30TB FS as there is obviously a need to store a lot of stuff. Is "free space needed" really a different concept dependning on the size of your FS? Mvh Hans-Kristian Bakke On 15 December 2013 00:50, Chris Murphy <lists@colorremedies.com> wrote:> > On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote: > >> Looking into triggering the error again and dmesg and sysrq, but here >> are the other two: >> >> # btrfs fi show >> Label: none uuid: 9302fc8f-15c6-46e9-9217-951d7423927c >> Total devices 8 FS bytes used 13.00TB >> devid 4 size 3.64TB used 3.48TB path /dev/sdt >> devid 3 size 3.64TB used 3.48TB path /dev/sds >> devid 8 size 3.64TB used 3.48TB path /dev/sdr >> devid 6 size 3.64TB used 3.48TB path /dev/sdp >> devid 7 size 3.64TB used 3.48TB path /dev/sdq >> devid 5 size 3.64TB used 3.48TB path /dev/sdo >> devid 1 size 3.64TB used 3.48TB path /dev/sdl >> devid 2 size 3.64TB used 3.48TB path /dev/sdm >> >> Btrfs v0.20-rc1 >> >> >> # btrfs fi df /storage/storage-vol0/ >> Data, RAID10: total=13.89TB, used=12.99TB >> System, RAID10: total=64.00MB, used=1.19MB >> System: total=4.00MB, used=0.00 >> Metadata, RAID10: total=21.00GB, used=17.59GB > > By my count this is ~ 95.6% full. My past experience with other file systems, including btree file systems, is they get unpredictably fussy when they''re this full. I start migration planning once 80% full is reached, and make it a policy to avoid going over 90% full. > > I don''t know what behavior Btrfs developers anticipate for this scenario. On the one hand it seems reasonable to expect it to only be slow, rather than block the whole server for 2 minutes. But on the other hand, it''s reasonable to expect server storage won''t get this full. > > > Chris Murphy-- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Dec 14, 2013, at 5:28 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote:> When I look at the entire FS with df-like tools it is reported as > 89.4% used (26638.65 of 29808.2 GB). But this is shared amongst both > data and metadata I guess?Yes.> > I do know that ~90%+ seems full, but it is still around 3TB in my > case! Are the "percentage rules" of old times still valid with modern > disk sizes?Probably not. But you also reported rather significant fragmentation. And it''s also still an experimental file system when not ~ 90% full. I think it''s fair to say that this level of fullness is a less tested use case.> It seems extremely inconvenient that a filesystem like > btrfs is starting to misbehave at "only" 3TB available space for > RAID10 mirroring and metadata, which is probably a little bit over 1TB > actual filestorage counting everything in.I''m not suggesting the behavior is either desired or expected, but certainly blocking is better than an oops or a broken file system, and in the not too distant past such things have happened on full volumes. Given the level of fragmentation this behavior might be expected at the current state of development, for all I know. But if you care about this data, I''d take the blocking as a warning to back off on this usage pattern, unless of course you''re intentionally trying to see at what point it breaks and why.> > I would normally expect that there is no difference in 1TB free space > on a FS that is 2TB in total, and 1TB free space on a filesystem that > is 30TB in total, other than my sense of urge and that you would > probably expect data growth to be more rapid on the 30TB FS as there > is obviously a need to store a lot of stuff.Seems reasonable.> Is "free space needed" really a different concept dependning on the > size of your FS?Maybe it depends more on the size and fragmentation of the files being access, and of remaining free space. Can you do an lsattr on these 25GB files that you say have ~ 100,000 extents? And what are these files? Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I have done some more testing. I turned off everything using the disk and only did defrag. I have created a script that gives me a list of the files with the most extents. I started from the top to improve the fragmentation of the worst files. The most fragmentet file was a file of about 32GB with over 250 000 extents! It seems that I can defrag a two to three largish (15-30GB) ~100 000 extents files just fine, but after a while the system locks up (not a complete hard lock, but everythings hangs and a restart is necessary to get a fully working system again) It seems like defrag operations is triggering the issue. Probably in combination with the large and heavily fragmentet files. I have slowly managed to defragment the most fragmented files, rebooting 4 times, so one of the worst files now is this one: # filefrag vide01.mkv vide01.mkv: 77810 extents found # lsattr vide01.mkv ---------------- vide01.mkv All the large fragmented files are ordinary mkv-files (video). The reason for the heavy fragmentation was that perhaps 50 to 100 files were written at the same time over a period of several days, with lots of other activity going on as well. No problem for the system as it was network limited most of the time. Although defrag alone can trigger blocking, so can also straight rsync from another internal 1000 MB/s continous reads internal array combined with some random activity. It seems that the cause is just heavy IO. Is it possible that even though I have seemingly lots of space free in measured MBytes, that it is all so fragmented that btrfs can''t allocate space efficiently enough? Or would that give other errors? I actually downgraded from kernel 3.13-rc2 because of not being able to do anything else if copying between the internal arrays without btrfs hanging, although seemingly just temporarily and not as bad as the defrag blocking. I will try to free up some space before running more defrag too, just to check if that is the issue. Mvh Hans-Kristian Bakke On 15 December 2013 02:59, Chris Murphy <lists@colorremedies.com> wrote:> > On Dec 14, 2013, at 5:28 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote: > >> When I look at the entire FS with df-like tools it is reported as >> 89.4% used (26638.65 of 29808.2 GB). But this is shared amongst both >> data and metadata I guess? > > Yes. > >> >> I do know that ~90%+ seems full, but it is still around 3TB in my >> case! Are the "percentage rules" of old times still valid with modern >> disk sizes? > > Probably not. But you also reported rather significant fragmentation. And it''s also still an experimental file system when not ~ 90% full. I think it''s fair to say that this level of fullness is a less tested use case. > > > >> It seems extremely inconvenient that a filesystem like >> btrfs is starting to misbehave at "only" 3TB available space for >> RAID10 mirroring and metadata, which is probably a little bit over 1TB >> actual filestorage counting everything in. > > I''m not suggesting the behavior is either desired or expected, but certainly blocking is better than an oops or a broken file system, and in the not too distant past such things have happened on full volumes. Given the level of fragmentation this behavior might be expected at the current state of development, for all I know. > > But if you care about this data, I''d take the blocking as a warning to back off on this usage pattern, unless of course you''re intentionally trying to see at what point it breaks and why. > >> >> I would normally expect that there is no difference in 1TB free space >> on a FS that is 2TB in total, and 1TB free space on a filesystem that >> is 30TB in total, other than my sense of urge and that you would >> probably expect data growth to be more rapid on the 30TB FS as there >> is obviously a need to store a lot of stuff. > > Seems reasonable. > > >> Is "free space needed" really a different concept dependning on the >> size of your FS? > > Maybe it depends more on the size and fragmentation of the files being access, and of remaining free space. > > Can you do an lsattr on these 25GB files that you say have ~ 100,000 extents? And what are these files? > > > > Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/14/2013 04:28 PM, Hans-Kristian Bakke wrote:> > I would normally expect that there is no difference in 1TB free space > on a FS that is 2TB in total, and 1TB free space on a filesystem that > is 30TB in total, other than my sense of urge and that you would > probably expect data growth to be more rapid on the 30TB FS as there > is obviously a need to store a lot of stuff. > Is "free space needed" really a different concept dependning on the > size of your FS?I would suggest there just might be a very significant difference. In the case of a 30TB array as opposed to a 3TB array, you are dealing with a much higher ratio of used space to free space. I believe this creates a higher likelihood that the free space is occurring as a larger number of very small pieces of drive space as opposed to a 3TB drive where 1/3rd of the drive space free would imply actual USABLE space on the drives. My concern would be that with only 1/30th of the space on the drives left free, that remaining space likely involves a lot of very small segments that create a situation where the filesystem is struggling to compute how to lay out new files. And, on top of that, defragmentation could become a nightmare of complexity as well, since the filesystem first has to clear contiguous space to somewhere in order to defragment each file. And then throw in the striping and mirroring requirements. I know those algorithms are likely pretty sophisticated, but something tells me that the higher the RATIO of used space to free space, the more difficult things might get for the filesystem. Just about everybody here knows a whole lot more about this than I do, but something really concerns me about this ratio issue. Ideally of course it probably should work, but its just got to be significantly more complex than a 3TB situation. These are just my thoughts as a comparative novice when it comes to btrfs or filesystems in general. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hans-Kristian Bakke posted on Sun, 15 Dec 2013 03:35:53 +0100 as excerpted:> I have done some more testing. I turned off everything using the disk > and only did defrag. I have created a script that gives me a list of the > files with the most extents. I started from the top to improve the > fragmentation of the worst files. The most fragmentet file was a file of > about 32GB with over 250 000 extents! > It seems that I can defrag a two to three largish (15-30GB) ~100 000 > extents files just fine, but after a while the system locks up (not a > complete hard lock, but everythings hangs and a restart is necessary to > get a fully working system again) > > It seems like defrag operations is triggering the issue. Probably in > combination with the large and heavily fragmentet files. > > I have slowly managed to defragment the most fragmented files, > rebooting 4 times, so one of the worst files now is this one: > > # filefrag vide01.mkv > vide01.mkv: 77810 extents found > # lsattr vide01.mkv > ---------------- vide01.mkv > > All the large fragmented files are ordinary mkv-files (video). The > reason for the heavy fragmentation was that perhaps 50 to 100 files were > written at the same time over a period of several days, with lots of > other activity going on as well. No problem for the system as it was > network limited most of the time. > Although defrag alone can trigger blocking, so can also straight rsync > from another internal 1000 MB/s continous reads internal array combined > with some random activity. It seems that the cause is just heavy IO. Is > it possible that even though I have seemingly lots of space free in > measured MBytes, that it is all so fragmented that btrfs can''t allocate > space efficiently enough? Or would that give other errors? > > I actually downgraded from kernel 3.13-rc2 because of not being able to > do anything else if copying between the internal arrays without btrfs > hanging, although seemingly just temporarily and not as bad as the > defrag blocking. > > I will try to free up some space before running more defrag too, just to > check if that is the issue.Three points based on bits you''ve mentioned, the third likely being the most critical for this thread, plus a fourth point, not something you''ve mentioned but just in case...: 1) You mentioned compress=lzo. It''s worth noting that at present, filefrag interprets the file segments btrfs breaks compressed files up into as part of the compression process as fragments (of IIRC 128 KiB each, altho I''m not absolutely sure on that number), so anything that''s compressed and over that size will be reported by filefrag as fragmented, even if it''s not. They''re working on teaching filefrag about this sort of thing, and in fact I saw some proposed patches for the kernel side of things just yesterday, IIRC, but it''ll be a few months before all the various pieces are in the kernel and filefrag upstreams, and it''ll probably be a few months to a year or more beyond that before those fixes filter out to what the distros are shipping. However, btrfs won''t ordinarily attempt to compress known video files (unless the compress-force mount option is used) since they''re normally already compressed, so that''s unlikely to be the issue with your mkvs. Additionally, if defragging them is reducing the fragmentation dramatically, that''s not the problem, as if it was defragging wouldn''t make a difference. But it might make a difference on some other files you have... 2) You mentioned backups. Your backups aren''t of the type that use lots and lots of hardlinks are they? Because btrfs isn''t the most efficient at processing large numbers of hardlinks. For hardlink-type backups, etc, a filesystem other than btrfs will be preferred. (Additionally, since btrfs is still experimental, it''s probably a good idea to avoid having both your working system and backups on btrfs anyway. Better to have the backups on something else, in case btrfs lives up to the risk level implied by its development status.) 3) Critically, the blocked task in your first post was rtorrent. Given that and the filetypes (mkv video files) involved, one can guess that you do quite a bit of torrenting. I''m not sure about rtorrent, but a lot of torrent clients (possibly optionally) pre-allocate the space required for a file, then fill in random individual chunks they are downloaded and written. *** THIS IS ONE OF THE WORST USE-CASES POSSIBLE FOR ANY COPY-ON-WRITE FILESYSTEM, BTRFS INCLUDED!! *** What happens is that each of those random chunk-writes creates a new extent, a new fragment of the file, since COW means it isn''t rewritten in- place and thus must be mapped to a new location on the disk. If that''s what you''re doing, then no WONDER those files have so many extents -- a 32-gig file with a quarter million extents in the worst-case you mentioned. And especially on spinning rust, YES, something that heavily fragmented WILL trigger I/O blockages for minutes at a time! (The other very common bad-case, tho I don''t believe quite as bad as the torrent case as I don''t think they commonly re-write the /entire/ thing, only large parts of it, is virtual machine images, where writes to the virtual disk in the VM end up being "internal file writes" in the file containing that image on the host filesystem. The below recommendations apply there as well.) There''s several possible workarounds including turning off the pre- allocate option in your torrent client, if possible, and several variants on the theme of telling btrfs not to COW those particular files so they get rewritten in-place instead. 3a) Create a separate filesystem for your torrent files and either use something other than a COW filesystem (ext4 or xfs might be usable options), or if you use btrfs, mount that filesystem with the nodatacow mount-option. 3b) Configure your btrfs client to use a particular directory (which it probably already does by default, but make sure all the files are going there -- you''re not telling it to directly write some torrent downloads elsewhere instead), then set the NOCOW attribute on that directory. Newly created files in it should inherit that NOCOW. 3c) Arrange to set NOCOW on individual files before you start writing into them. Often this is done by touching the file to create it, then setting the NOCOW attribute, then writing into the existing zero-length file. The attribute needs to be set before there''s data in the file -- setting it after the fact doesn''t really help, and this is one way to do it (with inherit from the directory as in 3b another). However, this could be impossible or at minimum rather complicated to handle with the torrent client, so 3a or 3b are likely to be more practical choices. 3d) As mentioned, in some clients it''s possible to turn off the pre- allocation option. However, this can have other effects as well or pre- allocation wouldn''t be a common torrent client practice in the first place, so it may not be what you want in any case. Pre-allocation is fine, as long as the file is set NOCOW using one of the methods above. Of course once you have that setup, you''ll still have to deal with the existing heavily fragmented files, but at least you won''t have a continuing regenerating problem you have to deal with. =:^) 4) This one you didn''t mention but just in case... There have been some issues with btrfs qgroups that I''m not sure are fully ironed out yet. In general, I''d recommend staying away from quotas and their btrfs qgroups implementation for now. As with hardlink-heavy use-cases, use a different filesystem if you are dependent on quotes, at least for the time being. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thank you for your very thorough answer Duncan. Just to clear up a couple of questions. # Backups The backups I am speaking of is backup of data on the btrfs filesystem to another place. The btrfs filesystem sees this as large reads at about 100 mbit/s, at the time for about a week continuous. In other words the backups are not storing any data on the btrfs array. The backup is not running when I am testing this, just to have said that. # Regarding torrents and preallocation I have actually turned preallocation on specifically in rtorrent thinking that it did btrfs a favour like with ext4 (system.file_allocate.set = yes). It is easy to turn it off. Is the "ideal" solution for btrfs and torrenting (or any other random writes to large files) to use preallocation and NOCOW, or use no preallocation and NOCOW? I am thinking the first, although I still do not understand quite why preallocation is worse than no preallocation for btrfs with COW enabled (or is both just as bad?) # qgroups I am not using qgroups. Regards Hans-Kristian On 15 December 2013 14:24, Duncan <1i5t5.duncan@cox.net> wrote:> Hans-Kristian Bakke posted on Sun, 15 Dec 2013 03:35:53 +0100 as > excerpted: > >> I have done some more testing. I turned off everything using the disk >> and only did defrag. I have created a script that gives me a list of the >> files with the most extents. I started from the top to improve the >> fragmentation of the worst files. The most fragmentet file was a file of >> about 32GB with over 250 000 extents! >> It seems that I can defrag a two to three largish (15-30GB) ~100 000 >> extents files just fine, but after a while the system locks up (not a >> complete hard lock, but everythings hangs and a restart is necessary to >> get a fully working system again) >> >> It seems like defrag operations is triggering the issue. Probably in >> combination with the large and heavily fragmentet files. >> >> I have slowly managed to defragment the most fragmented files, >> rebooting 4 times, so one of the worst files now is this one: >> >> # filefrag vide01.mkv >> vide01.mkv: 77810 extents found >> # lsattr vide01.mkv >> ---------------- vide01.mkv >> >> All the large fragmented files are ordinary mkv-files (video). The >> reason for the heavy fragmentation was that perhaps 50 to 100 files were >> written at the same time over a period of several days, with lots of >> other activity going on as well. No problem for the system as it was >> network limited most of the time. >> Although defrag alone can trigger blocking, so can also straight rsync >> from another internal 1000 MB/s continous reads internal array combined >> with some random activity. It seems that the cause is just heavy IO. Is >> it possible that even though I have seemingly lots of space free in >> measured MBytes, that it is all so fragmented that btrfs can''t allocate >> space efficiently enough? Or would that give other errors? >> >> I actually downgraded from kernel 3.13-rc2 because of not being able to >> do anything else if copying between the internal arrays without btrfs >> hanging, although seemingly just temporarily and not as bad as the >> defrag blocking. >> >> I will try to free up some space before running more defrag too, just to >> check if that is the issue. > > Three points based on bits you''ve mentioned, the third likely being the > most critical for this thread, plus a fourth point, not something you''ve > mentioned but just in case...: > > 1) You mentioned compress=lzo. It''s worth noting that at present, > filefrag interprets the file segments btrfs breaks compressed files up > into as part of the compression process as fragments (of IIRC 128 KiB > each, altho I''m not absolutely sure on that number), so anything that''s > compressed and over that size will be reported by filefrag as fragmented, > even if it''s not. > > They''re working on teaching filefrag about this sort of thing, and in > fact I saw some proposed patches for the kernel side of things just > yesterday, IIRC, but it''ll be a few months before all the various pieces > are in the kernel and filefrag upstreams, and it''ll probably be a few > months to a year or more beyond that before those fixes filter out to > what the distros are shipping. > > However, btrfs won''t ordinarily attempt to compress known video files > (unless the compress-force mount option is used) since they''re normally > already compressed, so that''s unlikely to be the issue with your mkvs. > Additionally, if defragging them is reducing the fragmentation > dramatically, that''s not the problem, as if it was defragging wouldn''t > make a difference. > > But it might make a difference on some other files you have... > > 2) You mentioned backups. Your backups aren''t of the type that use lots > and lots of hardlinks are they? Because btrfs isn''t the most efficient > at processing large numbers of hardlinks. For hardlink-type backups, > etc, a filesystem other than btrfs will be preferred. (Additionally, > since btrfs is still experimental, it''s probably a good idea to avoid > having both your working system and backups on btrfs anyway. Better to > have the backups on something else, in case btrfs lives up to the risk > level implied by its development status.) > > 3) Critically, the blocked task in your first post was rtorrent. Given > that and the filetypes (mkv video files) involved, one can guess that you > do quite a bit of torrenting. > > I''m not sure about rtorrent, but a lot of torrent clients (possibly > optionally) pre-allocate the space required for a file, then fill in > random individual chunks they are downloaded and written. > > *** THIS IS ONE OF THE WORST USE-CASES POSSIBLE FOR ANY COPY-ON-WRITE > FILESYSTEM, BTRFS INCLUDED!! *** > > What happens is that each of those random chunk-writes creates a new > extent, a new fragment of the file, since COW means it isn''t rewritten in- > place and thus must be mapped to a new location on the disk. If that''s > what you''re doing, then no WONDER those files have so many extents -- a > 32-gig file with a quarter million extents in the worst-case you > mentioned. And especially on spinning rust, YES, something that heavily > fragmented WILL trigger I/O blockages for minutes at a time! > > (The other very common bad-case, tho I don''t believe quite as bad as the > torrent case as I don''t think they commonly re-write the /entire/ thing, > only large parts of it, is virtual machine images, where writes to the > virtual disk in the VM end up being "internal file writes" in the file > containing that image on the host filesystem. The below recommendations > apply there as well.) > > There''s several possible workarounds including turning off the pre- > allocate option in your torrent client, if possible, and several variants > on the theme of telling btrfs not to COW those particular files so they > get rewritten in-place instead. > > 3a) Create a separate filesystem for your torrent files and either use > something other than a COW filesystem (ext4 or xfs might be usable > options), or if you use btrfs, mount that filesystem with the nodatacow > mount-option. > > 3b) Configure your btrfs client to use a particular directory (which it > probably already does by default, but make sure all the files are going > there -- you''re not telling it to directly write some torrent downloads > elsewhere instead), then set the NOCOW attribute on that directory. > Newly created files in it should inherit that NOCOW. > > 3c) Arrange to set NOCOW on individual files before you start writing > into them. Often this is done by touching the file to create it, then > setting the NOCOW attribute, then writing into the existing zero-length > file. The attribute needs to be set before there''s data in the file -- > setting it after the fact doesn''t really help, and this is one way to do > it (with inherit from the directory as in 3b another). However, this > could be impossible or at minimum rather complicated to handle with the > torrent client, so 3a or 3b are likely to be more practical choices. > > 3d) As mentioned, in some clients it''s possible to turn off the pre- > allocation option. However, this can have other effects as well or pre- > allocation wouldn''t be a common torrent client practice in the first > place, so it may not be what you want in any case. Pre-allocation is > fine, as long as the file is set NOCOW using one of the methods above. > > > Of course once you have that setup, you''ll still have to deal with the > existing heavily fragmented files, but at least you won''t have a > continuing regenerating problem you have to deal with. =:^) > > 4) This one you didn''t mention but just in case... There have been some > issues with btrfs qgroups that I''m not sure are fully ironed out yet. In > general, I''d recommend staying away from quotas and their btrfs qgroups > implementation for now. As with hardlink-heavy use-cases, use a > different filesystem if you are dependent on quotes, at least for the > time being. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hans-Kristian Bakke posted on Sun, 15 Dec 2013 15:51:37 +0100 as excerpted:> # Regarding torrents and preallocation I have actually turned > preallocation on specifically in rtorrent thinking that it did btrfs a > favour like with ext4 (system.file_allocate.set = yes). It is easy to > turn it off. > Is the "ideal" solution for btrfs and torrenting (or any other random > writes to large files) to use preallocation and NOCOW, or use no > preallocation and NOCOW? I am thinking the first, although I still do > not understand quite why preallocation is worse than no preallocation > for btrfs with COW enabled (or is both just as bad?)I''m not a dev only an admin who follows this list as I run btrfs too, and thus don''t claim to be an expert on the above -- it''s mostly echoing what I''ve seen here previously. That said, preallocation with nocow is the choice I''d make here. Meanwhile, a subpoint I didn''t make explicit previously, tho it''s a logical conclusion from the explanation, is that once the writing is finished and the file becomes like most media files effectively read- only, no further writes, NOCOW is no longer important. That is, you can (sequentially) copy the file somewhere else and not have to worry about it. In fact, that''s a reasonably good idea, since NOCOW turns off btrfs checksumming too, and presumably you''re still interested in maintaining file integrity on the thing. So what I''d do is setup a torrent download dir (or as I mentioned, a dedicated partition, since I like that sort of thing because it enforces size discipline on the stuff I''ve downloaded but not fully sorted thru... that''s what I do with binary newsgroup downloading, which I''ve been doing on and off since well before bittorrent was around), set/mount it NOCOW/ nowdatacow, and use it as a temporary download "cache". Then after a file is fully downloaded to "cache", I''d copy it off to a final destination in my normal media partition, ultimately removing my NOCOW copy. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy <lists@colorremedies.com> wrote:> On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote: > > > # btrfs fi df /storage/storage-vol0/ > > Data, RAID10: total=13.89TB, used=12.99TB > > System, RAID10: total=64.00MB, used=1.19MB > > System: total=4.00MB, used=0.00 > > Metadata, RAID10: total=21.00GB, used=17.59GB >> By my count this is ~ 95.6% full. My past experience with other file > systems, including btree file systems, is they get unpredictably fussy when > they''re this full. I start migration planning once 80% full is reached, and > make it a policy to avoid going over 90% full.For what it''s worth, I see exactly the same behaviour on a system where the filesystem is only ~60% full, with more than 5TB of free space. All I have to do is copy a single file of several gigabytes to the filesystem (over the network, so it''s only coming in at ~30MB/s) and I get similar task-blocked messages: INFO: task btrfs-transacti:4118 blocked for more than 120 seconds. Not tainted 3.12.5-custom+ #10 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. btrfs-transacti D ffff88082fd14140 0 4118 2 0x00000000 ffff880805a06040 0000000000000002 ffff8807f7665d40 ffff8808078f2040 0000000000014140 ffff8807f7665fd8 ffff8807f7665fd8 ffff880805a06040 0000000000000001 ffff88082fd14140 ffff880805a06040 ffff8807f7665c70 Call Trace: [<ffffffff810d1a19>] ? __lock_page+0x66/0x66 [<ffffffff813b26dd>] ? io_schedule+0x56/0x6c [<ffffffff810d1a20>] ? sleep_on_page+0x7/0xc [<ffffffff813b0ad6>] ? __wait_on_bit+0x40/0x79 [<ffffffff810d1df1>] ? find_get_pages_tag+0x66/0x121 [<ffffffff810d1ad8>] ? wait_on_page_bit+0x72/0x77 [<ffffffff8105f540>] ? wake_atomic_t_function+0x21/0x21 [<ffffffff810d218f>] ? filemap_fdatawait_range+0x66/0xfe [<ffffffffa0545bb5>] ? clear_extent_bit+0x25d/0x29d [btrfs] [<ffffffffa052ff9a>] ? btrfs_wait_marked_extents+0x79/0xca [btrfs] [<ffffffffa0530059>] ? btrfs_write_and_wait_transaction+0x6e/0x7e [btrfs] [<ffffffffa05307ad>] ? btrfs_commit_transaction+0x651/0x843 [btrfs] [<ffffffffa05297e8>] ? transaction_kthread+0xf4/0x191 [btrfs] [<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs] [<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs] [<ffffffff8105eb45>] ? kthread+0x81/0x89 [<ffffffff81013291>] ? paravirt_sched_clock+0x5/0x8 [<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d [<ffffffff813b880c>] ? ret_from_fork+0x7c/0xb0 [<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d So it''s not, at least in my case, due to the filesystem approaching full. I''ve seen this behaviour over many kernel versions; the above is with 3.12.5. Charles -- ----------------------------------------------------------------------- Charles Cazabon GPL''ed software available at: http://pyropus.ca/software/ ----------------------------------------------------------------------- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
torrents are really only one thing my storage server get hammered with. It also does a lot more IO intensive stuff. I actually run enterprise storage drives in a Supermicro-server for a reason, even if it is my home setup, consumer stuff just don''t cut it with my storage abuse :) It runs KVM virtualisation (not on btrfs though) with several VMs, including windows machines, do lots of manipulation of large files, offsite backups at 100 mbit/s for days on end, reencoding large amounts of audio files, runs lots of web sites, constantly streams blu-rays to at least one computer, and chews through enormous amounts of internet bandwith constantly. Last week it consumed ~10TB of internet bandwith alone. I was at about 140 mbit/s average throughput on a 100/100 link over a full 7 day week, peaking at 177 mbit/s average over 24 hours, and that is not counting the local gigabit traffic for all the video remuxing and stuff. In other words, all 19 storage drives in that server is driven really hard, and it is no wonder that this triggers some subtleties that normal users just don''t hit. But since torrenting are clearly the worst offender when it comes to fragmentation I can comment on that. Using btrfs with partitioning stops me from using the btrfs multidisk handling that I ideally need, so that is really not an option. I also think that if I were to use partitions (no multidisk), no COW and hence no checksumming, I might as well use ext4 which is more optimized for that usage scenario. Ideally I could use just a subvol with nodatacow and quota for this purpose, but per subvolume nodatacow is not available yet as far as I have understood (correct me if I''m wrong). What I will do now, as a way of removing the worst offender from messing up the general storage pool, is to shrink the btrfs array from 8x4TB drives in btrfs RAID10 to a 7 disk array, and dedicate a drive for rtorrent, running ext4 with preallocation. I have, until btrfs, normally just made one large array of all storage drives matching in performance characteristics, thinking that all the data can benefit from the extra IO-performance of the array. This has been a good compromise for a limited budget home setup where ideal storage teering with SSD hybrid SANs and such is not an option. But as I am now experiencing with btrfs, COW kind of changes the rules in a profound noticable all-the-time way. With COWs inherent random-write-to-large-file fragmentation penalty I think there is no other way than to separate the different workloads into separate storage pools going to different hardware. In my case it would probably mean having one storage pool for general storage, one for VMs and one for torrenting, as all of those react in their own way to COW and will get heavily affected by the other workloads in the worst case if run from the same drives with COW. Your system of a "cache" is actually already implemented logically in my setup, in the form of a post-processing script that rtorrent runs on completion. It moves completed files in dedicated per-tracker seeding folders, and then makes a copy (using cp --reflink=auto on btrfs) of the file, processes it if needed (tag clean up, reencoding, decompresssing or what not), and then moves it to another "finished" folder. This makes it easy to know what the new stuff is, and I can manipulate, rename and clean up all the data without messing up the seeds. I think that the "finished" folder could still be located on the RAID10 btrfs volume with COW, as I can use an internal move into the organized archive when I am actually sitting at the computer instead of a drive to drive copy via the network. Regards, H-K On 16 December 2013 00:08, Duncan <1i5t5.duncan@cox.net> wrote:> Hans-Kristian Bakke posted on Sun, 15 Dec 2013 15:51:37 +0100 as > excerpted: > >> # Regarding torrents and preallocation I have actually turned >> preallocation on specifically in rtorrent thinking that it did btrfs a >> favour like with ext4 (system.file_allocate.set = yes). It is easy to >> turn it off. >> Is the "ideal" solution for btrfs and torrenting (or any other random >> writes to large files) to use preallocation and NOCOW, or use no >> preallocation and NOCOW? I am thinking the first, although I still do >> not understand quite why preallocation is worse than no preallocation >> for btrfs with COW enabled (or is both just as bad?) > > I''m not a dev only an admin who follows this list as I run btrfs too, and > thus don''t claim to be an expert on the above -- it''s mostly echoing what > I''ve seen here previously. > > That said, preallocation with nocow is the choice I''d make here. > > Meanwhile, a subpoint I didn''t make explicit previously, tho it''s a > logical conclusion from the explanation, is that once the writing is > finished and the file becomes like most media files effectively read- > only, no further writes, NOCOW is no longer important. That is, you can > (sequentially) copy the file somewhere else and not have to worry about > it. In fact, that''s a reasonably good idea, since NOCOW turns off btrfs > checksumming too, and presumably you''re still interested in maintaining > file integrity on the thing. > > So what I''d do is setup a torrent download dir (or as I mentioned, a > dedicated partition, since I like that sort of thing because it enforces > size discipline on the stuff I''ve downloaded but not fully sorted thru... > that''s what I do with binary newsgroup downloading, which I''ve been doing > on and off since well before bittorrent was around), set/mount it NOCOW/ > nowdatacow, and use it as a temporary download "cache". Then after a > file is fully downloaded to "cache", I''d copy it off to a final > destination in my normal media partition, ultimately removing my NOCOW > copy. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
There are actually more. Like this one: http://iohq.net/index.php?title=Btrfs:RAID_5_Rsync_Freeze It seems to be the exact same issue as I have, as I too can''t do high speed rsyncs writing to the btrfs array without blocking (reading is fine). Mvh Hans-Kristian Bakke On 16 December 2013 00:39, Charles Cazabon <charlesc-lists-btrfs@pyropus.ca> wrote:> Chris Murphy <lists@colorremedies.com> wrote: >> On Dec 14, 2013, at 4:19 PM, Hans-Kristian Bakke <hkbakke@gmail.com> wrote: >> >> > # btrfs fi df /storage/storage-vol0/ >> > Data, RAID10: total=13.89TB, used=12.99TB >> > System, RAID10: total=64.00MB, used=1.19MB >> > System: total=4.00MB, used=0.00 >> > Metadata, RAID10: total=21.00GB, used=17.59GB >> > >> By my count this is ~ 95.6% full. My past experience with other file >> systems, including btree file systems, is they get unpredictably fussy when >> they''re this full. I start migration planning once 80% full is reached, and >> make it a policy to avoid going over 90% full. > > For what it''s worth, I see exactly the same behaviour on a system where the > filesystem is only ~60% full, with more than 5TB of free space. All I have to > do is copy a single file of several gigabytes to the filesystem (over the > network, so it''s only coming in at ~30MB/s) and I get similar task-blocked > messages: > > INFO: task btrfs-transacti:4118 blocked for more than 120 seconds. > Not tainted 3.12.5-custom+ #10 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > btrfs-transacti D ffff88082fd14140 0 4118 2 0x00000000 > ffff880805a06040 0000000000000002 ffff8807f7665d40 ffff8808078f2040 > 0000000000014140 ffff8807f7665fd8 ffff8807f7665fd8 ffff880805a06040 > 0000000000000001 ffff88082fd14140 ffff880805a06040 ffff8807f7665c70 > Call Trace: > [<ffffffff810d1a19>] ? __lock_page+0x66/0x66 > [<ffffffff813b26dd>] ? io_schedule+0x56/0x6c > [<ffffffff810d1a20>] ? sleep_on_page+0x7/0xc > [<ffffffff813b0ad6>] ? __wait_on_bit+0x40/0x79 > [<ffffffff810d1df1>] ? find_get_pages_tag+0x66/0x121 > [<ffffffff810d1ad8>] ? wait_on_page_bit+0x72/0x77 > [<ffffffff8105f540>] ? wake_atomic_t_function+0x21/0x21 > [<ffffffff810d218f>] ? filemap_fdatawait_range+0x66/0xfe > [<ffffffffa0545bb5>] ? clear_extent_bit+0x25d/0x29d [btrfs] > [<ffffffffa052ff9a>] ? btrfs_wait_marked_extents+0x79/0xca [btrfs] > [<ffffffffa0530059>] ? btrfs_write_and_wait_transaction+0x6e/0x7e [btrfs] > [<ffffffffa05307ad>] ? btrfs_commit_transaction+0x651/0x843 [btrfs] > [<ffffffffa05297e8>] ? transaction_kthread+0xf4/0x191 [btrfs] > [<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs] > [<ffffffffa05296f4>] ? try_to_freeze_unsafe+0x30/0x30 [btrfs] > [<ffffffff8105eb45>] ? kthread+0x81/0x89 > [<ffffffff81013291>] ? paravirt_sched_clock+0x5/0x8 > [<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d > [<ffffffff813b880c>] ? ret_from_fork+0x7c/0xb0 > [<ffffffff8105eac4>] ? __kthread_parkme+0x5d/0x5d > > > So it''s not, at least in my case, due to the filesystem approaching full. > > I''ve seen this behaviour over many kernel versions; the above is with 3.12.5. > > Charles > -- > ----------------------------------------------------------------------- > Charles Cazabon > GPL''ed software available at: http://pyropus.ca/software/ > ----------------------------------------------------------------------- > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hans-Kristian Bakke posted on Mon, 16 Dec 2013 01:06:36 +0100 as excerpted:> torrents are really only one thing my storage server get hammered with. > It also does a lot more IO intensive stuff. I actually run enterprise > storage drives in a Supermicro-server for a reason, even if it is my > home setup, consumer stuff just don''t cut it with my storage abuse :) > It runs KVM virtualisation (not on btrfs though) with several VMs, > including windows machines, do lots of manipulation of large files, > offsite backups at 100 mbit/s for days on end, reencoding large amounts > of audio files, runs lots of web sites, constantly streams blu-rays to > at least one computer, and chews through enormous amounts of internet > bandwith constantly. Last week it consumed ~10TB of internet bandwith > alone. I was at about 140 mbit/s average throughput on a 100/100 link > over a full 7 day week, peaking at 177 mbit/s average over 24 hours, and > that is not counting the local gigabit traffic for all the video > remuxing and stuff. > In other words, all 19 storage drives in that server is driven really > hard, and it is no wonder that this triggers some subtleties that normal > users just don''t hit.Wow! Indeed!> But since torrenting are clearly the worst offender when it comes to > fragmentation I can comment on that. > Using btrfs with partitioning stops me from using the btrfs multidisk > handling that I ideally need, so that is really not an option.?? I''m not running near what you''re running, but I *AM* running multiple independent multi-device btrfs filesystems (raid1 mode) on a single pair of partitioned 256 MB (238 MiB) SSDs, just as pre-btrfs and pre-SSD, I ran multiple 4-way md/raid1 volumes on individual partitions on 4-physical-spindle spinning rust. Like md/raid, btrfs'' multi-device support takes generic block devices. It doesn''t care whether they''re physical devices, partitions on physical devices, LVM2 volumes on physical devices, md/raid volumes on physical devices, partitions on md-raid on lvm2 on physical devices... you get the idea. As long as you can mkfs.btrfs it, you can run multiple-device btrfs on it. In fact, I have that pair of SSDs GPT partitioned up, with 11 independent btrfs, 9 of which are btrfs raid1 mode across similar partitions (one /var/log, plus working and primary backup for each of root, /home, gentoo distro packages tree with sources and binpkgs as well, and a 32-bit chroot that''s an install image for my netbook) on each device, with the other two being /boot and its backup on the other device, my only two non-raid1- mode btrfs. So yes, you can definitely run btrs multi-device on partition block- devices instead of directly on the physical device block devices, as I know quite well since my setup depends on that! =:^)> I also > think that if I were to use partitions (no multidisk), no COW and hence > no checksumming, I might as well use ext4 which is more optimized for > that usage scenario. Ideally I could use just a subvol with nodatacow > and quota for this purpose, but per subvolume nodatacow is not available > yet as far as I have understood (correct me if I''m wrong).Well, if your base assumption, that you couldn''t use btrfs multi-device on partitions, only on physical devices, was correct... But it''s not. Which means you /can/ partition if you like, and then use whatever filesystem on those partitions you want, combining multi-device btrfs on some of them, with ext4 on md/raid if you want multi-device support for it, since unlike btrfs, ext4 doesn''t support multi-device natively. You could even throw lvm2 in there, if you like, giving you additional sizing and deployment flexibility. Before btrfs here, I actually used reiserfs on lvm2 on mdraid on physical devices, and it worked, but that was complex enough I wasn''t confident of my ability to manage it in a disaster recovery scenario, and lvm2 requires userspace and thus an initr* to handle root on lvm2, while root on mdraid can be handled directly from the kernel commandline so no initr* required, so I kept the mdraid and dropped lvm2. [snipped further discussion along that invalid assumption line]> I have, until btrfs, normally just made one large array of all storage > drives matching in performance characteristics, thinking that all the > data can benefit from the extra IO-performance of the array. This has > been a good compromise for a limited budget home setup where ideal > storage teering with SSD hybrid SANs and such is not an option. But as I > am now experiencing with btrfs, COW kind of changes the rules in a > profound noticable all-the-time way. With COWs inherent > random-write-to-large-file fragmentation penalty I think there is no > other way than to separate the different workloads into separate storage > pools going to different hardware. In my case it would probably mean > having one storage pool for general storage, one for VMs and one for > torrenting, as all of those react in their own way to COW and will get > heavily affected by the other workloads in the worst case if run from > the same drives with COW.Luckily, the partitioning thing does work. Additionally, as mentioned you can set NOCOW on directories and have new files in them inherit that. So you have quite a bit more flexibility than you might have though. Tho of course it''s your system and you may well prefer administering whole physical devices to dealing with permissions, just as I decided lvm2 wasn''t appropriate to me, altho many people use it for everything.> Your system of a "cache" is actually already implemented logically in my > setup, in the form of a post-processing script that rtorrent runs on > completion. It moves completed files in dedicated per-tracker seeding > folders, and then makes a copy (using cp --reflink=auto on btrfs) of the > file, processes it if needed (tag clean up, reencoding, decompresssing > or what not), and then moves it to another "finished" folder. This makes > it easy to know what the new stuff is, and I can manipulate, rename and > clean up all the data without messing up the seeds. > > I think that the "finished" folder could still be located on the RAID10 > btrfs volume with COW, as I can use an internal move into the organized > archive when I am actually sitting at the computer instead of a drive to > drive copy via the network.That makes sense. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stupid me, I completely forgot that you can run multidisk arrays with just block level partitions, just like with md raid! It will introduce a rather significant management overhead in my case though, as managing several individual partitions per drive is quite annoying with so many drives. What happens If I do cp --reflink=auto from a NOCOW file in a NOCOW folder to a folder with COW set on the same btrfs volume? Do I still get "free" copying, and is the resulting file COW or NOCOW? Mvh Hans-Kristian Bakke On 16 December 2013 11:19, Duncan <1i5t5.duncan@cox.net> wrote:> Hans-Kristian Bakke posted on Mon, 16 Dec 2013 01:06:36 +0100 as > excerpted: > >> torrents are really only one thing my storage server get hammered with. >> It also does a lot more IO intensive stuff. I actually run enterprise >> storage drives in a Supermicro-server for a reason, even if it is my >> home setup, consumer stuff just don''t cut it with my storage abuse :) >> It runs KVM virtualisation (not on btrfs though) with several VMs, >> including windows machines, do lots of manipulation of large files, >> offsite backups at 100 mbit/s for days on end, reencoding large amounts >> of audio files, runs lots of web sites, constantly streams blu-rays to >> at least one computer, and chews through enormous amounts of internet >> bandwith constantly. Last week it consumed ~10TB of internet bandwith >> alone. I was at about 140 mbit/s average throughput on a 100/100 link >> over a full 7 day week, peaking at 177 mbit/s average over 24 hours, and >> that is not counting the local gigabit traffic for all the video >> remuxing and stuff. >> In other words, all 19 storage drives in that server is driven really >> hard, and it is no wonder that this triggers some subtleties that normal >> users just don''t hit. > > Wow! Indeed! > >> But since torrenting are clearly the worst offender when it comes to >> fragmentation I can comment on that. >> Using btrfs with partitioning stops me from using the btrfs multidisk >> handling that I ideally need, so that is really not an option. > > ?? I''m not running near what you''re running, but I *AM* running multiple > independent multi-device btrfs filesystems (raid1 mode) on a single pair > of partitioned 256 MB (238 MiB) SSDs, just as pre-btrfs and pre-SSD, I > ran multiple 4-way md/raid1 volumes on individual partitions on > 4-physical-spindle spinning rust. > > Like md/raid, btrfs'' multi-device support takes generic block devices. > It doesn''t care whether they''re physical devices, partitions on physical > devices, LVM2 volumes on physical devices, md/raid volumes on physical > devices, partitions on md-raid on lvm2 on physical devices... you get the > idea. As long as you can mkfs.btrfs it, you can run multiple-device > btrfs on it. > > In fact, I have that pair of SSDs GPT partitioned up, with 11 independent > btrfs, 9 of which are btrfs raid1 mode across similar partitions (one > /var/log, plus working and primary backup for each of root, /home, gentoo > distro packages tree with sources and binpkgs as well, and a 32-bit chroot > that''s an install image for my netbook) on each device, with the other > two being /boot and its backup on the other device, my only two non-raid1- > mode btrfs. > > So yes, you can definitely run btrs multi-device on partition block- > devices instead of directly on the physical device block devices, as I > know quite well since my setup depends on that! =:^) > >> I also >> think that if I were to use partitions (no multidisk), no COW and hence >> no checksumming, I might as well use ext4 which is more optimized for >> that usage scenario. Ideally I could use just a subvol with nodatacow >> and quota for this purpose, but per subvolume nodatacow is not available >> yet as far as I have understood (correct me if I''m wrong). > > Well, if your base assumption, that you couldn''t use btrfs multi-device > on partitions, only on physical devices, was correct... But it''s not. > > Which means you /can/ partition if you like, and then use whatever > filesystem on those partitions you want, combining multi-device btrfs on > some of them, with ext4 on md/raid if you want multi-device support for > it, since unlike btrfs, ext4 doesn''t support multi-device natively. > > You could even throw lvm2 in there, if you like, giving you additional > sizing and deployment flexibility. Before btrfs here, I actually used > reiserfs on lvm2 on mdraid on physical devices, and it worked, but that > was complex enough I wasn''t confident of my ability to manage it in a > disaster recovery scenario, and lvm2 requires userspace and thus an initr* > to handle root on lvm2, while root on mdraid can be handled directly from > the kernel commandline so no initr* required, so I kept the mdraid and > dropped lvm2. > > [snipped further discussion along that invalid assumption line] > >> I have, until btrfs, normally just made one large array of all storage >> drives matching in performance characteristics, thinking that all the >> data can benefit from the extra IO-performance of the array. This has >> been a good compromise for a limited budget home setup where ideal >> storage teering with SSD hybrid SANs and such is not an option. But as I >> am now experiencing with btrfs, COW kind of changes the rules in a >> profound noticable all-the-time way. With COWs inherent >> random-write-to-large-file fragmentation penalty I think there is no >> other way than to separate the different workloads into separate storage >> pools going to different hardware. In my case it would probably mean >> having one storage pool for general storage, one for VMs and one for >> torrenting, as all of those react in their own way to COW and will get >> heavily affected by the other workloads in the worst case if run from >> the same drives with COW. > > Luckily, the partitioning thing does work. Additionally, as mentioned > you can set NOCOW on directories and have new files in them inherit > that. So you have quite a bit more flexibility than you might have > though. Tho of course it''s your system and you may well prefer > administering whole physical devices to dealing with permissions, just as > I decided lvm2 wasn''t appropriate to me, altho many people use it for > everything. > >> Your system of a "cache" is actually already implemented logically in my >> setup, in the form of a post-processing script that rtorrent runs on >> completion. It moves completed files in dedicated per-tracker seeding >> folders, and then makes a copy (using cp --reflink=auto on btrfs) of the >> file, processes it if needed (tag clean up, reencoding, decompresssing >> or what not), and then moves it to another "finished" folder. This makes >> it easy to know what the new stuff is, and I can manipulate, rename and >> clean up all the data without messing up the seeds. >> >> I think that the "finished" folder could still be located on the RAID10 >> btrfs volume with COW, as I can use an internal move into the organized >> archive when I am actually sitting at the computer instead of a drive to >> drive copy via the network. > > That makes sense. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hans-Kristian Bakke posted on Mon, 16 Dec 2013 11:55:40 +0100 as excerpted:> Stupid me, I completely forgot that you can run multidisk arrays with > just block level partitions, just like with md raid! It will introduce a > rather significant management overhead in my case though, as managing > several individual partitions per drive is quite annoying with so many > drives.What I did here, both with mdraid and now with btrfs raid1, is use a parallel partition setup on all target drives. In a couple special cases it results in some wasted space[1], but for most cases it''s possible to simply plan partition sizes so the end result after raid combination is the desired size. And some years ago I switched to GPT partitions, for checksummed/ redundant partition table reliability, partition naming (similar to filesystem labels but in the GPT partition table itself), and to not have to mess with primary/secondary/logical partitions. That lets me set partition names, which given the scheme I use for both partition names and filesystem labeling, means I have unique name/label IDs for everything, across multiple machines and with thumb-drives too! The scheme is 15 chars long, reiserfs'' max label length since I was using it at the time I designed the scheme. Here''s the content of text file I keep, documenting it:>>>>>* 15 characters long 123456789012345 ff bbB ymd ssssS t n Example: rt0005gmd3+9bc0 Function: ff: 2-char function abbreviation (bt/rt/lg, etc) Device ID (size, brand, 0-based number/letter) ssssS: 4-digit size, 1-char multiplier (m=meg, g=gig, etc) This is the size of the underlying media, NOT the partition! bbB 2-char media brand ID, 1-digit sequence number. pa=patriot, io=iomega, md=md/raid, etc. Note that /dev/md10=mda... Target/separator t: 1-char target ID and separator. .=aa1, +=workstation, %=both (bt/swap on portable disk) Date code ymd: 1-char-each year/month/day prepared y=last digit of year m=month (1-9abc) d=day (1-9a-v) Number (working, backup-n) n=copy number (zero-based) So together, we have 2 chars of function, 8 of size/mfr/n as device-id, 1 of target/separator, 3 of date prepared, 1 of copy number. So our example: rt0005gmd3+9bc0 rt=root 0005gmd3=5-gig /dev/md3 +=targeted at the workstation 9bc0=2009.1112 (Nov. 12), first/main version. <<<<< For a multi-device btrfs, I set the "hardware" sequence number appropriately for the partitions, with the filesystem label identical, except its "hardware" sequence number is "x", indicating it''s across multiple hardware devices. The "filesystem" sequence number, meanwhile, is 0 for the working copy, 1 for the primary backup, etc. With that scheme, I have both the partitions and the filesystems on top of them uniquely labeled with function, hardware/media ID (brand, size, sequence number), target machine, and partition/filesystem ID (date of layout, working/backup sequence number). If it ever got /too/ complex I could keep a list of them somewhere, but so far, it hasn''t gotten beyond the context-manageable scope level, so between seeing the name/label and knowing the context of what I''m accessing, I''ve been able to track it without resorting to a written tracking list. But that''s not to say you gotta do what I do. If you still find the administrative overhead of all those partitions too high, well so be it. This is just the solution that I''ve come up with after a couple decades of incremental modification, to where it works pretty well for me now. If some of the stuff I''ve come up with the hard way makes useful hints for someone else, great. Otherwise, just ignore it and do what works for you. It''s your system and you''re the one dealing with it, after all, not mine/me. =:^)> What happens If I do cp --reflink=auto from a NOCOW file in a NOCOW > folder to a folder with COW set on the same btrfs volume? Do I still get > "free" copying, and is the resulting file COW or NOCOW?I don''t actually know as that doesn''t fit my USE case so well, tho a comment I read awhile back hinted it may spit out an error. FWIW, I tend to either use symlinks one direction or the other or I''m trying to keep deliberately redundant backups where I don''t want potential damage to kill the common-due-to-COW parts of both files, so I don''t actually tend to find reflinks particularly useful here, even if I appreciate the flexibility that option allows. --- [1] For instance, swap with a hibernate image back before I switched to SSD (hibernate was broken on my new machine, last I checked about a year ago, and I''ve enough memory on this machine I usually don''t fill it even with cache, so I wasn''t hibernate or swap anyway, when I switched to SSD). The hibernate image must ordinarily be on a single device and should be half the size of RAM or so to avoid dumping cache to fit, but making all the parallel swap partitions that size made for a prohibitively large swap, that even if I WERE to need it, would take well longer to transfer all those gigs to/from spinning rust than I''d want to take, so one way or another, I''d never actually use it all. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2013-12-15 at 03:35 +0100, Hans-Kristian Bakke wrote:> I have done some more testing. I turned off everything using the disk > and only did defrag. I have created a script that gives me a list of > the files with the most extents. I started from the top to improve the > fragmentation of the worst files. The most fragmentet file was a file > of about 32GB with over 250 000 extents! > It seems that I can defrag a two to three largish (15-30GB) ~100 000 > extents files just fine, but after a while the system locks up (not a > complete hard lock, but everythings hangs and a restart is necessary > to get a fully working system again) > > It seems like defrag operations is triggering the issue. Probably in > combination with the large and heavily fragmentet files. >I'm trying to understand how defrag factors into your backup workload? Do you have autodefrag on, or are you running a defrag as part of the backup when you see these stalls? If not, we're seeing a different problem. -chris
Ok, I guess the essence have been lost in the meta discussion. Basically I get blocking for more than 120 seconds during these workloads: - defragmenting several large fragmentet files in succession (leaving time for btrfs to finish writing each file). This have *always* happened in my array, even when it just consisted of 4x4TB drives. or - rsyncing *to* the btrfs array from another internal array (rsync -a <source_on_ext4_mdadm_array> <dest_on_btrfs_raid10_array>) rsyncing *from* the btrfs array is not a problem, so my issue seems to be contained to heavy writing. This is happening even if the server is doing nothing else, no backups, no torrenting, no copying. The only "external" thing that is happening is a regular poll from smartd to the drives and regular filesystem size checks from check_mk (Icinga monitoring). The FS has a little over 3 TB free (of 29 TB available for RAID10 data and metadata) and contains mainly largish files like FLAC-files, photos and large mkv files, ranging from 250 MB to around 70 GB, one subvolume and one snapshot of that subvolume. "find /storage/storage-vol0/ -xdev -type f | wc -l" gives a result of 131 820 files. No hard linking is used. I am currenty removing a drive from the array, reducing the number of drvies from 8 to 7. The rebalance have not blocked for more than 120 seconds yet, but it is clearly blocking for a quite a few seconds once in a while as all other software using the drives can''t get anything through and hangs for a period. I do expect slowdowns during heavy load, but not blocking. The ext4 mdadm RAID6 array in the same server have only been slow during heavy load, but never blocked noticeably. Mvh Hans-Kristian Bakke On 16 December 2013 16:18, Chris Mason <clm@fb.com> wrote:> On Sun, 2013-12-15 at 03:35 +0100, Hans-Kristian Bakke wrote: >> I have done some more testing. I turned off everything using the disk >> and only did defrag. I have created a script that gives me a list of >> the files with the most extents. I started from the top to improve the >> fragmentation of the worst files. The most fragmentet file was a file >> of about 32GB with over 250 000 extents! >> It seems that I can defrag a two to three largish (15-30GB) ~100 000 >> extents files just fine, but after a while the system locks up (not a >> complete hard lock, but everythings hangs and a restart is necessary >> to get a fully working system again) >> >> It seems like defrag operations is triggering the issue. Probably in >> combination with the large and heavily fragmentet files. >> > > I''m trying to understand how defrag factors into your backup workload? > Do you have autodefrag on, or are you running a defrag as part of the > backup when you see these stalls? > > If not, we''re seeing a different problem. > > -chris >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-12-16 at 17:32 +0100, Hans-Kristian Bakke wrote:> Ok, I guess the essence have been lost in the meta discussion. > > Basically I get blocking for more than 120 seconds during these workloads: > - defragmenting several large fragmentet files in succession (leaving > time for btrfs to finish writing each file). This have *always* > happened in my array, even when it just consisted of 4x4TB drives. > or > - rsyncing *to* the btrfs array from another internal array (rsync -a > <source_on_ext4_mdadm_array> <dest_on_btrfs_raid10_array>) >Ok, and do you have autodefrag enabled on the btrfs FS you are copying to? Also, how much ram do you have? -chris NrybXǧv^){.n+{n߲)w*jgݢj/zޖ2ޙ&)ߡaGhj:+vw٥
I have explicitly set compression=lzo, and later noatime just to test now, else it''s just default 3.12.4 options (or 3.13-rc2 when I tested that). To make sure, here is my btrfs mounts from /proc/mounts: /dev/sdl /btrfs btrfs rw,noatime,compress=lzo,space_cache 0 0 /dev/sdl /storage/storage-vol0 btrfs rw,noatime,compress=lzo,space_cache 0 0 /etc/fstab: UUID=9302fc8f-15c6-46e9-9217-951d7423927c /btrfs btrfs defaults,compress=lzo,noatime 0 2 UUID=9302fc8f-15c6-46e9-9217-951d7423927c /storage/storage-vol0 btrfs defaults,subvol=storage-vol0,noatime 0 2 Hardware: CPU: Intel Xeon X3430 (Quad Core) MB: Supermicro X8SI6-F RAM: 16GB (4x4GB) Samsung ECC/Unbuffered DDR3 1333mhz CL9 (MEM-DR340L-SL01-EU13) HDDs in btrfs RAID10: 8 x Western Digital Se 4TB 64MB 7200RPM SATA 6Gb/s (WD4000F9YZ) HBAs: LSI SAS 9211-8i, LSI SAS 9201-16i Mvh Hans-Kristian Bakke On 16 December 2013 19:16, Chris Mason <clm@fb.com> wrote:> On Mon, 2013-12-16 at 17:32 +0100, Hans-Kristian Bakke wrote: >> Ok, I guess the essence have been lost in the meta discussion. >> >> Basically I get blocking for more than 120 seconds during these workloads: >> - defragmenting several large fragmentet files in succession (leaving >> time for btrfs to finish writing each file). This have *always* >> happened in my array, even when it just consisted of 4x4TB drives. >> or >> - rsyncing *to* the btrfs array from another internal array (rsync -a >> <source_on_ext4_mdadm_array> <dest_on_btrfs_raid10_array>) >> > > Ok, and do you have autodefrag enabled on the btrfs FS you are copying > to? Also, how much ram do you have? > > -chris >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-12-16 at 19:22 +0100, Hans-Kristian Bakke wrote:> I have explicitly set compression=lzo, and later noatime just to test > now, else it's just default 3.12.4 options (or 3.13-rc2 when I tested > that). > > To make sure, here is my btrfs mounts from /proc/mounts: > /dev/sdl /btrfs btrfs rw,noatime,compress=lzo,space_cache 0 0 > /dev/sdl /storage/storage-vol0 btrfs rw,noatime,compress=lzo,space_cache 0 0 > > /etc/fstab: > UUID=9302fc8f-15c6-46e9-9217-951d7423927c /btrfs btrfs > defaults,compress=lzo,noatime 0 2 > UUID=9302fc8f-15c6-46e9-9217-951d7423927c /storage/storage-vol0 > btrfs defaults,subvol=storage-vol0,noatime 0 2 > > Hardware: > CPU: Intel Xeon X3430 (Quad Core) > MB: Supermicro X8SI6-F > RAM: 16GB (4x4GB) Samsung ECC/Unbuffered DDR3 1333mhz CL9 (MEM-DR340L-SL01-EU13) > HDDs in btrfs RAID10: 8 x Western Digital Se 4TB 64MB 7200RPM SATA > 6Gb/s (WD4000F9YZ) > HBAs: LSI SAS 9211-8i, LSI SAS 9201-16i >Ok, could you please capture the dmesg output after a sysrq-w during one of the stalls during rsync writing? We want to see all the stack traces of all the waiting procs. Defrag is a slightly different use case, so I want to address that separately. -chris
No problem. You have to wait a bit though, as the volume is currently going through a reduction in the number of drives from 8 to 7 and I do not feel comfortable stalling the volume while that is happening. I will report back with the logs later on. Mvh Hans-Kristian Bakke On 16 December 2013 19:33, Chris Mason <clm@fb.com> wrote:> On Mon, 2013-12-16 at 19:22 +0100, Hans-Kristian Bakke wrote: >> I have explicitly set compression=lzo, and later noatime just to test >> now, else it''s just default 3.12.4 options (or 3.13-rc2 when I tested >> that). >> >> To make sure, here is my btrfs mounts from /proc/mounts: >> /dev/sdl /btrfs btrfs rw,noatime,compress=lzo,space_cache 0 0 >> /dev/sdl /storage/storage-vol0 btrfs rw,noatime,compress=lzo,space_cache 0 0 >> >> /etc/fstab: >> UUID=9302fc8f-15c6-46e9-9217-951d7423927c /btrfs btrfs >> defaults,compress=lzo,noatime 0 2 >> UUID=9302fc8f-15c6-46e9-9217-951d7423927c /storage/storage-vol0 >> btrfs defaults,subvol=storage-vol0,noatime 0 2 >> >> Hardware: >> CPU: Intel Xeon X3430 (Quad Core) >> MB: Supermicro X8SI6-F >> RAM: 16GB (4x4GB) Samsung ECC/Unbuffered DDR3 1333mhz CL9 (MEM-DR340L-SL01-EU13) >> HDDs in btrfs RAID10: 8 x Western Digital Se 4TB 64MB 7200RPM SATA >> 6Gb/s (WD4000F9YZ) >> HBAs: LSI SAS 9211-8i, LSI SAS 9201-16i >> > > Ok, could you please capture the dmesg output after a sysrq-w during one > of the stalls during rsync writing? We want to see all the stack traces > of all the waiting procs. > > Defrag is a slightly different use case, so I want to address that > separately. > > -chris >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html