Hello, More than 12 hours ago, I tried to umount a btrfs filesystem. Something involving btrfs-cleaner and btrfs-transacti is still running, but I don''t know what. I have noticed excessively long umount times before, and it is a significant concern for me. A bit of background: The filesystem in question involves two 2TB USB hard drives. It is 49% full. Data is RAID0, metadata is RAID1. The files stored on it are for BackupPC, meaning there are many, many directories and hardlinks. I would estimate 30 million inodes in use and many of them have dozens of hardlinks to them. These disks used to be formatted with ext4. I used the e2fs dump to back them up, created a fresh btrfs filesystem, and used restore to load the data onto it. Now then. btrfs seemed to be extremely slow creating hard links. Slow to the tune of taking hours longer than ext4 to do the same task, and often triggering kernel task hung for more than 120 seconds warnings. I thought perhaps converting metadata to raid0 would help. So I started a btrfs balance start -mconver=raid0 on it. According to btrfs fi df, it churned through the first 900MB out of 26GB of metadata in quick order, but then the amount of RAID0 metadata bounced up and down between about 950MB and 1019MB -- always just shy of 1GB. There was an active rsync job to the disk during this time. With no apparent progress even after hours, I tried to cancel the balance. My cancel command did not return even after waiting hours. Finally I rebooted and mounted the FS with the option to not restart the balance, then it canceled in a few minutes. dstat showed all was quiet on the disk. So I thought I would unmount it, remount it normally, and start the convert again. And it is from that unmount that it has been sitting. According to dstat, it reads about 360K per second, every so often writing out about 25MB per second. And it''s been doing this for 12 hours. It seems I have encountered numerous problems here: * I/O Starvation on link(2) and perhaps also unlink(2) * btrfs convert having a lack of progress after many hours * btrfs convert stop not stopping anything * umount taking hours The umount is still pending, so if there is any debugging I can do, please let me know. Kernel 3.10 from Debian wheezy backports on i386. Thanks, John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
John Goerzen posted on Tue, 05 Nov 2013 07:42:02 -0600 as excerpted:> Hello, > > More than 12 hours ago, I tried to umount a btrfs filesystem. Something > involving btrfs-cleaner and btrfs-transacti is still running, but I > don''t know what. > > I have noticed excessively long umount times before, and it is a > significant concern for me. > > A bit of background: > > The filesystem in question involves two 2TB USB hard drives. It is 49% > full. Data is RAID0, metadata is RAID1. The files stored on it are for > BackupPC, meaning there are many, many directories and hardlinks. I > would estimate 30 million inodes in use and many of them have dozens of > hardlinks to them.That''s a bit of a problem for btrfs at this point, as you rightly mention.> I thought perhaps converting metadata to raid0 would help. So I > started a btrfs balance start -mconver=raid0 on it.> Kernel 3.10 from Debian wheezy backports on i386.There''s a known bug with balance on current kernels related to pre- allocated space (as with the systemd journal or torrent files with some clients). A patch is available and queued for 3.13 and then for stable (which doesn''t take patches unless they''re in current mainline already), but while 3.12 is out and the 3.13 commit window would normally be open, Linus is taking a week off for travel without a good internet connection, so the 3.13 kernel commit window is delayed a week. Which means this patch is likely to be delayed a couple weeks before it reaches stable. =:^( Here''s a link to the post with the patch: [PATCH] Btrfs: relocate csums properly with prealloc extents http://permalink.gmane.org/gmane.comp.file-systems.btrfs/28733 I''d suggest applying that to the latest 3.12 kernel and trying the balance again. Unfortunately that means an unsafe reboot without a remount read-only or unmount, but... -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Duncan <1i5t5.duncan <at> cox.net> writes:> > John Goerzen posted on Tue, 05 Nov 2013 07:42:02 -0600 as excerpted: > > > The filesystem in question involves two 2TB USB hard drives. It is 49% > > full. Data is RAID0, metadata is RAID1. The files stored on it are for > > BackupPC, meaning there are many, many directories and hardlinks. I > > would estimate 30 million inodes in use and many of them have dozens of > > hardlinks to them. > > That''s a bit of a problem for btrfs at this point, as you rightly mention.Hi Duncan, Thank you very much for taking the time to reply. Can you clarify a bit about what sort of problems I might expect to encounter with this sort of setup on btrfs?> > > I thought perhaps converting metadata to raid0 would help. So I > > started a btrfs balance start -mconver=raid0 on it. > > > Kernel 3.10 from Debian wheezy backports on i386. > > There''s a known bug with balance on current kernels related to pre- > allocated space (as with the systemd journal or torrent files with some > clients).[snip ]> http://permalink.gmane.org/gmane.comp.file-systems.btrfs/287I''m almost completely sure that this bug wasn''t being hit. The files were streamed back by restore(8), and a few written by BackupPC. I checked the source to both just to make sure, and neither have a call to fallocate. I do not believe there were sparse files on the disk either. I also haven''t experienced the csum errors mentioned in the post. Thanks again, -- John -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
John Goerzen posted on Tue, 05 Nov 2013 16:11:56 +0000 as excerpted:> Duncan <1i5t5.duncan <at> cox.net> writes: > > >> John Goerzen posted on Tue, 05 Nov 2013 07:42:02 -0600 as excerpted: >> >> > The filesystem in question involves two 2TB USB hard drives. It is >> > 49% full. Data is RAID0, metadata is RAID1. The files stored on it >> > are for BackupPC, meaning there are many, many directories and >> > hardlinks. I would estimate 30 million inodes in use and many of >> > them have dozens of hardlinks to them. >> >> That''s a bit of a problem for btrfs at this point, as you rightly >> mention.> Can you clarify a bit about what sort of problems I might expect to > encounter with this sort of setup on btrfs?I''m not a dev nor do I run that sort of setup, so I won''t attempt a lot of detail. This is admittedly a bit handwavy, but if you need more just use it as a place to start for for your own research. That out of the way, having followed the list for awhile, I''ve seen several reports of complications related to high hardlink count, mostly exactly as yours, related to unresponsive for N seconds warnings and inordinately long processing times for unmounts, etc. Additionally, it''s worth noting that until relatively recently (the wiki changelog page says 3.7), btrfs had a rather low limit on hardlinks in a single directory that people using btrfs for hardlink intensive purposes kept hitting. A developer could give you more details, but IIRC, the solution that worked around that, while it /did/ give btrfs the ability to handle them, effectively created a setup where the first few hardlinks were handled inline and thus were reasonably fast, but beyond that limit, an indirect referencing scheme was used that was rather less efficient. I''d guess btrfs'' current problems in that regard are thus two-fold, first, above a certain level the implementation /does/ get less efficient, and second, given the relatively recent kernel 3.7 implementation, btrfs'' large-numbers-of-hardlinks code hasn''t had nearly the time to shake out the bugs and get incremental optimizations that the more basic code has had. I doubt btrfs will ever be a speed demon in this area, but I expect that given another year or so, the high-numbers hardlink code will be somewhat better optimized and tested simply due to the incremental effect of bug shakeout and small code changes over time as btrfs continues maturing. Meanwhile, my own interest in btrfs is as a filesystem for SSDs (I still use reiserfs on my spinning rust and I''ve had very good luck with it even thru various shoddy hardware experiences since the ordered-by-default code went in around 2.6.16, IIRC, but its journaling isn''t well suited to SSDs), and being able to actually use btrfs'' data checksumming and integrity features, which means raid1 or raid10 mode (raid1 in my case), and the speed of SSDs does mitigate to a large degree a lot of the slowness I see others reporting for this and other cases. Additionally, I run several independent smaller partitions so if there /is/ a problem, the damage is contained, which means I''m typically dealing with double- digit gigs per partition at most, thus reducing full partition scrub and rebalance times from the hours to days I see people reporting on-list for multi-terabyte spinning rust, to typically seconds, perhaps a couple minutes, here. The time is short enough I typically use the don''t- background option, and run the scrub/balance in real-time, waiting for the result. Needless to say, if a full balance is going to take days, you don''t run it very often, but since it''s only a couple minutes here, I scrub and balance reasonably frequently, say if I have a bad shutdown (I use suspend-to-ram and sometimes on resume the SSDs don''t stabilize fast enough for the kernel, so a device drops from the btrfs raid1 and the whole system goes unstable after that, often leading to a bad shutdown and reboot). Since a full balance involves rewriting everything to new chunks that tends to limit bitrot or the chance for any errors to build up over time. My point being that my particular use-case is pretty much diametrically opposite yours! For your backups use-case, I''d probably use something less experimental than btrfs, like xfs or ext4 with ordered journaling... or the reiserfs I still use on spinning rust, tho people''s experience with it seems to either be really good or really bad, and while mine is definitely good that doesn''t mean yours will be. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tomasz Chmielewski
2013-Nov-05 18:46 UTC
Re: umount waiting for 12 hours and still running
> More than 12 hours ago, I tried to umount a btrfs filesystem. > Something involving btrfs-cleaner and btrfs-transacti is still > running, but I don''t know what.Does "iostat -x 1" or "iostat -k 1" show any disk activity? Anything interesting in dmesg? -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 11/05/2013 12:46 PM, Tomasz Chmielewski wrote:>> More than 12 hours ago, I tried to umount a btrfs filesystem. >> Something involving btrfs-cleaner and btrfs-transacti is still >> running, but I don''t know what. > > Does "iostat -x 1" or "iostat -k 1" show any disk activity?Yes. For instance, from iostat -x 1: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 104.00 0.00 416.00 0.00 8.00 1.03 10.08 10.08 0.00 9.31 96.80 sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc and sdd are the drives in this brtfs FS, and they are used for nothing but that. iostat -k 1 shows similar levels of activity.> > Anything interesting in dmesg?There was this when it was first mounted: Nov 4 11:28:43 erwin kernel: [ 200.669110] btrfs: use lzo compression Nov 4 11:28:43 erwin kernel: [ 200.669114] btrfs: disk space caching is enabled Nov 4 11:28:58 erwin kernel: [ 215.660695] BTRFS debug (device dm-15): unlinked 1 orphans Nov 4 11:28:58 erwin kernel: [ 215.673535] btrfs: force skipping balance I later canceled the balance Also, several like this: Nov 4 11:51:23 erwin kernel: [ 1560.552129] INFO: task btrfs-transacti:6775 blocked for more than 120 seconds. Nov 4 11:51:23 erwin kernel: [ 1560.552200] btrfs-transacti D 90d3d4a2 0 6775 2 0x00000000 Nov 4 11:51:23 erwin kernel: [ 1560.553136] [<f85852b6>] ? wait_current_trans.isra.20+0x8b/0xb5 [btrfs] Nov 4 11:51:23 erwin kernel: [ 1560.553217] [<f858740d>] ? start_transaction+0x1db/0x46f [btrfs] Nov 4 11:51:23 erwin kernel: [ 1560.553267] [<f85876e0>] ? btrfs_attach_transaction+0xd/0x10 [btrfs] Nov 4 11:51:23 erwin kernel: [ 1560.553316] [<f8580963>] ? transaction_kthread+0xa3/0x158 [btrfs] Nov 4 11:51:23 erwin kernel: [ 1560.553366] [<f85808c0>] ? try_to_freeze+0x28/0x28 [btrfs] but that was befrore the umount. Nothing else. -- John> >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html