Thomas Kuther
2014-Jan-13 10:29 UTC
Re: Issues with "no space left on device" maybe related to 3.13
Am 13.01.2014 08:25, schrieb Duncan:> [This mail was also posted to gmane.comp.file-systems.btrfs.] > > Thomas Kuther posted on Mon, 13 Jan 2014 00:05:25 +0100 as excerpted: >> > > [ Rearranged to standard quote/reply order so replies are in context. > Top-posting is irritating to try to reply to.]Oops, sorry. Has been too late for the second mail yesterday.> >> Am 12.01.2014 21:24, schrieb Thomas Kuther: >>> >>> I'm experiencing an interesting issue with the BTRFS filesystem on my >>> SSD drive. It first occured some time after the upgrade to kernel >>> 3.13-rc (-rc3 was my first 3.13-rc) but I'm not sure if it is >>> related. >>> >>> The obvious symptoms are that services on my system started crashing >>> with "no space left on device" errors. >>> >>> └» mount |grep "/mnt/ssd" >>> /dev/sda2 on /mnt/ssd type btrfs >>> (rw,noatime,compress=lzo,ssd,noacl,space_cache) >>> >>> └» btrfs fi df /mnt/ssd >>> Data, single: total=113.11GiB, used=90.02GiB >>> System, DUP: total=64.00MiB, used=24.00KiB >>> System, single: total=4.00MiB, used=0.00 >>> Metadata, DUP: total=3.00GiB, used=2.46GiB > > This shows only half the story, tho. You also need the output of btrfs > fi show /mnt/ssd. Btrfs fi show displays how much of the total > available space is chunk-allocated; btrfs fi df displays how much of > the chunk- allocation for each type is actually used. Only with both > of them is the picture complete enough to actually see what's going on.└» sudo btrfs fi show /mnt/ssd Label: none uuid: 52bc94ba-b21a-400f-a80d-e75c4cd8a936 Total devices 1 FS bytes used 93.22GiB devid 1 size 119.24GiB used 119.24GiB path /dev/sda2 Btrfs v3.12 └» sudo btrfs fi df /mnt/ssd Data, single: total=113.11GiB, used=90.79GiB System, DUP: total=64.00MiB, used=24.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=3.00GiB, used=2.43GiB So, this looks like it's really full.>>> I use snapper on two subvolumes of that BTRFS volume (/ and /home) - >>> each keeping 7 daily snapshots and up to 10 hourlys. >>> >>> When I saw those errors I started to delete most of the older >>> snapshots, >>> and the issue went away instantly, but this couldn't be a solution >>> nor a workaround. >>> >>> I do though have a "usual suspect" on that BTRFS volume. A KVM disk >>> image of a Win8 VM (I _need_ Adobe Lightroom) >>> >>> » lsattr /mnt/ssd/kvm-images/ >>> ---------------C /mnt/ssd/kvm-images/Windows_8_Pro.qcow2 >>> >>> So the image has CoW disabled. Now comes the interesting part: >>> I'm trying to copy off the image to my raid5 array (BTRFS ontop of a >>> mdraid 5 - absolutely no issues with that one), but the cp process >>> seems like it's stalled. >>> >>> After one hour the size of the destination copy is still 0 bytes. >>> iotop almost constantly show values like >>> >>> TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND >>> 4636 be/4 tom 14.40 K/s 0.00 B/s 0.00 % 0.71 % cp >>> /mnt/ssd/kvm-images/Windows_8_Pro.qcow2 . >>> >>> It tries to read the file with some 14K/s and writes absolutely >>> nothing. >>> >>> Any idea what's going wrong here, or suggestions how to get that qcow >>> file copied off? I do have a backup, but honestly that one is quite >>> aged - so simply rm'ing it would be the very last thing I'd like to >>> try. > > OK. There's a familiar known-troublesome pattern here that your > situation fits... with one difference that I had previously /thought/ > would ameliorate the problem, but either you didn't catch the problem > soon enough, or the root issue is more complex than I at first > understood (quite possible, since while I'm a regular on the list and > thus see the common issues posted, I'm just a btrfs user/admin, not a > dev, btrfs or otherwise). > > The base problem is that btrfs is normally a copy-on-write filesystem, > and frequently internally-rewritten (as opposed to sequential-write > append-only or write once, read many) files are in general a COW- > filesystem's worst-case, the larger the file and more frequently > partially rewritten, the worse it is, since every small internal write > will COW the area being written elsewhere, quickly fragmenting large > routinely internal-written files such as VM images into hundreds of > thousands of extents! =:^( > > In general, btrfs has two methods to help deal with that. For smaller > files the autodefrag mount option can help. For larger files > autodefrag can be a performance issue in itself due to write > magnification (each small internal write triggering a rewrite of the > entire multi-gig file), but there's the NOCOW extended-attribute, which > is what /has/ been recommended for these things as it's supposed to > tell the filesystem to do in-place rewrites instead of COW. That > doesn't seem to have worked for you, which is the interesting bit, but > it's possible that's an artifact of how it was handled. Additionally, > there's the snapshot aspect throwing further complexity into the works, > as described below. > > OK, so the file has NOCOW (the +C xattribute) set, which is good. > *BUT*, when/how did you set it? On btrfs that can make all the > difference! > > The caveat with NOCOW on btrfs is that in ordered to be properly > effective, NOCOW must be set on the file when it's first created, > before there's actually any data in it. If the attribute is not set > until later, when the file is not zero-size, behavior isn't what one > might expect or desire -- simply stated, it doesn't work. > > The simplest way to ensure that a file gets the NOCOW attribute set > while it's still empty is to set the attribute on the parent directory > before the file is created in the first place. Any newly created files > will then automatically inherit the directory's attribute, and thus > will be set NOCOW from the beginning.I created the subvolume /mnt/ssd/kvm-images and set +C on it. Then I moved the VM image in there. So the attribute for the file was inherited by the parent directory at creation time, yes.> > A second method is to do it manually by first creating the zero-length > file using touch, then setting the NOCOW attribute using chattr +C, and > only /then/ copying the content into it. However, this is a rather > difficult for files created by other processes, so the directory > inheritance method is generally recommended as the simplest method. > > So now the question is, the file has NOCOW set as recommended, but was > it set before the file had content in it as required, or was NOCOW only > set later, on the existing file with its existing content, thus in > practice nullifying the effect of setting it at all? > > Meanwhile, the other significant factor here is the snapshotting. In > VM- image-cases *WITHOUT* the NOCOW xattr properly set, heavy > snapshotting of a filesystem with VM images is a known extreme > worst-case of the worst- cases, with *EXTREMELY* bad behavior > characteristics that don't scale well at all, such that attempting to > work with that file will tie up the filesystem in huge knots such that > very little forward progress can be made, period. We're talking days > or even weeks to do what /should/ have taken a few minutes, due to the > *SEVERE* scaling issues. They're working on the problem, but it's a > tough one to solve and its scale only recently became apparent.I do not have any snapshots of that specific kvm-images subvolume for those reasons. There are some snapshots of other subvolumes (/ and /home) but only a hand full dating back a few days.> > Actually, the current theory is that the recent changes to make defrag > snapshot-aware may have triggered the severe scaling issues we're > seeing now. Before that, the situation was bad, but apparently not > horribly terribly broken to the point of not working at all, as it is > now. > > But as I said, the previous recommendation has been to NOCOW the file > to prevent the problem from ever appearing in the first place. > > Which you have apparently done and the problem is still there, except > that we don't know yet whether you set NOCOW effectively, probably > using the inheritance method, or not. If you set it effectively, then > the problem is worse, MUCH worse, than thought, since the recommended > workaround, doesn't workaround. But if you set it too late to be > effective, then the problem is simply another instance of the already > known issue.So it seems I hit the worst case.> > As for how to manage the existing file, you seem to have figured that > out already, below... > >>> PS: please reply-to-all, I'm not subscribed. Thanks. > > OK. I'm doing so here, but please remind in every reply. > > FWIW, I read and respond to the list as a newsgroup using gmane.org's > list2news service and normally reply to the "newsgroup", which gets > forwarded to the list. So I'm not actually using a mail client but a > news client, and replying to both author and newsgroup/list isn't > particularly easy, nor do I do it often, so reminding with every reply > does help me remember.Hmm, using nntp is a good idea, actually.> >> I did some more digging, and I think I have two maybe unrelated issues >> here. >> >> The "no space left on device" could be caused by the amount of >> metadata used. I defragmented the KVM image and other parts, ran a >> "balance start -dusage=5", and now it looks like >> >> └» btrfs fi df / >> Data, single: total=113.11GiB, used=88.83GiB >> System, DUP: total=64.00MiB, used=24.00KiB >> System, single: total=4.00MiB, used=0.00 >> Metadata, DUP: total=3.00GiB, used=2.40GiB > > Just as a hint, you can get rid of that extra system chunk (the empty > single one) by doing a balance -sf (system, force, force necessary when > balancing system chunks only, not as part of metadata). Since that's > only a few KiB of actual system data, it should go fast, and you won't > have that second system chunk display any more. =:^)OK, will do. Thanks!> >> The issue with copying/moving off the KVM image still remains. Using >> "cp" or "mv" hangs. Interestingly, what did work was using "qemu-img >> convert -O raw ..." so now I have a fresh backup at least. The VM >> works just fine with the original image file. I really wonder what >> goes wrong with cp and mv. > > They're apparently getting caught up in that 100k-extents snapshot > scaling morass...Even when subvolume in question has no snapshots and never had?> > But *THANKS* for the qemu-img convert idea. I haven't setup any VMs > here so didn't know about that at all. At least now I can pass on > something that should actually let people get a backup to work with. > =:^) > > > Meanwhile... > >> And I stumbled over a third issue with my raid5 array: >> └» df -h|grep /mnt/btrfs >> /dev/md0 5,5T 3,4T 2,1T 63% /mnt/btrfs >> └» sudo btrfs fi df /mnt/btrfs/ >> Data, single: total=3.33TiB, used=3.33TiB >> System, DUP: total=8.00MiB, used=388.00KiB >> System, single: total=4.00MiB, used=0.00 >> Metadata, DUP: total=56.12GiB, used=5.14GiB >> Metadata, single: total=8.00MiB, used=0.00 > > Again, you can use balance to get rid of those unused single chunks. > They're currently an artifact from the creation of the filesystem due > to how mkfs.btrfs works at present, so I've started doing a balance > immediately after first mount to deal with them, before there's > anything on the filesystem so the balance goes real fast. =:^) 3+ TiB > of data is a little late for that, but you can balance metadata (and > system) only, at least. > >> The array has been grown quite a while ago using "btrfs filesystem >> resize max", but "btrfs fi df" still shows the old data size. How >> could that happen? > > As hinted at above, btrfs fi df <mntpnt> is only half the story, > displaying how much of currently allocated chunks are used and for what > (data/metadata/system/shared/etc). What it does *NOT* display is how > much of the total filesystem size is actually allocated in the first > place. That's where btrfs fi show <mntpnt> comes in. (Just btrfs fi > show, without the <mntpnt> parameter, works fine if you've only a > single btrfs or maybe a couple, but once you get a half dozen or so, > adding the <mntpnt> just as you do for df, is useful to just display > the one.) > > Consider: On a single device btrfs, data is single mode by default, > with data chunks normally 1 GiB each, metadata is dup mode by default, > with metadata chunks normally 1/4 GiB (256 MiB), but due to dup mode, > two of them are allocated at a time, so half a GiB. > > Given that, how do you represent unallocated space that could be > allocated as either data (single, takes the space of the size of the > data, or a bit less when compression is on) or metadata (dup, takes > twice as much space as the size of the actual metadata as there's two > copies of it), depending on what is needed? > > Of course btrfs can be used on multiple devices in various raid modes > as well, complicating the picture further, particularly in the future > when each subvolume can have its own single/dup/raid policy applied so > they're not the same. > > The way btrfs deals with this question is that btrfs fi show displays > allocated vs. total space (with the space that doesn't show up as > allocated obviously being... unallocated! =:^), while btrfs fi df, > displays the usage detail on only /allocated/ space.OK, now I got it. └» sudo btrfs fi show /mnt/btrfs Label: none uuid: 939f2547-176a-4942-b8d6-8883fed68973 Total devices 1 FS bytes used 3.34TiB devid 1 size 5.46TiB used 3.44TiB path /dev/md0 No issues on that array, just PEBKAC.> > Meanwhile, plain df (not btrfs df, just df) currently doesn't work > particularly well for btrfs, because the rules it uses to display used > vs. available space that work on most filesystems, don't really apply > to btrfs in the same way, and it doesn't know to apply different rules > to btrfs or what they might be if it did. (There's an effort to teach > df to know about btrfs and similar filesystems, but it's early stage > ATM, as there's some very real questions to settle on exactly what a > sensible kernel API might look like for that, first, with the > assumption being that if the interface is designed correctly, other > filesystems will be able to make use of it in the future as well.) >> This is becomming a "collection of maybe unrelated BTRFS funny tales" >> thread... still I'd be happy on suggestions regarding any of the >> issues. > > Some of this stuff, including discussion of the issues surrounding > space used and left, is covered on the btrfs wiki, here (bookmark it! > =:^) : > > https://btrfs.wiki.kernel.org > > In particular, see FAQ items 4.4-4.10 (documentation, faq...) covering > space questions, but it's worth reading pretty much all the User level > (as opposed to developer) documentation. >Will do, last time I went through the wiki has been at least 2 or 3 years ago, I guess. And obviously I wasn't really aware of the difference between btrfs fi show and df. Thanks for your detailed input and the little slap on the backhead regarding df vs. show :-) Regards, Tom -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html