Hi, I need some help in designing a storage structure for 1 billion of small files (<512 Bytes), and I was wondering how btrfs will fit in this scenario. Keep in mind that I never worked with btrfs - I just read some documentation and browsed this mailing list - so forgive me if my questions are silly! :X On with the main questions, then: - What''s the advice to maximize disk capacity using such small files, even sacrificing some speed? - Would you store all the files "flat", or would you build a hierarchical tree of directories to speed up file lookups? (basically duplicating the filesystem Btree indexes) I tried to answer those questions, and here is what I found: it seems that the smallest block size is 4K. So, in this scenario, if every file uses a full block I will end up with lots of space wasted. Wouldn''t change much if block was 2K, anyhow. I tough about compression, but is not clear to me the compression is handled at the file level or at the block level. Also I read that there is a mode that uses blocks for shared storage of metadata and data, designed for small filesystems. Haven''t found any other info about it. Still is not yet clear to me if btrfs can fit my situation, would you recommend it over XFS? XFS has a minimum block size of 512, but BTRFS is more modern and, given the fact that is able to handle indexes on his own, it could help us speed up file operations (could it?) Thank you for any advice! Alessio Focardi ------------------ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Monday 07 of May 2012 11:28:13 Alessio Focardi wrote:> Hi, > > I need some help in designing a storage structure for 1 billion of small > files (<512 Bytes), and I was wondering how btrfs will fit in this > scenario. Keep in mind that I never worked with btrfs - I just read some > documentation and browsed this mailing list - so forgive me if my questions > are silly! :X > > > On with the main questions, then: > > - What''s the advice to maximize disk capacity using such small files, even > sacrificing some speed? > > - Would you store all the files "flat", or would you build a hierarchical > tree of directories to speed up file lookups? (basically duplicating the > filesystem Btree indexes) > > > I tried to answer those questions, and here is what I found: > > it seems that the smallest block size is 4K. So, in this scenario, if every > file uses a full block I will end up with lots of space wasted. Wouldn''t > change much if block was 2K, anyhow. > > I tough about compression, but is not clear to me the compression is handled > at the file level or at the block level. > > Also I read that there is a mode that uses blocks for shared storage of > metadata and data, designed for small filesystems. Haven''t found any other > info about it. > > > Still is not yet clear to me if btrfs can fit my situation, would you > recommend it over XFS? > > XFS has a minimum block size of 512, but BTRFS is more modern and, given the > fact that is able to handle indexes on his own, it could help us speed up > file operations (could it?) > > Thank you for any advice! >btrfs will inline such small files in metadata blocks. I''m not sure about limits to size of directory, but I''d guess that going over few tens of thousands of files in single flat directory will have speed penalties. Regards, -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Use a directory hierarchy. Even if the filesystem handles a flat structure effectively, userspace programs will choke on tens of thousands of files in a single directory. For example ''ls'' will try to lexically sort its output (very slowly) unless given the command-line option not to do so. Sent from my iPad On May 7, 2012, at 3:58 AM, Hubert Kario <hka@qbs.com.pl> wrote:> I''m not sure about limits to size of directory, but I''d guess that going over > few tens of thousands of files in single flat directory will have speed > penalties-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 07, 2012 at 11:28:13AM +0200, Alessio Focardi wrote:> Hi, > > I need some help in designing a storage structure for 1 billion of small files (<512 Bytes), and I was wondering how btrfs will fit in this scenario. Keep in mind that I never worked with btrfs - I just read some documentation and browsed this mailing list - so forgive me if my questions are silly! :X > > > On with the main questions, then:> - What''s the advice to maximize disk capacity using such small > files, even sacrificing some speed?See my comments below about inlining files.> - Would you store all the files "flat", or would you build a > hierarchical tree of directories to speed up file lookups? > (basically duplicating the filesystem Btree indexes)Hierarchically, for the reasons Hubert and Boyd gave. (And it''s not duplicating the btree indexes -- the tree of the btree does not reflect the tree of the directory hierarchy).> I tried to answer those questions, and here is what I found: > > it seems that the smallest block size is 4K. So, in this scenario, > if every file uses a full block I will end up with lots of space > wasted. Wouldn''t change much if block was 2K, anyhow.With small files, they will typically be inlined into the metadata. This is a lot more compact (as you can have several files'' data in a single block), but by default will write two copies of each file, even on a single disk. So, if you want to use some form of redundancy (e.g. RAID-1), then that''s great, and you need to do nothing unusual. However, if you want to maximise space usage at the expense of robustness in a device failure, then you need to ensure that you only keep one copy of your data. This will mean that you should format the filesystem with the -m single option.> I tough about compression, but is not clear to me the compression is > handled at the file level or at the block level.> Also I read that there is a mode that uses blocks for shared storage > of metadata and data, designed for small filesystems. Haven''t found > any other info about it.Don''t use that unless your filesystem is <16GB or so in size. It won''t help here (i.e. file data stored in data chunks will still be allocated on a block-by-block basis).> Still is not yet clear to me if btrfs can fit my situation, would > you recommend it over XFS?The relatively small metadata overhead (e.g. compared to ext4) and inline capability of btrfs would seem to be a good match for your use-case.> XFS has a minimum block size of 512, but BTRFS is more modern and, > given the fact that is able to handle indexes on his own, it could > help us speed up file operations (could it?)Not sure what you mean by "handle indexes on its own". XFS will have its own set of indexes and file metadata -- it wouldn''t be much of a filesystem if it didn''t. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- argc, argv, argh! ---
Il 07/05/2012 11:28, Alessio Focardi ha scritto:> Hi, > > I need some help in designing a storage structure for 1 billion of small files (<512 Bytes), and I was wondering how btrfs will fit in this scenario. Keep in mind that I never worked with btrfs - I just read some documentation and browsed this mailing list - so forgive me if my questions are silly! :XAre you *really* sure a database is *not* what are you looking for?> On with the main questions, then: > > - What''s the advice to maximize disk capacity using such small files, even sacrificing some speed? > > - Would you store all the files "flat", or would you build a hierarchical tree of directories to speed up file lookups? (basically duplicating the filesystem Btree indexes) > > > I tried to answer those questions, and here is what I found: > > it seems that the smallest block size is 4K. So, in this scenario, if every file uses a full block I will end up with lots of space wasted. Wouldn''t change much if block was 2K, anyhow. > > I tough about compression, but is not clear to me the compression is handled at the file level or at the block level. > > Also I read that there is a mode that uses blocks for shared storage of metadata and data, designed for small filesystems. Haven''t found any other info about it. > > > Still is not yet clear to me if btrfs can fit my situation, would you recommend it over XFS? > > XFS has a minimum block size of 512, but BTRFS is more modern and, given the fact that is able to handle indexes on his own, it could help us speed up file operations (could it?) > > Thank you for any advice! > > Alessio Focardi > ------------------ > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> This is a lot more compact (as you can have several files'' data in a > single block), but by default will write two copies of each file, > even > on a single disk.Great, no (or less) space wasted, then! I will have a filesystem that''s composed mostly of metadata blocks, if I understand correctly. Will this create any problem?> So, if you want to use some form of redundancy (e.g. RAID-1), then > that''s great, and you need to do nothing unusual. However, if you > want > to maximise space usage at the expense of robustness in a device > failure, then you need to ensure that you only keep one copy of your > data. This will mean that you should format the filesystem with the > -m > single option.That''s a very clever suggestion, I''m preparing a test server right now: going to use the -m single option. Any other suggestion regarding format options? pagesize? leafsize?> > XFS has a minimum block size of 512, but BTRFS is more modern and, > > given the fact that is able to handle indexes on his own, it could > > help us speed up file operations (could it?) > > Not sure what you mean by "handle indexes on its own". XFS will > have its own set of indexes and file metadata -- it wouldn''t be much > of a filesystem if it didn''t.Yes, you are perfectly right; I tough that recreating a tree like /d/u/m/m/y/ to store "dummy" would have been redundant since the whole filesystem is based on trees - I don''t have to "ls" directories, we are using php to write and read files, I will have to find a "compromise" between levels of directories and number of files in each one of them. May I ask you about compression? Would you use it in the scenario I described? Thank you for your help! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 07, 2012 at 01:15:26PM +0200, Alessio Focardi wrote:> > This is a lot more compact (as you can have several files'' data in a > > single block), but by default will write two copies of each file, > > even > > on a single disk. > > Great, no (or less) space wasted, then!Less space wasted -- you will still have empty bytes left at the end(*) of most metadata blocks, but you will definitely be packing in storage far more densely than otherwise. (*) Actually, the middle, but let''s ignore that here.> I will have a filesystem that''s composed mostly of metadata blocks, > if I understand correctly. Will this create any problem?Not that I''m aware of -- but you probably need to run proper tests of your likely behaviour just to see what it''ll be like.> > So, if you want to use some form of redundancy (e.g. RAID-1), then > > that''s great, and you need to do nothing unusual. However, if you > > want > > to maximise space usage at the expense of robustness in a device > > failure, then you need to ensure that you only keep one copy of your > > data. This will mean that you should format the filesystem with the > > -m > > single option. > > > That''s a very clever suggestion, I''m preparing a test server right now: going to use the -m single option. Any other suggestion regarding format options? > > pagesize? leafsize?I''m not sure about these -- some values of them definitely break things. I think they are required to be the same, and that you could take them up to 64k with no major problems, but do check that first with someone who actually knows. Having a larger pagesize/leafsize will reduce the depth of the trees, and will allow you to store more items in each tree block, which gives you less wastage again. I don''t know what the drawbacks are, though.> > > XFS has a minimum block size of 512, but BTRFS is more modern and, > > > given the fact that is able to handle indexes on his own, it could > > > help us speed up file operations (could it?) > > > > Not sure what you mean by "handle indexes on its own". XFS will > > have its own set of indexes and file metadata -- it wouldn''t be much > > of a filesystem if it didn''t.> Yes, you are perfectly right; I tough that recreating a tree like > /d/u/m/m/y/ to store "dummy" would have been redundant since the > whole filesystem is based on trees - I don''t have to "ls" > directories, we are using php to write and read files, I will have > to find a "compromise" between levels of directories and number of > files in each one of them.The FS tree (which is the bit that stores the directory hierarchy and file metadata) is (broadly) a tree-structured index of inodes, ordered by inode number. Don''t confuse the inode index structure with the directory structure -- they''re totally different arrangements of the data. You may want to try looking at [1], which attempts to describe how the FS tree holds file data.> May I ask you about compression? Would you use it in the scenario I > described?I''m not sure if compression will apply to inline file data. Again, someone else may be able to answer; and you should probably test it with your own use-cases anyway. Hugo. [1] http://btrfs.ipv5.de/index.php?title=Trees -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Welcome to Rivendell, Mr Anderson... ---
Am Mon, 7 May 2012 12:39:28 +0100 schrieb Hugo Mills <hugo@carfax.org.uk>:> On Mon, May 07, 2012 at 01:15:26PM +0200, Alessio Focardi wrote:...> > That''s a very clever suggestion, I''m preparing a test server right > > now: going to use the -m single option. Any other suggestion > > regarding format options? > > > > pagesize? leafsize? > > I''m not sure about these -- some values of them definitely break > things. I think they are required to be the same, and that you could > take them up to 64k with no major problems, but do check that first > with someone who actually knows.First, if you have this filesystem as rootfs, a separate /boot partition is needed. Grub is unable to boot from btrfs with different node-/leafsize. Second a very recent kernel is needed (linux-3.4-rc1 at least). regards, Johannes -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 07, 2012 at 11:28:13AM +0200, Alessio Focardi wrote:> I tough about compression, but is not clear to me the compression is > handled at the file level or at the block level.I don''t recommend using compression for your expected file size range. Unless the files are highly compressible (50-75%, which I don''t expect), the extra cpu processing of compression will make things only worse. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/05/12 20:06, Boyd Waters wrote:> Use a directory hierarchy. Even if the filesystem handles a > flat structure effectively, userspace programs will choke on > tens of thousands of files in a single directory. For example > ''ls'' will try to lexically sort its output (very slowly) unless > given the command-line option not to do so.In my experience it''s not so much that lexical sorting that kills you but the default -F option which gets set for users these days, that results in ls doing an lstat() on every file to work out if it''s an executable, directory, symlink, etc to modify how it displays it to you. For instance on one of our HPC systems here we''ve a user with over 200,000 files in one directory. It takes about 4 seconds for \ls whereas \ls -F takes, well I can''t tell you because it was still running after 53 minutes (strace confirmed it was still lstat()ing) when I killed it.. cheers, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 07, 2012 at 11:28:13AM +0200, Alessio Focardi wrote:> Hi, > > I need some help in designing a storage structure for 1 billion of small files (<512 Bytes), and I was wondering how btrfs will fit in this scenario. Keep in mind that I never worked with btrfs - I just read some documentation and browsed this mailing list - so forgive me if my questions are silly! :XA few people have already mentioned how btrfs will pack these small files into metadata blocks. If you''re running btrfs on a single disk, the mkfs default will duplicate metadata blocks, which will decrease the files per disk you''re able to store. If you use mkfs.btrfs -m single, you''ll store each file only once. I recommend some kind of raid for data you care about though, either hardware raid or putting the files across two drives (mkfs.btrfs -m raid1 -d raid1) I suggest you experiment with compression. Both lzo and zlib will make the files smaller, but exactly how much depends quite a lot on your workload. We compress on a per-extent level, which varies from a single block to up to much larger sizes. Newer kernels (3.4 and higher) can support larger metadata block sizes. This increases storage efficiency because we need fewer extent records to describe all your metadata blocks. It also allows us to pack many more files into a single block, reducing internal btree block fragmentation. But the cost is increased CPU usage. Btrfs hits memmove and memcpy pretty hard when you''re using larger blocks. I suggest using a 16K or 32K block size. You can go up to 64K, it may work well if you have beefy CPUs. Example for 16K: mkfs.btrfs -l 16K -n 16K /dev/xxx Others have already recommended deeper directory trees. You can experiment with a few variations here, but a few subdirs will improve performance. Too many subdirs will waste kernel ram and resources on the dentries. Another thing to keep in mind is that btrfs uses a btree for each subvolume. Using multiple subvolumes does allow you to break up the btree locks and improve concurrency. You can safely use a subvolume in most places you would use a top level directory, but remember that snapshots don''t recurse into subvolumes. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/05/12 12:05, vivo75@gmail.com wrote:> Il 07/05/2012 11:28, Alessio Focardi ha scritto: >> Hi, >> >> I need some help in designing a storage structure for 1 billion of >> small files (<512 Bytes), and I was wondering how btrfs will fit in >> this scenario. Keep in mind that I never worked with btrfs - I just >> read some documentation and browsed this mailing list - so forgive me >> if my questions are silly! :X> Are you *really* sure a database is *not* what are you looking for?My thought also. Or: 1 billion 512 byte files... Is that not a 512GByte HDD? With that, use a database to index your data by sector number and read/write your data direct to the disk? For that example, your database just holds filename, size, and sector. If your 512 byte files are written and accessed sequentially, then just use a HDD and address them by sector number from a database index. That then becomes your ''filesystem''. If you need fast random access, then use SSDs. Plausible? Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/05/12 13:31, Chris Mason wrote: [...]> A few people have already mentioned how btrfs will pack these small > files into metadata blocks. If you''re running btrfs on a single disk,[...]> But the cost is increased CPU usage. Btrfs hits memmove and memcpy > pretty hard when you''re using larger blocks. > > I suggest using a 16K or 32K block size. You can go up to 64K, it may > work well if you have beefy CPUs. Example for 16K: > > mkfs.btrfs -l 16K -n 16K /dev/xxxIs that still with "-s 4K" ? Might that help SSDs that work in 16kByte chunks? And why are memmove and memcpy more heavily used? Does that suggest better optimisation of the (meta)data, or just a greater housekeeping overhead to shuffle data to new offsets? Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 08, 2012 at 05:51:05PM +0100, Martin wrote:> On 08/05/12 13:31, Chris Mason wrote: > > [...] > > A few people have already mentioned how btrfs will pack these small > > files into metadata blocks. If you''re running btrfs on a single disk, > > [...] > > But the cost is increased CPU usage. Btrfs hits memmove and memcpy > > pretty hard when you''re using larger blocks. > > > > I suggest using a 16K or 32K block size. You can go up to 64K, it may > > work well if you have beefy CPUs. Example for 16K: > > > > mkfs.btrfs -l 16K -n 16K /dev/xxx > > Is that still with "-s 4K" ?Yes, the data sector size should still be the same as the page size.> > > Might that help SSDs that work in 16kByte chunks?Most ssds today work in much larger chunks, so the bulk of the benefit comes from better packing, and fewer extent records required to hold the same amount of metadata.> > And why are memmove and memcpy more heavily used? > > Does that suggest better optimisation of the (meta)data, or just a > greater housekeeping overhead to shuffle data to new offsets?Inserting something into the middle of a block is more expensive because we have to shift left and right first. The bigger the block, the more we have to shift. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html