thr3ads.net - Btrfs devel - btrfs and 1 billion small files [May 2012]

If this information is useful, please help other people find it:
Share via:

Alessio Focardi

2012-May-07 09:28 UTC

btrfs and 1 billion small files

Hi,

I need some help in designing a storage structure for 1 billion of small files
(<512 Bytes), and I was wondering how btrfs will fit in this scenario. Keep
in mind that I never worked with btrfs - I just read some documentation and
browsed this mailing list - so forgive me if my questions are silly! :X


On with the main questions, then:

- What''s the advice to maximize disk capacity using such small files,
even sacrificing some speed?

- Would you store all the files "flat", or would you build a
hierarchical tree of directories to speed up file lookups? (basically
duplicating the filesystem Btree indexes)


I tried to answer those questions, and here is what I found:

it seems that the smallest block size is 4K. So, in this scenario, if every file
uses a full block I will end up with lots of space wasted. Wouldn''t
change much if block was 2K, anyhow.

I tough about compression, but is not clear to me the compression is handled at
the file level or at the block level.

Also I read that there is a mode that uses blocks for shared storage of metadata
and data, designed for small filesystems. Haven''t found any other info
about it.


Still is not yet clear to me if btrfs can fit my situation, would you recommend
it over XFS?

XFS has a minimum block size of 512, but BTRFS is more modern and, given the
fact that is able to handle indexes on his own, it could help us speed up file
operations (could it?)

Thank you for any advice!

Alessio Focardi
------------------


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2012-May-07 09:58 UTC

head link

Re: btrfs and 1 billion small files

On Monday 07 of May 2012 11:28:13 Alessio Focardi wrote:> Hi,
> 
> I need some help in designing a storage structure for 1 billion of small
> files (<512 Bytes), and I was wondering how btrfs will fit in this
> scenario. Keep in mind that I never worked with btrfs - I just read some
> documentation and browsed this mailing list - so forgive me if my questions
> are silly! :X
> 
> 
> On with the main questions, then:
> 
> - What''s the advice to maximize disk capacity using such small
files, even
> sacrificing some speed?
> 
> - Would you store all the files "flat", or would you build a
hierarchical
> tree of directories to speed up file lookups? (basically duplicating the
> filesystem Btree indexes)
> 
> 
> I tried to answer those questions, and here is what I found:
> 
> it seems that the smallest block size is 4K. So, in this scenario, if every
> file uses a full block I will end up with lots of space wasted.
Wouldn''t
> change much if block was 2K, anyhow.
> 
> I tough about compression, but is not clear to me the compression is
handled
> at the file level or at the block level.
> 
> Also I read that there is a mode that uses blocks for shared storage of
> metadata and data, designed for small filesystems. Haven''t found
any other
> info about it.
> 
> 
> Still is not yet clear to me if btrfs can fit my situation, would you
> recommend it over XFS?
> 
> XFS has a minimum block size of 512, but BTRFS is more modern and, given
the
> fact that is able to handle indexes on his own, it could help us speed up
> file operations (could it?)
> 
> Thank you for any advice!
> 
btrfs will inline such small files in metadata blocks.

I''m not sure about limits to size of directory, but I''d guess
that going over
few tens of thousands of files in single flat directory will have speed 
penalties.

Regards,
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Boyd Waters

2012-May-07 10:06 UTC

head link

Re: btrfs and 1 billion small files

Use a directory hierarchy. Even if the filesystem handles a flat structure
effectively, userspace programs will choke on tens of thousands of files in a
single directory. For example ''ls'' will try to lexically sort
its output (very slowly) unless given the command-line option not to do so.

Sent from my iPad

On May 7, 2012, at 3:58 AM, Hubert Kario <hka@qbs.com.pl> wrote:
> I''m not sure about limits to size of directory, but I''d
guess that going over
> few tens of thousands of files in single flat directory will have speed 
> penalties--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-May-07 10:55 UTC

head link

Re: btrfs and 1 billion small files

On Mon, May 07, 2012 at 11:28:13AM +0200, Alessio Focardi
wrote:> Hi,
> 
> I need some help in designing a storage structure for 1 billion of small
files (<512 Bytes), and I was wondering how btrfs will fit in this scenario.
Keep in mind that I never worked with btrfs - I just read some documentation and
browsed this mailing list - so forgive me if my questions are silly! :X
> 
> 
> On with the main questions, then:
> - What''s the advice to maximize disk capacity using such small
>   files, even sacrificing some speed?
   See my comments below about inlining files.
> - Would you store all the files "flat", or would you build a
>   hierarchical tree of directories to speed up file lookups?
>   (basically duplicating the filesystem Btree indexes)
   Hierarchically, for the reasons Hubert and Boyd gave. (And it''s not
duplicating the btree indexes -- the tree of the btree does not
reflect the tree of the directory hierarchy).
> I tried to answer those questions, and here is what I found:
>
> it seems that the smallest block size is 4K. So, in this scenario,
> if every file uses a full block I will end up with lots of space
> wasted. Wouldn''t change much if block was 2K, anyhow.
   With small files, they will typically be inlined into the metadata.
This is a lot more compact (as you can have several files'' data in a
single block), but by default will write two copies of each file, even
on a single disk.

   So, if you want to use some form of redundancy (e.g. RAID-1), then
that''s great, and you need to do nothing unusual. However, if you want
to maximise space usage at the expense of robustness in a device
failure, then you need to ensure that you only keep one copy of your
data. This will mean that you should format the filesystem with the -m
single option.
> I tough about compression, but is not clear to me the compression is
> handled at the file level or at the block level.
> Also I read that there is a mode that uses blocks for shared storage
> of metadata and data, designed for small filesystems. Haven''t
found
> any other info about it.
   Don''t use that unless your filesystem is <16GB or so in size. It
won''t help here (i.e. file data stored in data chunks will still be
allocated on a block-by-block basis).
> Still is not yet clear to me if btrfs can fit my situation, would
> you recommend it over XFS?
   The relatively small metadata overhead (e.g. compared to ext4) and
inline capability of btrfs would seem to be a good match for your
use-case.
> XFS has a minimum block size of 512, but BTRFS is more modern and,
> given the fact that is able to handle indexes on his own, it could
> help us speed up file operations (could it?)
   Not sure what you mean by "handle indexes on its own". XFS will
have its own set of indexes and file metadata -- it wouldn''t be much
of a filesystem if it didn''t.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
                        --- argc, argv, argh! ---

vivo75@gmail.com

2012-May-07 11:05 UTC

head link

Re: btrfs and 1 billion small files

Il 07/05/2012 11:28, Alessio Focardi ha scritto:> Hi,
>
> I need some help in designing a storage structure for 1 billion of small
files (<512 Bytes), and I was wondering how btrfs will fit in this scenario.
Keep in mind that I never worked with btrfs - I just read some documentation and
browsed this mailing list - so forgive me if my questions are silly! :XAre you *really* sure a database is *not* what are you looking for?
> On with the main questions, then:
>
> - What''s the advice to maximize disk capacity using such small
files, even sacrificing some speed?
>
> - Would you store all the files "flat", or would you build a
hierarchical tree of directories to speed up file lookups? (basically
duplicating the filesystem Btree indexes)
>
>
> I tried to answer those questions, and here is what I found:
>
> it seems that the smallest block size is 4K. So, in this scenario, if every
file uses a full block I will end up with lots of space wasted.
Wouldn''t change much if block was 2K, anyhow.
>
> I tough about compression, but is not clear to me the compression is
handled at the file level or at the block level.
>
> Also I read that there is a mode that uses blocks for shared storage of
metadata and data, designed for small filesystems. Haven''t found any
other info about it.
>
>
> Still is not yet clear to me if btrfs can fit my situation, would you
recommend it over XFS?
>
> XFS has a minimum block size of 512, but BTRFS is more modern and, given
the fact that is able to handle indexes on his own, it could help us speed up
file operations (could it?)
>
> Thank you for any advice!
>
> Alessio Focardi
> ------------------
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alessio Focardi

2012-May-07 11:15 UTC

head link

Re: btrfs and 1 billion small files

> This is a lot more compact (as you can have several files'' data in
a
> single block), but by default will write two copies of each file,
> even
> on a single disk.
Great, no (or less) space wasted, then! I will have a filesystem that''s
composed mostly of metadata blocks, if I understand correctly. Will this create
any problem?
>    So, if you want to use some form of redundancy (e.g. RAID-1), then
> that''s great, and you need to do nothing unusual. However, if you
> want
> to maximise space usage at the expense of robustness in a device
> failure, then you need to ensure that you only keep one copy of your
> data. This will mean that you should format the filesystem with the
> -m
> single option.

That''s a very clever suggestion, I''m preparing a test server
right now: going to use the -m single option. Any other suggestion regarding
format options?

pagesize? leafsize?

 > > XFS has a minimum block size of 512, but BTRFS is more modern and,
> > given the fact that is able to handle indexes on his own, it could
> > help us speed up file operations (could it?)
> 
>    Not sure what you mean by "handle indexes on its own". XFS
will
> have its own set of indexes and file metadata -- it wouldn''t be
much
> of a filesystem if it didn''t.
Yes, you are perfectly right; I tough that recreating a tree like /d/u/m/m/y/ to
store "dummy" would have been redundant since the whole filesystem is
based on trees - I don''t have to "ls" directories, we are
using php to write and read files, I will have to find a "compromise"
between levels of directories and number of files in each one of them.

May I ask you about compression? Would you use it in the scenario I described?

Thank you for your help!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-May-07 11:39 UTC

head link

Re: btrfs and 1 billion small files

On Mon, May 07, 2012 at 01:15:26PM +0200, Alessio Focardi
wrote:> > This is a lot more compact (as you can have several files''
data in a
> > single block), but by default will write two copies of each file,
> > even
> > on a single disk.
> 
> Great, no (or less) space wasted, then!
   Less space wasted -- you will still have empty bytes left at the
end(*) of most metadata blocks, but you will definitely be packing in
storage far more densely than otherwise.

(*) Actually, the middle, but let''s ignore that here.
> I will have a filesystem that''s composed mostly of metadata
blocks,
> if I understand correctly. Will this create any problem?
   Not that I''m aware of -- but you probably need to run proper tests
of your likely behaviour just to see what it''ll be like.
> >    So, if you want to use some form of redundancy (e.g. RAID-1), then
> > that''s great, and you need to do nothing unusual. However, if
you
> > want
> > to maximise space usage at the expense of robustness in a device
> > failure, then you need to ensure that you only keep one copy of your
> > data. This will mean that you should format the filesystem with the
> > -m
> > single option.
> 
> 
> That''s a very clever suggestion, I''m preparing a test
server right now: going to use the -m single option. Any other suggestion
regarding format options?
> 
> pagesize? leafsize?
   I''m not sure about these -- some values of them definitely break
things. I think they are required to be the same, and that you could
take them up to 64k with no major problems, but do check that first
with someone who actually knows.

   Having a larger pagesize/leafsize will reduce the depth of the
trees, and will allow you to store more items in each tree block,
which gives you less wastage again. I don''t know what the drawbacks
are, though.
> > > XFS has a minimum block size of 512, but BTRFS is more modern
and,
> > > given the fact that is able to handle indexes on his own, it
could
> > > help us speed up file operations (could it?)
> > 
> >    Not sure what you mean by "handle indexes on its own".
XFS will
> > have its own set of indexes and file metadata -- it wouldn''t
be much
> > of a filesystem if it didn''t.
> Yes, you are perfectly right; I tough that recreating a tree like
> /d/u/m/m/y/ to store "dummy" would have been redundant since the
> whole filesystem is based on trees - I don''t have to
"ls"
> directories, we are using php to write and read files, I will have
> to find a "compromise" between levels of directories and number
of
> files in each one of them.
   The FS tree (which is the bit that stores the directory hierarchy
and file metadata) is (broadly) a tree-structured index of inodes,
ordered by inode number. Don''t confuse the inode index structure with
the directory structure -- they''re totally different arrangements of
the data. You may want to try looking at [1], which attempts to
describe how the FS tree holds file data.
> May I ask you about compression? Would you use it in the scenario I
> described?
   I''m not sure if compression will apply to inline file data. Again,
someone else may be able to answer; and you should probably test it
with your own use-cases anyway.

   Hugo.

[1] http://btrfs.ipv5.de/index.php?title=Trees

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
              --- Welcome to Rivendell,  Mr Anderson... ---

Johannes Hirte

2012-May-07 12:19 UTC

head link

Re: btrfs and 1 billion small files

Am Mon, 7 May 2012 12:39:28 +0100
schrieb Hugo Mills <hugo@carfax.org.uk>:
> On Mon, May 07, 2012 at 01:15:26PM +0200, Alessio Focardi wrote:
...> > That''s a very clever suggestion, I''m preparing a
test server right
> > now: going to use the -m single option. Any other suggestion
> > regarding format options?
> > 
> > pagesize? leafsize?
> 
>    I''m not sure about these -- some values of them definitely
break
> things. I think they are required to be the same, and that you could
> take them up to 64k with no major problems, but do check that first
> with someone who actually knows.
First, if you have this filesystem as rootfs, a separate /boot
partition is needed. Grub is unable to boot from btrfs with different
node-/leafsize. Second a very recent kernel is needed (linux-3.4-rc1 at
least).

regards,
  Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Sterba

2012-May-07 15:13 UTC

head link

Re: btrfs and 1 billion small files

On Mon, May 07, 2012 at 11:28:13AM +0200, Alessio Focardi
wrote:> I tough about compression, but is not clear to me the compression is
> handled at the file level or at the block level.
I don''t recommend using compression for your expected file size range.
Unless the files are highly compressible (50-75%, which I don''t
expect), the extra cpu processing of compression will make things only
worse.


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Samuel

2012-May-08 06:31 UTC

head link

Re: btrfs and 1 billion small files

On 07/05/12 20:06, Boyd Waters wrote:
> Use a directory hierarchy. Even if the filesystem handles a
> flat structure effectively, userspace programs will choke on
> tens of thousands of files in a single directory. For example
> ''ls'' will try to lexically sort its output (very slowly)
unless
> given the command-line option not to do so.
In my experience it''s not so much that lexical sorting that kills you
but the default -F option which gets set for users these days, that
results in ls doing an lstat() on every file to work out if it''s an
executable, directory, symlink, etc to modify how it displays it to you.

For instance on one of our HPC systems here we''ve a user with over
200,000 files in one directory.  It takes about 4 seconds for \ls
whereas \ls -F takes, well I can''t tell you because it was still
running
after 53 minutes (strace confirmed it was still lstat()ing) when I
killed it..

cheers,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-May-08 12:31 UTC

head link

Re: btrfs and 1 billion small files

On Mon, May 07, 2012 at 11:28:13AM +0200, Alessio Focardi
wrote:> Hi,
> 
> I need some help in designing a storage structure for 1 billion of small
files (<512 Bytes), and I was wondering how btrfs will fit in this scenario.
Keep in mind that I never worked with btrfs - I just read some documentation and
browsed this mailing list - so forgive me if my questions are silly! :X
A few people have already mentioned how btrfs will pack these small
files into metadata blocks.  If you''re running btrfs on a single disk,
the mkfs default will duplicate metadata blocks, which will decrease the
files per disk you''re able to store.

If you use mkfs.btrfs -m single, you''ll store each file only once.  I
recommend some kind of raid for data you care about though, either
hardware raid or putting the files across two drives (mkfs.btrfs -m
raid1 -d raid1)

I suggest you experiment with compression.  Both lzo and zlib will make
the files smaller, but exactly how much depends quite a lot on your
workload.  We compress on a per-extent level, which varies from a single
block to up to much larger sizes.

Newer kernels (3.4 and higher) can support larger metadata block sizes.
This increases storage efficiency because we need fewer extent records
to describe all your metadata blocks.  It also allows us to pack many
more files into a single block, reducing internal btree block
fragmentation.

But the cost is increased CPU usage.  Btrfs hits memmove and memcpy
pretty hard when you''re using larger blocks.

I suggest using a 16K or 32K block size.  You can go up to 64K, it may
work well if you have beefy CPUs.  Example for 16K:

mkfs.btrfs -l 16K -n 16K /dev/xxx

Others have already recommended deeper directory trees.  You can
experiment with a few variations here, but a few subdirs will improve
performance.  Too many subdirs will waste kernel ram and resources on
the dentries.

Another thing to keep in mind is that btrfs uses a btree for each
subvolume.  Using multiple subvolumes does allow you to break up the
btree locks and improve concurrency.  You can safely use a subvolume in
most places you would use a top level directory, but remember that
snapshots don''t recurse into subvolumes.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2012-May-08 16:46 UTC

head link

Re: btrfs and 1 billion small files

On 07/05/12 12:05, vivo75@gmail.com wrote:> Il 07/05/2012 11:28, Alessio Focardi ha scritto:
>> Hi,
>>
>> I need some help in designing a storage structure for 1 billion of
>> small files (<512 Bytes), and I was wondering how btrfs will fit in
>> this scenario. Keep in mind that I never worked with btrfs - I just
>> read some documentation and browsed this mailing list - so forgive me
>> if my questions are silly! :X
> Are you *really* sure a database is *not* what are you looking for?
My thought also.

Or:

1 billion 512 byte files... Is that not a 512GByte HDD?

With that, use a database to index your data by sector number and
read/write your data direct to the disk?

For that example, your database just holds filename, size, and sector.

If your 512 byte files are written and accessed sequentially, then just
use a HDD and address them by sector number from a database index. That
then becomes your ''filesystem''.

If you need fast random access, then use SSDs.

Plausible?

Regards,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin

2012-May-08 16:51 UTC

head link

Re: btrfs and 1 billion small files

On 08/05/12 13:31, Chris Mason wrote:

[...]> A few people have already mentioned how btrfs will pack these small
> files into metadata blocks.  If you''re running btrfs on a single
disk,
[...]> But the cost is increased CPU usage.  Btrfs hits memmove and memcpy
> pretty hard when you''re using larger blocks.
> 
> I suggest using a 16K or 32K block size.  You can go up to 64K, it may
> work well if you have beefy CPUs.  Example for 16K:
> 
> mkfs.btrfs -l 16K -n 16K /dev/xxx
Is that still with "-s 4K" ?


Might that help SSDs that work in 16kByte chunks?

And why are memmove and memcpy more heavily used?

Does that suggest better optimisation of the (meta)data, or just a
greater housekeeping overhead to shuffle data to new offsets?


Regards,
Martin


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Chris Mason

2012-May-08 20:54 UTC

head link

Re: btrfs and 1 billion small files

On Tue, May 08, 2012 at 05:51:05PM +0100, Martin wrote:> On 08/05/12 13:31, Chris Mason wrote:
> 
> [...]
> > A few people have already mentioned how btrfs will pack these small
> > files into metadata blocks.  If you''re running btrfs on a
single disk,
> 
> [...]
> > But the cost is increased CPU usage.  Btrfs hits memmove and memcpy
> > pretty hard when you''re using larger blocks.
> > 
> > I suggest using a 16K or 32K block size.  You can go up to 64K, it may
> > work well if you have beefy CPUs.  Example for 16K:
> > 
> > mkfs.btrfs -l 16K -n 16K /dev/xxx
> 
> Is that still with "-s 4K" ?
Yes, the data sector size should still be the same as the page size.
> 
> 
> Might that help SSDs that work in 16kByte chunks?
Most ssds today work in much larger chunks, so the bulk of the benefit
comes from better packing, and fewer extent records required to hold the
same amount of metadata.
> 
> And why are memmove and memcpy more heavily used?
> 
> Does that suggest better optimisation of the (meta)data, or just a
> greater housekeeping overhead to shuffle data to new offsets?
Inserting something into the middle of a block is more expensive because
we have to shift left and right first.  The bigger the block, the more
we have to shift.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - May 2012 - btrfs and 1 billion small files

btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files

Re: btrfs and 1 billion small files