thr3ads.net - zfs discuss - [zfs-discuss] Two quick questions [Feb 2006]

If this information is useful, please help other people find it:
Share via:

Rich Teer

2006-Feb-08 22:29 UTC

[zfs-discuss] Two quick questions

Hi all,

I''m going to be deploying ZFS in the near future but I have a
couple of quick questions:

1. With UFS, for performance reasons it is (was?) desirable to
limit the number of files in a directory to no more than a few
thousand.  If MANY files were anticipated being in a directory,
a hierarchy was usually recommend to keep the number of files
in a singel directory to a manageable number.  Does this still
apply to ZFS?

2. If one has an application mix that makes use of two or more
distinct sets of data (say, sets A and B), and one has set of
4 disks, space considerations aside, is it better to create two
ZFS storage pools (one for each set)?  In other words, if just
one big pool is used, is ZFS smart enough to balance the I/O
across the spindles?  Assume each data set has a different mount
point.

To illustrate what I mean, suppose I have /var/data/A and /var/data/B,
where A and B are unrelated sets of data on different file systems.
FOr best performance, should A and B reside in their own ZFS pool,
or is it OK to have one big pool containing both?

My guess is that two pools would be the way to go, but it would be
nice to get that confirmed by those in the know.

Cheers,

-- 
Rich Teer, SCNA, SCSA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich

Bill Moore

2006-Feb-08 22:46 UTC

head link

[zfs-discuss] Two quick questions

On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer
wrote:> I''m going to be deploying ZFS in the near future but I have a
> couple of quick questions:
> 
> 1. With UFS, for performance reasons it is (was?) desirable to
> limit the number of files in a directory to no more than a few
> thousand.  If MANY files were anticipated being in a directory,
> a hierarchy was usually recommend to keep the number of files
> in a singel directory to a manageable number.  Does this still
> apply to ZFS?
No.  With ZFS, we use an extensible on-disk hash for directories.
Trying to create or lookup a given file is a constant-time operation.
> 2. If one has an application mix that makes use of two or more
> distinct sets of data (say, sets A and B), and one has set of
> 4 disks, space considerations aside, is it better to create two
> ZFS storage pools (one for each set)?  In other words, if just
> one big pool is used, is ZFS smart enough to balance the I/O
> across the spindles?  Assume each data set has a different mount
> point.
> 
> To illustrate what I mean, suppose I have /var/data/A and /var/data/B,
> where A and B are unrelated sets of data on different file systems.
> FOr best performance, should A and B reside in their own ZFS pool,
> or is it OK to have one big pool containing both?
> 
> My guess is that two pools would be the way to go, but it would be
> nice to get that confirmed by those in the know.
Most definately, you should only use ONE pool for both sets of data.
ZFS will stripe the data across all spindles in the pool.  This will be
true even if the two data sets are simply two large files in a single
filesystem.


--Bill

Bill Sommerfeld

2006-Feb-08 22:58 UTC

head link

[zfs-discuss] Two quick questions

On Wed, 2006-02-08 at 14:29 -0800, Rich Teer wrote:> My guess is that two pools would be the way to go, but it would be
> nice to get that confirmed by those in the know.
I thought the ZFS conventional wisdom would have you create one pool,
using quotas & reservations if necessary to ensure that the two (or
more) competing workloads don''t eat too much of the pool to interfere
with the other.

Main benefit is that you get 4 disks worth of I/O bandwidth even when
just one of the workloads is active.

Al Hopper

2006-Feb-08 23:45 UTC

head link

[zfs-discuss] Two quick questions

On Wed, 8 Feb 2006, Bill Moore wrote:
> On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote:
> > I''m going to be deploying ZFS in the near future but I have a
> > couple of quick questions:
> >
> > 1. With UFS, for performance reasons it is (was?) desirable to
> > limit the number of files in a directory to no more than a few
> > thousand.  If MANY files were anticipated being in a directory,
> > a hierarchy was usually recommend to keep the number of files
> > in a singel directory to a manageable number.  Does this still
> > apply to ZFS?
>
> No.  With ZFS, we use an extensible on-disk hash for directories.
> Trying to create or lookup a given file is a constant-time operation.
ZFS solves a bug which I have encountered on Solaris 8, 9 and 10.  The
basic bug signature is that you create and remove lots of small files in a
given directory.  It does not matter if its a UFS or a TMPFS filesystem.
The files have a common signature.  One example that comes immediately to
mind, based on (nasty) first hand experience, are files like:

aosems.124312431234.mail0
aosems.124312431234.mail1
aosems.124312431234.mail2
aosems.124312431234.mail3
aosems.124312431234.mail4
aosems.124312431234.mail5
aosems.124312431234.mail6
aosems.124312431234.mail6
aosems.124312431234.rawmail
aosems.124312431234.lock

The files are scanned and once a ''.lock'' file is found, the
related, by the
common center section (124312431234 in the above example), bounded by the
''.'' characters, are consumed by a process that scans the
directory
periodically.  When processed, the files are moved to an intermediate
directory, which is a subdirectory of where these files were written, and
archived when the number of files in the subdirectory reaches a thresehold
(as "seen" by a cron job).

The bug: after several days of this type of activity, the directory
structure "deteriorates" with the following observed behavior:

1) if the number of ''aosems.*'' files exceeds approx 600 or 700
files, the C
code (which is bone-head simple) scanning the directory slows to a crawl.
This impacts, severly, the ability of the consuming process to actually
consume the files ... which leads to more files accumulating ... and the
entire scheme falls over.  A classic case of a failure mode that resembles
the trajectory of an object falling off a cliff (the EE term is "knee"
curve failure mode).

2) The interesting part: If the failure mode is recognized and, lets say
that the number of files is still reasonable, say between 2,000 and 3,000
files have accumulated, then, when a human attempts to manually remove the
files ....  the rm command becomes single-threaded and the best rate of
file removeal approaches 2 or 3 files per Second (on a lightly loaded 6
processor SPARC box equipped with 900MHz ultraSPARC IIIs with 8Mb of
cache).

3) No variation of the rm command (find blah,blah | xargs rm OR rm -rf
blah)  etc. can improve the situation.

4) An ls command exhibits the same behavior - it takes minutes of elapsed
time to complete.

5) The same behavior is observed on x86/AMD64 based systems.

6) The observed behavior is independent of the number of CPUs installed in
the "problem" system.  One CPU will be observed to be 100% busy, all
other
CPUs will be observed to be idle.

7) The same behavior is observed even if the files are written in a well
planned directory structure.  For example:

/var/project/something/2006/02/08/1600
                       yyyy mm dd hhhh

/var/project/somethingelse/2006/02/08/1600
/var/project/xml/2006/02/08/1600

and the number of files that accumulate in each "leaf" node is
reasonable,
for example, < approx.  600 files worst case and less than 100 on average.

[in the above example the max # of files that can accumulate in a directory
is limited by the activity on the system, to the number of files that are
generated by the system lusers in a one hour interval.]

The key to exercising this bug, it to write & rm the files several times.
After several generations of files are written and removed below the
top-level directory, in the above case, /var/project, the observed behavior
will be very obvious even from behind the command prompt!

8) using tempfile (or was it tmpfile) to generate the files, results in the
same observed pattern of behavior.  So its not specific to the signature
of the actual file name.

9) the only real operational solution to the issue is to stop the
producing/consuming processes and remove the top-level directory,
periodically, then re-create it.  Unfortunately, if the top-level directory
is /tmp, you are SOL!  Warning: Don''t try this in your home directory!
Create a subdirectory and then try it.

Followup: I''ve seen people refer to this bug in public forums, without
an
understanding of how to ''tickle'' it, with comments like
''why does it take
so long for Solaris to remove files'' ....

I had fully intended to root cause this issue since the launch of the
OpenSolaris project.  At this point in time, its obvious that it''ll
continue to be "tomorrow" before I get a chance to work on it, given
my
current workload and CAB involvement etc.
> > 2. If one has an application mix that makes use of two or more
> > distinct sets of data (say, sets A and B), and one has set of
> > 4 disks, space considerations aside, is it better to create two
> > ZFS storage pools (one for each set)?  In other words, if just
> > one big pool is used, is ZFS smart enough to balance the I/O
> > across the spindles?  Assume each data set has a different mount
> > point.
> >
> > To illustrate what I mean, suppose I have /var/data/A and /var/data/B,
> > where A and B are unrelated sets of data on different file systems.
> > FOr best performance, should A and B reside in their own ZFS pool,
> > or is it OK to have one big pool containing both?
> >
> > My guess is that two pools would be the way to go, but it would be
> > nice to get that confirmed by those in the know.
>
> Most definately, you should only use ONE pool for both sets of data.
> ZFS will stripe the data across all spindles in the pool.  This will be
> true even if the two data sets are simply two large files in a single
> filesystem.
>
Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005

Nathan Kroenert

2006-Feb-09 00:23 UTC

head link

[zfs-discuss] Two quick questions

Hey, Bill - 

Having played a little with zillions of files in a single directory, I
can say that there are still cases where it is NOT desirable to have
said zillions of files (or directories) in one spot... The worst case
for me is when you have something performing an action on ''*''
or a
regular ls [-l] sort of action... 

Of course, there is much that could be said about the design of whatever
it is *using* the directory, and it''s own efficiency, but I''m
looking at
it from the ''already existing'' methodologies that tend to work
sequentially, or on all entries in the directory...

For example - having a million directories within a directory, and
typing ''ls''... You better hope that you have *lots* of free
memory...
:)  (the ls wants in excess of 500mb...)

Another question some of my playing raised, however, was the amount of
space directory entries use on disk. I created a million directories in
a 2GB zfs filesystem, and noted the following:

Filesystem            kbytes    used   avail capacity  Mounted on
pool0/fs0            2064384 1135474  928168    56%    /pool0/fs0

Nearly a MB per directory entry?

Hm!

Nathan.


On Thu, 2006-02-09 at 09:46, Bill Moore wrote:> On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote:
> > I''m going to be deploying ZFS in the near future but I have a
> > couple of quick questions:
> > 
> > 1. With UFS, for performance reasons it is (was?) desirable to
> > limit the number of files in a directory to no more than a few
> > thousand.  If MANY files were anticipated being in a directory,
> > a hierarchy was usually recommend to keep the number of files
> > in a singel directory to a manageable number.  Does this still
> > apply to ZFS?
> 
> No.  With ZFS, we use an extensible on-disk hash for directories.
> Trying to create or lookup a given file is a constant-time operation.
> 
> > 2. If one has an application mix that makes use of two or more
> > distinct sets of data (say, sets A and B), and one has set of
> > 4 disks, space considerations aside, is it better to create two
> > ZFS storage pools (one for each set)?  In other words, if just
> > one big pool is used, is ZFS smart enough to balance the I/O
> > across the spindles?  Assume each data set has a different mount
> > point.
> > 
> > To illustrate what I mean, suppose I have /var/data/A and /var/data/B,
> > where A and B are unrelated sets of data on different file systems.
> > FOr best performance, should A and B reside in their own ZFS pool,
> > or is it OK to have one big pool containing both?
> > 
> > My guess is that two pools would be the way to go, but it would be
> > nice to get that confirmed by those in the know.
> 
> Most definately, you should only use ONE pool for both sets of data.
> ZFS will stripe the data across all spindles in the pool.  This will be
> true even if the two data sets are simply two large files in a single
> filesystem.
> 
> 
> --Bill
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- 
//////////////////////////////////////////////////////////////////
// Nathan Kroenert             nathan.kroenert at sun.com          //
// PTS Engineer                Phone:  +61 2 9844-5235          //
// Sun Services                Direct Ext:      x57235          //
// Level 2, 828 Pacific Hwy    Fax:    +61 2 9844-5311          //
// Gordon        2072  New South Wales   Australia              //
//////////////////////////////////////////////////////////////////

Nathan Kroenert

2006-Feb-09 00:38 UTC

head link

[zfs-discuss] Two quick questions

Sorry to follow up my own mail...
> Another question some of my playing raised, however, was the amount of
> space directory entries use on disk. I created a million directories in
> a 2GB zfs filesystem, and noted the following:
> 
> Filesystem            kbytes    used   avail capacity  Mounted on
> pool0/fs0            2064384 1135474  928168    56%    /pool0/fs0
> 
> Nearly a MB per directory entry?
I''m on drugs. Or perhaps lacking caffine... I''m out by an
order of
magnitude here... of course it''s not an MB per directory...

oops. That will teach me to play before my morning Cola... :)

Nathan.
> 
On Thu, 2006-02-09 at 11:23, Nathan Kroenert wrote:> Hey, Bill - 
> 
> Having played a little with zillions of files in a single directory, I
> can say that there are still cases where it is NOT desirable to have
> said zillions of files (or directories) in one spot... The worst case
> for me is when you have something performing an action on
''*'' or a
> regular ls [-l] sort of action... 
> 
> Of course, there is much that could be said about the design of whatever
> it is *using* the directory, and it''s own efficiency, but
I''m looking at
> it from the ''already existing'' methodologies that tend to
work
> sequentially, or on all entries in the directory...
> 
> For example - having a million directories within a directory, and
> typing ''ls''... You better hope that you have *lots* of
free memory...
> :)  (the ls wants in excess of 500mb...)
> 
> Another question some of my playing raised, however, was the amount of
> space directory entries use on disk. I created a million directories in
> a 2GB zfs filesystem, and noted the following:
> 
> Filesystem            kbytes    used   avail capacity  Mounted on
> pool0/fs0            2064384 1135474  928168    56%    /pool0/fs0
> 
> Nearly a MB per directory entry?
> 
> Hm!
> 
> Nathan.
> 
> 
> On Thu, 2006-02-09 at 09:46, Bill Moore wrote:
> > On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote:
> > > I''m going to be deploying ZFS in the near future but I
have a
> > > couple of quick questions:
> > > 
> > > 1. With UFS, for performance reasons it is (was?) desirable to
> > > limit the number of files in a directory to no more than a few
> > > thousand.  If MANY files were anticipated being in a directory,
> > > a hierarchy was usually recommend to keep the number of files
> > > in a singel directory to a manageable number.  Does this still
> > > apply to ZFS?
> > 
> > No.  With ZFS, we use an extensible on-disk hash for directories.
> > Trying to create or lookup a given file is a constant-time operation.
> > 
> > > 2. If one has an application mix that makes use of two or more
> > > distinct sets of data (say, sets A and B), and one has set of
> > > 4 disks, space considerations aside, is it better to create two
> > > ZFS storage pools (one for each set)?  In other words, if just
> > > one big pool is used, is ZFS smart enough to balance the I/O
> > > across the spindles?  Assume each data set has a different mount
> > > point.
> > > 
> > > To illustrate what I mean, suppose I have /var/data/A and
/var/data/B,
> > > where A and B are unrelated sets of data on different file
systems.
> > > FOr best performance, should A and B reside in their own ZFS
pool,
> > > or is it OK to have one big pool containing both?
> > > 
> > > My guess is that two pools would be the way to go, but it would
be
> > > nice to get that confirmed by those in the know.
> > 
> > Most definately, you should only use ONE pool for both sets of data.
> > ZFS will stripe the data across all spindles in the pool.  This will
be
> > true even if the two data sets are simply two large files in a single
> > filesystem.
> > 
> > 
> > --Bill
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> -- 
> //////////////////////////////////////////////////////////////////
> // Nathan Kroenert             nathan.kroenert at sun.com          //
> // PTS Engineer                Phone:  +61 2 9844-5235          //
> // Sun Services                Direct Ext:      x57235          //
> // Level 2, 828 Pacific Hwy    Fax:    +61 2 9844-5311          //
> // Gordon        2072  New South Wales   Australia              //
> //////////////////////////////////////////////////////////////////-- 
//////////////////////////////////////////////////////////////////
// Nathan Kroenert             nathan.kroenert at sun.com          //
// PTS Engineer                Phone:  +61 2 9844-5235          //
// Sun Services                Direct Ext:      x57235          //
// Level 2, 828 Pacific Hwy    Fax:    +61 2 9844-5311          //
// Gordon        2072  New South Wales   Australia              //
//////////////////////////////////////////////////////////////////

Jeff Bonwick

2006-Feb-09 00:44 UTC

head link

[zfs-discuss] Two quick questions

> Having played a little with zillions of files in a single directory, I
> can say that there are still cases where it is NOT desirable to have
> said zillions of files (or directories) in one spot... The worst case
> for me is when you have something performing an action on
''*'' or a
> regular ls [-l] sort of action... 
Very good point.  If you have millions of files in a single directory,
you really only want to access them programmatically.  Even ''echo
*''
takes a while with that many files.
> Another question some of my playing raised, however, was the amount of
> space directory entries use on disk. I created a million directories in
> a 2GB zfs filesystem, and noted the following:
> 
> Filesystem            kbytes    used   avail capacity  Mounted on
> pool0/fs0            2064384 1135474  928168    56%    /pool0/fs0
> 
> Nearly a MB per directory entry?
I believe it''s 1k, not 1m:

	space per dir = (1135474 * 1024 used / 2^20 dirs) = 1180 bytes
	
Jeff

Matthew Ahrens

2006-Feb-09 01:25 UTC

head link

[zfs-discuss] Two quick questions

On Thu, Feb 09, 2006 at 11:23:40AM +1100, Nathan Kroenert
wrote:> For example - having a million directories within a directory, and
> typing ''ls''... You better hope that you have *lots* of
free memory...
> :)  (the ls wants in excess of 500mb...)
Yeah, ''ls'' could be a lot more efficient.  See a couple of
bugs that I
filed a while back:

6299767 ''ls -f'' should not buffer output
6299769 ''ls'' memory usage is excessive

These would be great candidates for someone from the community to work
on...
> Another question some of my playing raised, however, was the amount of
> space directory entries use on disk. I created a million directories in
> a 2GB zfs filesystem, and noted the following:
> 
> Filesystem            kbytes    used   avail capacity  Mounted on
> pool0/fs0            2064384 1135474  928168    56%    /pool0/fs0
It''s interesting to see the breakdown of where that ~1100 bytes per
directory comes from:

(1) The directory itself is using about 89 bytes per entry (as reported
by stat64() or ''ls -s'' on the directory)
(2) Each (empty) subdirectory is using up 512 bytes (each in a single
512-byte sector) to store its (zero) entries.
(3) Each subdirectory has an entry in the "dnode file", which stores
the
metadata for each file, similar to inodes in UFS.  Each dnode is 512
bytes.

Until recently, ZFS has compressed all metadata (it was turned off to
help diagnose some nasty bugs).  With compression on, the directory and
the dnode file will get smaller but unfortunately each subdirectory''s
data is already stored on the smallest possible (512-byte) block.  With
compression, the total is about 650 bytes per subdirectory, broken down
as follows:

(1) directory: 38 bytes per entry
(2) subdirectory entries: 512 bytes per entry
(3) dnode: 100 bytes per entry

If we stored empty files rather than empty directories, we eliminate the
512-byte block per entry for a total of around 93 bytes per file (the
dnode file can compress almost twice as well because it doesn''t need to
store the block pointers for the subdirectory''s blocks).

(1) directory: 38 bytes per entry
(2) empty files don''t use any blocks for their data
(3) dnode: 56 bytes per entry

There you have it, more than you ever wanted to know about space usage
for lots of empty files or directories.

--matt

Bill Sommerfeld

2006-Feb-09 02:16 UTC

head link

[zfs-discuss] Two quick questions

On Wed, 2006-02-08 at 17:25 -0800, Matthew Ahrens wrote:> (2) Each (empty) subdirectory is using up 512 bytes (each in a single
> 512-byte sector) to store its (zero) entries.
Hmm.  Are empty directories common enough to make it worthwhile to store
an empty directory as a zero-length object instead of a 512-byte object?
(I may not have the terminology correct...).   

One case which this could help: Subversion goes nuts creating a lot of
empty subdirs in its working copies (each source directory in a working
copy contains a .svn subdir with between 5 and 8 initially-empty
subdirs), but I''m not sure if anything else would benefit.

Given that everything''s COW already it doesn''t seem like it
would cost
that much to implement, but it''s unclear if the payback would be worth
it.

					- Bill

Torrey McMahon

2006-Feb-10 18:07 UTC

head link

[zfs-discuss] Two quick questions

Bill Moore wrote:>
>> 2. If one has an application mix that makes use of two or more
>> distinct sets of data (say, sets A and B), and one has set of
>> 4 disks, space considerations aside, is it better to create two
>> ZFS storage pools (one for each set)?  In other words, if just
>> one big pool is used, is ZFS smart enough to balance the I/O
>> across the spindles?  Assume each data set has a different mount
>> point.
>>
>> To illustrate what I mean, suppose I have /var/data/A and /var/data/B,
>> where A and B are unrelated sets of data on different file systems.
>> FOr best performance, should A and B reside in their own ZFS pool,
>> or is it OK to have one big pool containing both?
>>
>> My guess is that two pools would be the way to go, but it would be
>> nice to get that confirmed by those in the know.
>>     
>
> Most definately, you should only use ONE pool for both sets of data.
> ZFS will stripe the data across all spindles in the pool.  This will be
> true even if the two data sets are simply two large files in a single
> filesystem.
>   
This goes against the current provisioning guidelines we''ve used for 
data sets that require different performance or access patterns. You 
might want to expand on how ZFS isn''t impacted. For example: Today we 
place oracle logs and tables on different filesystems, usually on 
different physical LUNs, to avoid things like lun skew. How does ZFS fix 
that?

zfs discuss - Feb 2006 - Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions

[zfs-discuss] Two quick questions