Hi all, I''m going to be deploying ZFS in the near future but I have a couple of quick questions: 1. With UFS, for performance reasons it is (was?) desirable to limit the number of files in a directory to no more than a few thousand. If MANY files were anticipated being in a directory, a hierarchy was usually recommend to keep the number of files in a singel directory to a manageable number. Does this still apply to ZFS? 2. If one has an application mix that makes use of two or more distinct sets of data (say, sets A and B), and one has set of 4 disks, space considerations aside, is it better to create two ZFS storage pools (one for each set)? In other words, if just one big pool is used, is ZFS smart enough to balance the I/O across the spindles? Assume each data set has a different mount point. To illustrate what I mean, suppose I have /var/data/A and /var/data/B, where A and B are unrelated sets of data on different file systems. FOr best performance, should A and B reside in their own ZFS pool, or is it OK to have one big pool containing both? My guess is that two pools would be the way to go, but it would be nice to get that confirmed by those in the know. Cheers, -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich
On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote:> I''m going to be deploying ZFS in the near future but I have a > couple of quick questions: > > 1. With UFS, for performance reasons it is (was?) desirable to > limit the number of files in a directory to no more than a few > thousand. If MANY files were anticipated being in a directory, > a hierarchy was usually recommend to keep the number of files > in a singel directory to a manageable number. Does this still > apply to ZFS?No. With ZFS, we use an extensible on-disk hash for directories. Trying to create or lookup a given file is a constant-time operation.> 2. If one has an application mix that makes use of two or more > distinct sets of data (say, sets A and B), and one has set of > 4 disks, space considerations aside, is it better to create two > ZFS storage pools (one for each set)? In other words, if just > one big pool is used, is ZFS smart enough to balance the I/O > across the spindles? Assume each data set has a different mount > point. > > To illustrate what I mean, suppose I have /var/data/A and /var/data/B, > where A and B are unrelated sets of data on different file systems. > FOr best performance, should A and B reside in their own ZFS pool, > or is it OK to have one big pool containing both? > > My guess is that two pools would be the way to go, but it would be > nice to get that confirmed by those in the know.Most definately, you should only use ONE pool for both sets of data. ZFS will stripe the data across all spindles in the pool. This will be true even if the two data sets are simply two large files in a single filesystem. --Bill
On Wed, 2006-02-08 at 14:29 -0800, Rich Teer wrote:> My guess is that two pools would be the way to go, but it would be > nice to get that confirmed by those in the know.I thought the ZFS conventional wisdom would have you create one pool, using quotas & reservations if necessary to ensure that the two (or more) competing workloads don''t eat too much of the pool to interfere with the other. Main benefit is that you get 4 disks worth of I/O bandwidth even when just one of the workloads is active.
On Wed, 8 Feb 2006, Bill Moore wrote:> On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote: > > I''m going to be deploying ZFS in the near future but I have a > > couple of quick questions: > > > > 1. With UFS, for performance reasons it is (was?) desirable to > > limit the number of files in a directory to no more than a few > > thousand. If MANY files were anticipated being in a directory, > > a hierarchy was usually recommend to keep the number of files > > in a singel directory to a manageable number. Does this still > > apply to ZFS? > > No. With ZFS, we use an extensible on-disk hash for directories. > Trying to create or lookup a given file is a constant-time operation.ZFS solves a bug which I have encountered on Solaris 8, 9 and 10. The basic bug signature is that you create and remove lots of small files in a given directory. It does not matter if its a UFS or a TMPFS filesystem. The files have a common signature. One example that comes immediately to mind, based on (nasty) first hand experience, are files like: aosems.124312431234.mail0 aosems.124312431234.mail1 aosems.124312431234.mail2 aosems.124312431234.mail3 aosems.124312431234.mail4 aosems.124312431234.mail5 aosems.124312431234.mail6 aosems.124312431234.mail6 aosems.124312431234.rawmail aosems.124312431234.lock The files are scanned and once a ''.lock'' file is found, the related, by the common center section (124312431234 in the above example), bounded by the ''.'' characters, are consumed by a process that scans the directory periodically. When processed, the files are moved to an intermediate directory, which is a subdirectory of where these files were written, and archived when the number of files in the subdirectory reaches a thresehold (as "seen" by a cron job). The bug: after several days of this type of activity, the directory structure "deteriorates" with the following observed behavior: 1) if the number of ''aosems.*'' files exceeds approx 600 or 700 files, the C code (which is bone-head simple) scanning the directory slows to a crawl. This impacts, severly, the ability of the consuming process to actually consume the files ... which leads to more files accumulating ... and the entire scheme falls over. A classic case of a failure mode that resembles the trajectory of an object falling off a cliff (the EE term is "knee" curve failure mode). 2) The interesting part: If the failure mode is recognized and, lets say that the number of files is still reasonable, say between 2,000 and 3,000 files have accumulated, then, when a human attempts to manually remove the files .... the rm command becomes single-threaded and the best rate of file removeal approaches 2 or 3 files per Second (on a lightly loaded 6 processor SPARC box equipped with 900MHz ultraSPARC IIIs with 8Mb of cache). 3) No variation of the rm command (find blah,blah | xargs rm OR rm -rf blah) etc. can improve the situation. 4) An ls command exhibits the same behavior - it takes minutes of elapsed time to complete. 5) The same behavior is observed on x86/AMD64 based systems. 6) The observed behavior is independent of the number of CPUs installed in the "problem" system. One CPU will be observed to be 100% busy, all other CPUs will be observed to be idle. 7) The same behavior is observed even if the files are written in a well planned directory structure. For example: /var/project/something/2006/02/08/1600 yyyy mm dd hhhh /var/project/somethingelse/2006/02/08/1600 /var/project/xml/2006/02/08/1600 and the number of files that accumulate in each "leaf" node is reasonable, for example, < approx. 600 files worst case and less than 100 on average. [in the above example the max # of files that can accumulate in a directory is limited by the activity on the system, to the number of files that are generated by the system lusers in a one hour interval.] The key to exercising this bug, it to write & rm the files several times. After several generations of files are written and removed below the top-level directory, in the above case, /var/project, the observed behavior will be very obvious even from behind the command prompt! 8) using tempfile (or was it tmpfile) to generate the files, results in the same observed pattern of behavior. So its not specific to the signature of the actual file name. 9) the only real operational solution to the issue is to stop the producing/consuming processes and remove the top-level directory, periodically, then re-create it. Unfortunately, if the top-level directory is /tmp, you are SOL! Warning: Don''t try this in your home directory! Create a subdirectory and then try it. Followup: I''ve seen people refer to this bug in public forums, without an understanding of how to ''tickle'' it, with comments like ''why does it take so long for Solaris to remove files'' .... I had fully intended to root cause this issue since the launch of the OpenSolaris project. At this point in time, its obvious that it''ll continue to be "tomorrow" before I get a chance to work on it, given my current workload and CAB involvement etc.> > 2. If one has an application mix that makes use of two or more > > distinct sets of data (say, sets A and B), and one has set of > > 4 disks, space considerations aside, is it better to create two > > ZFS storage pools (one for each set)? In other words, if just > > one big pool is used, is ZFS smart enough to balance the I/O > > across the spindles? Assume each data set has a different mount > > point. > > > > To illustrate what I mean, suppose I have /var/data/A and /var/data/B, > > where A and B are unrelated sets of data on different file systems. > > FOr best performance, should A and B reside in their own ZFS pool, > > or is it OK to have one big pool containing both? > > > > My guess is that two pools would be the way to go, but it would be > > nice to get that confirmed by those in the know. > > Most definately, you should only use ONE pool for both sets of data. > ZFS will stripe the data across all spindles in the pool. This will be > true even if the two data sets are simply two large files in a single > filesystem. >Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Hey, Bill - Having played a little with zillions of files in a single directory, I can say that there are still cases where it is NOT desirable to have said zillions of files (or directories) in one spot... The worst case for me is when you have something performing an action on ''*'' or a regular ls [-l] sort of action... Of course, there is much that could be said about the design of whatever it is *using* the directory, and it''s own efficiency, but I''m looking at it from the ''already existing'' methodologies that tend to work sequentially, or on all entries in the directory... For example - having a million directories within a directory, and typing ''ls''... You better hope that you have *lots* of free memory... :) (the ls wants in excess of 500mb...) Another question some of my playing raised, however, was the amount of space directory entries use on disk. I created a million directories in a 2GB zfs filesystem, and noted the following: Filesystem kbytes used avail capacity Mounted on pool0/fs0 2064384 1135474 928168 56% /pool0/fs0 Nearly a MB per directory entry? Hm! Nathan. On Thu, 2006-02-09 at 09:46, Bill Moore wrote:> On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote: > > I''m going to be deploying ZFS in the near future but I have a > > couple of quick questions: > > > > 1. With UFS, for performance reasons it is (was?) desirable to > > limit the number of files in a directory to no more than a few > > thousand. If MANY files were anticipated being in a directory, > > a hierarchy was usually recommend to keep the number of files > > in a singel directory to a manageable number. Does this still > > apply to ZFS? > > No. With ZFS, we use an extensible on-disk hash for directories. > Trying to create or lookup a given file is a constant-time operation. > > > 2. If one has an application mix that makes use of two or more > > distinct sets of data (say, sets A and B), and one has set of > > 4 disks, space considerations aside, is it better to create two > > ZFS storage pools (one for each set)? In other words, if just > > one big pool is used, is ZFS smart enough to balance the I/O > > across the spindles? Assume each data set has a different mount > > point. > > > > To illustrate what I mean, suppose I have /var/data/A and /var/data/B, > > where A and B are unrelated sets of data on different file systems. > > FOr best performance, should A and B reside in their own ZFS pool, > > or is it OK to have one big pool containing both? > > > > My guess is that two pools would be the way to go, but it would be > > nice to get that confirmed by those in the know. > > Most definately, you should only use ONE pool for both sets of data. > ZFS will stripe the data across all spindles in the pool. This will be > true even if the two data sets are simply two large files in a single > filesystem. > > > --Bill > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ////////////////////////////////////////////////////////////////// // Nathan Kroenert nathan.kroenert at sun.com // // PTS Engineer Phone: +61 2 9844-5235 // // Sun Services Direct Ext: x57235 // // Level 2, 828 Pacific Hwy Fax: +61 2 9844-5311 // // Gordon 2072 New South Wales Australia // //////////////////////////////////////////////////////////////////
Sorry to follow up my own mail...> Another question some of my playing raised, however, was the amount of > space directory entries use on disk. I created a million directories in > a 2GB zfs filesystem, and noted the following: > > Filesystem kbytes used avail capacity Mounted on > pool0/fs0 2064384 1135474 928168 56% /pool0/fs0 > > Nearly a MB per directory entry?I''m on drugs. Or perhaps lacking caffine... I''m out by an order of magnitude here... of course it''s not an MB per directory... oops. That will teach me to play before my morning Cola... :) Nathan.>On Thu, 2006-02-09 at 11:23, Nathan Kroenert wrote:> Hey, Bill - > > Having played a little with zillions of files in a single directory, I > can say that there are still cases where it is NOT desirable to have > said zillions of files (or directories) in one spot... The worst case > for me is when you have something performing an action on ''*'' or a > regular ls [-l] sort of action... > > Of course, there is much that could be said about the design of whatever > it is *using* the directory, and it''s own efficiency, but I''m looking at > it from the ''already existing'' methodologies that tend to work > sequentially, or on all entries in the directory... > > For example - having a million directories within a directory, and > typing ''ls''... You better hope that you have *lots* of free memory... > :) (the ls wants in excess of 500mb...) > > Another question some of my playing raised, however, was the amount of > space directory entries use on disk. I created a million directories in > a 2GB zfs filesystem, and noted the following: > > Filesystem kbytes used avail capacity Mounted on > pool0/fs0 2064384 1135474 928168 56% /pool0/fs0 > > Nearly a MB per directory entry? > > Hm! > > Nathan. > > > On Thu, 2006-02-09 at 09:46, Bill Moore wrote: > > On Wed, Feb 08, 2006 at 02:29:32PM -0800, Rich Teer wrote: > > > I''m going to be deploying ZFS in the near future but I have a > > > couple of quick questions: > > > > > > 1. With UFS, for performance reasons it is (was?) desirable to > > > limit the number of files in a directory to no more than a few > > > thousand. If MANY files were anticipated being in a directory, > > > a hierarchy was usually recommend to keep the number of files > > > in a singel directory to a manageable number. Does this still > > > apply to ZFS? > > > > No. With ZFS, we use an extensible on-disk hash for directories. > > Trying to create or lookup a given file is a constant-time operation. > > > > > 2. If one has an application mix that makes use of two or more > > > distinct sets of data (say, sets A and B), and one has set of > > > 4 disks, space considerations aside, is it better to create two > > > ZFS storage pools (one for each set)? In other words, if just > > > one big pool is used, is ZFS smart enough to balance the I/O > > > across the spindles? Assume each data set has a different mount > > > point. > > > > > > To illustrate what I mean, suppose I have /var/data/A and /var/data/B, > > > where A and B are unrelated sets of data on different file systems. > > > FOr best performance, should A and B reside in their own ZFS pool, > > > or is it OK to have one big pool containing both? > > > > > > My guess is that two pools would be the way to go, but it would be > > > nice to get that confirmed by those in the know. > > > > Most definately, you should only use ONE pool for both sets of data. > > ZFS will stripe the data across all spindles in the pool. This will be > > true even if the two data sets are simply two large files in a single > > filesystem. > > > > > > --Bill > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- > ////////////////////////////////////////////////////////////////// > // Nathan Kroenert nathan.kroenert at sun.com // > // PTS Engineer Phone: +61 2 9844-5235 // > // Sun Services Direct Ext: x57235 // > // Level 2, 828 Pacific Hwy Fax: +61 2 9844-5311 // > // Gordon 2072 New South Wales Australia // > //////////////////////////////////////////////////////////////////-- ////////////////////////////////////////////////////////////////// // Nathan Kroenert nathan.kroenert at sun.com // // PTS Engineer Phone: +61 2 9844-5235 // // Sun Services Direct Ext: x57235 // // Level 2, 828 Pacific Hwy Fax: +61 2 9844-5311 // // Gordon 2072 New South Wales Australia // //////////////////////////////////////////////////////////////////
> Having played a little with zillions of files in a single directory, I > can say that there are still cases where it is NOT desirable to have > said zillions of files (or directories) in one spot... The worst case > for me is when you have something performing an action on ''*'' or a > regular ls [-l] sort of action...Very good point. If you have millions of files in a single directory, you really only want to access them programmatically. Even ''echo *'' takes a while with that many files.> Another question some of my playing raised, however, was the amount of > space directory entries use on disk. I created a million directories in > a 2GB zfs filesystem, and noted the following: > > Filesystem kbytes used avail capacity Mounted on > pool0/fs0 2064384 1135474 928168 56% /pool0/fs0 > > Nearly a MB per directory entry?I believe it''s 1k, not 1m: space per dir = (1135474 * 1024 used / 2^20 dirs) = 1180 bytes Jeff
On Thu, Feb 09, 2006 at 11:23:40AM +1100, Nathan Kroenert wrote:> For example - having a million directories within a directory, and > typing ''ls''... You better hope that you have *lots* of free memory... > :) (the ls wants in excess of 500mb...)Yeah, ''ls'' could be a lot more efficient. See a couple of bugs that I filed a while back: 6299767 ''ls -f'' should not buffer output 6299769 ''ls'' memory usage is excessive These would be great candidates for someone from the community to work on...> Another question some of my playing raised, however, was the amount of > space directory entries use on disk. I created a million directories in > a 2GB zfs filesystem, and noted the following: > > Filesystem kbytes used avail capacity Mounted on > pool0/fs0 2064384 1135474 928168 56% /pool0/fs0It''s interesting to see the breakdown of where that ~1100 bytes per directory comes from: (1) The directory itself is using about 89 bytes per entry (as reported by stat64() or ''ls -s'' on the directory) (2) Each (empty) subdirectory is using up 512 bytes (each in a single 512-byte sector) to store its (zero) entries. (3) Each subdirectory has an entry in the "dnode file", which stores the metadata for each file, similar to inodes in UFS. Each dnode is 512 bytes. Until recently, ZFS has compressed all metadata (it was turned off to help diagnose some nasty bugs). With compression on, the directory and the dnode file will get smaller but unfortunately each subdirectory''s data is already stored on the smallest possible (512-byte) block. With compression, the total is about 650 bytes per subdirectory, broken down as follows: (1) directory: 38 bytes per entry (2) subdirectory entries: 512 bytes per entry (3) dnode: 100 bytes per entry If we stored empty files rather than empty directories, we eliminate the 512-byte block per entry for a total of around 93 bytes per file (the dnode file can compress almost twice as well because it doesn''t need to store the block pointers for the subdirectory''s blocks). (1) directory: 38 bytes per entry (2) empty files don''t use any blocks for their data (3) dnode: 56 bytes per entry There you have it, more than you ever wanted to know about space usage for lots of empty files or directories. --matt
On Wed, 2006-02-08 at 17:25 -0800, Matthew Ahrens wrote:> (2) Each (empty) subdirectory is using up 512 bytes (each in a single > 512-byte sector) to store its (zero) entries.Hmm. Are empty directories common enough to make it worthwhile to store an empty directory as a zero-length object instead of a 512-byte object? (I may not have the terminology correct...). One case which this could help: Subversion goes nuts creating a lot of empty subdirs in its working copies (each source directory in a working copy contains a .svn subdir with between 5 and 8 initially-empty subdirs), but I''m not sure if anything else would benefit. Given that everything''s COW already it doesn''t seem like it would cost that much to implement, but it''s unclear if the payback would be worth it. - Bill
Bill Moore wrote:> >> 2. If one has an application mix that makes use of two or more >> distinct sets of data (say, sets A and B), and one has set of >> 4 disks, space considerations aside, is it better to create two >> ZFS storage pools (one for each set)? In other words, if just >> one big pool is used, is ZFS smart enough to balance the I/O >> across the spindles? Assume each data set has a different mount >> point. >> >> To illustrate what I mean, suppose I have /var/data/A and /var/data/B, >> where A and B are unrelated sets of data on different file systems. >> FOr best performance, should A and B reside in their own ZFS pool, >> or is it OK to have one big pool containing both? >> >> My guess is that two pools would be the way to go, but it would be >> nice to get that confirmed by those in the know. >> > > Most definately, you should only use ONE pool for both sets of data. > ZFS will stripe the data across all spindles in the pool. This will be > true even if the two data sets are simply two large files in a single > filesystem. >This goes against the current provisioning guidelines we''ve used for data sets that require different performance or access patterns. You might want to expand on how ZFS isn''t impacted. For example: Today we place oracle logs and tables on different filesystems, usually on different physical LUNs, to avoid things like lun skew. How does ZFS fix that?