Hello, I''d like to plan a storage solution for a system currently in production. The system''s storage is based on code which writes many files to the file system, with overall storage needs currently around 40TB and expected to reach hundreds of TBs. The average file size of the system is ~100K, which translates to ~500 million files today, and billions of files in the future. This storage is accessed over NFS by a rack of 40 Linux blades, and is mostly read-only (99% of the activity is reads). While I realize calling this sub-optimal system design is probably an understatement, the design of the system is beyond my control and isn''t likely to change in the near future. The system''s current storage is based on 4 VxFS filesystems, created on SVM meta-devices each ~10TB in size. A 2-node Sun Cluster serves the filesystems, 2 filesystems per node. Each of the filesystems undergoes growfs as more storage is made available. We''re looking for an alternative solution, in an attempt to improve performance and ability to recover from disasters (fsck on 2^42 files isn''t practical, and I''m getting pretty worried due to this fact - even the smallest filesystem inconsistency will leave me lots of useless bits). Question is - does anyone here have experience with large ZFS filesystems with many small-files? Is it practical to base such a solution on a few (8) zpools, each with single large filesystem in it? Many thanks in advance for any advice, - Yaniv This message posted from opensolaris.org
Hello Yaniv, Wednesday, April 18, 2007, 3:44:57 PM, you wrote: YA> Hello, YA> I''d like to plan a storage solution for a system currently in production. YA> The system''s storage is based on code which writes many files to YA> the file system, with overall storage needs currently around 40TB YA> and expected to reach hundreds of TBs. The average file size of YA> the system is ~100K, which translates to ~500 million files today, YA> and billions of files in the future. This storage is accessed over YA> NFS by a rack of 40 Linux blades, and is mostly read-only (99% of YA> the activity is reads). While I realize calling this sub-optimal YA> system design is probably an understatement, the design of the YA> system is beyond my control and isn''t likely to change in the near future. YA> The system''s current storage is based on 4 VxFS filesystems, YA> created on SVM meta-devices each ~10TB in size. A 2-node Sun YA> Cluster serves the filesystems, 2 filesystems per node. Each of YA> the filesystems undergoes growfs as more storage is made YA> available. We''re looking for an alternative solution, in an YA> attempt to improve performance and ability to recover from YA> disasters (fsck on 2^42 files isn''t practical, and I''m getting YA> pretty worried due to this fact - even the smallest filesystem YA> inconsistency will leave me lots of useless bits). YA> Question is - does anyone here have experience with large ZFS YA> filesystems with many small-files? Is it practical to base such a YA> solution on a few (8) zpools, each with single large filesystem in it? YA> Many thanks in advance for any advice, I have "some" experience with similar but bigger environment and lot of data already on zfs (for years now). Although I can''t talk many details... One of a problems is: how are you going to backup all this data? With so many small files classical approach probably won''t work, and if it does now it won''t in a (near) future. I would strongly suggest disk-to-disk backup + snapshots for point-in-time backups. With lot of small files I observed zfs to consume about the same disk space as UFS. It seems there''s a problem with fs fragmentation after some time with lot of files (zfs send|recv helps for some time). While I see no problem going with one file system (pool itself?) in each zpool, with TBs of data I would consider to split it into more file systems mostly for "management" reasons like backup, snapshotting. Splitting into more file systems also helps when you have to migrate one of file systems to another storage - it''s easier to find 1TB of storage than 20TB. I try to keep each production file system below 1TB, not that there are any problems with larger file systems. When doing Sun Cluster consider creating at least as many zpools as you have nodes in a cluster, so if you have to you can spread out your workload to each node in a cluster (put each zpool in a different sc group with its own IP). We did some tests with Linux (2.4 and 2.6) and it seems there''s a problem if you have thousands of nfs file systems - they won''t all be mounted automatically, and even doing it manually (or in a script with a sleep between each mount) there seems to be a limit below 1000. We did not investigate further as in that environment all nfs clients are Solaris server (x86, sparc) and we see no problems with thousands of file systems. If you switch a rg from one node to another which was serving another nfs rg group, keep in mind that nfsd will actually restart which means service distraction for that other group also. With zfs stopping nfsd can sometimes take even minutes... There are more things also to consider (storage layout, network config, etc...). -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Robert Milkowski wrote:> We did some tests with Linux (2.4 and 2.6) and it seems there''s a > problem if you have thousands of nfs file systems - they won''t all be > mounted automatically, and even doing it manually (or in a script with > a sleep between each mount) there seems to be a limit below 1000. We > did not investigate further as in that environment all nfs clients are > Solaris server (x86, sparc) and we see no problems with thousands of > file systems.The Linux limitation is possibly due to privileged port exhaustion with TCP mounts, FYI. -- Carson
Hello Carson, Thursday, April 19, 2007, 1:22:17 AM, you wrote: CG> Robert Milkowski wrote:>> We did some tests with Linux (2.4 and 2.6) and it seems there''s a >> problem if you have thousands of nfs file systems - they won''t all be >> mounted automatically, and even doing it manually (or in a script with >> a sleep between each mount) there seems to be a limit below 1000. We >> did not investigate further as in that environment all nfs clients are >> Solaris server (x86, sparc) and we see no problems with thousands of >> file systems.CG> The Linux limitation is possibly due to privileged port exhaustion with CG> TCP mounts, FYI. We''ve been thinking about the same lines (1024-some services already running). But still with few hundreds nfs entries Linux time outs end you end up with some file system not mounted, etc. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Apr 18, 2007, at 6:44 PM, Robert Milkowski wrote:> Hello Carson, > > Thursday, April 19, 2007, 1:22:17 AM, you wrote: > > CG> Robert Milkowski wrote: > >>> We did some tests with Linux (2.4 and 2.6) and it seems there''s a >>> problem if you have thousands of nfs file systems - they won''t >>> all be >>> mounted automatically, and even doing it manually (or in a script >>> with >>> a sleep between each mount) there seems to be a limit below 1000. We >>> did not investigate further as in that environment all nfs >>> clients are >>> Solaris server (x86, sparc) and we see no problems with thousands of >>> file systems. > > CG> The Linux limitation is possibly due to privileged port > exhaustion with > CG> TCP mounts, FYI. > > > We''ve been thinking about the same lines (1024-some services already > running). > > But still with few hundreds nfs entries Linux time outs end you end up > with some file system not mounted, etc.See the Linux NFS FAQ at http://nfs.sourceforge.net/ Question/Answer B3. There is a limit of a few hundred NFS mounts. Spencer
Hello Spencer, Thursday, April 19, 2007, 2:28:30 AM, you wrote: SS> On Apr 18, 2007, at 6:44 PM, Robert Milkowski wrote:>> Hello Carson, >> >> Thursday, April 19, 2007, 1:22:17 AM, you wrote: >> >> CG> Robert Milkowski wrote: >> >>>> We did some tests with Linux (2.4 and 2.6) and it seems there''s a >>>> problem if you have thousands of nfs file systems - they won''t >>>> all be >>>> mounted automatically, and even doing it manually (or in a script >>>> with >>>> a sleep between each mount) there seems to be a limit below 1000. We >>>> did not investigate further as in that environment all nfs >>>> clients are >>>> Solaris server (x86, sparc) and we see no problems with thousands of >>>> file systems. >> >> CG> The Linux limitation is possibly due to privileged port >> exhaustion with >> CG> TCP mounts, FYI. >> >> >> We''ve been thinking about the same lines (1024-some services already >> running). >> >> But still with few hundreds nfs entries Linux time outs end you end up >> with some file system not mounted, etc.SS> See the Linux NFS FAQ at http://nfs.sourceforge.net/ SS> Question/Answer B3. There is a limit of a few hundred SS> NFS mounts. Thanks. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, thanks for the information. I understand from your words that you''re more worried about overall filesystem size rather than the number of files, yes? Is the number of files something I should or should not worry about? i.e., what are the differences (in stability, recoverability, performance, manageability... etc) between a 25TB filesystem with 2^35 files and a 25TB filesystem with 1,000 files, each 25GB? Also, if it''s possible to ask without stepping out of any of your customers'' NDAs, can you at least say what''s the average filesize you have on some of your multi-tera volumes (is 10K a small file? is 100K? 1K?) Is anyone else on the forum able to quote their numbers? Regarding your Sun Cluster recommendations - thanks, and I''ll do just that. Thanks again and regards, - Yaniv This message posted from opensolaris.org
Hello Aknin, Thursday, April 19, 2007, 7:20:26 AM, you wrote: > Hi Robert, thanks for the information. I understand from your words that you''re more worried about overall filesystem size rather than the number of files, yes? Is the number of files something I should or should not worry about? i.e., what are the differences (in stability, recoverability, performance, manageability... etc) between a 25TB filesystem with 2^35 files and a 25TB filesystem with 1,000 files, each 25GB? If you are ok with your application having to access lot of small files then it''s not an issue except backup. Really depends how you want to do your backup. Lot of small files is bad, very bad for classical backup solutions. In terms of many small files I see no problem with stability, recoverability or performance (depends on app and workload). Now the difference in a scenario you asked is that if you want to backup 1000 files, depending what file system you use and how you created those files you''re probably going to read them mostly sequentially on physical layer. Also it''s very cheap in most cases to check 1000 files if they changed instead of millions. As I wrote - if your app/workload is happy with many small files then fine. But you''ll definitely have a problem with a backup. > Also, if it''s possible to ask without stepping out of any of your customers'' NDAs, can you at least say what''s the average filesize you have on some of your multi-tera volumes (is 10K a small file? is 100K? 1K?) I''m afraid I can''t :( But I can say that to me anything below 512KB is a small file (starting from few bytes). Also a file size distribution is such that I have mostly small files and large files, the rest 10% is somewhere between. -- Best regards, Robert Milkowski mailto:rmilkowski@task.gda.pl http://milek.blogspot.com _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Yaniv Aknin wrote:> I understand from your words that you''re more worried about overall filesystem size rather than the number of files, yes? Is the number of files something I should or should not worry about? > i.e., what are the differences (in stability, recoverability, performance, manageability... etc) between a 25TB filesystem with 2^35 files and a 25TB filesystem with 1,000 files, each 25GB?I don''t think we anticipate problems, but I''m not sure there are a lot of people who have done this, yet. We do know of such limitations in UFS, and other file systems, which do not exist in ZFS, by design. -- richard
You should definitely worry about the number of files when it comes to backup & management. It will also make a big difference in space overhead. A ZFS filesystem with 2^35 files will have a minimum of 2^44 bytes of overhead just for the file nodes, which is about 16 TB. If it takes about 20 ms for the overhead to backup a file (2 seeks), then 2^35 files will take 21 years to back up. ;-) I''m guessing you didn''t really mean 2^35, though. (If you did, you''re likely to need a system along the lines of DARPA''s HPCS program....) Anton This message posted from opensolaris.org