A user has 5 directories, each has tens of thousands of files, the largest directory has over a million files. The files themselves are not very large, here is an "ls -lh" on the directories: [these are all ZFS-based] [root at cluster]# ls -lh total 341M drwxr-xr-x+ 2 someone cluster 13K Sep 14 19:09 0/ drwxr-xr-x+ 2 someone cluster 50K Sep 14 19:09 1/ drwxr-xr-x+ 2 someone cluster 197K Sep 14 19:09 2/ drwxr-xr-x+ 2 someone cluster 785K Sep 14 19:09 3/ drwxr-xr-x+ 2 someone cluster 3.1M Sep 14 19:09 4/ When I go into directory "0", it takes about a minute for an "ls -1 | grep wc" to return (it has about 12,000 files). Directory "1" takes between 5-10 minutes for the same command to return (it has about 50,000 files). I did an rsync of this directory structure to another filesystem [lustre-based, FWIW] and it took about 24 hours to complete. We have done rsyncs on other directories that are much larger in terms of file-sizes, but have thousands of files rather than tens, hundreds, and millions of files. Is there someway to speed up "simple" things like determining the contents of these directories? And why does an rsync take so much longer on these directories when directories that contain hundreds of gigabytes transfer much faster? Jeff
Jeff Haferman wrote:> A user has 5 directories, each has tens of thousands of files, the > largest directory has over a million files. The files themselves are > not very large, here is an "ls -lh" on the directories: > [these are all ZFS-based] > > [root at cluster]# ls -lh > total 341M > drwxr-xr-x+ 2 someone cluster 13K Sep 14 19:09 0/ > drwxr-xr-x+ 2 someone cluster 50K Sep 14 19:09 1/ > drwxr-xr-x+ 2 someone cluster 197K Sep 14 19:09 2/ > drwxr-xr-x+ 2 someone cluster 785K Sep 14 19:09 3/ > drwxr-xr-x+ 2 someone cluster 3.1M Sep 14 19:09 4/ > > When I go into directory "0", it takes about a minute for an "ls -1 | > grep wc" to return (it has about 12,000 files). Directory "1" takes > between 5-10 minutes for the same command to return (it has about 50,000 > files)."ls" sorts its output before printing, unless you use the option to turn this off (-f, IIRC, but check the man-page). "echo * | wc" is also a way to find out what''s in a directory, but you''ll miss "."files, and the shell you''re using may have an influence .. HTH Michael -- Michael Schuster http://blogs.sun.com/recursion Recursion, n.: see ''Recursion''
+------------------------------------------------------------------------------ | On 2009-10-03 18:50:58, Jeff Haferman wrote: | | I did an rsync of this directory structure to another filesystem | [lustre-based, FWIW] and it took about 24 hours to complete. We have | done rsyncs on other directories that are much larger in terms of | file-sizes, but have thousands of files rather than tens, hundreds, and | millions of files. | | Is there someway to speed up "simple" things like determining the | contents of these directories? Use zfs snapshots. See zfs(1M) and review the incremental send syntax. | And why does an rsync take so much | longer on these directories when directories that contain hundreds of rsync has to build its file list (stat is slow) on both sides of the sync, then compare them, and then send each one. (d)truss it sometime. It''s a lot of syscalls. The initial zfs send may be slow, depending on the total size, but the incrementals will be pretty fast. Certainly faster than rsync (by orders of magnitude), as ZFS already knows which blocks it seends to send, and is only sending blocks. If the target host doesn''t support ZFS in some form, you could dump the snapshots to disk and use those for backups. Or restructure your storage hierarchy (which uh, you might want to do anyway). -- bda cyberpunk is dead. long live cyberpunk.
On Sat, Oct 3, 2009 at 6:50 PM, Jeff Haferman <jeff at haferman.com> wrote:> > A user has 5 directories, each has tens of thousands of files, the > largest directory has over a million files. ?The files themselves are > not very large, here is an "ls -lh" on the directories: > [these are all ZFS-based] > > [root at cluster]# ls -lh > total 341M > drwxr-xr-x+ 2 someone cluster ?13K Sep 14 19:09 0/ > drwxr-xr-x+ 2 someone cluster ?50K Sep 14 19:09 1/ > drwxr-xr-x+ 2 someone cluster 197K Sep 14 19:09 2/ > drwxr-xr-x+ 2 someone cluster 785K Sep 14 19:09 3/ > drwxr-xr-x+ 2 someone cluster 3.1M Sep 14 19:09 4/ > > When I go into directory "0", it takes about a minute for an "ls -1 | > grep wc" to return (it has about 12,000 files). ?Directory "1" takes > between 5-10 minutes for the same command to return (it has about 50,000 > files). > > I did an rsync of this directory structure to another filesystem > [lustre-based, FWIW] and it took about 24 hours to complete. ?We have > done rsyncs on other directories that are much larger in terms of > file-sizes, but have thousands of files rather than tens, hundreds, and > millions of files. > > Is there someway to speed up "simple" things like determining the > contents of these directories? ?And why does an rsync take so much > longer on these directories when directories that contain hundreds of > gigabytes transfer much faster? > > Jeff > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Be happy you don''t have Windows + NTFS with hundreds of thousands, or millions of files. Explorer will crash, run your system out of memory and slow it down, or plain out hard lock windows for hours on end. This is on brand new hardware, 64bit, 32GB RAM, and 15k SAS disks. Regardless of filesystem, I''d suggest splitting your directory structure into a hierarchy. It makes sense even just for cleanliness. -- Brent Jones brent at servuhome.net
On Sat, 3 Oct 2009, Jeff Haferman wrote:> > When I go into directory "0", it takes about a minute for an "ls -1 | > grep wc" to return (it has about 12,000 files). Directory "1" takes > between 5-10 minutes for the same command to return (it has about 50,000 > files).This seems kind of slow. In the directory with a million files that I keep around for testing this is the time for the first access: % time \ls -1 | grep wc \ls -1 4.70s user 1.20s system 32% cpu 17.994 total grep wc 0.11s user 0.02s system 0% cpu 17.862 total and for the second access: % time \ls -1 | grep wc \ls -1 4.66s user 1.17s system 69% cpu 8.366 total grep wc 0.11s user 0.02s system 1% cpu 8.234 total However, my directory was created as quickly as possible rather than incrementally over a long period of time so it lacks the longer/increased disks seeks caused by fragmentation and block allocations. That said, directories with 50K files list quite quickly here.> I did an rsync of this directory structure to another filesystem > [lustre-based, FWIW] and it took about 24 hours to complete. We haveRsync is very slow in such situations. What version of Solaris are you using? The Solaris version (including patch version if using Solaris 10) can make a big difference. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
>> Directory "1" takes between 5-10 minutes for the same command toreturn >> (it has about 50,000 files). > That said, directories with 50K files list quite quickly here. a directory with 52,705 files lists in half a second here 36 % time \ls -1 > /dev/null 0.41u 0.07s 0:00.50 96.0% perhaps your ARC is too small? Rob
Rob Logan wrote:> > >> Directory "1" takes between 5-10 minutes for the same command to > return > >> (it has about 50,000 files). > > > That said, directories with 50K files list quite quickly here. > > a directory with 52,705 files lists in half a second here > > 36 % time \ls -1 > /dev/null > 0.41u 0.07s 0:00.50 96.0% > > perhaps your ARC is too small? >I set it according to Section 1.1 of the ZFS Evil Tuning Guide: http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
That section doesn''t actually prescribe one size, so what size did you choose and how exactly did you set it? You haven''t told us, neither has anyone asked you about the basic system config. For starters, what CPU, memory and storage? What other stuff is this machine doing? Also we do really need to know what version of Solaris (inc relevant patches) you are using? What other changes gave you made, if any? Thanks, Phil Sent from my iPhone On 5 Oct 2009, at 00:24, Jeff Haferman <jeff at haferman.com> wrote:> Rob Logan wrote: >> >>>> Directory "1" takes between 5-10 minutes for the same command to >> return >>>> (it has about 50,000 files). >> >>> That said, directories with 50K files list quite quickly here. >> >> a directory with 52,705 files lists in half a second here >> >> 36 % time \ls -1 > /dev/null >> 0.41u 0.07s 0:00.50 96.0% >> >> perhaps your ARC is too small? >> > > > I set it according to Section 1.1 of the ZFS Evil Tuning Guide: > http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Sat, October 3, 2009 20:50, Jeff Haferman wrote:> And why does an rsync take so much > longer on these directories when directories that contain hundreds of > gigabytes transfer much faster?Rsync protocol has to exchange information about each file between client and server, as part of the process of deciding whether to send that file. Clearly there will be many more such exchanges in the directories containing many more files. It therefore appears quite natural to me that, given two directories with the same amount of actual data, the one with more files will take longer to rsync. (The time will ALSO depend on the amount of actual data, of course; given two directories with the same number of files, but one having 10x the data, the one with more data will at least sometimes take longer, particularly if the files differ and the data actually has to be transmitted.) -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info