alain meunier
2013-May-07 15:15 UTC
[Gluster-users] How many subfolders in parent folders ?
Hello, I will have a lot (many hundreds of millions) of subdirectories inside a directory, named with md5 hashes. I thought of a subdividivision abcd/efgh/... or maybe a/b/c/... But it will create a bunch of subfolders. Will gluster handler those files without problems ? One last thing : hardlinks will be involved to point the files and often deleted. Does someone has any experience of the troubleshooting I will have to face ? Many Thanks, regards alain -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130507/a11e7b6d/attachment.html>
Hans Lambermont
2013-May-08 08:10 UTC
[Gluster-users] How many subfolders in parent folders ?
Hi Alain, alain meunier wrote on 20130507:> I will have a lot (many hundreds of millions) of subdirectories inside a directory, named with md5 hashes. > I thought of a subdividivision abcd/efgh/... or maybe a/b/c/... > But it will create a bunch of subfolders. > Will gluster handler those files without problems ?Yes, but slow. Very slow IMO :-/ To give you an idea : a recursive walk through my Distributed-Replicate 14 x 2 filesystem (2M directories, max 100 per subdirectory) takes several days.> One last thing : hardlinks will be involved to point the files and often deleted. > Does someone has any experience of the troubleshooting I will have to face ?Should be fine. Try to avoid filename renames as that causes a DHT miss penalty. My advice : first build a test setup. -- Hans -- Hans Lambermont | Senior Architect (t) +31407370104 (w) www.shapeways.com
Laurent Chouinard
2013-May-08 21:56 UTC
[Gluster-users] How many subfolders in parent folders ?
Hi Alain, The performance of this is going to be very dependent on the hardware that your cluster uses. For example, I have a cluster here of 4 nodes with 4 SSDs each, and a replication factor set to 2. Bricks are using XFS. Nodes are inter-connected with 10 GigE networking. The folder structure we use is based on hexadecimal representation and we take the first two bytes of the file name to decide where it goes. This way, we end up with: \00\00\... \00\01\... \00\0f\... etc That makes a total of only 65?792 folders (256 x 256 + 256), and if you spread 100 million different files in that, the result is a bit over 1500 files per folder, which is very reasonable. Now for the speed that crawling to such a layout happens, my cluster here has 3 million files at the moment. If I find a "find ." command from inside one of the 256 top folders, it takes 14 seconds. I can extrapolate that to just over 59 minutes if I were to crawl through all of them. I would advise against going too deep in the folder-in-folder-in-folder, because as you multiply the possibilities, you'll end up with a file system with millions of entries just for folders themselves. Laurent Chouinard