Mark Hampton
2012-Feb-01 13:25 UTC
[Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?
We have an application that has many processing threads writing more than a billion files ranging from 2KB ? 50KB, with 50% under 8KB (currently there are 700 million files). The files are never deleted or modified ? they are written once, and read infrequently. The files are hashed so that they are evenly distributed across ~1,000,000 subdirectories up to 3 levels deep, with up to 1000 files per directory. The directories are structured like this: 0/00/00 0/00/01 ? F/FF/FE F/FF/FF The files need to be readable and writable across a number of servers. The NetApp filer we purchased for this project has both NFS and iSCSI capabilities. We first tried doing this via NFS. After writing 700 million files (12 TB) into a single NetApp volume, file-write performance became abysmally slow. We can't create more than 200 files per second on the NetApp volume, which is about 20% of our required performance target of 1000 files per second. It appears that most of the file-write time is going towards stat and inode-create operations. So I now I?m trying the same thing with OCFS2 over iSCSI. I created 16 luns on the NetApp. The 16 luns became 16 OCFS2 filesystems with 16 different mount points on our servers. With this configuration I was initially able to write ~1800 files per second. Now that I have completed 100 million files, performance has dropped to ~1500 files per second. I?m using OEL 6.1 (2.6.32-100 kernel) with OCFS2 version 1.6. The application servers have 128GB of memory. I created my OCFS2 filesystems as follows: mkfs.ocfs2 ?T mail ?b 4k ?C 4k ?L <my label> --fs-features=indexed-dirs ?fs-feature-level=max-features /dev/mapper/<my device> And I mount them with these options: _netdev,commit=30,noatime,localflocks,localalloc=32 So my questions are these: 1) Given a billion files sized 2KB ? 50KB, with 50% under 8KB, do I have the optimal OCFS2 filesystem and mount-point configurations? 2) Should I split the files across even more filesystems? Currently I have them split across 16 OCFS2 filesystems. Thanks a billion! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20120201/a7097f0d/attachment.html
One more thing. When I straced one of the application processes (these are the processes that create the files) I saw this: % time seconds usecs/call calls errors syscall ------- ---------- ---------- -------- ------ ------- 68.94 3.002017 111 27154 open 18.93 0.929679 2 418108 read 12.40 0.543714 2 257548 write So it seams that inode creation is the biggest time consumer by far.
Sunil Mushran
2012-Feb-01 17:53 UTC
[Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?
On 02/01/2012 07:02 AM, Mark wrote:> One more thing. When I straced one of the application processes (these are the > processes that create the files) I saw this: > > % time seconds usecs/call calls errors syscall > ------- ---------- ---------- -------- ------ ------- > 68.94 3.002017 111 27154 open > 18.93 0.929679 2 418108 read > 12.40 0.543714 2 257548 write > > So it seams that inode creation is the biggest time consumer by far.Yes. open() triggers cluster lock creation which cannot be skipped. Reads and writes could skip cluster activity if the node already has the appropriate lock level.
Sunil Mushran
2012-Feb-01 18:04 UTC
[Ocfs2-users] A Billion Files on OCFS2 -- Best Practices?
debugfs.ocfs2 -R "stats" /dev/mapper/... I want to see the features enabled. The main issue with large metdata is the fsck timing. The recently tagged 1.8 release of the tools has much better fsck performance. On 02/01/2012 05:25 AM, Mark Hampton wrote:> We have an application that has many processing threads writing more > than a billion files ranging from 2KB ? 50KB, with 50% under > 8KB (currently there are 700 million files). The files are never > deleted or modified ? they are written once, and read infrequently. The > files are hashed so that they are evenly distributed across ~1,000,000 > subdirectories up to 3 levels deep, with up to 1000 files per > directory. The directories are structured like this: > > 0/00/00 > > 0/00/01 > > ? > > F/FF/FE > > F/FF/FF > > The files need to be readable and writable across a number of > servers. The NetApp filer we purchased for this project has both NFS and > iSCSI capabilities. > > We first tried doing this via NFS. After writing 700 million files (12 > TB) into a single NetApp volume, file-write performance became abysmally > slow. We can't create more than 200 files per second on the NetApp > volume, which is about 20% of our required performance target of 1000 > files per second. It appears that most of the file-write time is going > towards stat and inode-create operations. > > So I now I?m trying the same thing with OCFS2 over iSCSI. I created 16 > luns on the NetApp. The 16 luns became 16 OCFS2 filesystems with 16 > different mount points on our servers. > > With this configuration I was initially able to write ~1800 files per > second. Now that I have completed 100 million files, performance has > dropped to ~1500 files per second. > > I?m using OEL 6.1 (2.6.32-100 kernel) with OCFS2 version 1.6. The > application servers have 128GB of memory. I created my OCFS2 > filesystems as follows: > > mkfs.ocfs2 ?T mail ?b 4k ?C 4k ?L <my label> --fs-features=indexed-dirs > ?fs-feature-level=max-features /dev/mapper/<my device> > > And I mount them with these options: > > _netdev,commit=30,noatime,localflocks,localalloc=32 > > So my questions are these: > > > 1) Given a billion files sized 2KB ? 50KB, with 50% under 8KB, do I have > the optimal OCFS2 filesystem and mount-point configurations? > > > 2) Should I split the files across even more filesystems? Currently I > have them split across 16 OCFS2 filesystems. > > Thanks a billion!