Infantino, Joe (Contractor)
2012-Jul-09 20:31 UTC
[Ocfs2-users] Deleting a large dataset or a large number of files
We are having an intermittent issue with our ocfs2 file system. Oracle Linux Server release 5.6 2.6.32-200.13.1.el5uek ocfs2 1.6.3 After deleting a large amount of data (~100G) we noticed the ocfs2 file system go "offline" for a few minutes at different times throughout the day. By "offline" I mean it is not accessible to the server, through samba, or through nfs mounts. After about 5 minutes it is back online. After a time, it is different each day, it will happen again. There is no pattern to the "offline" status and it is not predictable. If we run an fsck.ocfs2 -f on the file system it seems to clear the issue (at least we have run for 1 full week without the issue returning). We noticed it again after deleting ~125G of data (1.5Million files). After running fsck.ocfs2 the system stabilized. Has anyone else seen this or have experience with it? Joe Infantino MCP, MCSA, Security+, ITILv3 Senior Systems Analyst Security and Server Operations Intersil Corporation email: jinfantino at intersil.com office: 321-724-7119 fax: 321-729-1186 www.intersil.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20120709/71fa183f/attachment.html
Peter Grandi
2012-Jul-09 22:13 UTC
[Ocfs2-users] Deleting a large dataset or a large number of files
[ ... ]> After deleting a large amount of data (~100G) we noticed the ocfs2 > file system go "offline" for a few minutes at different times > throughout the day. By "offline" I mean it is not accessible to the > server, through samba, or through nfs mounts.Seems rather unsurprising to me...> After about 5 minutes it is back online After a time, it is different > each day, it will happen again. There is no pattern to the "offline" > status and it is not predictable.Seems rather unsurprising to me...> If we run an fsck.ocfs2 -f on the file system it seems to clear the > issue (at least we have run for 1 full week without the issue > returning).Even if you delete a large amount of data or metadata during that week? That would be a bit surprising, and seems to indicate that your ordinary SMB/NFS workload involves a lot of filesystem updates.> We noticed it again after deleting ~125G of data (1.5Million files). > After running fsck.ocfs2 the system stabilized.Seems especially unsurprising to me... Interesting that the description of the underlying hw is omitted, in particular how many servers share the filetree, and the latency and structure of the storage subsystem. Perhaps asking some questions may help, for example: * How many random IOPS can the underlying storage system deliver? * How many metadata operations per second do you expect a heavily shared and interlocked filetree to deliver under load? * Specifically, how many days do you expect the deletion of 1.5m million files to take (taking into account directory and free list update operations) on a heavily shared interlocked filetree? * How "smooth" do you expect massive metadata updating workloads to be in terms of completion rates, given arm competition with file serving workloads and CPU and disk scheduling policies and local buffering and sizes of the queues before the interlocks? * How strong is your imagination as to the ability of shared filesystems to have extremely low latency and very high throughput for doing mass metadata updates concurrently with ordinary file-serving workloads? * Are you also suprised when even non-shared/interlocked filesystems seem to become "stuck" during heavy write-based workloads especially when large cache flushes happen on HBAs with unwise elevator policies? (extra points if you are running any of the OCFS2 systems on a VM! :->) While some filesystem users would be content with 'O_PONIES' some even wish for 'O_UNICORNS'... Note: there have been OCFS2 "performance bugs" in the past, and some real bugs with Samba, but the situation reported here does not need to be related to bugs.