thr3ads.net - Ocfs2 users - [Ocfs2-users] Deleting a large dataset or a large number of files [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Infantino, Joe (Contractor)

2012-Jul-09 20:31 UTC

[Ocfs2-users] Deleting a large dataset or a large number of files

We are having an intermittent issue with our ocfs2 file system.

Oracle Linux Server release 5.6
2.6.32-200.13.1.el5uek
ocfs2 1.6.3

After deleting a large amount of data (~100G) we noticed the ocfs2 file
system go "offline" for a few minutes at different times throughout
the
day.  By "offline" I mean it is not accessible to the server, through
samba, or through nfs mounts.  After about 5 minutes it is back online.
After a time, it is different each day, it will happen again.  There is
no pattern to the "offline" status and it is not predictable.

If we run an fsck.ocfs2 -f on the file system it seems to clear the
issue (at least we have run for 1 full week without the issue
returning).

We noticed it again after deleting ~125G of data (1.5Million files).
After running fsck.ocfs2 the system stabilized.

Has anyone else seen this or have experience with it? 


Joe Infantino
MCP, MCSA, Security+, ITILv3
Senior Systems Analyst
Security and Server Operations
Intersil Corporation

email: jinfantino at intersil.com
office: 321-724-7119
fax: 321-729-1186

www.intersil.com


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20120709/71fa183f/attachment.html

Peter Grandi

2012-Jul-09 22:13 UTC

head link

[Ocfs2-users] Deleting a large dataset or a large number of files

[ ... ]
> After deleting a large amount of data (~100G) we noticed the ocfs2
> file system go "offline" for a few minutes at different times
> throughout the day. By "offline" I mean it is not accessible to
the
> server, through samba, or through nfs mounts.
Seems rather unsurprising to me...
> After about 5 minutes it is back online After a time, it is different
> each day, it will happen again. There is no pattern to the
"offline"
> status and it is not predictable.
Seems rather unsurprising to me...
> If we run an fsck.ocfs2 -f on the file system it seems to clear the
> issue (at least we have run for 1 full week without the issue
> returning).
Even if you delete a large amount of data or metadata during that week?

That would be a bit surprising, and seems to indicate that your ordinary
SMB/NFS workload involves a lot of filesystem updates.
> We noticed it again after deleting ~125G of data (1.5Million files).
> After running fsck.ocfs2 the system stabilized.
Seems especially unsurprising to me...

Interesting that the description of the underlying hw is omitted, in
particular how many servers share the filetree, and the latency and
structure of the storage subsystem.

Perhaps asking some questions may help, for example:

* How many random IOPS can the underlying storage system deliver?

* How many metadata operations per second do you expect a heavily shared
  and interlocked filetree to deliver under load?

* Specifically, how many days do you expect the deletion of 1.5m million
  files to take (taking into account directory and free list update
  operations) on a heavily shared interlocked filetree?

* How "smooth" do you expect massive metadata updating workloads to be
  in terms of completion rates, given arm competition with file serving
  workloads and CPU and disk scheduling policies and local buffering and
  sizes of the queues before the interlocks?

* How strong is your imagination as to the ability of shared filesystems
  to have extremely low latency and very high throughput for doing mass
  metadata updates concurrently with ordinary file-serving workloads?

* Are you also suprised when even non-shared/interlocked filesystems
  seem to become "stuck" during heavy write-based workloads especially
  when large cache flushes happen on HBAs with unwise elevator policies?

(extra points if you are running any of the OCFS2 systems on a VM! :->)

While some filesystem users would be content with 'O_PONIES' some even
wish for 'O_UNICORNS'...

Note: there have been OCFS2 "performance bugs" in the past, and some
real bugs with Samba, but the situation reported here does not need to
be related to bugs.

Ocfs2 users - Jul 2012 - Deleting a large dataset or a large number of files

[Ocfs2-users] Deleting a large dataset or a large number of files

[Ocfs2-users] Deleting a large dataset or a large number of files