Hi ocfs2 guys,
Actually, ocfs2 is not doomed - far from it! I just wanted to grab a few
more of those eyeballs that normally glaze over each time they see a
cluster mail run by on lkml[1]. Today's news: you are still not finished
with global lock systime badness, and I already have more performance bugs
for you.
For interested observers, the story so far:
A couple of days ago I reported a bug to the ocfs2 team via irc[2]. If I
untar a kernel tree several times, each into its own directory, ocfs2
system time (already four times higher than ext3 to start with) increases
linearly on each successive iteration, up to a very silly amount where it
finally stabilizes.
This is with ocfs2 running uniprocessor on a single node with a local disk.
The same performance bug also bites on a cluster of course.
In a dazzling display of developer studlyness, the ocfs2 team found the hash
table bottleneck while I was still trying to figure out how to make oprofile
behave.
http://sosdg.org/~coywolf/lxr/source/fs/ocfs2/dlm/dlmdomain.c?v=2.6.16-rc3#L102
So the ocfs2 guys switched from double word to single word hash table heads
and quadrupled the size of the hash table. Problem solved, right? No!
I got the patch from Git (note to Gitweb devs: could "plain" please be
_really plain_ just like marc's "raw" instead of colorless html?).
While
system is indeed reduced, it still sucks chamber pot juice:
CPU: P4 / Xeon, speed 2793.41 MHz (estimated)
samples % image name app name symbol
name
-------------------------------------------------------------------------------
4399778 64.0568 libbz2.so.1.0.2 libbz2.so.1.0.2 (no
symbols)
4399778 100.000 libbz2.so.1.0.2 libbz2.so.1.0.2 (no
symbols) [self]
-------------------------------------------------------------------------------
1410173 20.5308 vmlinux vmlinux
__dlm_lookup_lockres
1410173 100.000 vmlinux vmlinux
__dlm_lookup_lockres [self]
-------------------------------------------------------------------------------
82840 1.2061 oprofiled oprofiled (no
symbols)
82840 100.000 oprofiled oprofiled (no
symbols) [self]
-------------------------------------------------------------------------------
68368 0.9954 vmlinux vmlinux
ocfs2_local_alloc_count_bits
68368 100.000 vmlinux vmlinux
ocfs2_local_alloc_count_bits [self]
-------------------------------------------------------------------------------
47639 0.6936 tar tar (no
symbols)
47639 100.000 tar tar (no
symbols) [self]
-------------------------------------------------------------------------------
1 0.0488 vmlinux vmlinux
ide_build_sglist
6 0.2925 vmlinux vmlinux
ide_do_request
70 3.4130 vmlinux vmlinux
__ide_dma_test_irq
128 6.2409 vmlinux vmlinux
ata_vlb_sync
134 6.5334 vmlinux vmlinux
proc_ide_read_imodel
147 7.1672 vmlinux vmlinux
ide_dma_verbose
1565 76.3042 vmlinux vmlinux
ide_do_rw_disk
20390 0.2969 vmlinux vmlinux
ide_dma_speed
20390 100.000 vmlinux vmlinux
ide_dma_speed [self]
-------------------------------------------------------------------------------
Hash probe cleanup and hand optimization
----------------------------------------
I included two patches below. The first vmallocs a megabyte hash table
instead of a single page. This definitively squishes away the global
lock hash table overhead so that systime stays about twice as high as
ext3 instead of starting at three times and increasing to four times
with the (new improved) 4K hash table.
Re vmalloc... even with the extra tlb load, this is better than
kmallocing a smaller vector and running the very high risk that the
allocation will fail or be forced to fall back because of our lame
fragmentation handling. Of course the Right Think To Do is kmalloc a
vector of pages and mask the hash into it, but as a lazy person I
would rather sink the time into writing this deathless prose.
The second patch is a rewrite of __dlm_lookup_lockres to get rid of the
bogus clearing of tmpres on every loop iteration. I also threw in a
little hand optimization, which gcc really ought to be doing but isn't.
Comparing the first character of the string before doing the full
memcmp seems to help measurably. I also tried to provide some more
optimization guidance for the compiler, though I do fear it will fall
on deaf ears. Finally, I cleaned up some stylistic oddities[3]. I
also left one screaming lindent violation intact. Executive summary:
could you please install some editors that show whitespace? And what
is it with this one line per function parameter? After a few edits
of the function name you end up with a mess anyway so why not do as
the romans do and stick to lindent style?
The __dlm_lookup_lockres rewrite shows a measurable improvement in
hash chain traversal overhead. Every little bit helps. I failed to
quantify the improvement precisely because there are other glitchy
things going on that interfere with accurate measurement. So now we
get to...
How fast is it now?
-------------------
Here is ocfs2 from yesterday's git:
Untar Sync
Trial Real User Sys Real User Sys
1. 31.42 24.90 3.81
2. 31.97 25.15 5.03 (similar to below)
3. 94.72 25.29 5.79
4. 26.27 0.28 4.43
5. 34.24 25.21 6.41
6. 50.71 25.26 7.23
7. 33.73 25.44 6.83
Here is what happens with the shiny new patches:
Untar Sync
Trial Real User Sys Real User Sys
1. 34.12 24.92 3.17 46.70 0.00 0.17
2. 29.63 24.96 3.36 47.25 0.00 0.16
3. 141.18 25.06 3.11 28.43 0.00 0.12
4. 45.20 25.38 3.07 49.04 0.00 0.18
5. 160.70 25.11 3.24 21.74 0.00 0.07
6. 31.65 24.99 3.34 46.30 0.00 0.19
7. 145.45 25.28 3.16 27.26 0.00 0.12
8. 30.87 25.16 3.29 44.60 0.00 0.20
9. 58.31 25.32 3.20 22.82 0.00 0.09
10. 31.01 25.29 3.09 43.93 0.00 0.22
Things to notice: with the still-teensy hash table, systime for global
lock lookups is still way too high, eventually stabilizing when lock
purging kicks in at about 4 times ext3's systime. Not good (but indeed,
not as excruciatingly painful as two days ago).
With the attached patches we kill the hash table overhead dead, dead,
dead. But look at real time for the untar and sync!
A) We have huge swings in idle time during the untar
B) We have great whacks of dirty memory sitting around after the
untar
So that brings us to...
More performance bugs
---------------------
Ocfs2 starts little or no writeout during the untar itself. Ext3 does.
Solution: check ext3 to see what secret incantations Andrew conjured up
to fiddle it into doing something reasonable.
Ocfs2 sometimes sits and gazes at its navel for minutes at a time, doing
nothing at all. Timer bug? A quick glance at SysRq-t shows ocfs2 waiting
in io_schedule. Waiting for io that never happens? This needs more
investigation.
You also might want to look at more fine grained lock purging instead of
the current burn the inbox approach. One can closely approximate the
performance of a LRU without paying the penalty of a separate per-lock lru
field by treating each hash chain as a LRU list. Here is the equation:
either the hash chains are short in which case you don't care about poor
LRU behavior because you are not purging locks, or they are long and you
do care. Luckily, the longer the hash chains get, the more closely they
approximate the behavior of a dedicated LRU list. Amazing.
Delete performance remains horrible, even with a 256 meg journal[4] which
is unconscionably big anyway. Compare to ext3, which deletes kernel trees
at a steady 2 seconds per, with a much smaller journal. Ocfs2 takes more
like a minute.
I would chase down these glitches too, except that I really need to get
back to my cluster block devices, without which we cannot fully bake
our cluster cake. After all, there are six of you and only one of me, I
had better go work on the thing nobody else is working on.
[1] People would be much more interested in ocfs2 if they could get it
running in less than a day, right? Well, you can, or at least you could
if Oracle's startup scripts were ever so slightly less user hostile. For
example, how about scripts/ocfs2 instead of vendor/common/o2cb.init? And
it would be nice to expunge the evil of using lsmod to see if a subsystem
is available. Also, an actual shared disk is not recommended for your
first run. That way you avoid the pain of iscsi, nbd, fiber channel,
etc al. (At least for a little while.) Once you get the init script
insanity under control, ocfs2 comes up just like a normal filesystem on
each node: mount -tocfs2.
[2] It is high time we pried loose the ocfs2 design process from secret
irc channels and dark tunnels running deep beneath Oracle headquarters,
and started sharing the squishy goodness of filesystem clustering
development with some more of the smart people lurking here.
[3] Aha! I hear you cry, he wrote an 83 character line! Actually, I am
going to blame the person who invented the name DLM_HASH_BUCKETS. Can
you spot the superfluous redundancy? (Hint: name three other kinds of
kernel objects that have buckets.)
[4] Could you please have fsck.ocfs2 report the size of the journal?
Patches apply to current linux-git.
Regards,
Daniel
[PATCH] Vmalloc ocfs global lock hash table and make it way bigger
Signed-Off-By: Daniel "the wayward" Phillips <phillips at
google.com>
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmcommon.h 2.6.16/fs/ocfs2/dlm/dlmcommon.h
--- 2.6.16.clean/fs/ocfs2/dlm/dlmcommon.h 2006-03-02 19:05:13.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmcommon.h 2006-03-03 09:52:54.000000000 -0800
@@ -36,8 +36,8 @@
#define DLM_LOCK_RES_OWNER_UNKNOWN O2NM_MAX_NODES
#define DLM_THREAD_SHUFFLE_INTERVAL 5 // flush everything every 5
passes
#define DLM_THREAD_MS 200 // flush at least every 200 ms
-
-#define DLM_HASH_BUCKETS (PAGE_SIZE / sizeof(struct hlist_head))
+#define DLM_HASH_SIZE (1 << 20)
+#define DLM_HASH_BUCKETS (DLM_HASH_SIZE / sizeof(struct hlist_head))
enum dlm_ast_type {
DLM_AST = 0,
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2.6.16/fs/ocfs2/dlm/dlmdomain.c
--- 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2006-03-02 16:47:47.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmdomain.c 2006-03-03 09:53:21.000000000 -0800
@@ -194,7 +191,7 @@ static int dlm_wait_on_domain_helper(con
static void dlm_free_ctxt_mem(struct dlm_ctxt *dlm)
{
if (dlm->lockres_hash)
- free_page((unsigned long) dlm->lockres_hash);
+ vfree(dlm->lockres_hash);
if (dlm->name)
kfree(dlm->name);
@@ -1191,7 +1188,7 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
goto leave;
}
- dlm->lockres_hash = (struct hlist_head *) __get_free_page(GFP_KERNEL);
+ dlm->lockres_hash = (struct hlist_head *)vmalloc(DLM_HASH_SIZE);
if (!dlm->lockres_hash) {
mlog_errno(-ENOMEM);
kfree(dlm->name);
@@ -1200,7 +1197,7 @@ static struct dlm_ctxt *dlm_alloc_ctxt(c
goto leave;
}
- for (i=0; i<DLM_HASH_BUCKETS; i++)
+ for (i = 0; i < DLM_HASH_BUCKETS; i++)
INIT_HLIST_HEAD(&dlm->lockres_hash[i]);
strcpy(dlm->name, domain);
[PATCH] Clean up ocfs2 hash probe and make it faster
Signed-Off-By: Daniel "He's Baaaaack" Phillips <phillips at
google.com>
diff -urp 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2.6.16/fs/ocfs2/dlm/dlmdomain.c
--- 2.6.16.clean/fs/ocfs2/dlm/dlmdomain.c 2006-03-02 16:47:47.000000000 -0800
+++ 2.6.16/fs/ocfs2/dlm/dlmdomain.c 2006-03-03 09:53:21.000000000 -0800
@@ -103,31 +103,28 @@ struct dlm_lock_resource * __dlm_lookup_
const char *name,
unsigned int len)
{
- unsigned int hash;
- struct hlist_node *iter;
- struct dlm_lock_resource *tmpres=NULL;
struct hlist_head *bucket;
+ struct hlist_node *list;
mlog_entry("%.*s\n", len, name);
assert_spin_locked(&dlm->spinlock);
+ bucket = dlm->lockres_hash + full_name_hash(name, len) % DLM_HASH_BUCKETS;
- hash = full_name_hash(name, len);
+ hlist_for_each(list, bucket) {
+ struct dlm_lock_resource *res = hlist_entry(list,
+ struct dlm_lock_resource, hash_node);
- bucket = &(dlm->lockres_hash[hash % DLM_HASH_BUCKETS]);
-
- /* check for pre-existing lock */
- hlist_for_each(iter, bucket) {
- tmpres = hlist_entry(iter, struct dlm_lock_resource, hash_node);
- if (tmpres->lockname.len == len &&
- memcmp(tmpres->lockname.name, name, len) == 0) {
- dlm_lockres_get(tmpres);
- break;
- }
-
- tmpres = NULL;
+ if (likely(res->lockname.name[0] != name[0]))
+ continue;
+ if (likely(res->lockname.len != len))
+ continue;
+ if (likely(memcmp(res->lockname.name + 1, name + 1, len - 1)))
+ continue;
+ dlm_lockres_get(res);
+ return res;
}
- return tmpres;
+ return NULL;
}
struct dlm_lock_resource * dlm_lookup_lockres(struct dlm_ctxt *dlm,
Hi Daniel, On Fri, Mar 03, 2006 at 02:27:52PM -0800, Daniel Phillips wrote:> Hi ocfs2 guys, > > Actually, ocfs2 is not doomed - far from it! I just wanted to grab a few > more of those eyeballs that normally glaze over each time they see a > cluster mail run by on lkml[1]. Today's news: you are still not finished > with global lock systime badness, and I already have more performance bugs > for you.Thanks for continuing to profile this stuff. We've definitely got lots of profiling to do, so any help is welcome.> So the ocfs2 guys switched from double word to single word hash table heads > and quadrupled the size of the hash table. Problem solved, right? No!Heh, well if you remember from irc I told you that there was more to do - specifically that I was looking at increasing the memory dedicated to the hash.> I included two patches below. The first vmallocs a megabyte hash table > instead of a single page. This definitively squishes away the global > lock hash table overhead so that systime stays about twice as high as > ext3 instead of starting at three times and increasing to four times > with the (new improved) 4K hash table.I completely agree that we need to increase the memory dedicated to the hash, but I fear that vmallocing a 1 megabyte table per domain (effectively per mount) is going overboard. I will assume that this was a straw man patch :)> Of course the Right Think To Do is kmalloc a vector of pages and mask the > hash into it, but as a lazy person I would rather sink the time into > writing this deathless prose.Right. That's one possible approach - I was originally thinking of allocating a large global lockres hash table allocated at module init time. This way we don't have to make large allocs for each domain. Before anything else though, I'd really like to get an idea of how large we want things. This might very well dictate the severity of the solution.> The __dlm_lookup_lockres rewrite shows a measurable improvement in > hash chain traversal overhead. Every little bit helps.Yes, this patch seems much easier to swallow ;) Thanks for doing that work. I have some comments about it below.> I failed to quantify the improvement precisely because there are other > glitchy things going on that interfere with accurate measurement. So now > we get to...Definitely if you get a chance to show how much just the lookup optimization helps, I'd like to know. I'll also try to gather some numbers.> Things to notice: with the still-teensy hash table, systime for global > lock lookups is still way too high, eventually stabilizing when lock > purging kicks in at about 4 times ext3's systime. Not good (but indeed, > not as excruciatingly painful as two days ago). > > With the attached patches we kill the hash table overhead dead, dead, > dead. But look at real time for the untar and sync!Indeed. The real time numbers are certainly confusing. I actually saw real time decrease on most of my tests (usually I hack things to increase the hash allocation to something like a few pages). I want to do some more consecutive untars though. Btw, for what it's worth, here's some single untar numbers I have lying around. Sync numbers are basicaly the same. Before the hlist patch: real 0m21.279s user 0m3.940s sys 0m18.884s With the hlist patch: real 0m15.717s user 0m3.924s sys 0m13.324s> Ocfs2 starts little or no writeout during the untar itself. Ext3 does. > Solution: check ext3 to see what secret incantations Andrew conjured up > to fiddle it into doing something reasonable.Good idea. As you probably already know, we've never been beyond checking to see how ext3 handles a given problem :)> Ocfs2 sometimes sits and gazes at its navel for minutes at a time, doing > nothing at all. Timer bug? A quick glance at SysRq-t shows ocfs2 waiting > in io_schedule. Waiting for io that never happens? This needs more > investigation.Did you notice this during the untar? If not, do you have any reproducible test case?> Delete performance remains horrible, even with a 256 meg journal[4] which > is unconscionably big anyway. Compare to ext3, which deletes kernel trees > at a steady 2 seconds per, with a much smaller journal. Ocfs2 takes more > like a minute.This doesn't surprise me - our unlink performance leaves much to be desired at the moment. How many nodes did you have mounted when you ran that test? Off the top of my head,the two things which I would guess are hurting delete the most right now are node messaging and lack of directory read ahead. The first will be solved by more intelligent use of the DLM so that the network will only be hit for those nodes that actually care about a given inode -- unlink rename and mount/unmount are the only things left that still use the slower 'vote' mechanism. Directory readahead is much easier to solve, it's just that nobody has gotten around to fixing that yet :/ I bet there's more to figure out with respect to unlink performance.> I would chase down these glitches too, except that I really need to get > back to my cluster block devices, without which we cannot fully bake > our cluster cake. After all, there are six of you and only one of me, I > had better go work on the thing nobody else is working on.Hey, we appreciate the help so far - thanks.> [2] It is high time we pried loose the ocfs2 design process from secret > irc channels and dark tunnels running deep beneath Oracle headquarters, > and started sharing the squishy goodness of filesystem clustering > development with some more of the smart people lurking here.We're not trying to hide anything from anyone :) I'm always happy to talk about design. We've been in bugfix (and more recently, performance fix) mode for a while now, so there hasn't been much new design yet.> - bucket = &(dlm->lockres_hash[hash % DLM_HASH_BUCKETS]); > - > - /* check for pre-existing lock */ > - hlist_for_each(iter, bucket) { > - tmpres = hlist_entry(iter, struct dlm_lock_resource, > hash_node); > - if (tmpres->lockname.len == len && > - memcmp(tmpres->lockname.name, name, len) == 0) { > - dlm_lockres_get(tmpres); > - break; > - } > - > - tmpres = NULL; > + if (likely(res->lockname.name[0] != name[0]))Is the likely() here necessary? It actually surprises me that the check even helps so much - if you check the way OCFS2 lock names are built, the first few bytes are very regular - the first char is going to almost always be one of M, D, or W. Hmm, I guess if you're hitting it 1/3 fewer times...> + continue; > + if (likely(res->lockname.len != len))Also here, lengths of OCFS2 locks are also actually fairly regular, so perhaps the likely() isn't right? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com
Daniel Phillips <phillips at google.com> wrote:> > @@ -103,31 +103,28 @@ struct dlm_lock_resource * __dlm_lookup_ > const char *name, > unsigned int len) > { > - unsigned int hash; > - struct hlist_node *iter; > - struct dlm_lock_resource *tmpres=NULL; > struct hlist_head *bucket; > + struct hlist_node *list; > > mlog_entry("%.*s\n", len, name); > > assert_spin_locked(&dlm->spinlock); > + bucket = dlm->lockres_hash + full_name_hash(name, len) % DLM_HASH_BUCKETS; > > - hash = full_name_hash(name, len);err, you might want to calculate that hash outside the spinlock. Maybe have a lock per bucket, too. A 1MB hashtable is verging on comical. How may data are there in total?
Please don't drop the ocfs2-devel Cc: I'm bouncing this there. On Fri, Mar 10, 2006 at 02:14:08AM +0100, Bernd Eckenfels wrote:> Mark Fasheh <mark.fasheh at oracle.com> wrote: > > Your hash sizes are still ridiculously large. > > How long are those entries in the buckets kept? I mean if I untar a tree the > files are only locked while extracted, afterwards they are owner-less... (I > must admint I dont understand ocfs2 very deeply, but maybe explaining why so > many active locks need to be cached might help to find an optimized way. > > > By the way, an interesting thing happened when I recently switched disk > > arrays - the fluctuations in untar times disappeared. The new array is much > > nicer, while the old one was basically Just A Bunch Of Disks. Also, sync > > times dropped dramatically. > > Writeback Cache? > > Gruss > Bernd > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo at vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/-- "Depend on the rabbit's foot if you will, but remember, it didn't help the rabbit." - R. E. Shay Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127
On Fri, Mar 10, 2006 at 02:14:08AM +0100, Bernd Eckenfels wrote:> Mark Fasheh <mark.fasheh at oracle.com> wrote: > > Your hash sizes are still ridiculously large. > > How long are those entries in the buckets kept?Shortly after the inode is destroyed. Basically the entries are "lock resources" which the DLM tracks. OCFS2 only ever gets lock objects attached to those resources. In OCFS2 the lock objects obey inode lifetimes. So the DLM can't purge the lock resources until OCFS2 destroys the inode, etc. If other nodes take locks on that resource and your local node is the "master" (as in, arbitrates access to the resource) it may stay around longer.> I mean if I untar a tree the files are only locked while extracted, > afterwards they are owner-less... (I must admint I dont understand ocfs2 > very deeply, but maybe explaining why so many active locks need to be > cached might help to find an optimized way.Well, OCFS2 caches locks. That is, once you've gone to the DLM to acquire a lock at a given level, OCFS2 will just hold onto it and manage access to it until the locks needs to be upgraded or downgraded. This provides a very large performance increase over always asking the DLM for a new lock. Anyway, at the point that we've acquired the lock the first time, the file system isn't really forcing the dlm to hit the hash much.> > By the way, an interesting thing happened when I recently switched disk > > arrays - the fluctuations in untar times disappeared. The new array is much > > nicer, while the old one was basically Just A Bunch Of Disks. Also, sync > > times dropped dramatically. > > Writeback Cache?Yep, and a whole slew of other nice features :) --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh at oracle.com
On Fri, Mar 10, 2006 at 05:09:13PM -0800, Mark Fasheh wrote:> until the locks needs to be upgraded or downgraded. This provides a very > large performance increase over always asking the DLM for a new lock.Yes, it is basically the same problem as the buffer cache. Excessive single-use patterns dirty the small cache or require a too big cache to be usefull. Maybe a user specific limit of percentage of hash (and locks) used? I mean the untar test case is a bit synthetic, but think of concurrent read access in a cluster of nntp servers (news article pool). Gruss Bernd