thr3ads.net - zfs discuss - [zfs-discuss] single memory allocation in the ZFS intent log [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Erblichs

2006-Oct-04 02:45 UTC

[zfs-discuss] single memory allocation in the ZFS intent log

group,

	at least one location:

	When adding a new dva node into the tree, a kmem_alloc is done with
	a KM_SLEEP argument.

	thus, this process thread could block waiting for memory.

	I would suggest adding a  pre-allocated pool of dva nodes.

	When a new dva node is needed, first check this pre-allocated
	pool and allocate from their.

	Why? This would eliminate a possible sleep condition if memory
	     is not immediately available. The pool would add a working
	     set of dva nodes that could be monitored. Per alloc latencies
	     could be amortized over a chunk allocation.

	     Lastly, if memory is scarce along time may pass before
	     this node could be allocated to the tree. If the number
	     is monitored, it is possible that restricted operations
	     could be done while until the intent log is decreased in
	     size.

	I can supply untested code within 24 hours if wanted.

	Mitchell Erblich

Casper.Dik at Sun.COM

2006-Oct-04 06:57 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

>	at least one location:
>
>	When adding a new dva node into the tree, a kmem_alloc is done with
>	a KM_SLEEP argument.
>
>	thus, this process thread could block waiting for memory.
>
>	I would suggest adding a  pre-allocated pool of dva nodes.
This is how the Solaris memory allocator works.  It keeps pools of
"pre-allocated" nodes about until memory conditions are low.
>	When a new dva node is needed, first check this pre-allocated
>	pool and allocate from their.
There are two reasons why this is a really bad idea:

	- the system will run out of memory even sooner if people
	  start building their own free-lists

	- a single freelist does not scale; at two CPUs it becomes
	  the allocation bottleneck (I''ve measured and removed two
	  such bottlenecks from Solaris 9)


You might want to learn about how the Solaris memory allocator works;
it pretty much works like you want, except that it is all part of the
framework.  And, just as in your case, it does run out some times but
a private freelist does not help against that.
>	Why? This would eliminate a possible sleep condition if memory
>	     is not immediately available. The pool would add a working
>	     set of dva nodes that could be monitored. Per alloc latencies
>	     could be amortized over a chunk allocation.
That''s how the Solaris memory allocator already works.

Casper

Erblichs

2006-Oct-04 08:19 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

Casper Dik,

	Yes, I am familiar with Bonwick''s slab allocators and tried
	it for wirespeed test of 64byte pieces for a 1Gb and then
	100Mb Eths and lastly 10Mb Eth. My results were not 
	encouraging. I assume it has improved over time.

	First, let me ask what happens to the FS if the allocs
	in the intent log code are sleeping waiting for memory????

	IMO, The general problem with memory allocators is:

	- getting memory from a "cache" of ones own size/type
	  is orders of magnitude higher than just getting some
	  off one''s own freelist,

	- their is a built in latency to recouperate/steal memory
	  from other processes,

	- this stealing forces a sleep and context switches,

	- the amount of time to sleep is undeterminate with a single
	  call per struct. How long can you sleep for? 100ms or
	  250ms or more..

	- no process can guarantee a working set,

 	In the time when memory was expensive, maybe a global
	sharing mechanisms would make sense, but when  the amount
	of memory is somewhat plentiful and cheap,

	*** It then makes sense for a 2 stage implementation of
	    preallocation of a working set and then normal allocation
	    with the added latency. 

	So, it makes sense to pre-allocate a working set of allocs
	by a single alloc call, break up the alloc into needed sizes,
	and then alloc from your own free list,

	-----> if that freelist then empties, maybe then take the extra
	overhead with the kmem call. Consider this a expected cost to exceed
	a certain watermark.

	But otherwise, I bet if I give you some code for the pre-alloc, I bet
10 
	allocs from the freelist can be done versus the kmem_alloc call, and
	at least 100 to 10k allocs if sleep occurs on your side.

	Actually, I think it is so bad, that why don''t you time 1 kmem_free
	versus grabbing elements off the freelist,

	However, don''t trust me, I will drop a snapshot of the code to you
	tomarrow if you want and you make a single CPU benchmark comparison.

	Your multiple CPU issue, forces me to ask, is it a common
	occurance that 2 are more CPUs are simultaneouly requesting
	memory for the intent log? If it is, then their should be a
	freelist of a low watermark set of elements per CPU. However,
	one thing at a time..

	So, do you want that code? It will be a single alloc of X units
	and then place them on a freelist. You then time it takes to
	remove Y elements from the freelist versus 1 kmem_alloc with
	a NO_SLEEP arg and report the numbers. Then I would suggest the
	call with the smallest sleep possible. How many allocs can then
	be done? 25k, 35k, more...

	Oh, the reason why we aren''t timing the initial kmem_alloc call
	for the freelist, is because I expect that to occur during init
	and not proceed until memory is alloc''ed.
	

	Mitchell Erblich
	------------------------

	

	

	

Casper.Dik at Sun.COM wrote:> 
> >       at least one location:
> >
> >       When adding a new dva node into the tree, a kmem_alloc is done
with
> >       a KM_SLEEP argument.
> >
> >       thus, this process thread could block waiting for memory.
> >
> >       I would suggest adding a  pre-allocated pool of dva nodes.
> 
> This is how the Solaris memory allocator works.  It keeps pools of
> "pre-allocated" nodes about until memory conditions are low.
> 
> >       When a new dva node is needed, first check this pre-allocated
> >       pool and allocate from their.
> 
> There are two reasons why this is a really bad idea:
> 
>         - the system will run out of memory even sooner if people
>           start building their own free-lists
> 
>         - a single freelist does not scale; at two CPUs it becomes
>           the allocation bottleneck (I''ve measured and removed two
>           such bottlenecks from Solaris 9)
> 
> You might want to learn about how the Solaris memory allocator works;
> it pretty much works like you want, except that it is all part of the
> framework.  And, just as in your case, it does run out some times but
> a private freelist does not help against that.
> 
> >       Why? This would eliminate a possible sleep condition if memory
> >            is not immediately available. The pool would add a working
> >            set of dva nodes that could be monitored. Per alloc
latencies
> >            could be amortized over a chunk allocation.
> 
> That''s how the Solaris memory allocator already works.
> 
> Casper

Casper.Dik at Sun.COM

2006-Oct-04 10:14 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

>Casper Dik,
>
>	Yes, I am familiar with Bonwick''s slab allocators and tried
>	it for wirespeed test of 64byte pieces for a 1Gb and then
>	100Mb Eths and lastly 10Mb Eth. My results were not 
>	encouraging. I assume it has improved over time.
Nothing which tries to send 64 byte pieces over 1Gb ethernet or 100Mb
ethernet will give encouraging results.
>	First, let me ask what happens to the FS if the allocs
>	in the intent log code are sleeping waiting for memory????
How are you going to guarantee that there is *always* memory available?

I think that''s barking up the wrong tree.  I think that a proper
solution is
not trying to find a way which prevents memory from running out but
rather a way of dealing with the case of it running out.

If KMEM_SLEEP is used in a path where it is causing problems, then no
amount of freelists is going to solve that.  There needs to be a solution
which does not sleep.
>	- getting memory from a "cache" of ones own size/type
>	  is orders of magnitude higher than just getting some
>	  off one''s own freelist,
Actually, that''s not true; Bonwick''s allocator is *FASTER* by
a *wide*
margin than your own freelist.

Believe me, I''ve measured this, I''ve seen "my own
freelist" collapse
on the floor when confronted with as few as two CPUs.

As a minimum, you will need *per CPU* free lists.

And that''s precisely what the kernel memory allocator gives you.
> 	In the time when memory was expensive, maybe a global
>	sharing mechanisms would make sense, but when  the amount
>	of memory is somewhat plentiful and cheap,
Not if all bits of the system are going to keep their own freelists * #CPUs.

Then you are suddenly faced with a *MUCH* higher memory demand.  The
Bonwick allocator does keep quite a bit cached and keeps more memory
unavailable already.
>	*** It then makes sense for a 2 stage implementation of
>	    preallocation of a working set and then normal allocation
>	    with the added latency. 
But the normal Bonwick allocation *is* two-stage; you are proposing to
add a 3rd stage.
>	So, it makes sense to pre-allocate a working set of allocs
>	by a single alloc call, break up the alloc into needed sizes,
>	and then alloc from your own free list,
That''s what the Bonwick allocator does; so why are you duplicating
this?

Apart from the questionable performance gain (I believe there to be none),
the loss of the kernel memory allocator debugging functionality is severe:

	- you can no longer track where the individual blocks are allocated
	- you can no longer track buffer overruns
	- buffers run into one another, so one overrun buffer corrupts another
	  without trace
>	-----> if that freelist then empties, maybe then take the extra
>	overhead with the kmem call. Consider this a expected cost to exceed
>	a certain watermark.
This is exactly how the magazine layer works.
>	But otherwise, I bet if I give you some code for the pre-alloc, I bet
>10 
>	allocs from the freelist can be done versus the kmem_alloc call, and
>	at least 100 to 10k allocs if sleep occurs on your side.
I hope you''re not designing this with a single lock per queue.

I have eradicated code in Solaris 9 which looked like this:

	struct au_buff *
	au_get_buff(void)
	{
		au_buff_t *buffer = NULL;

		mutex_enter(&au_free_queue_lock);

		if (au_free_queue == NULL) {
			if (au_get_chunk(1)) {
				mutex_exit(&au_free_queue_lock);
				return (NULL);
			}
		}

		buffer = au_free_queue;
		au_free_queue = au_free_queue->next_buf;
		mutex_exit(&au_free_queue_lock);
		buffer->next_buf = NULL;
		return (buffer);
	}

(with a corresponding free routine which never returned memory to the
system but kept it in the freelist)

This was replaced with essentially:

	buffer = kmem_cache_alloc(au_buf_cache, KM_SLEEP);

The first bit of code stopped scaling at 1 CPU (the performance
with two CPUs was slightly worse than with one CPU)

The second bit of code was both FASTER in the single CPU case and
scaled to the twelve CPUs I had for testing.
>	Actually, I think it is so bad, that why don''t you time 1
kmem_free
>	versus grabbing elements off the freelist,
I did, it''s horrendous.

Don''t forget that the typical case, when the magazine layer is properly
size after the system has been running for a while, no locks need to be
grabbed to get memory as the magazines are per-CPU.

But with your single freelist, you must grab a lock.  Somewhere in
the grab/release lock cycle there''s at least one atomic operation
and memory barrier.

Those are perhaps cheap on single CPU systems but run in the hundreds
of cycles on larger systems.

Here''s something else from Sun''s "collective" memory
which I think
illustrates why we think private freelists are bad, /on principle/.

When Jeff Bonwick redid the memory allocator he did this because he
noticed that the performance was bad, so bad even that people generally
avoided the allocator if they could (hence the use of private freelists
in particular parts of the system such as the above example from
auditing).  This is particularly poor software engineering for two important
reasons; if a core bit of software does not work or perform properly, it
needs to be rewritten and not worked around; and if a core bit of software
is worked around, any future improvements will go unnoticed.

The auditing example and the resulting poor performance of the auditing
code amply demonstrates both points; there''s even additional similarity
between it and the intent log: both allocate buffers to be written to
disk.

The object lesson of the Bonwick memory allocator, the addition of
the magazine layer in a later release is this: if this tool is not
up to the job, it must be improved not worked around.

So *if* the memory allocator is not able to deliver the QoS required
for a particular part of the system, it stands to reason that it
should be improved with QoS features to serve the need of those
consumers.

(But I''m not sure that that is even the case; but then, I''m
not sufficiently
familiar with how ZIL or ZFS work and what exact issues there are)

Sorry for the lecture; my experience with single freelists of
buffers to be written to disk is just something I thought I needed to
share in this discussion.


Casper

Frank Hofmann

2006-Oct-04 10:15 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

On Wed, 4 Oct 2006, Erblichs wrote:
> Casper Dik,
>
> 	Yes, I am familiar with Bonwick''s slab allocators and tried
> 	it for wirespeed test of 64byte pieces for a 1Gb and then
> 	100Mb Eths and lastly 10Mb Eth. My results were not
> 	encouraging. I assume it has improved over time.
>
> 	First, let me ask what happens to the FS if the allocs
> 	in the intent log code are sleeping waiting for memory????
The same as would happen to the FS with your proposed additional allocator 
layer in if that "freelist" of yours runs out - it''ll wait,
you''ll see a
latency bubble.

You seem to think it''s likely that a kmem_alloc(..., KM_SLEEP) will
sleep.
It''s not. Anything but. See below.
>
> 	IMO, The general problem with memory allocators is:
>
> 	- getting memory from a "cache" of ones own size/type
> 	  is orders of magnitude higher than just getting some
> 	  off one''s own freelist,
This is why the kernel memory allocator in Solaris has two such freelists:

 	- the per-CPU kmem magazines (you say below ''one step at a
time'',
 	  but that step is already done in Solaris kemem)
 	- the slab cache
>
> 	- their is a built in latency to recouperate/steal memory
> 	  from other processes,
Stealing ("reclaim" in Solaris kmem terms) happens if the following
three
conditions are true:

 	- nothing in the per-CPU magazines
 	- nothing in the slab cache
 	- nothing in the quantum caches
 	- on the attempt to grow the quantum cache, the request to the
 	  vmem backend finds no readily-available heap to satisfy the
 	  growth demand immediately
>
> 	- this stealing forces a sleep and context switches,
>
> 	- the amount of time to sleep is undeterminate with a single
> 	  call per struct. How long can you sleep for? 100ms or
> 	  250ms or more..
>
> 	- no process can guarantee a working set,
Yes and no. If your working set is small, use the stack.
>
> 	In the time when memory was expensive, maybe a global
> 	sharing mechanisms would make sense, but when  the amount
> 	of memory is somewhat plentiful and cheap,
>
> 	*** It then makes sense for a 2 stage implementation of
> 	    preallocation of a working set and then normal allocation
> 	    with the added latency.
>
> 	So, it makes sense to pre-allocate a working set of allocs
> 	by a single alloc call, break up the alloc into needed sizes,
> 	and then alloc from your own free list,
See above - all of that _IS_ already done in Solaris kmem/vmem, with more 
parallelism and more intermediate caching layers designed to bring down 
allocation latency than your simple freelist approach would achieve.
>
> 	-----> if that freelist then empties, maybe then take the extra
> 	overhead with the kmem call. Consider this a expected cost to exceed
> 	a certain watermark.
>
> 	But otherwise, I bet if I give you some code for the pre-alloc, I bet
> 10
> 	allocs from the freelist can be done versus the kmem_alloc call, and
> 	at least 100 to 10k allocs if sleep occurs on your side.
The same statistics can be made for Solaris kmem - you satisfy the request 
from the per-CPU magazine, you satisfy the request from the slab cache, 
you satisfy the request via immediate vmem backend allocation and a growth 
of the slab cache. All of these with increased latency but without 
sleeping. Sleeping only comes in if you''re so tight on memory that you 
need to perform coalescing in the backend, and purge least-recently-used 
things from other kmem caches in favour of new backend requests. Just 
because you chose to say kmem_alloc(...,KM_SLEEP) doesn''t mean you
_will_
sleep. Normally you won''t.
>
> 	Actually, I think it is so bad, that why don''t you time 1
kmem_free
> 	versus grabbing elements off the freelist,
>
> 	However, don''t trust me, I will drop a snapshot of the code to
you
> 	tomarrow if you want and you make a single CPU benchmark comparison.
>
> 	Your multiple CPU issue, forces me to ask, is it a common
> 	occurance that 2 are more CPUs are simultaneouly requesting
> 	memory for the intent log? If it is, then their should be a
> 	freelist of a low watermark set of elements per CPU. However,
> 	one thing at a time..
Of course it''s common - have two or more threads do filesystem I/O at
the
same time and you''re already there. Which is why, one thing at a time, 
Solaris kmem had the magazine layer for, I think (predates my time at 
Sun), around 12 years now, to get SMP scalability. Been there done that ...
>
> 	So, do you want that code? It will be a single alloc of X units
> 	and then place them on a freelist. You then time it takes to
> 	remove Y elements from the freelist versus 1 kmem_alloc with
> 	a NO_SLEEP arg and report the numbers. Then I would suggest the
> 	call with the smallest sleep possible. How many allocs can then
> 	be done? 25k, 35k, more...
>
> 	Oh, the reason why we aren''t timing the initial kmem_alloc call
> 	for the freelist, is because I expect that to occur during init
> 	and not proceed until memory is alloc''ed.
Can you provide timing measurements under various loads that show the 
benefit of your change to ZFS vs. using Solaris kmem as-is ? On single-CPU 
machines as well as on 100-CPU machines ? On single disks as well as on 
100-disk pools ? We''re very interested there. Performance
characterization
of ZFS has, in a way, just started, as people begin using it for their own 
purposes, coming up with their own numbers; changes that improve
"speed"
will obviously be welcome !

Best wishes,
FrankH.
>
>
> 	Mitchell Erblich
> 	------------------------
>
>
>
>
>
>
>
> Casper.Dik at Sun.COM wrote:
>>
>>>       at least one location:
>>>
>>>       When adding a new dva node into the tree, a kmem_alloc is
done with
>>>       a KM_SLEEP argument.
>>>
>>>       thus, this process thread could block waiting for memory.
>>>
>>>       I would suggest adding a  pre-allocated pool of dva nodes.
>>
>> This is how the Solaris memory allocator works.  It keeps pools of
>> "pre-allocated" nodes about until memory conditions are low.
>>
>>>       When a new dva node is needed, first check this pre-allocated
>>>       pool and allocate from their.
>>
>> There are two reasons why this is a really bad idea:
>>
>>         - the system will run out of memory even sooner if people
>>           start building their own free-lists
>>
>>         - a single freelist does not scale; at two CPUs it becomes
>>           the allocation bottleneck (I''ve measured and removed
two
>>           such bottlenecks from Solaris 9)
>>
>> You might want to learn about how the Solaris memory allocator works;
>> it pretty much works like you want, except that it is all part of the
>> framework.  And, just as in your case, it does run out some times but
>> a private freelist does not help against that.
>>
>>>       Why? This would eliminate a possible sleep condition if
memory
>>>            is not immediately available. The pool would add a
working
>>>            set of dva nodes that could be monitored. Per alloc
latencies
>>>            could be amortized over a chunk allocation.
>>
>> That''s how the Solaris memory allocator already works.
>>
>> Casper
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
=========================================================================No good
can come from selling your freedom, not for all gold of the world,
for the value of this heavenly gift exceeds that of any fortune on earth.
==========================================================================

Erblichs

2006-Oct-06 01:20 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

Casper Dik,

	After my posting, I assumed that a code question should be
	directed to the ZFS code alias, so I apologize to the people
	show don''t read code. However, since the discussion is here,
	I will post a code proof here. Just use "time program" to get
	a generic time frame. It is under 0.1 secs for 500k loops
	(each loop does removes a obj and puts it back).

	It is just to be used as a proof of concept that a simple
	pre-alloc''ed set of objects can be accessed so much faster
	than allocating and assigning them.

	To support the change to the intent log objects, I would suggest
	first identiying the number of objects normally allocated and
	use that as a working set of objects. A time element is also
	needed to identify when objects should be released from the
	free list to the memory cache. Yes, the initial thoughts of
	having a per CPU based allocs are coded in which would allow
	multiple simultaneious access to a freelist per CPU. This
	should remove most of the mutex code necessary for scalability.

	Objective
	-----------
	   What this app code is proving is that items that are pre-alloc''ed
	can be removed off a simple free list and stored on a free list.
	This is just a inception proof that shows "fast access" to a
	working set of objects.
	
	   The time to make one chunk alloc, place all of them on a
	free list, and then perform 500k ops of removal and insertions
	is probably somewhere 50 to 1000x faster than even the best memory
	allocators allocating/retrieving 500k items. If a dynamic list of
	nodes are required the chunk alloc should be changed.

	  This quick piece of app prog runs in less than 0.1 sec with 500k
	retrieves and store ops. This is fast enough to grab 64byte chunks
	dealing with even a 1Gb Eth link. Even though this code is simplified,
it
	indicates that kmem_allocs will have the same benefits even without
	sleeping.

	-------------

	The code does a single pre alloc and then breaks up the alloc
	into N node pieces. It takes each piece and places them
	on a free list in the init section. The assumption here is
	that we have a fixed reasonable number of items. If the
	number of items is dynamic the init could easily alloc
	a number of nodes, then use watermarks to alloc and free
	into / from the free list as the number of nodes are used.

	If the logic is used to deal with kmem ops, then any free
	nodes could be returned to memory as excess nodes are in
	the free list.

	The base level of this type of logic is normally used when
	a special program project requires non-standard interfaces
	to guarantee a HIGH level of performance.

	The main has a hard code of 500k loops, which allocs one
	node and then frees it. Thus, 500k equiv allocs would need
	to be done. This ran in the 0.02 to 0.35 secs on a 1.3GHz
	laptop Linux box.

	-----

	It is my understanding that the Bonwicks''s new allocator was created
	to remove fragmentation. And yes it also allows the OS to reduce the
	overhead of of dealing with mem objects of process''s that 
	are freed and alloc''ed frequently. When the system gets low on
	memory, it steals freed objects that are being cached. However,
	even with no SLEEPING, I have yet to see it perform as fast as 
	simple retrieves and stores.

	Years ago, the amount of memory on a system was limited due to
	its expense. This is no longer the case. Some/most 
	processes/threads could have a decent increase in performance
	if the amount of workload done
	on a working set of objects is minimized. Up to this workset,
	I propose that a almost guranteed level of performance could
	be achieved.

	With the comment that any type of functionality that has merits
	get a API so multiple processes / threads can use that
	functionality. Years ago I was in the "process" of doing that
	when I left a company with a ARC group. It was to add a layer
	of working set mem objects that would have "fast access" properties.

	I will ONLY GUARANTEE that X working set objects once freed 
	to the FREE LIST can be re-allocated without waiting for the objects. 
	Any count beyond that working set, has the same underlying properties. 
	Except if I KNOW
	that the number on my freelist goes down to a small value, I could
	pre-request more objects. The latency of retrieving these objects
	could thus be minimized.

	This logic then removes on demand memory allocations, so any WAIT
	time MAY not effect the other parts of the process that need more
	objects.

	-----------------

	Mitchell Erblich


Casper.Dik at Sun.COM wrote:> 
> >Casper Dik,
> >
> >       Yes, I am familiar with Bonwick''s slab allocators and
tried
> >       it for wirespeed test of 64byte pieces for a 1Gb and then
> >       100Mb Eths and lastly 10Mb Eth. My results were not
> >       encouraging. I assume it has improved over time.
> 
> Nothing which tries to send 64 byte pieces over 1Gb ethernet or 100Mb
> ethernet will give encouraging results.
> 
> >       First, let me ask what happens to the FS if the allocs
> >       in the intent log code are sleeping waiting for memory????
> 
> How are you going to guarantee that there is *always* memory available?
> 
> I think that''s barking up the wrong tree.  I think that a proper
solution is
> not trying to find a way which prevents memory from running out but
> rather a way of dealing with the case of it running out.
> 
> If KMEM_SLEEP is used in a path where it is causing problems, then no
> amount of freelists is going to solve that.  There needs to be a solution
> which does not sleep.
> 
> >       - getting memory from a "cache" of ones own size/type
> >         is orders of magnitude higher than just getting some
> >         off one''s own freelist,
> 
> Actually, that''s not true; Bonwick''s allocator is
*FASTER* by a *wide*
> margin than your own freelist.
> 
> Believe me, I''ve measured this, I''ve seen "my own
freelist" collapse
> on the floor when confronted with as few as two CPUs.
> 
> As a minimum, you will need *per CPU* free lists.
> 
> And that''s precisely what the kernel memory allocator gives you.
> 
> >       In the time when memory was expensive, maybe a global
> >       sharing mechanisms would make sense, but when  the amount
> >       of memory is somewhat plentiful and cheap,
> 
> Not if all bits of the system are going to keep their own freelists *
#CPUs.
> 
> Then you are suddenly faced with a *MUCH* higher memory demand.  The
> Bonwick allocator does keep quite a bit cached and keeps more memory
> unavailable already.
> 
> >       *** It then makes sense for a 2 stage implementation of
> >           preallocation of a working set and then normal allocation
> >           with the added latency.
> 
> But the normal Bonwick allocation *is* two-stage; you are proposing to
> add a 3rd stage.
> 
> >       So, it makes sense to pre-allocate a working set of allocs
> >       by a single alloc call, break up the alloc into needed sizes,
> >       and then alloc from your own free list,
> 
> That''s what the Bonwick allocator does; so why are you duplicating
this?
> 
> Apart from the questionable performance gain (I believe there to be none),
> the loss of the kernel memory allocator debugging functionality is severe:
> 
>         - you can no longer track where the individual blocks are allocated
>         - you can no longer track buffer overruns
>         - buffers run into one another, so one overrun buffer corrupts
another
>           without trace
> 
> >       -----> if that freelist then empties, maybe then take the
extra
> >       overhead with the kmem call. Consider this a expected cost to
exceed
> >       a certain watermark.
> 
> This is exactly how the magazine layer works.
> 
> >       But otherwise, I bet if I give you some code for the pre-alloc,
I bet
> >10
> >       allocs from the freelist can be done versus the kmem_alloc call,
and
> >       at least 100 to 10k allocs if sleep occurs on your side.
> 
> I hope you''re not designing this with a single lock per queue.
> 
> I have eradicated code in Solaris 9 which looked like this:
> 
>         struct au_buff *
>         au_get_buff(void)
>         {
>                 au_buff_t *buffer = NULL;
> 
>                 mutex_enter(&au_free_queue_lock);
> 
>                 if (au_free_queue == NULL) {
>                         if (au_get_chunk(1)) {
>                                 mutex_exit(&au_free_queue_lock);
>                                 return (NULL);
>                         }
>                 }
> 
>                 buffer = au_free_queue;
>                 au_free_queue = au_free_queue->next_buf;
>                 mutex_exit(&au_free_queue_lock);
>                 buffer->next_buf = NULL;
>                 return (buffer);
>         }
> 
> (with a corresponding free routine which never returned memory to the
> system but kept it in the freelist)
> 
> This was replaced with essentially:
> 
>         buffer = kmem_cache_alloc(au_buf_cache, KM_SLEEP);
> 
> The first bit of code stopped scaling at 1 CPU (the performance
> with two CPUs was slightly worse than with one CPU)
> 
> The second bit of code was both FASTER in the single CPU case and
> scaled to the twelve CPUs I had for testing.
> 
> >       Actually, I think it is so bad, that why don''t you time
1 kmem_free
> >       versus grabbing elements off the freelist,
> 
> I did, it''s horrendous.
> 
> Don''t forget that the typical case, when the magazine layer is
properly
> size after the system has been running for a while, no locks need to be
> grabbed to get memory as the magazines are per-CPU.
> 
> But with your single freelist, you must grab a lock.  Somewhere in
> the grab/release lock cycle there''s at least one atomic operation
> and memory barrier.
> 
> Those are perhaps cheap on single CPU systems but run in the hundreds
> of cycles on larger systems.
> 
> Here''s something else from Sun''s "collective"
memory which I think
> illustrates why we think private freelists are bad, /on principle/.
> 
> When Jeff Bonwick redid the memory allocator he did this because he
> noticed that the performance was bad, so bad even that people generally
> avoided the allocator if they could (hence the use of private freelists
> in particular parts of the system such as the above example from
> auditing).  This is particularly poor software engineering for two
important
> reasons; if a core bit of software does not work or perform properly, it
> needs to be rewritten and not worked around; and if a core bit of software
> is worked around, any future improvements will go unnoticed.
> 
> The auditing example and the resulting poor performance of the auditing
> code amply demonstrates both points; there''s even additional
similarity
> between it and the intent log: both allocate buffers to be written to
> disk.
> 
> The object lesson of the Bonwick memory allocator, the addition of
> the magazine layer in a later release is this: if this tool is not
> up to the job, it must be improved not worked around.
> 
> So *if* the memory allocator is not able to deliver the QoS required
> for a particular part of the system, it stands to reason that it
> should be improved with QoS features to serve the need of those
> consumers.
> 
> (But I''m not sure that that is even the case; but then,
I''m not sufficiently
> familiar with how ZIL or ZFS work and what exact issues there are)
> 
> Sorry for the lecture; my experience with single freelists of
> buffers to be written to disk is just something I thought I needed to
> share in this discussion.
> 
> Casper-------------- next part --------------
A non-text attachment was scrubbed...
Name: node_alloc.c
Type: application/x-unknown-content-type-c_auto_file
Size: 2502 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061005/581b3e77/attachment.bin>

Anton B. Rang

2006-Oct-06 04:48 UTC

head link

[zfs-discuss] Re: single memory allocation in the ZFS intent log

Hi Mitchell,

I do work for Sun, but I don''t consider myself biased towards the slab
allocator or any other Solaris or Sun code. I know we''ve got plenty of
improvements to make!

That said, your example is not multi-threaded. There are two major performance
issues which come up with a list structure in a multi-threaded environment. The
slab allocator addresses both of these.

The first is locking. In a multi-threaded environment, your node_alloc/node_free
would have to lock the free list head. This leads to contention between threads.
(Yes, I know about lockfree algorithms, but they don''t eliminate
contention; and they are even more susceptible to the second problem.)

The second is cache line movement. Each time a CPU reads the memory location
containing the head of the free list, it needs to get this data from memory or
from another CPU''s cache. Each time the CPU updates memory (which
happens for each allocation or free), the cache line must be invalidated in all
other CPU''s caches. Hence ownership of the cache line moves between
CPUs. This is incredibly expensive (it?s a major reason that locking is slow,
actually). It can take tens to hundreds of cycles.

How does the slab allocator avoid these? By taking advantage of the operating
system''s providing a ?current CPU? index to the thread. The allocator
uses this to index into a set of cached objects which have been recently freed
by the current CPU, and allocates/frees into that set. This has the advantages
that (a) the lock protecting the set is very rarely contended, since a thread
would have to be pre-empted during the run of the allocator to have its current
CPU change; (b) the data structures describing the set are not shared between
CPUs and hence can stay in one cache; (c) the objects themselves tend to have
been freed from the current CPU and hence are somewhat more likely to be in the
current CPU''s cache. (I''m not sure if that last has a
measurable effect.)

If you?re interested in learning more, I?d suggest reading the original paper,
which you can find at:

  http://citeseer.ist.psu.edu/bonwick94slab.html

The slab allocator does somewhat more, such as allowing objects of a type which
require initialization to be reused more cheaply than an
allocation+initialization, but the above are the key scalability insights as I
understand them.

Your approach will certainly perform better on a single processor, and might on
a dual processor. Beyond that, it''s scalability is likely poor. And, as
Casper pointed out, keeping private memory pools for each subsystem tends to
increase kernel memory usage .
 
 
This message posted from opensolaris.org

Erblichs

2006-Oct-06 09:23 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

Group,

	This example is done with a single threaded app.
	It is NOT NOT NOT intended to show any level of
	Thread-safe type coding.

	It is ONLY used to show that it is signifcantly lower cost
	to grab pre-allocated objects than to allocate the 
	objects on demand.

	Thus, grabbing 64 byte chunks off a free list and
	placing them back on can be done with this simple
	base code even when dealing with 1Gb/sec intfs.

	Under extreme circumstances, the normal on demand
	allocator can sleep if it needs to colesce memory
	or steal it from another''s cache.

	Mitchell Erblich
	-----------------

Frank Hofmann wrote:> 
> On Thu, 5 Oct 2006, Erblichs wrote:
> 
> > Casper Dik,
> >
> >       After my posting, I assumed that a code question should be
> >       directed to the ZFS code alias, so I apologize to the people
> >       show don''t read code. However, since the discussion is
here,
> >       I will post a code proof here. Just use "time program"
to get
> >       a generic time frame. It is under 0.1 secs for 500k loops
> >       (each loop does removes a obj and puts it back).
> >
> >       It is just to be used as a proof of concept that a simple
> >       pre-alloc''ed set of objects can be accessed so much
faster
> >       than allocating and assigning them.
> 
> Ok, could you please explain how is this piece (and all else, for that
> matter):
> 
> /*
>   * Get a node structure from the freelist
>   */
> struct node *
> node_getnode()
> {
>         struct node *node;
> 
>         if ((node = nodefree) == NULL)  /* "shouldn''t
happen" */
>                 printf("out of nodes");
> 
>         nodefree = node->node_next;
>         node->node_next = NULL;
> 
>         return (node);
> }
> 
> is multithread-safe ?
> 
> Best wishes,
> FrankH.

Casper.Dik at Sun.COM

2006-Oct-06 09:46 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

>	This example is done with a single threaded app.
>	It is NOT NOT NOT intended to show any level of
>	Thread-safe type coding.
>
>	It is ONLY used to show that it is signifcantly lower cost
>	to grab pre-allocated objects than to allocate the 
>	objects on demand.
Your example does not demonstrate that at all.

The kernel memory allocator''s magazine layer implements almost exactly
the same algorithm.

I was wrong about it not grabbing a lock (it grab''s the
cache''s per-CPU
lock) but that lock is uncontended and only used on the one CPU.

So the only thing that makes your code faster is stripping out all the
debugging functionality.

Casper

Frank Hofmann

2006-Oct-06 10:06 UTC

head link

[zfs-discuss] single memory allocation in the ZFS intent log

On Thu, 5 Oct 2006, Erblichs wrote:
> Casper Dik,
>
> 	After my posting, I assumed that a code question should be
> 	directed to the ZFS code alias, so I apologize to the people
> 	show don''t read code. However, since the discussion is here,
> 	I will post a code proof here. Just use "time program" to get
> 	a generic time frame. It is under 0.1 secs for 500k loops
> 	(each loop does removes a obj and puts it back).
>
> 	It is just to be used as a proof of concept that a simple
> 	pre-alloc''ed set of objects can be accessed so much faster
> 	than allocating and assigning them.
Ok, could you please explain how is this piece (and all else, for that 
matter):

/*
  * Get a node structure from the freelist
  */
struct node *
node_getnode()
{
 	struct node *node;

 	if ((node = nodefree) == NULL)  /* "shouldn''t happen" */
 		printf("out of nodes");

 	nodefree = node->node_next;
 	node->node_next = NULL;

 	return (node);
}

is multithread-safe ?

Best wishes,
FrankH.

zfs discuss - Oct 2006 - single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] Re: single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log

[zfs-discuss] single memory allocation in the ZFS intent log