Erblichs
2006-Oct-04 02:45 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
group, at least one location: When adding a new dva node into the tree, a kmem_alloc is done with a KM_SLEEP argument. thus, this process thread could block waiting for memory. I would suggest adding a pre-allocated pool of dva nodes. When a new dva node is needed, first check this pre-allocated pool and allocate from their. Why? This would eliminate a possible sleep condition if memory is not immediately available. The pool would add a working set of dva nodes that could be monitored. Per alloc latencies could be amortized over a chunk allocation. Lastly, if memory is scarce along time may pass before this node could be allocated to the tree. If the number is monitored, it is possible that restricted operations could be done while until the intent log is decreased in size. I can supply untested code within 24 hours if wanted. Mitchell Erblich
Casper.Dik at Sun.COM
2006-Oct-04 06:57 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
> at least one location: > > When adding a new dva node into the tree, a kmem_alloc is done with > a KM_SLEEP argument. > > thus, this process thread could block waiting for memory. > > I would suggest adding a pre-allocated pool of dva nodes.This is how the Solaris memory allocator works. It keeps pools of "pre-allocated" nodes about until memory conditions are low.> When a new dva node is needed, first check this pre-allocated > pool and allocate from their.There are two reasons why this is a really bad idea: - the system will run out of memory even sooner if people start building their own free-lists - a single freelist does not scale; at two CPUs it becomes the allocation bottleneck (I''ve measured and removed two such bottlenecks from Solaris 9) You might want to learn about how the Solaris memory allocator works; it pretty much works like you want, except that it is all part of the framework. And, just as in your case, it does run out some times but a private freelist does not help against that.> Why? This would eliminate a possible sleep condition if memory > is not immediately available. The pool would add a working > set of dva nodes that could be monitored. Per alloc latencies > could be amortized over a chunk allocation.That''s how the Solaris memory allocator already works. Casper
Erblichs
2006-Oct-04 08:19 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
Casper Dik, Yes, I am familiar with Bonwick''s slab allocators and tried it for wirespeed test of 64byte pieces for a 1Gb and then 100Mb Eths and lastly 10Mb Eth. My results were not encouraging. I assume it has improved over time. First, let me ask what happens to the FS if the allocs in the intent log code are sleeping waiting for memory???? IMO, The general problem with memory allocators is: - getting memory from a "cache" of ones own size/type is orders of magnitude higher than just getting some off one''s own freelist, - their is a built in latency to recouperate/steal memory from other processes, - this stealing forces a sleep and context switches, - the amount of time to sleep is undeterminate with a single call per struct. How long can you sleep for? 100ms or 250ms or more.. - no process can guarantee a working set, In the time when memory was expensive, maybe a global sharing mechanisms would make sense, but when the amount of memory is somewhat plentiful and cheap, *** It then makes sense for a 2 stage implementation of preallocation of a working set and then normal allocation with the added latency. So, it makes sense to pre-allocate a working set of allocs by a single alloc call, break up the alloc into needed sizes, and then alloc from your own free list, -----> if that freelist then empties, maybe then take the extra overhead with the kmem call. Consider this a expected cost to exceed a certain watermark. But otherwise, I bet if I give you some code for the pre-alloc, I bet 10 allocs from the freelist can be done versus the kmem_alloc call, and at least 100 to 10k allocs if sleep occurs on your side. Actually, I think it is so bad, that why don''t you time 1 kmem_free versus grabbing elements off the freelist, However, don''t trust me, I will drop a snapshot of the code to you tomarrow if you want and you make a single CPU benchmark comparison. Your multiple CPU issue, forces me to ask, is it a common occurance that 2 are more CPUs are simultaneouly requesting memory for the intent log? If it is, then their should be a freelist of a low watermark set of elements per CPU. However, one thing at a time.. So, do you want that code? It will be a single alloc of X units and then place them on a freelist. You then time it takes to remove Y elements from the freelist versus 1 kmem_alloc with a NO_SLEEP arg and report the numbers. Then I would suggest the call with the smallest sleep possible. How many allocs can then be done? 25k, 35k, more... Oh, the reason why we aren''t timing the initial kmem_alloc call for the freelist, is because I expect that to occur during init and not proceed until memory is alloc''ed. Mitchell Erblich ------------------------ Casper.Dik at Sun.COM wrote:> > > at least one location: > > > > When adding a new dva node into the tree, a kmem_alloc is done with > > a KM_SLEEP argument. > > > > thus, this process thread could block waiting for memory. > > > > I would suggest adding a pre-allocated pool of dva nodes. > > This is how the Solaris memory allocator works. It keeps pools of > "pre-allocated" nodes about until memory conditions are low. > > > When a new dva node is needed, first check this pre-allocated > > pool and allocate from their. > > There are two reasons why this is a really bad idea: > > - the system will run out of memory even sooner if people > start building their own free-lists > > - a single freelist does not scale; at two CPUs it becomes > the allocation bottleneck (I''ve measured and removed two > such bottlenecks from Solaris 9) > > You might want to learn about how the Solaris memory allocator works; > it pretty much works like you want, except that it is all part of the > framework. And, just as in your case, it does run out some times but > a private freelist does not help against that. > > > Why? This would eliminate a possible sleep condition if memory > > is not immediately available. The pool would add a working > > set of dva nodes that could be monitored. Per alloc latencies > > could be amortized over a chunk allocation. > > That''s how the Solaris memory allocator already works. > > Casper
Casper.Dik at Sun.COM
2006-Oct-04 10:14 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
>Casper Dik, > > Yes, I am familiar with Bonwick''s slab allocators and tried > it for wirespeed test of 64byte pieces for a 1Gb and then > 100Mb Eths and lastly 10Mb Eth. My results were not > encouraging. I assume it has improved over time.Nothing which tries to send 64 byte pieces over 1Gb ethernet or 100Mb ethernet will give encouraging results.> First, let me ask what happens to the FS if the allocs > in the intent log code are sleeping waiting for memory????How are you going to guarantee that there is *always* memory available? I think that''s barking up the wrong tree. I think that a proper solution is not trying to find a way which prevents memory from running out but rather a way of dealing with the case of it running out. If KMEM_SLEEP is used in a path where it is causing problems, then no amount of freelists is going to solve that. There needs to be a solution which does not sleep.> - getting memory from a "cache" of ones own size/type > is orders of magnitude higher than just getting some > off one''s own freelist,Actually, that''s not true; Bonwick''s allocator is *FASTER* by a *wide* margin than your own freelist. Believe me, I''ve measured this, I''ve seen "my own freelist" collapse on the floor when confronted with as few as two CPUs. As a minimum, you will need *per CPU* free lists. And that''s precisely what the kernel memory allocator gives you.> In the time when memory was expensive, maybe a global > sharing mechanisms would make sense, but when the amount > of memory is somewhat plentiful and cheap,Not if all bits of the system are going to keep their own freelists * #CPUs. Then you are suddenly faced with a *MUCH* higher memory demand. The Bonwick allocator does keep quite a bit cached and keeps more memory unavailable already.> *** It then makes sense for a 2 stage implementation of > preallocation of a working set and then normal allocation > with the added latency.But the normal Bonwick allocation *is* two-stage; you are proposing to add a 3rd stage.> So, it makes sense to pre-allocate a working set of allocs > by a single alloc call, break up the alloc into needed sizes, > and then alloc from your own free list,That''s what the Bonwick allocator does; so why are you duplicating this? Apart from the questionable performance gain (I believe there to be none), the loss of the kernel memory allocator debugging functionality is severe: - you can no longer track where the individual blocks are allocated - you can no longer track buffer overruns - buffers run into one another, so one overrun buffer corrupts another without trace> -----> if that freelist then empties, maybe then take the extra > overhead with the kmem call. Consider this a expected cost to exceed > a certain watermark.This is exactly how the magazine layer works.> But otherwise, I bet if I give you some code for the pre-alloc, I bet >10 > allocs from the freelist can be done versus the kmem_alloc call, and > at least 100 to 10k allocs if sleep occurs on your side.I hope you''re not designing this with a single lock per queue. I have eradicated code in Solaris 9 which looked like this: struct au_buff * au_get_buff(void) { au_buff_t *buffer = NULL; mutex_enter(&au_free_queue_lock); if (au_free_queue == NULL) { if (au_get_chunk(1)) { mutex_exit(&au_free_queue_lock); return (NULL); } } buffer = au_free_queue; au_free_queue = au_free_queue->next_buf; mutex_exit(&au_free_queue_lock); buffer->next_buf = NULL; return (buffer); } (with a corresponding free routine which never returned memory to the system but kept it in the freelist) This was replaced with essentially: buffer = kmem_cache_alloc(au_buf_cache, KM_SLEEP); The first bit of code stopped scaling at 1 CPU (the performance with two CPUs was slightly worse than with one CPU) The second bit of code was both FASTER in the single CPU case and scaled to the twelve CPUs I had for testing.> Actually, I think it is so bad, that why don''t you time 1 kmem_free > versus grabbing elements off the freelist,I did, it''s horrendous. Don''t forget that the typical case, when the magazine layer is properly size after the system has been running for a while, no locks need to be grabbed to get memory as the magazines are per-CPU. But with your single freelist, you must grab a lock. Somewhere in the grab/release lock cycle there''s at least one atomic operation and memory barrier. Those are perhaps cheap on single CPU systems but run in the hundreds of cycles on larger systems. Here''s something else from Sun''s "collective" memory which I think illustrates why we think private freelists are bad, /on principle/. When Jeff Bonwick redid the memory allocator he did this because he noticed that the performance was bad, so bad even that people generally avoided the allocator if they could (hence the use of private freelists in particular parts of the system such as the above example from auditing). This is particularly poor software engineering for two important reasons; if a core bit of software does not work or perform properly, it needs to be rewritten and not worked around; and if a core bit of software is worked around, any future improvements will go unnoticed. The auditing example and the resulting poor performance of the auditing code amply demonstrates both points; there''s even additional similarity between it and the intent log: both allocate buffers to be written to disk. The object lesson of the Bonwick memory allocator, the addition of the magazine layer in a later release is this: if this tool is not up to the job, it must be improved not worked around. So *if* the memory allocator is not able to deliver the QoS required for a particular part of the system, it stands to reason that it should be improved with QoS features to serve the need of those consumers. (But I''m not sure that that is even the case; but then, I''m not sufficiently familiar with how ZIL or ZFS work and what exact issues there are) Sorry for the lecture; my experience with single freelists of buffers to be written to disk is just something I thought I needed to share in this discussion. Casper
Frank Hofmann
2006-Oct-04 10:15 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
On Wed, 4 Oct 2006, Erblichs wrote:> Casper Dik, > > Yes, I am familiar with Bonwick''s slab allocators and tried > it for wirespeed test of 64byte pieces for a 1Gb and then > 100Mb Eths and lastly 10Mb Eth. My results were not > encouraging. I assume it has improved over time. > > First, let me ask what happens to the FS if the allocs > in the intent log code are sleeping waiting for memory????The same as would happen to the FS with your proposed additional allocator layer in if that "freelist" of yours runs out - it''ll wait, you''ll see a latency bubble. You seem to think it''s likely that a kmem_alloc(..., KM_SLEEP) will sleep. It''s not. Anything but. See below.> > IMO, The general problem with memory allocators is: > > - getting memory from a "cache" of ones own size/type > is orders of magnitude higher than just getting some > off one''s own freelist,This is why the kernel memory allocator in Solaris has two such freelists: - the per-CPU kmem magazines (you say below ''one step at a time'', but that step is already done in Solaris kemem) - the slab cache> > - their is a built in latency to recouperate/steal memory > from other processes,Stealing ("reclaim" in Solaris kmem terms) happens if the following three conditions are true: - nothing in the per-CPU magazines - nothing in the slab cache - nothing in the quantum caches - on the attempt to grow the quantum cache, the request to the vmem backend finds no readily-available heap to satisfy the growth demand immediately> > - this stealing forces a sleep and context switches, > > - the amount of time to sleep is undeterminate with a single > call per struct. How long can you sleep for? 100ms or > 250ms or more.. > > - no process can guarantee a working set,Yes and no. If your working set is small, use the stack.> > In the time when memory was expensive, maybe a global > sharing mechanisms would make sense, but when the amount > of memory is somewhat plentiful and cheap, > > *** It then makes sense for a 2 stage implementation of > preallocation of a working set and then normal allocation > with the added latency. > > So, it makes sense to pre-allocate a working set of allocs > by a single alloc call, break up the alloc into needed sizes, > and then alloc from your own free list,See above - all of that _IS_ already done in Solaris kmem/vmem, with more parallelism and more intermediate caching layers designed to bring down allocation latency than your simple freelist approach would achieve.> > -----> if that freelist then empties, maybe then take the extra > overhead with the kmem call. Consider this a expected cost to exceed > a certain watermark. > > But otherwise, I bet if I give you some code for the pre-alloc, I bet > 10 > allocs from the freelist can be done versus the kmem_alloc call, and > at least 100 to 10k allocs if sleep occurs on your side.The same statistics can be made for Solaris kmem - you satisfy the request from the per-CPU magazine, you satisfy the request from the slab cache, you satisfy the request via immediate vmem backend allocation and a growth of the slab cache. All of these with increased latency but without sleeping. Sleeping only comes in if you''re so tight on memory that you need to perform coalescing in the backend, and purge least-recently-used things from other kmem caches in favour of new backend requests. Just because you chose to say kmem_alloc(...,KM_SLEEP) doesn''t mean you _will_ sleep. Normally you won''t.> > Actually, I think it is so bad, that why don''t you time 1 kmem_free > versus grabbing elements off the freelist, > > However, don''t trust me, I will drop a snapshot of the code to you > tomarrow if you want and you make a single CPU benchmark comparison. > > Your multiple CPU issue, forces me to ask, is it a common > occurance that 2 are more CPUs are simultaneouly requesting > memory for the intent log? If it is, then their should be a > freelist of a low watermark set of elements per CPU. However, > one thing at a time..Of course it''s common - have two or more threads do filesystem I/O at the same time and you''re already there. Which is why, one thing at a time, Solaris kmem had the magazine layer for, I think (predates my time at Sun), around 12 years now, to get SMP scalability. Been there done that ...> > So, do you want that code? It will be a single alloc of X units > and then place them on a freelist. You then time it takes to > remove Y elements from the freelist versus 1 kmem_alloc with > a NO_SLEEP arg and report the numbers. Then I would suggest the > call with the smallest sleep possible. How many allocs can then > be done? 25k, 35k, more... > > Oh, the reason why we aren''t timing the initial kmem_alloc call > for the freelist, is because I expect that to occur during init > and not proceed until memory is alloc''ed.Can you provide timing measurements under various loads that show the benefit of your change to ZFS vs. using Solaris kmem as-is ? On single-CPU machines as well as on 100-CPU machines ? On single disks as well as on 100-disk pools ? We''re very interested there. Performance characterization of ZFS has, in a way, just started, as people begin using it for their own purposes, coming up with their own numbers; changes that improve "speed" will obviously be welcome ! Best wishes, FrankH.> > > Mitchell Erblich > ------------------------ > > > > > > > > Casper.Dik at Sun.COM wrote: >> >>> at least one location: >>> >>> When adding a new dva node into the tree, a kmem_alloc is done with >>> a KM_SLEEP argument. >>> >>> thus, this process thread could block waiting for memory. >>> >>> I would suggest adding a pre-allocated pool of dva nodes. >> >> This is how the Solaris memory allocator works. It keeps pools of >> "pre-allocated" nodes about until memory conditions are low. >> >>> When a new dva node is needed, first check this pre-allocated >>> pool and allocate from their. >> >> There are two reasons why this is a really bad idea: >> >> - the system will run out of memory even sooner if people >> start building their own free-lists >> >> - a single freelist does not scale; at two CPUs it becomes >> the allocation bottleneck (I''ve measured and removed two >> such bottlenecks from Solaris 9) >> >> You might want to learn about how the Solaris memory allocator works; >> it pretty much works like you want, except that it is all part of the >> framework. And, just as in your case, it does run out some times but >> a private freelist does not help against that. >> >>> Why? This would eliminate a possible sleep condition if memory >>> is not immediately available. The pool would add a working >>> set of dva nodes that could be monitored. Per alloc latencies >>> could be amortized over a chunk allocation. >> >> That''s how the Solaris memory allocator already works. >> >> Casper > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >=========================================================================No good can come from selling your freedom, not for all gold of the world, for the value of this heavenly gift exceeds that of any fortune on earth. ==========================================================================
Erblichs
2006-Oct-06 01:20 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
Casper Dik, After my posting, I assumed that a code question should be directed to the ZFS code alias, so I apologize to the people show don''t read code. However, since the discussion is here, I will post a code proof here. Just use "time program" to get a generic time frame. It is under 0.1 secs for 500k loops (each loop does removes a obj and puts it back). It is just to be used as a proof of concept that a simple pre-alloc''ed set of objects can be accessed so much faster than allocating and assigning them. To support the change to the intent log objects, I would suggest first identiying the number of objects normally allocated and use that as a working set of objects. A time element is also needed to identify when objects should be released from the free list to the memory cache. Yes, the initial thoughts of having a per CPU based allocs are coded in which would allow multiple simultaneious access to a freelist per CPU. This should remove most of the mutex code necessary for scalability. Objective ----------- What this app code is proving is that items that are pre-alloc''ed can be removed off a simple free list and stored on a free list. This is just a inception proof that shows "fast access" to a working set of objects. The time to make one chunk alloc, place all of them on a free list, and then perform 500k ops of removal and insertions is probably somewhere 50 to 1000x faster than even the best memory allocators allocating/retrieving 500k items. If a dynamic list of nodes are required the chunk alloc should be changed. This quick piece of app prog runs in less than 0.1 sec with 500k retrieves and store ops. This is fast enough to grab 64byte chunks dealing with even a 1Gb Eth link. Even though this code is simplified, it indicates that kmem_allocs will have the same benefits even without sleeping. ------------- The code does a single pre alloc and then breaks up the alloc into N node pieces. It takes each piece and places them on a free list in the init section. The assumption here is that we have a fixed reasonable number of items. If the number of items is dynamic the init could easily alloc a number of nodes, then use watermarks to alloc and free into / from the free list as the number of nodes are used. If the logic is used to deal with kmem ops, then any free nodes could be returned to memory as excess nodes are in the free list. The base level of this type of logic is normally used when a special program project requires non-standard interfaces to guarantee a HIGH level of performance. The main has a hard code of 500k loops, which allocs one node and then frees it. Thus, 500k equiv allocs would need to be done. This ran in the 0.02 to 0.35 secs on a 1.3GHz laptop Linux box. ----- It is my understanding that the Bonwicks''s new allocator was created to remove fragmentation. And yes it also allows the OS to reduce the overhead of of dealing with mem objects of process''s that are freed and alloc''ed frequently. When the system gets low on memory, it steals freed objects that are being cached. However, even with no SLEEPING, I have yet to see it perform as fast as simple retrieves and stores. Years ago, the amount of memory on a system was limited due to its expense. This is no longer the case. Some/most processes/threads could have a decent increase in performance if the amount of workload done on a working set of objects is minimized. Up to this workset, I propose that a almost guranteed level of performance could be achieved. With the comment that any type of functionality that has merits get a API so multiple processes / threads can use that functionality. Years ago I was in the "process" of doing that when I left a company with a ARC group. It was to add a layer of working set mem objects that would have "fast access" properties. I will ONLY GUARANTEE that X working set objects once freed to the FREE LIST can be re-allocated without waiting for the objects. Any count beyond that working set, has the same underlying properties. Except if I KNOW that the number on my freelist goes down to a small value, I could pre-request more objects. The latency of retrieving these objects could thus be minimized. This logic then removes on demand memory allocations, so any WAIT time MAY not effect the other parts of the process that need more objects. ----------------- Mitchell Erblich Casper.Dik at Sun.COM wrote:> > >Casper Dik, > > > > Yes, I am familiar with Bonwick''s slab allocators and tried > > it for wirespeed test of 64byte pieces for a 1Gb and then > > 100Mb Eths and lastly 10Mb Eth. My results were not > > encouraging. I assume it has improved over time. > > Nothing which tries to send 64 byte pieces over 1Gb ethernet or 100Mb > ethernet will give encouraging results. > > > First, let me ask what happens to the FS if the allocs > > in the intent log code are sleeping waiting for memory???? > > How are you going to guarantee that there is *always* memory available? > > I think that''s barking up the wrong tree. I think that a proper solution is > not trying to find a way which prevents memory from running out but > rather a way of dealing with the case of it running out. > > If KMEM_SLEEP is used in a path where it is causing problems, then no > amount of freelists is going to solve that. There needs to be a solution > which does not sleep. > > > - getting memory from a "cache" of ones own size/type > > is orders of magnitude higher than just getting some > > off one''s own freelist, > > Actually, that''s not true; Bonwick''s allocator is *FASTER* by a *wide* > margin than your own freelist. > > Believe me, I''ve measured this, I''ve seen "my own freelist" collapse > on the floor when confronted with as few as two CPUs. > > As a minimum, you will need *per CPU* free lists. > > And that''s precisely what the kernel memory allocator gives you. > > > In the time when memory was expensive, maybe a global > > sharing mechanisms would make sense, but when the amount > > of memory is somewhat plentiful and cheap, > > Not if all bits of the system are going to keep their own freelists * #CPUs. > > Then you are suddenly faced with a *MUCH* higher memory demand. The > Bonwick allocator does keep quite a bit cached and keeps more memory > unavailable already. > > > *** It then makes sense for a 2 stage implementation of > > preallocation of a working set and then normal allocation > > with the added latency. > > But the normal Bonwick allocation *is* two-stage; you are proposing to > add a 3rd stage. > > > So, it makes sense to pre-allocate a working set of allocs > > by a single alloc call, break up the alloc into needed sizes, > > and then alloc from your own free list, > > That''s what the Bonwick allocator does; so why are you duplicating this? > > Apart from the questionable performance gain (I believe there to be none), > the loss of the kernel memory allocator debugging functionality is severe: > > - you can no longer track where the individual blocks are allocated > - you can no longer track buffer overruns > - buffers run into one another, so one overrun buffer corrupts another > without trace > > > -----> if that freelist then empties, maybe then take the extra > > overhead with the kmem call. Consider this a expected cost to exceed > > a certain watermark. > > This is exactly how the magazine layer works. > > > But otherwise, I bet if I give you some code for the pre-alloc, I bet > >10 > > allocs from the freelist can be done versus the kmem_alloc call, and > > at least 100 to 10k allocs if sleep occurs on your side. > > I hope you''re not designing this with a single lock per queue. > > I have eradicated code in Solaris 9 which looked like this: > > struct au_buff * > au_get_buff(void) > { > au_buff_t *buffer = NULL; > > mutex_enter(&au_free_queue_lock); > > if (au_free_queue == NULL) { > if (au_get_chunk(1)) { > mutex_exit(&au_free_queue_lock); > return (NULL); > } > } > > buffer = au_free_queue; > au_free_queue = au_free_queue->next_buf; > mutex_exit(&au_free_queue_lock); > buffer->next_buf = NULL; > return (buffer); > } > > (with a corresponding free routine which never returned memory to the > system but kept it in the freelist) > > This was replaced with essentially: > > buffer = kmem_cache_alloc(au_buf_cache, KM_SLEEP); > > The first bit of code stopped scaling at 1 CPU (the performance > with two CPUs was slightly worse than with one CPU) > > The second bit of code was both FASTER in the single CPU case and > scaled to the twelve CPUs I had for testing. > > > Actually, I think it is so bad, that why don''t you time 1 kmem_free > > versus grabbing elements off the freelist, > > I did, it''s horrendous. > > Don''t forget that the typical case, when the magazine layer is properly > size after the system has been running for a while, no locks need to be > grabbed to get memory as the magazines are per-CPU. > > But with your single freelist, you must grab a lock. Somewhere in > the grab/release lock cycle there''s at least one atomic operation > and memory barrier. > > Those are perhaps cheap on single CPU systems but run in the hundreds > of cycles on larger systems. > > Here''s something else from Sun''s "collective" memory which I think > illustrates why we think private freelists are bad, /on principle/. > > When Jeff Bonwick redid the memory allocator he did this because he > noticed that the performance was bad, so bad even that people generally > avoided the allocator if they could (hence the use of private freelists > in particular parts of the system such as the above example from > auditing). This is particularly poor software engineering for two important > reasons; if a core bit of software does not work or perform properly, it > needs to be rewritten and not worked around; and if a core bit of software > is worked around, any future improvements will go unnoticed. > > The auditing example and the resulting poor performance of the auditing > code amply demonstrates both points; there''s even additional similarity > between it and the intent log: both allocate buffers to be written to > disk. > > The object lesson of the Bonwick memory allocator, the addition of > the magazine layer in a later release is this: if this tool is not > up to the job, it must be improved not worked around. > > So *if* the memory allocator is not able to deliver the QoS required > for a particular part of the system, it stands to reason that it > should be improved with QoS features to serve the need of those > consumers. > > (But I''m not sure that that is even the case; but then, I''m not sufficiently > familiar with how ZIL or ZFS work and what exact issues there are) > > Sorry for the lecture; my experience with single freelists of > buffers to be written to disk is just something I thought I needed to > share in this discussion. > > Casper-------------- next part -------------- A non-text attachment was scrubbed... Name: node_alloc.c Type: application/x-unknown-content-type-c_auto_file Size: 2502 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20061005/581b3e77/attachment.bin>
Anton B. Rang
2006-Oct-06 04:48 UTC
[zfs-discuss] Re: single memory allocation in the ZFS intent log
Hi Mitchell, I do work for Sun, but I don''t consider myself biased towards the slab allocator or any other Solaris or Sun code. I know we''ve got plenty of improvements to make! That said, your example is not multi-threaded. There are two major performance issues which come up with a list structure in a multi-threaded environment. The slab allocator addresses both of these. The first is locking. In a multi-threaded environment, your node_alloc/node_free would have to lock the free list head. This leads to contention between threads. (Yes, I know about lockfree algorithms, but they don''t eliminate contention; and they are even more susceptible to the second problem.) The second is cache line movement. Each time a CPU reads the memory location containing the head of the free list, it needs to get this data from memory or from another CPU''s cache. Each time the CPU updates memory (which happens for each allocation or free), the cache line must be invalidated in all other CPU''s caches. Hence ownership of the cache line moves between CPUs. This is incredibly expensive (it?s a major reason that locking is slow, actually). It can take tens to hundreds of cycles. How does the slab allocator avoid these? By taking advantage of the operating system''s providing a ?current CPU? index to the thread. The allocator uses this to index into a set of cached objects which have been recently freed by the current CPU, and allocates/frees into that set. This has the advantages that (a) the lock protecting the set is very rarely contended, since a thread would have to be pre-empted during the run of the allocator to have its current CPU change; (b) the data structures describing the set are not shared between CPUs and hence can stay in one cache; (c) the objects themselves tend to have been freed from the current CPU and hence are somewhat more likely to be in the current CPU''s cache. (I''m not sure if that last has a measurable effect.) If you?re interested in learning more, I?d suggest reading the original paper, which you can find at: http://citeseer.ist.psu.edu/bonwick94slab.html The slab allocator does somewhat more, such as allowing objects of a type which require initialization to be reused more cheaply than an allocation+initialization, but the above are the key scalability insights as I understand them. Your approach will certainly perform better on a single processor, and might on a dual processor. Beyond that, it''s scalability is likely poor. And, as Casper pointed out, keeping private memory pools for each subsystem tends to increase kernel memory usage . This message posted from opensolaris.org
Erblichs
2006-Oct-06 09:23 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
Group, This example is done with a single threaded app. It is NOT NOT NOT intended to show any level of Thread-safe type coding. It is ONLY used to show that it is signifcantly lower cost to grab pre-allocated objects than to allocate the objects on demand. Thus, grabbing 64 byte chunks off a free list and placing them back on can be done with this simple base code even when dealing with 1Gb/sec intfs. Under extreme circumstances, the normal on demand allocator can sleep if it needs to colesce memory or steal it from another''s cache. Mitchell Erblich ----------------- Frank Hofmann wrote:> > On Thu, 5 Oct 2006, Erblichs wrote: > > > Casper Dik, > > > > After my posting, I assumed that a code question should be > > directed to the ZFS code alias, so I apologize to the people > > show don''t read code. However, since the discussion is here, > > I will post a code proof here. Just use "time program" to get > > a generic time frame. It is under 0.1 secs for 500k loops > > (each loop does removes a obj and puts it back). > > > > It is just to be used as a proof of concept that a simple > > pre-alloc''ed set of objects can be accessed so much faster > > than allocating and assigning them. > > Ok, could you please explain how is this piece (and all else, for that > matter): > > /* > * Get a node structure from the freelist > */ > struct node * > node_getnode() > { > struct node *node; > > if ((node = nodefree) == NULL) /* "shouldn''t happen" */ > printf("out of nodes"); > > nodefree = node->node_next; > node->node_next = NULL; > > return (node); > } > > is multithread-safe ? > > Best wishes, > FrankH.
Casper.Dik at Sun.COM
2006-Oct-06 09:46 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
> This example is done with a single threaded app. > It is NOT NOT NOT intended to show any level of > Thread-safe type coding. > > It is ONLY used to show that it is signifcantly lower cost > to grab pre-allocated objects than to allocate the > objects on demand.Your example does not demonstrate that at all. The kernel memory allocator''s magazine layer implements almost exactly the same algorithm. I was wrong about it not grabbing a lock (it grab''s the cache''s per-CPU lock) but that lock is uncontended and only used on the one CPU. So the only thing that makes your code faster is stripping out all the debugging functionality. Casper
Frank Hofmann
2006-Oct-06 10:06 UTC
[zfs-discuss] single memory allocation in the ZFS intent log
On Thu, 5 Oct 2006, Erblichs wrote:> Casper Dik, > > After my posting, I assumed that a code question should be > directed to the ZFS code alias, so I apologize to the people > show don''t read code. However, since the discussion is here, > I will post a code proof here. Just use "time program" to get > a generic time frame. It is under 0.1 secs for 500k loops > (each loop does removes a obj and puts it back). > > It is just to be used as a proof of concept that a simple > pre-alloc''ed set of objects can be accessed so much faster > than allocating and assigning them.Ok, could you please explain how is this piece (and all else, for that matter): /* * Get a node structure from the freelist */ struct node * node_getnode() { struct node *node; if ((node = nodefree) == NULL) /* "shouldn''t happen" */ printf("out of nodes"); nodefree = node->node_next; node->node_next = NULL; return (node); } is multithread-safe ? Best wishes, FrankH.