On Fri, 16 Aug 2013 23:09:06 +0000 "Nicholas A. Bellinger" <nab at linux-iscsi.org> wrote:> From: Kent Overstreet <kmo at daterainc.com> > > Percpu frontend for allocating ids. With percpu allocation (that works), > it's impossible to guarantee it will always be possible to allocate all > nr_tags - typically, some will be stuck on a remote percpu freelist > where the current job can't get to them. > > We do guarantee that it will always be possible to allocate at least > (nr_tags / 2) tags - this is done by keeping track of which and how many > cpus have tags on their percpu freelists. On allocation failure if > enough cpus have tags that there could potentially be (nr_tags / 2) tags > stuck on remote percpu freelists, we then pick a remote cpu at random to > steal from. > > Note that there's no cpu hotplug notifier - we don't care, because > steal_tags() will eventually get the down cpu's tags. We _could_ satisfy > more allocations if we had a notifier - but we'll still meet our > guarantees and it's absolutely not a correctness issue, so I don't think > it's worth the extra code. > > ... > > include/linux/idr.h | 53 +++++++++ > lib/idr.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++--I don't think this should be in idr.[ch] at all. It has no relationship with the existing code. Apart from duplicating its functionality :(> > ... > > @@ -243,4 +245,55 @@ static inline int ida_get_new(struct ida *ida, int *p_id) > > void __init idr_init_cache(void); > > +/* Percpu IDA/tag allocator */ > + > +struct percpu_ida_cpu; > + > +struct percpu_ida { > + /* > + * number of tags available to be allocated, as passed to > + * percpu_ida_init() > + */ > + unsigned nr_tags; > + > + struct percpu_ida_cpu __percpu *tag_cpu; > + > + /* > + * Bitmap of cpus that (may) have tags on their percpu freelists: > + * steal_tags() uses this to decide when to steal tags, and which cpus > + * to try stealing from. > + * > + * It's ok for a freelist to be empty when its bit is set - steal_tags() > + * will just keep looking - but the bitmap _must_ be set whenever a > + * percpu freelist does have tags. > + */ > + unsigned long *cpus_have_tags;Why not cpumask_t?> + struct { > + spinlock_t lock; > + /* > + * When we go to steal tags from another cpu (see steal_tags()), > + * we want to pick a cpu at random. Cycling through them every > + * time we steal is a bit easier and more or less equivalent: > + */ > + unsigned cpu_last_stolen; > + > + /* For sleeping on allocation failure */ > + wait_queue_head_t wait; > + > + /* > + * Global freelist - it's a stack where nr_free points to the > + * top > + */ > + unsigned nr_free; > + unsigned *freelist; > + } ____cacheline_aligned_in_smp;Why the ____cacheline_aligned_in_smp?> +}; > > ... > > + > +/* Percpu IDA */ > + > +/* > + * Number of tags we move between the percpu freelist and the global freelist at > + * a time"between a percpu freelist" would be more accurate?> + */ > +#define IDA_PCPU_BATCH_MOVE 32U > + > +/* Max size of percpu freelist, */ > +#define IDA_PCPU_SIZE ((IDA_PCPU_BATCH_MOVE * 3) / 2) > + > +struct percpu_ida_cpu { > + spinlock_t lock; > + unsigned nr_free; > + unsigned freelist[]; > +};Data structure needs documentation. There's one of these per cpu. I guess nr_free and freelist are clear enough. The presence of a lock in a percpu data structure is a surprise. It's for cross-cpu stealing, I assume?> +static inline void move_tags(unsigned *dst, unsigned *dst_nr, > + unsigned *src, unsigned *src_nr, > + unsigned nr) > +{ > + *src_nr -= nr; > + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr); > + *dst_nr += nr; > +} > + > > ... > > +static inline void alloc_global_tags(struct percpu_ida *pool, > + struct percpu_ida_cpu *tags) > +{ > + move_tags(tags->freelist, &tags->nr_free, > + pool->freelist, &pool->nr_free, > + min(pool->nr_free, IDA_PCPU_BATCH_MOVE)); > +}Document this function?> +static inline unsigned alloc_local_tag(struct percpu_ida *pool, > + struct percpu_ida_cpu *tags) > +{ > + int tag = -ENOSPC; > + > + spin_lock(&tags->lock); > + if (tags->nr_free) > + tag = tags->freelist[--tags->nr_free]; > + spin_unlock(&tags->lock); > + > + return tag; > +}I guess this one's clear enough, if the data structure relationships are understood.> +/** > + * percpu_ida_alloc - allocate a tag > + * @pool: pool to allocate from > + * @gfp: gfp flags > + * > + * Returns a tag - an integer in the range [0..nr_tags) (passed to > + * tag_pool_init()), or otherwise -ENOSPC on allocation failure. > + * > + * Safe to be called from interrupt context (assuming it isn't passed > + * __GFP_WAIT, of course). > + * > + * Will not fail if passed __GFP_WAIT. > + */ > +int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp) > +{ > + DEFINE_WAIT(wait); > + struct percpu_ida_cpu *tags; > + unsigned long flags; > + int tag; > + > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); > + > + /* Fastpath */ > + tag = alloc_local_tag(pool, tags); > + if (likely(tag >= 0)) { > + local_irq_restore(flags); > + return tag; > + } > + > + while (1) { > + spin_lock(&pool->lock); > + > + /* > + * prepare_to_wait() must come before steal_tags(), in case > + * percpu_ida_free() on another cpu flips a bit in > + * cpus_have_tags > + * > + * global lock held and irqs disabled, don't need percpu lock > + */ > + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE); > + > + if (!tags->nr_free) > + alloc_global_tags(pool, tags); > + if (!tags->nr_free) > + steal_tags(pool, tags); > + > + if (tags->nr_free) { > + tag = tags->freelist[--tags->nr_free]; > + if (tags->nr_free) > + set_bit(smp_processor_id(), > + pool->cpus_have_tags); > + } > + > + spin_unlock(&pool->lock); > + local_irq_restore(flags); > + > + if (tag >= 0 || !(gfp & __GFP_WAIT)) > + break; > + > + schedule(); > + > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); > + }What guarantees that this wait will terminate?> + finish_wait(&pool->wait, &wait); > + return tag; > +} > +EXPORT_SYMBOL_GPL(percpu_ida_alloc); > + > +/** > + * percpu_ida_free - free a tag > + * @pool: pool @tag was allocated from > + * @tag: a tag previously allocated with percpu_ida_alloc() > + * > + * Safe to be called from interrupt context. > + */ > +void percpu_ida_free(struct percpu_ida *pool, unsigned tag) > +{ > + struct percpu_ida_cpu *tags; > + unsigned long flags; > + unsigned nr_free; > + > + BUG_ON(tag >= pool->nr_tags); > + > + local_irq_save(flags); > + tags = this_cpu_ptr(pool->tag_cpu); > + > + spin_lock(&tags->lock);Why do we need this lock, btw? It's a cpu-local structure and local irqs are disabled...> + tags->freelist[tags->nr_free++] = tag; > + > + nr_free = tags->nr_free; > + spin_unlock(&tags->lock); > + > + if (nr_free == 1) { > + set_bit(smp_processor_id(), > + pool->cpus_have_tags); > + wake_up(&pool->wait); > + } > + > + if (nr_free == IDA_PCPU_SIZE) { > + spin_lock(&pool->lock); > + > + /* > + * Global lock held and irqs disabled, don't need percpu > + * lock > + */ > + if (tags->nr_free == IDA_PCPU_SIZE) { > + move_tags(pool->freelist, &pool->nr_free, > + tags->freelist, &tags->nr_free, > + IDA_PCPU_BATCH_MOVE); > + > + wake_up(&pool->wait); > + } > + spin_unlock(&pool->lock); > + } > + > + local_irq_restore(flags); > +} > +EXPORT_SYMBOL_GPL(percpu_ida_free); > > ... > > +int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags) > +{ > + unsigned i, cpu, order; > + > + memset(pool, 0, sizeof(*pool)); > + > + init_waitqueue_head(&pool->wait); > + spin_lock_init(&pool->lock); > + pool->nr_tags = nr_tags; > + > + /* Guard against overflow */ > + if (nr_tags > (unsigned) INT_MAX + 1) { > + pr_err("tags.c: nr_tags too large\n");"tags.c"?> + return -EINVAL; > + } > + > + order = get_order(nr_tags * sizeof(unsigned)); > + pool->freelist = (void *) __get_free_pages(GFP_KERNEL, order); > + if (!pool->freelist) > + return -ENOMEM; > + > + for (i = 0; i < nr_tags; i++) > + pool->freelist[i] = i; > + > + pool->nr_free = nr_tags; > + > + pool->cpus_have_tags = kzalloc(BITS_TO_LONGS(nr_cpu_ids) * > + sizeof(unsigned long), GFP_KERNEL); > + if (!pool->cpus_have_tags) > + goto err; > + > + pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) + > + IDA_PCPU_SIZE * sizeof(unsigned), > + sizeof(unsigned)); > + if (!pool->tag_cpu) > + goto err; > + > + for_each_possible_cpu(cpu) > + spin_lock_init(&per_cpu_ptr(pool->tag_cpu, cpu)->lock); > + > + return 0; > +err: > + percpu_ida_destroy(pool); > + return -ENOMEM; > +} > +EXPORT_SYMBOL_GPL(percpu_ida_init);
On Tue, Aug 20, 2013 at 02:31:57PM -0700, Andrew Morton wrote:> On Fri, 16 Aug 2013 23:09:06 +0000 "Nicholas A. Bellinger" <nab at linux-iscsi.org> wrote: > > > From: Kent Overstreet <kmo at daterainc.com> > > > > Percpu frontend for allocating ids. With percpu allocation (that works), > > it's impossible to guarantee it will always be possible to allocate all > > nr_tags - typically, some will be stuck on a remote percpu freelist > > where the current job can't get to them. > > > > We do guarantee that it will always be possible to allocate at least > > (nr_tags / 2) tags - this is done by keeping track of which and how many > > cpus have tags on their percpu freelists. On allocation failure if > > enough cpus have tags that there could potentially be (nr_tags / 2) tags > > stuck on remote percpu freelists, we then pick a remote cpu at random to > > steal from. > > > > Note that there's no cpu hotplug notifier - we don't care, because > > steal_tags() will eventually get the down cpu's tags. We _could_ satisfy > > more allocations if we had a notifier - but we'll still meet our > > guarantees and it's absolutely not a correctness issue, so I don't think > > it's worth the extra code. > > > > ... > > > > include/linux/idr.h | 53 +++++++++ > > lib/idr.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++-- > > I don't think this should be in idr.[ch] at all. It has no > relationship with the existing code. Apart from duplicating its > functionality :(Well, in the full patch series it does make use of the non-percpu ida. I'm still hoping to get the ida/idr rewrites in.
On Tue, Aug 20, 2013 at 02:31:57PM -0700, Andrew Morton wrote:> On Fri, 16 Aug 2013 23:09:06 +0000 "Nicholas A. Bellinger" <nab at linux-iscsi.org> wrote: > > + /* > > + * Bitmap of cpus that (may) have tags on their percpu freelists: > > + * steal_tags() uses this to decide when to steal tags, and which cpus > > + * to try stealing from. > > + * > > + * It's ok for a freelist to be empty when its bit is set - steal_tags() > > + * will just keep looking - but the bitmap _must_ be set whenever a > > + * percpu freelist does have tags. > > + */ > > + unsigned long *cpus_have_tags; > > Why not cpumask_t?I hadn't encountered it before - looks like it's probably what I want. I don't see any explanation for the parallel set of operations for working on cpumasks - e.g. next_cpu()/cpumask_next(). For now I'm going with the cpumask_* versions, is that what I want?o If you can have a look at the fixup patch that'll be most appreciated.> > + struct { > > + spinlock_t lock; > > + /* > > + * When we go to steal tags from another cpu (see steal_tags()), > > + * we want to pick a cpu at random. Cycling through them every > > + * time we steal is a bit easier and more or less equivalent: > > + */ > > + unsigned cpu_last_stolen; > > + > > + /* For sleeping on allocation failure */ > > + wait_queue_head_t wait; > > + > > + /* > > + * Global freelist - it's a stack where nr_free points to the > > + * top > > + */ > > + unsigned nr_free; > > + unsigned *freelist; > > + } ____cacheline_aligned_in_smp; > > Why the ____cacheline_aligned_in_smp?It's separating the RW stuff that isn't always touched from the RO stuff that's used on every allocation.> > > +}; > > > > ... > > > > + > > +/* Percpu IDA */ > > + > > +/* > > + * Number of tags we move between the percpu freelist and the global freelist at > > + * a time > > "between a percpu freelist" would be more accurate?No, because when we're stealing tags we always grab all of the remote percpu freelist's tags - IDA_PCPU_BATCH_MOVE is only used when moving to/from the global freelist.> > > + */ > > +#define IDA_PCPU_BATCH_MOVE 32U > > + > > +/* Max size of percpu freelist, */ > > +#define IDA_PCPU_SIZE ((IDA_PCPU_BATCH_MOVE * 3) / 2) > > + > > +struct percpu_ida_cpu { > > + spinlock_t lock; > > + unsigned nr_free; > > + unsigned freelist[]; > > +}; > > Data structure needs documentation. There's one of these per cpu. I > guess nr_free and freelist are clear enough. The presence of a lock > in a percpu data structure is a surprise. It's for cross-cpu stealing, > I assume?Yeah, I'll add some comments.> > +static inline void alloc_global_tags(struct percpu_ida *pool, > > + struct percpu_ida_cpu *tags) > > +{ > > + move_tags(tags->freelist, &tags->nr_free, > > + pool->freelist, &pool->nr_free, > > + min(pool->nr_free, IDA_PCPU_BATCH_MOVE)); > > +} > > Document this function?Will do> > + while (1) { > > + spin_lock(&pool->lock); > > + > > + /* > > + * prepare_to_wait() must come before steal_tags(), in case > > + * percpu_ida_free() on another cpu flips a bit in > > + * cpus_have_tags > > + * > > + * global lock held and irqs disabled, don't need percpu lock > > + */ > > + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE); > > + > > + if (!tags->nr_free) > > + alloc_global_tags(pool, tags); > > + if (!tags->nr_free) > > + steal_tags(pool, tags); > > + > > + if (tags->nr_free) { > > + tag = tags->freelist[--tags->nr_free]; > > + if (tags->nr_free) > > + set_bit(smp_processor_id(), > > + pool->cpus_have_tags); > > + } > > + > > + spin_unlock(&pool->lock); > > + local_irq_restore(flags); > > + > > + if (tag >= 0 || !(gfp & __GFP_WAIT)) > > + break; > > + > > + schedule(); > > + > > + local_irq_save(flags); > > + tags = this_cpu_ptr(pool->tag_cpu); > > + } > > What guarantees that this wait will terminate?It seems fairly clear to me from the break statement a couple lines up; if we were passed __GFP_WAIT we terminate iff we succesfully allocated a tag. If we weren't passed __GFP_WAIT we never actually sleep. I can add a comment if you think it needs one.> > + finish_wait(&pool->wait, &wait); > > + return tag; > > +} > > +EXPORT_SYMBOL_GPL(percpu_ida_alloc); > > + > > +/** > > + * percpu_ida_free - free a tag > > + * @pool: pool @tag was allocated from > > + * @tag: a tag previously allocated with percpu_ida_alloc() > > + * > > + * Safe to be called from interrupt context. > > + */ > > +void percpu_ida_free(struct percpu_ida *pool, unsigned tag) > > +{ > > + struct percpu_ida_cpu *tags; > > + unsigned long flags; > > + unsigned nr_free; > > + > > + BUG_ON(tag >= pool->nr_tags); > > + > > + local_irq_save(flags); > > + tags = this_cpu_ptr(pool->tag_cpu); > > + > > + spin_lock(&tags->lock); > > Why do we need this lock, btw? It's a cpu-local structure and local > irqs are disabled...Tag stealing. I added a comment for the data structure explaining the lock, do you think that suffices?> > + /* Guard against overflow */ > > + if (nr_tags > (unsigned) INT_MAX + 1) { > > + pr_err("tags.c: nr_tags too large\n"); > > "tags.c"?Whoops, out of date.
Kent Overstreet
2013-Aug-28 19:55 UTC
[PATCH] percpu ida: Switch to cpumask_t, add some comments
Fixup patch, addressing Andrew's review feedback: Signed-off-by: Kent Overstreet <kmo at daterainc.com> --- include/linux/idr.h | 2 +- lib/idr.c | 38 +++++++++++++++++++++----------------- 2 files changed, 22 insertions(+), 18 deletions(-) diff --git a/include/linux/idr.h b/include/linux/idr.h index f0db12b..cdf39be 100644 --- a/include/linux/idr.h +++ b/include/linux/idr.h @@ -267,7 +267,7 @@ struct percpu_ida { * will just keep looking - but the bitmap _must_ be set whenever a * percpu freelist does have tags. */ - unsigned long *cpus_have_tags; + cpumask_t cpus_have_tags; struct { spinlock_t lock; diff --git a/lib/idr.c b/lib/idr.c index 26495e1..15c021c 100644 --- a/lib/idr.c +++ b/lib/idr.c @@ -1178,7 +1178,13 @@ EXPORT_SYMBOL(ida_init); #define IDA_PCPU_SIZE ((IDA_PCPU_BATCH_MOVE * 3) / 2) struct percpu_ida_cpu { + /* + * Even though this is percpu, we need a lock for tag stealing by remote + * CPUs: + */ spinlock_t lock; + + /* nr_free/freelist form a stack of free IDs */ unsigned nr_free; unsigned freelist[]; }; @@ -1209,21 +1215,21 @@ static inline void steal_tags(struct percpu_ida *pool, unsigned cpus_have_tags, cpu = pool->cpu_last_stolen; struct percpu_ida_cpu *remote; - for (cpus_have_tags = bitmap_weight(pool->cpus_have_tags, nr_cpu_ids); + for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags); cpus_have_tags * IDA_PCPU_SIZE > pool->nr_tags / 2; cpus_have_tags--) { - cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu); + cpu = cpumask_next(cpu, &pool->cpus_have_tags); - if (cpu == nr_cpu_ids) - cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids); + if (cpu >= nr_cpu_ids) + cpu = cpumask_first(&pool->cpus_have_tags); - if (cpu == nr_cpu_ids) + if (cpu >= nr_cpu_ids) BUG(); pool->cpu_last_stolen = cpu; remote = per_cpu_ptr(pool->tag_cpu, cpu); - clear_bit(cpu, pool->cpus_have_tags); + cpumask_clear_cpu(cpu, &pool->cpus_have_tags); if (remote == tags) continue; @@ -1246,6 +1252,10 @@ static inline void steal_tags(struct percpu_ida *pool, } } +/* + * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto + * our percpu freelist: + */ static inline void alloc_global_tags(struct percpu_ida *pool, struct percpu_ida_cpu *tags) { @@ -1317,8 +1327,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp) if (tags->nr_free) { tag = tags->freelist[--tags->nr_free]; if (tags->nr_free) - set_bit(smp_processor_id(), - pool->cpus_have_tags); + cpumask_set_cpu(smp_processor_id(), + &pool->cpus_have_tags); } spin_unlock(&pool->lock); @@ -1363,8 +1373,8 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag) spin_unlock(&tags->lock); if (nr_free == 1) { - set_bit(smp_processor_id(), - pool->cpus_have_tags); + cpumask_set_cpu(smp_processor_id(), + &pool->cpus_have_tags); wake_up(&pool->wait); } @@ -1398,7 +1408,6 @@ EXPORT_SYMBOL_GPL(percpu_ida_free); void percpu_ida_destroy(struct percpu_ida *pool) { free_percpu(pool->tag_cpu); - kfree(pool->cpus_have_tags); free_pages((unsigned long) pool->freelist, get_order(pool->nr_tags * sizeof(unsigned))); } @@ -1428,7 +1437,7 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags) /* Guard against overflow */ if (nr_tags > (unsigned) INT_MAX + 1) { - pr_err("tags.c: nr_tags too large\n"); + pr_err("percpu_ida_init(): nr_tags too large\n"); return -EINVAL; } @@ -1442,11 +1451,6 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags) pool->nr_free = nr_tags; - pool->cpus_have_tags = kzalloc(BITS_TO_LONGS(nr_cpu_ids) * - sizeof(unsigned long), GFP_KERNEL); - if (!pool->cpus_have_tags) - goto err; - pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) + IDA_PCPU_SIZE * sizeof(unsigned), sizeof(unsigned)); -- 1.8.4.rc3
On Wed, 28 Aug 2013 12:53:17 -0700 Kent Overstreet <kmo at daterainc.com> wrote:> > > + while (1) { > > > + spin_lock(&pool->lock); > > > + > > > + /* > > > + * prepare_to_wait() must come before steal_tags(), in case > > > + * percpu_ida_free() on another cpu flips a bit in > > > + * cpus_have_tags > > > + * > > > + * global lock held and irqs disabled, don't need percpu lock > > > + */ > > > + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE); > > > + > > > + if (!tags->nr_free) > > > + alloc_global_tags(pool, tags); > > > + if (!tags->nr_free) > > > + steal_tags(pool, tags); > > > + > > > + if (tags->nr_free) { > > > + tag = tags->freelist[--tags->nr_free]; > > > + if (tags->nr_free) > > > + set_bit(smp_processor_id(), > > > + pool->cpus_have_tags); > > > + } > > > + > > > + spin_unlock(&pool->lock); > > > + local_irq_restore(flags); > > > + > > > + if (tag >= 0 || !(gfp & __GFP_WAIT)) > > > + break; > > > + > > > + schedule(); > > > + > > > + local_irq_save(flags); > > > + tags = this_cpu_ptr(pool->tag_cpu); > > > + } > > > > What guarantees that this wait will terminate? > > It seems fairly clear to me from the break statement a couple lines up; > if we were passed __GFP_WAIT we terminate iff we succesfully allocated a > tag. If we weren't passed __GFP_WAIT we never actually sleep.OK ;) Let me rephrase. What guarantees that a tag will become available? If what we have here is an open-coded __GFP_NOFAIL then that is potentially problematic.
Andrew Morton
2013-Aug-28 20:25 UTC
[PATCH] percpu ida: Switch to cpumask_t, add some comments
On Wed, 28 Aug 2013 12:55:17 -0700 Kent Overstreet <kmo at daterainc.com> wrote:> Fixup patch, addressing Andrew's review feedback:Looks reasonable.> lib/idr.c | 38 +++++++++++++++++++++-----------------I still don't think it should be in this file. You say that some as-yet-unmerged patches will tie the new code into the old ida code. But will it do it in a manner which requires that the two reside in the same file?