On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote:> On 04/01/2015 02:48 PM, Peter Zijlstra wrote:> I am sorry that I don't quite get what you mean here. My point is that in > the hashing step, a cpu will need to scan an empty bucket to put the lock > in. In the interim, an previously used bucket before the empty one may get > freed. In the lookup step for that lock, the scanning will stop because of > an empty bucket in front of the target one.Right, that's broken. So we need to do something else to limit the lookup, because without that break, a lookup that needs to iterate the entire array in order to determine -ENOENT, which is expensive. So my alternative proposal is that IFF we can guarantee that every lookup will succeed -- the entry we're looking for is always there, we don't need the break on empty but can probe until we find the entry. This will be bound in cost to the same number if probes we required for insertion and avoids the full array scan. Now I think we can indeed do this, if as said earlier we do not clear the bucket on insert if the cmpxchg succeeds, in that case the unlock will observe _Q_SLOW_VAL and do the lookup, the lookup will then find the entry. And we then need the unlock to clear the entry. Does that explain this? Or should I try again with code?
On 04/01/2015 05:03 PM, Peter Zijlstra wrote:> On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote: >> On 04/01/2015 02:48 PM, Peter Zijlstra wrote: >> I am sorry that I don't quite get what you mean here. My point is that in >> the hashing step, a cpu will need to scan an empty bucket to put the lock >> in. In the interim, an previously used bucket before the empty one may get >> freed. In the lookup step for that lock, the scanning will stop because of >> an empty bucket in front of the target one. > Right, that's broken. So we need to do something else to limit the > lookup, because without that break, a lookup that needs to iterate the > entire array in order to determine -ENOENT, which is expensive. > > So my alternative proposal is that IFF we can guarantee that every > lookup will succeed -- the entry we're looking for is always there, we > don't need the break on empty but can probe until we find the entry. > This will be bound in cost to the same number if probes we required for > insertion and avoids the full array scan. > > Now I think we can indeed do this, if as said earlier we do not clear > the bucket on insert if the cmpxchg succeeds, in that case the unlock > will observe _Q_SLOW_VAL and do the lookup, the lookup will then find > the entry. And we then need the unlock to clear the entry. > _Q_SLOW_VAL > Does that explain this? Or should I try again with code?OK, I got your proposal now. However, there is still the issue that setting the _Q_SLOW_VAL flag and the hash bucket are not atomic wrt each other. It is possible a CPU has set the _Q_SLOW_VAL flag but not yet filling in the hash bucket while another one is trying to look for it. So we need to have some kind of synchronization mechanism to let the lookup CPU know when is a good time to look up. One possibility is to delay setting _Q_SLOW_VAL until the hash bucket is set up. Maybe we can make that work. Cheers, Longman
On Thu, Apr 02, 2015 at 12:28:30PM -0400, Waiman Long wrote:> On 04/01/2015 05:03 PM, Peter Zijlstra wrote: > >On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote: > >>On 04/01/2015 02:48 PM, Peter Zijlstra wrote: > >>I am sorry that I don't quite get what you mean here. My point is that in > >>the hashing step, a cpu will need to scan an empty bucket to put the lock > >>in. In the interim, an previously used bucket before the empty one may get > >>freed. In the lookup step for that lock, the scanning will stop because of > >>an empty bucket in front of the target one. > >Right, that's broken. So we need to do something else to limit the > >lookup, because without that break, a lookup that needs to iterate the > >entire array in order to determine -ENOENT, which is expensive. > > > >So my alternative proposal is that IFF we can guarantee that every > >lookup will succeed -- the entry we're looking for is always there, we > >don't need the break on empty but can probe until we find the entry. > >This will be bound in cost to the same number if probes we required for > >insertion and avoids the full array scan. > > > >Now I think we can indeed do this, if as said earlier we do not clear > >the bucket on insert if the cmpxchg succeeds, in that case the unlock > >will observe _Q_SLOW_VAL and do the lookup, the lookup will then find > >the entry. And we then need the unlock to clear the entry. > >_Q_SLOW_VAL > >Does that explain this? Or should I try again with code? > > OK, I got your proposal now. However, there is still the issue that setting > the _Q_SLOW_VAL flag and the hash bucket are not atomic wrt each other.So? They're strictly ordered, that's sufficient. We first hash the lock, then we set _Q_SLOW_VAL. There's a full memory barrier between them.> It > is possible a CPU has set the _Q_SLOW_VAL flag but not yet filling in the > hash bucket while another one is trying to look for it.Nope. The unlock side does an xchg() on the locked value first, xchg also implies a full barrier, so that guarantees that if we observe _Q_SLOW_VAL we must also observe the hash bucket with the lock value.> So we need to have > some kind of synchronization mechanism to let the lookup CPU know when is a > good time to look up.No, its all already ordered and working. pv_wait_head(): pv_hash() /* MB as per cmpxchg */ cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); VS __pv_queue_spin_unlock(): if (xchg(&l->locked, 0) != _Q_SLOW_VAL) return; /* MB as per xchg */ pv_hash_find(lock);