thr3ads.net - Linux Virtualization - [PATCH 05/10] s390/cio: introduce DMA pools to cio [May 2019]

If this information is useful, please help other people find it:
Share via:

Sebastian Ott

2019-May-16 13:59 UTC

[PATCH 05/10] s390/cio: introduce DMA pools to cio

On Sun, 12 May 2019, Halil Pasic wrote:> I've also got code that deals with AIRQ_IV_CACHELINE by turning the
> kmem_cache into a dma_pool.
> 
> Cornelia, Sebastian which approach do you prefer:
> 1) get rid of cio_dma_pool and AIRQ_IV_CACHELINE, and waste a page per
> vector, or
> 2) go with the approach taken by the patch below?
We only have a couple of users for airq_iv:

virtio_ccw.c: 2K bits

pci with floating IRQs: <= 2K (for the per-function bit vectors)
                        1..4K (for the summary bit vector)

pci with CPU directed IRQs: 2K (for the per-CPU bit vectors)
                            1..nr_cpu (for the summary bit vector)


The options are:
* page allocations for everything
* dma_pool for AIRQ_IV_CACHELINE ,gen_pool for others
* dma_pool for everything

I think we should do option 3 and use a dma_pool with cachesize
alignment for everything (as a prerequisite we have to limit
config PCI_NR_FUNCTIONS to 2K - but that is not a real constraint).

Sebastian

Halil Pasic

2019-May-20 12:13 UTC

head link

[PATCH 05/10] s390/cio: introduce DMA pools to cio

On Thu, 16 May 2019 15:59:22 +0200 (CEST)
Sebastian Ott <sebott at linux.ibm.com> wrote:
> On Sun, 12 May 2019, Halil Pasic wrote:
> > I've also got code that deals with AIRQ_IV_CACHELINE by turning
the
> > kmem_cache into a dma_pool.
> > 
> > Cornelia, Sebastian which approach do you prefer:
> > 1) get rid of cio_dma_pool and AIRQ_IV_CACHELINE, and waste a page per
> > vector, or
> > 2) go with the approach taken by the patch below?
> 
> We only have a couple of users for airq_iv:
> 
> virtio_ccw.c: 2K bits

You mean a single allocation is 2k bits (VIRTIO_IV_BITS = 256 * 8)? My
understanding is that the upper bound is more like:
MAX_AIRQ_AREAS * VIRTIO_IV_BITS = 20 * 256 * 8 = 40960 bits.

In practice it is most likely just 2k.
> 
> pci with floating IRQs: <= 2K (for the per-function bit vectors)
>                         1..4K (for the summary bit vector)
>
As far as I can tell with virtio_pci arch_setup_msi_irqs() gets called
once per device and allocates a small number of bits (2 and 3 in my
test, it may depend on #virtqueues, but I did not check).

So for an upper bound we would have to multiply with the upper bound of
pci devices/functions. What is the upper bound on the number of
functions?
> pci with CPU directed IRQs: 2K (for the per-CPU bit vectors)
>                             1..nr_cpu (for the summary bit vector)
> 
I guess this is the same.
> 
> The options are:
> * page allocations for everything
Worst case we need 20 + #max_pci_dev pages. At the moment we allocate
from ZONE_DMA (!) and waste a lot.
> * dma_pool for AIRQ_IV_CACHELINE ,gen_pool for others
I prefer this. Explanation follows.
> * dma_pool for everything
> 
Less waste by factor factor 16.
> I think we should do option 3 and use a dma_pool with cachesize
> alignment for everything (as a prerequisite we have to limit
> config PCI_NR_FUNCTIONS to 2K - but that is not a real constraint).
>
I prefer option 3 because it is conceptually the smallest change, and
provides the behavior which is closest to the current one.

Commit  414cbd1e3d14 "s390/airq: provide cacheline aligned
ivs" (Sebastian Ott, 2019-02-27) could have been smaller had you
implemented
'kmem_cache for everything' (and I would have had just to replace
kmem_cache with
dma_cache to achieve option 3). For some reason you decided to keep the
iv->vector = kzalloc(size, GFP_KERNEL) code-path and make the client code
request
iv->vector = kmem_cache_zalloc(airq_iv_cache, GFP_KERNEL) explicitly, using a
flag
which you only decided to use for directed pci irqs AFAICT.

My understanding of these decisions, and especially of the rationale
behind commit 414cbd1e3d14 is limited. Thus if option 3 is the way to
go, and the choices made by 414cbd1e3d14 were sub-optimal, I would feel
much more comfortable if you provided a patch that revises  and switches
everything to kmem_chache. I would then just swap kmem_cache out with a
dma_cache and my change would end up a straightforward and relatively
clean one.

So Sebastian, what shall we do?

Regards,
Halil

> Sebastian
>

Michael Mueller

2019-May-21 08:46 UTC

head link

[PATCH 05/10] s390/cio: introduce DMA pools to cio

On 20.05.19 14:13, Halil Pasic wrote:> On Thu, 16 May 2019 15:59:22 +0200 (CEST)
> Sebastian Ott <sebott at linux.ibm.com> wrote:
> 
>> On Sun, 12 May 2019, Halil Pasic wrote:
>>> I've also got code that deals with AIRQ_IV_CACHELINE by turning
the
>>> kmem_cache into a dma_pool.
>>>
>>> Cornelia, Sebastian which approach do you prefer:
>>> 1) get rid of cio_dma_pool and AIRQ_IV_CACHELINE, and waste a page
per
>>> vector, or
>>> 2) go with the approach taken by the patch below?
>>
>> We only have a couple of users for airq_iv:
>>
>> virtio_ccw.c: 2K bits
> 
> 
> You mean a single allocation is 2k bits (VIRTIO_IV_BITS = 256 * 8)? My
> understanding is that the upper bound is more like:
> MAX_AIRQ_AREAS * VIRTIO_IV_BITS = 20 * 256 * 8 = 40960 bits.
> 
> In practice it is most likely just 2k.
> 
>>
>> pci with floating IRQs: <= 2K (for the per-function bit vectors)
>>                          1..4K (for the summary bit vector)
>>
> 
> As far as I can tell with virtio_pci arch_setup_msi_irqs() gets called
> once per device and allocates a small number of bits (2 and 3 in my
> test, it may depend on #virtqueues, but I did not check).
> 
> So for an upper bound we would have to multiply with the upper bound of
> pci devices/functions. What is the upper bound on the number of
> functions?
> 
>> pci with CPU directed IRQs: 2K (for the per-CPU bit vectors)
>>                              1..nr_cpu (for the summary bit vector)
>>
> 
> I guess this is the same.
> 
>>
>> The options are:
>> * page allocations for everything
> 
> Worst case we need 20 + #max_pci_dev pages. At the moment we allocate
> from ZONE_DMA (!) and waste a lot.
> 
>> * dma_pool for AIRQ_IV_CACHELINE ,gen_pool for others
> 
> I prefer this. Explanation follows.
> 
>> * dma_pool for everything
>>
> 
> Less waste by factor factor 16.
> 
>> I think we should do option 3 and use a dma_pool with cachesize
>> alignment for everything (as a prerequisite we have to limit
>> config PCI_NR_FUNCTIONS to 2K - but that is not a real constraint).
>>
> 
> I prefer option 3 because it is conceptually the smallest change, and
> provides the behavior which is closest to the current one.
> 
> Commit  414cbd1e3d14 "s390/airq: provide cacheline aligned
> ivs" (Sebastian Ott, 2019-02-27) could have been smaller had you
implemented
> 'kmem_cache for everything' (and I would have had just to replace
kmem_cache with
> dma_cache to achieve option 3). For some reason you decided to keep the
> iv->vector = kzalloc(size, GFP_KERNEL) code-path and make the client
code request
> iv->vector = kmem_cache_zalloc(airq_iv_cache, GFP_KERNEL) explicitly,
using a flag
> which you only decided to use for directed pci irqs AFAICT.
> 
> My understanding of these decisions, and especially of the rationale
> behind commit 414cbd1e3d14 is limited. Thus if option 3 is the way to
> go, and the choices made by 414cbd1e3d14 were sub-optimal, I would feel
> much more comfortable if you provided a patch that revises  and switches
> everything to kmem_chache. I would then just swap kmem_cache out with a
> dma_cache and my change would end up a straightforward and relatively
> clean one.
> 
> So Sebastian, what shall we do?
> 
> Regards,
> Halil
> 
> 
> 
> 
>> Sebastian
>>
> 
Folks, I had a version running with slight changes to the initial
v1 patch set together with a revert of 414cbd1e3d14 "s390/airq: provide 
cacheline aligned ivs". That of course has the deficit of the memory
usage pattern.

Now you are discussing same substantial changes. The exercise was to
get an initial working code through the door. We really need a decision!


Michael

Sebastian Ott

2019-May-22 12:07 UTC

head link

[PATCH 05/10] s390/cio: introduce DMA pools to cio

On Mon, 20 May 2019, Halil Pasic wrote:> On Thu, 16 May 2019 15:59:22 +0200 (CEST)
> Sebastian Ott <sebott at linux.ibm.com> wrote:
> > We only have a couple of users for airq_iv:
> > 
> > virtio_ccw.c: 2K bits
> 
> You mean a single allocation is 2k bits (VIRTIO_IV_BITS = 256 * 8)? My
> understanding is that the upper bound is more like:
> MAX_AIRQ_AREAS * VIRTIO_IV_BITS = 20 * 256 * 8 = 40960 bits.
> 
> In practice it is most likely just 2k.
> 
> > 
> > pci with floating IRQs: <= 2K (for the per-function bit vectors)
> >                         1..4K (for the summary bit vector)
> >
> 
> As far as I can tell with virtio_pci arch_setup_msi_irqs() gets called
> once per device and allocates a small number of bits (2 and 3 in my
> test, it may depend on #virtqueues, but I did not check).
> 
> So for an upper bound we would have to multiply with the upper bound of
> pci devices/functions. What is the upper bound on the number of
> functions?
> 
> > pci with CPU directed IRQs: 2K (for the per-CPU bit vectors)
> >                             1..nr_cpu (for the summary bit vector)
> > 
> 
> I guess this is the same.
> 
> > 
> > The options are:
> > * page allocations for everything
> 
> Worst case we need 20 + #max_pci_dev pages. At the moment we allocate
> from ZONE_DMA (!) and waste a lot.
> 
> > * dma_pool for AIRQ_IV_CACHELINE ,gen_pool for others
> 
> I prefer this. Explanation follows.
> 
> > * dma_pool for everything
> > 
> 
> Less waste by factor factor 16.
> 
> > I think we should do option 3 and use a dma_pool with cachesize
> > alignment for everything (as a prerequisite we have to limit
> > config PCI_NR_FUNCTIONS to 2K - but that is not a real constraint).
> 
> I prefer option 3 because it is conceptually the smallest change, and                  ^
                  2> provides the behavior which is closest to the current one.
I can see that this is the smallest change on top of the current
implementation. I'm good with doing that and looking for further
simplification/unification later.

> Commit  414cbd1e3d14 "s390/airq: provide cacheline aligned
> ivs" (Sebastian Ott, 2019-02-27) could have been smaller had you
implemented
> 'kmem_cache for everything' (and I would have had just to replace
kmem_cache with
> dma_cache to achieve option 3). For some reason you decided to keep the
> iv->vector = kzalloc(size, GFP_KERNEL) code-path and make the client
code request
> iv->vector = kmem_cache_zalloc(airq_iv_cache, GFP_KERNEL) explicitly,
using a flag
> which you only decided to use for directed pci irqs AFAICT.
> 
> My understanding of these decisions, and especially of the rationale
> behind commit 414cbd1e3d14 is limited.
I introduced per cpu interrupt vectors and wanted to prevent 2 CPUs from
sharing data from the same cacheline. No other user of the airq stuff had
this need. If I had been aware of the additional complexity we would add
on top of that maybe I would have made a different decision.

Seemingly Similar Threads

Search for more apparently analagous threads

Linux Virtualization - May 2019 - [PATCH 05/10] s390/cio: introduce DMA pools to cio

[PATCH 05/10] s390/cio: introduce DMA pools to cio

[PATCH 05/10] s390/cio: introduce DMA pools to cio

[PATCH 05/10] s390/cio: introduce DMA pools to cio

[PATCH 05/10] s390/cio: introduce DMA pools to cio

Seemingly Similar Threads