I didn't file an Issue as I understand the cause and have a successful
patch that I've been running here for several days.
gitlab.freedesktop.org is RO (ie, I can't fork/stage a PR there) so I
committed the absurdity of creating a PR on the public gitlab mirror
in order to have a URL to point to:
https://gitlab.com/freedesktop-mirror/drm/-/merge_requests/1
copypasting the commit message:
nv50 IB-mode DMA crash fixes
Address the following nv50 IB-mode dma code errors:
1) nouveau_drm.h labels NV50_DMA_PUSH_MAX_LENGTH "Maximum push buffer
size". Perhaps originally true(?), but nothing suggests, guards, or
enforces the assumption, and it's not true in practice. 0x7fffff is
not the maximum possible push buffer size submitted for IB-mode DMA
transfers, nor is it the maximum legal DMA transfer size (it is three
bytes too large according to our reverse-engineering docs).
2) nouveau_gem_ioctl_pushbuf() still implicitly treats bit 23 of the
pushbuffer buffer[].length field as NOUVEAU_GEM_PUSHBUF_NO_PREFETCH,
despite previous commits moving the flag out of the length field for
lower levels as this is an nv50-hardware-specific arrangement. It
also conflicts with point 1 above as pushbuffers can be equal to or
greater than 1<<23 in practice. (Does the ioctl code in question have
reason to set or expose no_prefetch?)
3) Bits 0 and 1 of the length field are passed through and placed into
the IB entry, but they don't signify length in the DMA request. Bit 0
is labeled an unknown flag, bit 1 specifies use of the non_main
transfer counter. DMA transfers can only be whole words, despite
pushbuffers not being restricted to whole-word lengths. So there are
two ways this can bite: Setting flags we don't mean to set and not
transferring trailing bytes in incomplete words.
Changes:
This PR makes the following changes toward the single logical change
of 'Don't lock up Quadros when using CAD':
1) NV50_DMA_PUSH_MAX_LENGTH is reduced from 0x7fffff to 0x7ffffc, the
maximum value that can actually be encoded in a DMA request.
2) nouveau_gem_ioctl_pushbuf() no longer looks for
NOUVEAU_GEM_PUSHBUF_NO_PREFETCH in bit 23 of the length field, and
always sets no_prefetch to 0 in the call to nv50_dma_push() when using
IB-mode.
3) nouveau_gem_ioctl_pushbuf() specifies (push[].length + 0x3) << 2
>>
2 for DMA transfer length to enforce complete-word IB-mode transfers
that also send trailing bytes.
4) nouveau_gem_ioctl_pushbuf() splits pushbuffers longer than
NV50_DMA_PUSH_MAX_LENGTH (as calculated in 3 above) into multiple
calls to nv50_dma_push() when using nv50 IB-mode.
Although I understand the cause, the patch raises a few other
unknowns-to-me I'd like some commentary on:
1) The nv50 DMA code clearly believed push buffers could not (or
should not) exceed NV50_DMA_PUSH_MAX_LENGTH, when my analysis logging
shows that they regularly exceed this maximum by an order of magnitude
or more. Was the assumption originally true? If so, when/why did it
change? Is the more correct fix to enforce individual pushbuf size
limits?
2) Are any of the documented bit flags that 'hug' the length field
(NON_MAIN, NO_PREFETCH, etc) actually used/useful at the level of the
GEM API? Right now, two are exposed by accident (because the code
thinks the length is byte-granularity when the lower two bits are
actually flags) and the third is broken out explicitly--- but it's
actually only ever set by a clobber from push[i].length exceeding
0x7fffff.
Monty