It seems that Noveau is assuming that once the FIFO pointer is past a command, that command has finished executing, and all the buffers it used are no longer needed. However, this seems to be false at least on G71. In particular, the card may not have even finished reading the input vertex buffers when the pushbuffer "fence" triggers. While Mesa does not reuse the buffer object itself, the current allocator tends to return memory that has just been freed, resulting in the buffer actually been reused. Thus Mesa will overwrite the vertices before the GPU has used them. This results in all kinds of artifacts, such as vertices going to infinity, and random polygons appearing. This can be seen in progs/demos/engine, progs/demos/dinoshade, Blender, Extreme Tux Racer and probably any non-trivial OpenGL software. The problem can be significantly reduced by just adding a waiting loop at the end of draw_arrays and draw_elements, or by synchronizing drawing by adding and calling the following function instead of pipe->flush in nv40_vbo.c: I think the remaining artifacts may be due to missing 2D engine synchronization, but I'm not sure how that works. Note that this causes the CPU to wait for rendering, which is not the correct solution static void nv40_sync(struct nv40_context *nv40) { nouveau_notifier_reset(nv40->screen->sync, 0); // BEGIN_RING(curie, 0x1d6c, 1); // OUT_RING(0x5c0); // static int value = 0x23; // BEGIN_RING(curie, 0x1d70, 1); // OUT_RING(value++); BEGIN_RING(curie, NV40TCL_NOTIFY, 1); OUT_RING(0); BEGIN_RING(curie, NV40TCL_NOP, 1); OUT_RING(0); FIRE_RING(NULL); nouveau_notifier_wait_status(nv40->screen->sync, 0, 0, 0); } It seems that NV40TCL_NOTIFY (which must be followed by a nop for some reason) triggers a notification of rendering completion. Furthermore, the card will probably put the value set with 0x1d70 somewhere, where 0x1d6c has an unknown use The 1d70/1d6c is frequently used by the nVidia driver, with 0x1d70 being a sequence number, while 0x1d6c is always set to 0x5c0, while NV40TCL_NOTIFY seems to be inserted on demand. On my machine, setting 0x1d6c/0x1d70 like the nVidia driver does causes a GPU lockup. That is probably because the location where the GPU is supposed to put the value has not been setup correctly. So it seems that the current model is wrong, and the current fence should only be used to determine whether the pushbuffer itself can be reused. It seems that, after figuring out where the GPU writes the value and how to use the mechanism properly, this should be used by the kernel driver as the bo->sync_obj implementation. This will delay destruction of the buffers, and thus prevent reallocation of them, and artifacts, without synchronizing rendering. I'm not sure why this hasn't been noticed before though. Is everyone getting randomly misrendered OpenGL or is my machine somehow more prone to reusing buffers? What do you think? Is the analysis correct?
Hi, Luca Barbieri <luca at luca-barbieri.com> writes:> It seems that Noveau is assuming that once the FIFO pointer is past a > command, that command has finished executing, and all the buffers it > used are no longer needed. > > However, this seems to be false at least on G71. > In particular, the card may not have even finished reading the input > vertex buffers when the pushbuffer "fence" triggers. > While Mesa does not reuse the buffer object itself, the current > allocator tends to return memory that has just been freed, resulting > in the buffer actually been reused. > Thus Mesa will overwrite the vertices before the GPU has used them. > > This results in all kinds of artifacts, such as vertices going to > infinity, and random polygons appearing. > This can be seen in progs/demos/engine, progs/demos/dinoshade, > Blender, Extreme Tux Racer and probably any non-trivial OpenGL > software. >Can you reproduce this with your vertex buffers in VRAM instead of GART? (to rule out that it's a fencing issue).> The problem can be significantly reduced by just adding a waiting loop > at the end of draw_arrays and draw_elements, or by synchronizing > drawing by adding and calling the following function instead of > pipe->flush in nv40_vbo.c: > I think the remaining artifacts may be due to missing 2D engine > synchronization, but I'm not sure how that works. > Note that this causes the CPU to wait for rendering, which is not the > correct solution > > static void nv40_sync(struct nv40_context *nv40) > { > nouveau_notifier_reset(nv40->screen->sync, 0); > > // BEGIN_RING(curie, 0x1d6c, 1); > // OUT_RING(0x5c0); > > // static int value = 0x23; > // BEGIN_RING(curie, 0x1d70, 1); > // OUT_RING(value++); > > BEGIN_RING(curie, NV40TCL_NOTIFY, 1); > OUT_RING(0); > > BEGIN_RING(curie, NV40TCL_NOP, 1); > OUT_RING(0); > > FIRE_RING(NULL); > > nouveau_notifier_wait_status(nv40->screen->sync, 0, 0, 0); > } > > It seems that NV40TCL_NOTIFY (which must be followed by a nop for some > reason) triggers a notification of rendering completion. > Furthermore, the card will probably put the value set with 0x1d70 > somewhere, where 0x1d6c has an unknown use > The 1d70/1d6c is frequently used by the nVidia driver, with 0x1d70 > being a sequence number, while 0x1d6c is always set to 0x5c0, while > NV40TCL_NOTIFY seems to be inserted on demand. > On my machine, setting 0x1d6c/0x1d70 like the nVidia driver does > causes a GPU lockup. That is probably because the location where the > GPU is supposed to put the value has not been setup correctly. > > So it seems that the current model is wrong, and the current fence > should only be used to determine whether the pushbuffer itself can be > reused. > It seems that, after figuring out where the GPU writes the value and > how to use the mechanism properly, this should be used by the kernel > driver as the bo->sync_obj implementation. > This will delay destruction of the buffers, and thus prevent > reallocation of them, and artifacts, without synchronizing rendering. > > I'm not sure why this hasn't been noticed before though. > Is everyone getting randomly misrendered OpenGL or is my machine > somehow more prone to reusing buffers? > > What do you think? Is the analysis correct? > _______________________________________________ > Nouveau mailing list > Nouveau at lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/nouveau-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available Url : http://lists.freedesktop.org/archives/nouveau/attachments/20091228/c83eead9/attachment-0001.pgp
I figured out the registers. There is a fence/sync mechanism which apparently triggers after rendering is finished. There are two ways to use it, but they trigger at the same time (spinning in a loop on the CPU checking them, they trigger at the same iteration or in two successive iterations). The first is the "sync" notifier, which involves a notifier object set at NV40TCL_DMA_NOTIFY. When NV40TCL_NOTIFY, with argument 0, followed by NV40TCL_NOP, with argument 0 is inserted in the ring, the notifier object will be notified when rendering is finished. fbcon uses this to sync rendering. Currently the Mesa driver sets an object but does not use it. The renouveau traces use this mechanism only in the EXT_framebuffer_object tests. It's not clear what the purpose of the NOP is, but it seems necessary. The second is the fence mechanism, which involves an object set at NV40TCL_DMA_FENCE. When register 0x1d70 is set, the value set there will be written to the object at the offset programmed in 0x1d6c. The offset in 0x1d6c must be 16-byte aligned, but the GPU seems to only write 4 bytes with the sequence number. Nouveau does not use this currently, and sets NV40TCL_DMA_FENCE to 0. The nVidia driver uses this often. It allocates a 4KB object and asks the GPU to put the sequence number always at offset 0x5c0. Why it does this rather than allocating a 16 byte object and using offset 0 is unknown. IMHO the fence mechanism should be implemented in the kernel along with the current FIFO fencing, and should protect the relocated buffer object.
Luca Barbieri pisze:> I'm not sure why this hasn't been noticed before though. > Is everyone getting randomly misrendered OpenGL or is my machine > somehow more prone to reusing buffers?I reported a similar problem about 2 weeks ago. It first became apparent with NV40 but I also confirmed it with NV30 - in both cases it was visible in morph3d demo. As long as nothing changes in memory allocation, everything is fine. If I even move a window(which causes some allocations in the system) vertexes become damaged. Some information from that previous emails: "" I see this problem on morph3d demo. What it does is: for each frame create a call list and then call it 4 times. ADDR VRAM OFFSET A X B Y C X A,B,C is the memory offset of 32kb buffer created for vertex buffer when call lists are compiled. X,Y is the VRAM OFFSET (bo.mem.mm_node.start) First buffer is created (X,A). When it gets full (after around 3 frames) second buffer is created (Y,B). Then first one is freed. When second buffer is full, third is created (X,C) - here the problem start: according to my observations, the card seems to read vertexes not from address C but from address A as if it somehow remembered the initial address binding. Other observations: - the data during execution of gl commands actually seems to be put into location C - when I switch to software path, I could track down that it reads data from location C - rendering is done correctly in software path - when I comment out freeing of memory manager node (bo.mem.mm_node), so that the third buffer is Z,C (paired with not yet used offset of VRAM) then hardware rendering behaves correctly - but this will make card "run out" of memory as no memory manager nodes will be deallocated - when I switch the calls of glCallList into actual rendering code and disable invocation of glNewList/glEndList the hardware rendering also behaves correctly "" Best regards, Krzysztof
Possibly Parallel Threads
- [PATCH 1/2] nv30-nv40: Rewrite primitive splitting and emission
- [PATCH 1/3] nv50: remove vtxbuf stateobject after a referenced vtxbuf is mapped
- [PATCH] nv10/exa: Spring-cleaning
- [PATCH] nv30/exa : cleanup from nv40 exa
- [PATCH] nv50/exa: use dual-source blending for component-alpha composite