Luca Barbieri
2010-Jan-18 16:27 UTC
[Nouveau] [Mesa3d-dev] [PATCH] glsl: put varyings in texcoord slots
So, basically, you allocate the rasterizer units according to the vertex shader, and when the fragment shader comes up, you say "write rasterizer output 4 to fragment input 1000000"? The current nouveau drivers can't do this. There are "routing" registers in hardware, but I think the nVidia proprietary driver (at least without GLSL) leaves them unaltered after initialization and I don't think we really know how they would work. They are also very likely limited to at most 256 values (maybe even less, such as 16), even if they can actually be made to work. The way the current pre-nv50 driver works is that there are 8 slots, each of which has an interpolator and a fixed associated vertex shader output and fixed fragment input. This seems a rather obvious way to design hardware, and so shouldn't be uncommon. Thus, the inputs/outputs can't be packed, because that will break if the fragment shader doesn't use a vertex output. And there is no way to correct that when the fragment program comes up, other than recompiling the vertex shader, which would be very desirable to avoid having to do. Non-GLSL programs can only use the 8 texcoords, so there is no problem there since hardware supports 8 slots. Thus, I think my proposed solution is the simplest and most efficient approach. Any other solution would require much more, and slower, code in the Gallium drivers for nv30, nv40, and maybe Intel too.
Corbin Simpson
2010-Jan-18 19:41 UTC
[Nouveau] [Mesa3d-dev] [PATCH] glsl: put varyings in texcoord slots
Actually, we don't even bother worrying about the rasterizer's routing table until we've bound a pair of shaders and start drawing. Right before the draw call, we re-generate, among other things, routing tables for the vert shader and the rasterizer. This is *incredibly* powerful, because it means we only have to compile the shaders once, and load the rasterizer tables based on those shaders. I even baked up a CSO to cache the tables, but it turned out to be an overall slowdown. If you get this patch in, then you'll still have to fight with every other state tracker that doesn't prettify their TGSI. It would be a much better approach to attempt to RE the routing tables. Also FYI the r300-r500 rasterizer can only handle, off the top of my head, 16 sets of vectors total (8 colors, 8 texcoords) so you're not the only ones with this kind of limitation. The situation gets better for r600 and nv50. ~ C. On Mon, Jan 18, 2010 at 8:27 AM, Luca Barbieri <luca at luca-barbieri.com> wrote:> So, basically, you allocate the rasterizer units according to the > vertex shader, and when the fragment shader comes up, you say "write > rasterizer output 4 to fragment input 1000000"? > > The current nouveau drivers can't do this. > There are "routing" registers in hardware, but I think the nVidia > proprietary driver (at least without GLSL) leaves them unaltered after > initialization and I don't think we really know how they would work. > They are also very likely limited to at most 256 values (maybe even > less, such as 16), even if they can actually be made to work. > > The way the current pre-nv50 driver works is that there are 8 slots, > each of which has an interpolator and a fixed associated vertex shader > output and fixed fragment input. This seems a rather obvious way to > design hardware, and so shouldn't be uncommon. > > Thus, the inputs/outputs can't be packed, because that will break if > the fragment shader doesn't use a vertex output. > And there is no way to correct that when the fragment program comes > up, other than recompiling the vertex shader, which would be very > desirable to avoid having to do. > > Non-GLSL programs can only use the 8 texcoords, so there is no problem > there since hardware supports 8 slots. > > Thus, I think my proposed solution is the simplest and most efficient approach. > Any other solution would require much more, and slower, code in the > Gallium drivers for nv30, nv40, and maybe Intel too. > > ------------------------------------------------------------------------------ > Throughout its 18-year history, RSA Conference consistently attracts the > world's best and brightest in the field, creating opportunities for Conference > attendees to learn about information security's most important issues through > interactions with peers, luminaries and emerging and established companies. > http://p.sf.net/sfu/rsaconf-dev2dev > _______________________________________________ > Mesa3d-dev mailing list > Mesa3d-dev at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/mesa3d-dev >-- Only fools are easily impressed by what is only barely beyond their reach. ~ Unknown Corbin Simpson <MostAwesomeDude at gmail.com>
Luca Barbieri
2010-Jan-18 20:06 UTC
[Nouveau] [Mesa3d-dev] [PATCH] glsl: put varyings in texcoord slots
> If you get this patch in, then you'll still have to fight with every > other state tracker that doesn't prettify their TGSI. It would be a > much better approach to attempt to RE the routing tables.I don't think there any users of the Gallium interface that need more than 8 vertex outputs/fragment inputs and don't use sequential values starting at 0, except the GLSL linker without this patch. ARB_fragment_program and ARB_vertex_program is limited to texcoord slots, and Mesa should advertise only 8 of them. Also users of this interface will likely only use as many as they need, sequentially. Vega, xorg seem to only use up to 2 slots. g3dvl up to 8 (starting from 0, of course). Cards with less than 8 slots may sometimes still have problems, but such cards will probably be DX8 cards that don't work anyway. Furthermore, even if you can route things, usings vertex outputs and fragment inputs with lower indices may be more efficient anyway. As for REing the tables, it may not be possible. This is the code that apparently sets them up right now: /* vtxprog output routing */ so_method(so, screen->curie, 0x1fc4, 1); so_data (so, 0x06144321); so_method(so, screen->curie, 0x1fc8, 2); so_data (so, 0xedcba987); so_data (so, 0x00000021); so_method(so, screen->curie, 0x1fd0, 1); so_data (so, 0x00171615); so_method(so, screen->curie, 0x1fd4, 1); so_data (so, 0x001b1a19); This makes me think that only 4 bits might be used for the values (look at the arithmetic progressions of 4-bit values), so that there is a limit of 16 vertex output/fragment inputs. If GLSL starts at index 10, we are still in trouble because less than 8 varyings will be available. Also leaving vertex outputs/fragment inputs unused by starting at high values may be bad for performance even if supported, as it may lead to a bigger register file and thus less simultaneous GPU threads running. In other words, having GLSL start at index 10 is easily avoided, and causes problems nothing else causes, so why not just stop doing that?
Apparently Analagous Threads
- [PATCH 1/3] nv50: remove vtxbuf stateobject after a referenced vtxbuf is mapped
- [LLVMdev] LLVM as an OpenGL backend
- nv50: shader generation patches
- [LLVMdev] LLVM as an OpenGL backend
- [PATCH 2/2] nv30/draw: switch varying hookup logic to know about texcoords