Leonardo de Paula Rosa Piga
2008-Feb-28 04:42 UTC
[theora-dev] Multi-thread Theora Decoder
Hi all, Does Theora Community have an interest in a multi-thread decoder implementation? I'm starting to work with multi-thread and I thought that Theora Decoder is a good choice for me, because I had been working with it in a FPGA implementation and I have experience with the library. I'm thinking in working with LoopFilter at first. Do you think I could start with it or there is a better choice? Thanks -- Leonardo de Paula Rosa Piga Undergraduate Computer Engineering Student LSC - IC - UNICAMP http://www.students.ic.unicamp.br/~ra033956
Leonardo de Paula Rosa Piga wrote:> Hi all, > > Does Theora Community have an interest in a multi-thread decoder implementation? > > I'm starting to work with multi-thread and I thought that Theora > Decoder is a good choice for me, because I had been working with it in > a FPGA implementation and I have experience with the library. > > I'm thinking in working with LoopFilter at first. Do you think I could > start with it or there is a better choice?Keep in mind that the current decoder is a completely different code base than the one you based your FPGA implementation on. This one re-organizes the decode process to be much more cache-friendly, which makes it scale much better to larger resolutions (in addition to a number of other algorithmic optimizations). Part of this means the decode process is pipelined, so that, for example, the loop filter is only applied to a few rows of blocks at a time, while the data for those blocks is in cache, instead of being applied to the entire frame at once. There were some early experiments with a multi-threaded decoder in svn: http://svn.xiph.org/experimental/giles/theora-exp-mt/ The basic idea was that after coefficient unpacking, each color plane could be decoded completely independently (inverse DC prediction, dequantization, iDCT, MC, and loop filter). This obtained a moderate speed-up (I forget the exact numbers, search the mailing list archives), but the difference in the workload between luma and the two chroma planes limited the potential gains. gcc 4.2 includes support for OpenMP (http://en.wikipedia.org/wiki/OpenMP). I have been meaning to take a look at this and see if it is lightweight enough for finer-grain parallelism, e.g., of the individual loops during the decode process. Some of these are easy to parallelize further (e.g., copying uncoded blocks), but most of the steps within a single color plane have a dependency on the previous rows. There may be some opportunity to break these dependencies with additional overhead, but it's unclear if the trade-off is worth it (this requires experimentation). The loop filter definitely falls into this category, as the precise output is extremely dependent on the order in which it is applied to each edge, since all of the edges overlap. The ability to specify parallelism with a few #pragma statements in OpenMP is, however, appealing, as this kind of code is hard to get right, and this would be much easier than maintaining a separate, multi-threaded version of the code. It also provides a portable means of adapting to the number of cores, etc., available.