Hi people, As I said before: I did the IDCT to run on the FPGA. My friends from university did the Reconstruction routines running on the FPGA. I'm helping with the LoopFilter, and it is almost there. (all VHDL) I did a small profiling of the libTheora running on a Altera Stratix II device: The processor used was the NIOS II with 8Kb of data and instruction cache, branch prediction and hardware divider. (this is the more roubust NIOS II version). I decoded some frames of a 320x240 theora stream. Decoding all frames only in software (without the hardware modules) I got 44 ms per frame.>From this 44 ms:The IDCT takes 7 ms The Reconstruction routines takes 6 ms As you know, the ReconRefFrames routine is the caller of the IDCT, Reconstruction and LoopFilter. The ReconRefFrames wastes 31 ms from the total 44 ms. This is more than 66% of the decoding time. If I run the libtheora without the software IDCT , and using the IDCT hadware module, I get 46 ms of decoding time per frame. You can say that this makes no sense: Why with the help of a hardware module the time can increase ? The increase of time can be explained by two factors: 1) The overhead of data transfer on the bus is too expensive, this bus is shared with normal memory access (by the processor) too. 2) I did a sequencial test: software sends data to IDCT, waits for data be ready, and Read the data from IDCT. Its bad, because I cut the hardware paralelism. But this is just a small test, the final version will have a buffer to receive and send to IDCT, without to have to stop the software. So, you must consider about 2 ms of data-transfer overhead, and 7 ms of IDCT processing time. We could get 5 ms less if the IDCT hardware module run in parallel. But the important thing to see from these numbers are: Even if the hardware IDCT had no data transfer overhead, we could get only 7 ms (15%) less of decoding time per frame. But, If we have all the ReconRefFrames routine on harware, we can have 31 ms (66%) less. It will be very good. Just 33% CPU-time of the algorithm will be running on software. And better: If we have the ReconRefFrames on hardware, we can send the output of the ReconRefFrames hardware module direct to the screen (without pass through software). So, this way, the libTheora software will just copy the data to the hardware module, and the hardware output will be sent direct to a screen buffer (another hardware module like a video board will present the frame on a video monitor). So, this way will need only the overhead of 1 transfer (just send), and not 2 like the way I did in IDCT (send and receive). To put all ReconRefFrames routine in hardware I will need at least 3 big buffers: Current Frame Last Frame Golden Frame On a 320x240 stream, it represent about 150 Kbyte of each buffer, so I will need about 500 Kbytes of memory. It is too much to use FPGA internal memory. So I'm planning use a external SRAM of 500Kbytes. SRAM data sheet: http://www.olimex.com/dev/pdf/71V416_DS_74666.pdf Another alternative is to use a PC100 SDRAM of 16 Mb: http://download.micron.com/pdf/datasheets/dram/sdram/128MbSDRAMx32.pdf http://www.altera.com/literature/ds/ds_sdram_ctrl.pdf My Altera Stratix Dev. Kit has this SRAM and SDRAM. See: http://www.altera.com/literature/manual/mnl_nios2_board_stratixII_2s60.pdf Please, comments and sugestions are wellcome. Best Regards, felipe -- ________________________________________ Felipe Portavales <portavales@gmail.com> Undergraduate Student - IC-UNICAMP Computer Systems Laboratory http://www.lsc.ic.unicamp.br
On Sun, Jul 02, 2006 at 07:10:00PM -0300, Felipe Portavales Goldstein wrote:> Even if the hardware IDCT had no data transfer overhead, > we could get only 7 ms (15%) less of decoding time per frame.This is what profiles on the software implementation show too, no individual component dominates, and particularly not the IDCT. I'm not surprised you see something similar. Looks like getting an efficient implementation is going to be all about carefully measuring parallelism and bus latency.> To put all ReconRefFrames routine in hardware I will need at least 3 > big buffers: > > Current Frame > Last Frame > Golden Frame > > On a 320x240 stream, it represent about 150 Kbyte of each buffer, so I > will need about 500 Kbytes of memory.Yes. Note that Andrey's encoder implementation was also bound by memory bandwidth; he said writing the memory controller was where he had to focus most of his effort.> It is too much to use FPGA internal memory. > So I'm planning use a external SRAM of 500Kbytes. > SRAM data sheet: http://www.olimex.com/dev/pdf/71V416_DS_74666.pdf > > Another alternative is to use a PC100 SDRAM of 16 Mb: > http://download.micron.com/pdf/datasheets/dram/sdram/128MbSDRAMx32.pdf > http://www.altera.com/literature/ds/ds_sdram_ctrl.pdfIf you use the SDRAM do you think it will be possible to get to 720x480 (or even 640x480) by the end of the summer? If there's a chance for that I'd vote for that. A hardware decoder is most interesting at higher resolution, and it's unlikely anyone will want to spring for enough SRAM to do that. Thanks for the benchmark; it's good to see this. Hope your exams went well! -r
Le dimanche 02 juillet 2006 ? 19:10 -0300, Felipe Portavales Goldstein a ?crit : ...> As you know, the ReconRefFrames routine is the caller of the IDCT, > Reconstruction and LoopFilter. > > The ReconRefFrames wastes 31 ms from the total 44 ms. > This is more than 66% of the decoding time.... Where are the other 34% spent (various routines?)? Personnally, I think too the "enough memory for 720x480" is the way to go. (Are there other options than SDRAM in this case?) Rodolphe
On Mon, 3 Jul 2006, Rodolphe Ortalo wrote:> Le dimanche 02 juillet 2006 ?? 19:10 -0300, Felipe Portavales Goldstein a > ??crit : > ... > > As you know, the ReconRefFrames routine is the caller of the IDCT, > > Reconstruction and LoopFilter. > > > > The ReconRefFrames wastes 31 ms from the total 44 ms. > > This is more than 66% of the decoding time. > ... > > Where are the other 34% spent (various routines?)? > > Personnally, I think too the "enough memory for 720x480" is the way to > go. (Are there other options than SDRAM in this case?)Without knowing the algorithm... would it be possible to push the decoded frame in parts to an overlay region. Or does VP3/Theora require computations on the complete frame space? Stefan de Konink
On 7/3/06, Rodolphe Ortalo <rodolphe.ortalo@free.fr> wrote:> Le dimanche 02 juillet 2006 ? 19:10 -0300, Felipe Portavales Goldstein a > ?crit : > ... > > As you know, the ReconRefFrames routine is the caller of the IDCT, > > Reconstruction and LoopFilter. > > > > The ReconRefFrames wastes 31 ms from the total 44 ms. > > This is more than 66% of the decoding time. > ... > > Where are the other 34% spent (various routines?)?Yes all other decoding routines (since header decoding, untill before IDCT)> > Personnally, I think too the "enough memory for 720x480" is the way to > go. (Are there other options than SDRAM in this case?)We will work on this alternative.> > Rodolphe > > >-- ________________________________________ Felipe Portavales <portavales@gmail.com> Undergraduate Student - IC-UNICAMP Computer Systems Laboratory http://www.lsc.ic.unicamp.br
We need store the entire Last Frame and the entire Golden Frame Becouse of the Reconstruction routines On 7/3/06, Stefan de Konink <skinkie@xs4all.nl> wrote:> On Mon, 3 Jul 2006, Rodolphe Ortalo wrote: > > > Le dimanche 02 juillet 2006 ? 19:10 -0300, Felipe Portavales Goldstein a > > ?crit : > > ... > > > As you know, the ReconRefFrames routine is the caller of the IDCT, > > > Reconstruction and LoopFilter. > > > > > > The ReconRefFrames wastes 31 ms from the total 44 ms. > > > This is more than 66% of the decoding time. > > ... > > > > Where are the other 34% spent (various routines?)? > > > > Personnally, I think too the "enough memory for 720x480" is the way to > > go. (Are there other options than SDRAM in this case?) > > Without knowing the algorithm... would it be possible to push the decoded > frame in parts to an overlay region. Or does VP3/Theora require > computations on the complete frame space? > > Stefan de Konink > >-- ________________________________________ Felipe Portavales <portavales@gmail.com> Undergraduate Student - IC-UNICAMP Computer Systems Laboratory http://www.lsc.ic.unicamp.br