Hello people My name is Felipe and I sent a proposal to the Google Summer of Code that the goal is to get a FPGA embeded system decoding Theora Streams in real-time. It was accepted and the mentor is the Ralph Giles. The proposal can be viewd here: http://atlas.lsc.ic.unicamp.br/~portavales/wp-content/uploads/2006/05/soc_proposal.txt There is also a presentation with a better division of the hardware modules: http://svn.xiph.org/trunk/theora-fpga/doc/hard_theora.pdf Now, I'm working on it, and today I did a simple implementation of the IDctSlow procedure as a VHDL module. This module run and decode samples correctly, but It consumes a lot of FPGA resources (logic cells, multipliers, etc..) I will optimize this module for area, to get better results. The testbench uses the GHDL tool to simulate and can be download from the svn: http://svn.xiph.org/trunk/theora-fpga/idctslow/ Just run: $make $make run $make compare to see the testbench working and validating the module data output. This IDctSlow implementation was synthesized to the Altera Stratix II FPGA. The report is below: ------------------------------------ Analysis & Synthesis Status : Successful - Thu Jun 1 02:15:09 2006 Quartus II Version : 5.1 Build 176 10/26/2005 SJ Revision Name : idctslow Top-level Entity Name : IDctSlow Family : Stratix II Total combinational functions : 13782 Total registers : 3451 Total pins : 54 Total virtual pins : 0 Total memory bits : 2,048 DSP block 9-bit elements : 230 Total PLLs : 0 Total DLLs : 0 ------------------------------------ These numbers are no good. Im using (on this first version) a RAM like an array, acessing every time , without worry. But, It inferrs flipflops for each memory position, and big muxes to control it. So, to solve this problem, I will use a syncronous memory model, That will inferr Block RAMS (FPGA specialized blocks). This is like small SRAMs into the FPGA chip. I think that using it, the area can drop down to 3% to 5% of the Stratix FPGA slices. (estimated by looking other detailed synthesis reports) And I'm using a lot of multipliers to do all calculations in just one clock cycle (this is easier), but (to save multipliers) I can break the operations in several clock cycles and use the same multiplier across them. Now I'm working on these optimizations. Bye --felipe -- ________________________________________ Felipe Portavales <portavales@gmail.com> Undergraduate Student - IC-UNICAMP Computer Systems Laboratory http://www.lsc.ic.unicamp.br