thr3ads.net - theora dev - [theora-dev] What goes to Hardware ? [Jul 2006]

If this information is useful, please help other people find it:
Share via:

Felipe Portavales Goldstein

2006-Jul-02 15:10 UTC

[theora-dev] What goes to Hardware ?

Hi people,

As I said before: I did the IDCT to run on the FPGA.

My friends from university did the Reconstruction routines running on the FPGA.
I'm helping with the LoopFilter, and it is almost there.

(all VHDL)



I did a small profiling of the libTheora running on a Altera Stratix II device:

The processor used was the NIOS II with 8Kb of data and instruction
cache, branch prediction and hardware divider. (this is the more
roubust NIOS II version).

I decoded some frames of a 320x240 theora stream.

Decoding all frames only in software (without the hardware modules)
I got 44 ms per frame.
>From this 44 ms:    The IDCT takes 7 ms
    The Reconstruction routines takes 6 ms


As you know, the ReconRefFrames routine is the caller of the IDCT,
Reconstruction and LoopFilter.

The ReconRefFrames wastes 31 ms from the total 44 ms.
This is more than 66% of the decoding time.



If I run the libtheora without the software IDCT , and using the IDCT
hadware module,
I get 46 ms of decoding time per frame.

You can say that this makes no sense: Why with the help of a hardware
module the time can increase ?

The increase of time can be explained by two factors:

1) The overhead of data transfer on the bus is too expensive, this bus
is shared with normal memory access (by the processor) too.

2) I did a sequencial test: software sends data to IDCT, waits for
data be ready, and Read the data from IDCT.
Its bad, because I cut the hardware paralelism.
But this is just a small test, the final version will have a buffer to
receive and send to IDCT, without to have to stop the software.

So, you must consider about 2 ms of data-transfer overhead, and 7 ms
of IDCT processing time.
We could get 5 ms less if the IDCT hardware module run in parallel.


But the important thing to see from these numbers are:

Even if the hardware IDCT had no data transfer overhead,
we could get only 7 ms (15%) less of decoding time per frame.

But,
If we have all the ReconRefFrames routine on harware, we can have 31
ms (66%) less.
It will be very good. Just 33% CPU-time of the algorithm will be
running on software.

And better:
If we have the ReconRefFrames on hardware, we can send the output of
the ReconRefFrames hardware module direct to the screen (without pass
through software).

So, this way, the libTheora software will just copy the data to the
hardware module, and the hardware output will be sent direct to a
screen buffer (another hardware module like a video board will present
the frame on a video monitor).

So, this way will need only the overhead of 1 transfer (just send),
and not 2 like the way I did in IDCT (send and receive).



To put all ReconRefFrames routine in hardware I will need at least 3
big buffers:

Current Frame
Last Frame
Golden Frame

On a 320x240 stream, it represent about 150 Kbyte of each buffer, so I
will need about 500 Kbytes of memory.

It is too much to use FPGA internal memory.
So I'm planning use a external SRAM of 500Kbytes.
SRAM data sheet: http://www.olimex.com/dev/pdf/71V416_DS_74666.pdf

Another alternative is to use a PC100 SDRAM of 16 Mb:
http://download.micron.com/pdf/datasheets/dram/sdram/128MbSDRAMx32.pdf
http://www.altera.com/literature/ds/ds_sdram_ctrl.pdf


My Altera Stratix Dev. Kit has this SRAM and SDRAM.
See:
http://www.altera.com/literature/manual/mnl_nios2_board_stratixII_2s60.pdf



Please,
comments and sugestions are wellcome.


Best Regards,
felipe




-- 
________________________________________
Felipe Portavales <portavales@gmail.com>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://www.lsc.ic.unicamp.br

Ralph Giles

2006-Jul-02 18:39 UTC

head link

[theora-dev] What goes to Hardware ?

On Sun, Jul 02, 2006 at 07:10:00PM -0300, Felipe Portavales Goldstein wrote:
> Even if the hardware IDCT had no data transfer overhead,
> we could get only 7 ms (15%) less of decoding time per frame.
This is what profiles on the software implementation show too, no
individual component dominates, and particularly not the IDCT. I'm
not surprised you see something similar. Looks like getting an 
efficient implementation is going to be all about carefully measuring 
parallelism and bus latency.
> To put all ReconRefFrames routine in hardware I will need at least 3
> big buffers:
> 
> Current Frame
> Last Frame
> Golden Frame
> 
> On a 320x240 stream, it represent about 150 Kbyte of each buffer, so I
> will need about 500 Kbytes of memory.
Yes. Note that Andrey's encoder implementation was also bound by memory 
bandwidth; he said writing the memory controller was where he had to 
focus most of his effort.
> It is too much to use FPGA internal memory.
> So I'm planning use a external SRAM of 500Kbytes.
> SRAM data sheet: http://www.olimex.com/dev/pdf/71V416_DS_74666.pdf
> 
> Another alternative is to use a PC100 SDRAM of 16 Mb:
> http://download.micron.com/pdf/datasheets/dram/sdram/128MbSDRAMx32.pdf
> http://www.altera.com/literature/ds/ds_sdram_ctrl.pdf
If you use the SDRAM do you think it will be possible to get to 720x480 
(or even 640x480) by the end of the summer? If there's a chance
for that I'd vote for that. A hardware decoder is most interesting at 
higher resolution, and it's unlikely anyone will want to spring for 
enough SRAM to do that.

Thanks for the benchmark; it's good to see this. Hope your exams went 
well!

 -r

Rodolphe Ortalo

2006-Jul-03 10:36 UTC

head link

[theora-dev] What goes to Hardware ?

Le dimanche 02 juillet 2006 ? 19:10 -0300, Felipe Portavales Goldstein a
?crit :
...> As you know, the ReconRefFrames routine is the caller of the IDCT,
> Reconstruction and LoopFilter.
> 
> The ReconRefFrames wastes 31 ms from the total 44 ms.
> This is more than 66% of the decoding time....

Where are the other 34% spent (various routines?)?

Personnally, I think too the "enough memory for 720x480" is the way to
go. (Are there other options than SDRAM in this case?)

Rodolphe

Stefan de Konink

2006-Jul-03 10:45 UTC

head link

[theora-dev] What goes to Hardware ?

On Mon, 3 Jul 2006, Rodolphe Ortalo wrote:
> Le dimanche 02 juillet 2006 ?? 19:10 -0300, Felipe Portavales Goldstein a
> ??crit :
> ...
> > As you know, the ReconRefFrames routine is the caller of the IDCT,
> > Reconstruction and LoopFilter.
> >
> > The ReconRefFrames wastes 31 ms from the total 44 ms.
> > This is more than 66% of the decoding time.
> ...
>
> Where are the other 34% spent (various routines?)?
>
> Personnally, I think too the "enough memory for 720x480" is the
way to
> go. (Are there other options than SDRAM in this case?)
Without knowing the algorithm... would it be possible to push the decoded
frame in parts to an overlay region. Or does VP3/Theora require
computations on the complete frame space?

Stefan de Konink

Felipe Portavales Goldstein

2006-Jul-03 12:51 UTC

head link

[theora-dev] What goes to Hardware ?

On 7/3/06, Rodolphe Ortalo <rodolphe.ortalo@free.fr>
wrote:> Le dimanche 02 juillet 2006 ? 19:10 -0300, Felipe Portavales Goldstein a
> ?crit :
> ...
> > As you know, the ReconRefFrames routine is the caller of the IDCT,
> > Reconstruction and LoopFilter.
> >
> > The ReconRefFrames wastes 31 ms from the total 44 ms.
> > This is more than 66% of the decoding time.
> ...
>
> Where are the other 34% spent (various routines?)?
Yes
all other decoding routines (since header decoding, untill before IDCT)
>
> Personnally, I think too the "enough memory for 720x480" is the
way to
> go. (Are there other options than SDRAM in this case?)
We will work on this alternative.
>
> Rodolphe
>
>
>

-- 
________________________________________
Felipe Portavales <portavales@gmail.com>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://www.lsc.ic.unicamp.br

Felipe Portavales Goldstein

2006-Jul-03 12:52 UTC

head link

[theora-dev] What goes to Hardware ?

We need store the entire Last Frame and the entire Golden Frame

Becouse of the Reconstruction routines


On 7/3/06, Stefan de Konink <skinkie@xs4all.nl>
wrote:> On Mon, 3 Jul 2006, Rodolphe Ortalo wrote:
>
> > Le dimanche 02 juillet 2006 ? 19:10 -0300, Felipe Portavales Goldstein
a
> > ?crit :
> > ...
> > > As you know, the ReconRefFrames routine is the caller of the
IDCT,
> > > Reconstruction and LoopFilter.
> > >
> > > The ReconRefFrames wastes 31 ms from the total 44 ms.
> > > This is more than 66% of the decoding time.
> > ...
> >
> > Where are the other 34% spent (various routines?)?
> >
> > Personnally, I think too the "enough memory for 720x480" is
the way to
> > go. (Are there other options than SDRAM in this case?)
>
> Without knowing the algorithm... would it be possible to push the decoded
> frame in parts to an overlay region. Or does VP3/Theora require
> computations on the complete frame space?
>
> Stefan de Konink
>
>

-- 
________________________________________
Felipe Portavales <portavales@gmail.com>
Undergraduate Student - IC-UNICAMP
Computer Systems Laboratory
http://www.lsc.ic.unicamp.br

Apparently Analagous Threads

Search for more apparently analagous threads

theora dev - Jul 2006 - What goes to Hardware ?

[theora-dev] What goes to Hardware ?

[theora-dev] What goes to Hardware ?

[theora-dev] What goes to Hardware ?

[theora-dev] What goes to Hardware ?

[theora-dev] What goes to Hardware ?

[theora-dev] What goes to Hardware ?

Apparently Analagous Threads