[NCLUG] Re: parallel processing users?

Tue Oct 18 16:38:29 MDT 2005

Matt <rosing at peakfive.com>writes:
>John wrote:
>
> > Actually, it's quite rational to do so, but it takes developing a compiler
> > that will reduce C/Fortan into several different types of netlists to fit
> > the problems into FPGA's. Key is that it does take a pretty stiff relearning
> > the basic principles of architecture and machine design, as what "everybody
> > knows" is the right or only way to build machines, quickly leads you down
> > the wrong path for this class of machines.
>
>My brief experience with using FPGAs and C is that C doesn't seem like
>the right tool.  In order to get anything to work well I had to think
>in terms of data flow and placing/connecting functional units.  That's
>not C.  But verilog is too low a level if you're talking about 1000s
>of lines of code and scientists that are more interested in science
>than circuit design.

Yep, it can be that way, and the choice of C compiler makes a huge difference
in what you can do, AND how you can debug. Handle-C (Celoxica) is just
Verilog with C syntax for the most part, so even writing in near C, you
are writing specifically to create gates, data paths, and state machines.
Use any of the extensions geared to optimize for hardware fit, and you
can not longer compiler and test with a traditional compiler on traditional
cpus.

StreamsC (Impact C) starts you down a slightly better path, especially for
anyone used to using MPI (or other PVM) applications communications library
as you are already thinking in terms of moving data and building natural
pipelines or processing clusters with work queues and communications.

My personal choice was to hack TMCC to be more C like, so working in a
slightly restricted C languague I can test on a traditional cpu with gcc,
and recompile with TMCC to fpga.  Using MPI style problem partitioning
and coding for communications means the problem gets broken down into
many functional units/processors to exploit parallelism.

There is a third approach for C to netlist, which is based on partial
evaluation (have the compiler evaluate EVERYTHING) so that the minimum
real work makes it out of a compiler. Harpe for example, when compiling
a soft core CPU with code in a soft rom image, actually unrolls the entire
execution of the soft core CPU, optimizes away the cruft, and ends up
with the minimum logic to complete the task at hand, and not be cycle
accurate in executing the equivalent C alorithm.

>
> > I actually proposed doing so as a $30-50M project to Sandia earlier this year
> > with the intent to produce a 1-10 petaflop FPGA/Memory machine specifically
> > targeting their large simulation applications. Both the nature of the compiler,
> > and the architecture of the machine, are critical aspects to realizing usable
> > solutions, along with a critically strong dose of keep it simple.
>
>Well, I suppose if you made a big homogenous sea of
>memory/FPGA/floating point units it could be much simpler to figure
>out and work with than the complexity of a typical cpu.
>
> > Language/tool support, cosmic radiation, and "nobody has done it before" are
> > the three primary problems. The first and last are really the primary problems,
>
>Language and tool support seems to be a big problem with all
>performance related software.  C and the like are too generic and far
>away from the problem and require super human optimizers or low level
>mucking around by the programmer.  My approach is to work on specific
>extensions for specific problems.  I say this because I build my own
>tools and that's what works for me.  If you have a specific
>application from Sandia and specific hardware, you're half way there.

We all end up doing that to some extent, it's just what tools we are
willing to build.

John