Hi Mike,
Though I can't answer all your questions, I can start with the basic
ones, given my short experience on building an LLVM compiler from
scratch.
> 1. How hard will it be to navigate the LLVM/clang codebase having very
> little compiler domain knowledge?
I knew very little about compilers when I started and after a month or
so hacking LLVM I knew more about compilers than I would by reading a
book on it. LLVM is very understanding and caring, it has asserts all
over the place and an extensive documentation (tutorials, examples,
doxygen). I felt it was very easy to work with LLVM even when I knew
nothing about compilers or LLVM itself.
> 2. What stages of the compilation are worth parallelizing(at least for a
> first step)?
I'm not an expert, but I think most compilers use "off-line"
parallelization (each step can run separately and independently), like
compiling all C files, then linking them. You can't link without
compiling all sources, you can't compile without parsing the whole
file, you can't parse without pre-processor passes on everything. So,
it's very unlikely that you'll be able to parallelize for real, and be
really different than running "make -j4" with compilation and linking
separately.
Some compilers (like MS, AFAIK) have a "database" of symbols, so it
can incrementally compile and link without doing the whole run. But
that's a big thing. Also, MS controls the development environment
(Visual Studio), so they can do whatever they want. With open source
compilers, it's very hard to force any build system in particular.
You could try to create a system to hold all external symbols (a big,
indexed, object file) that can be partially updated (instead of
re-created) every time a file is compiled. That would avoid long
lasting linkage and would be simple to accommodate to most legacy
build systems.
> 3. Will it be feasible to implement a basic distcc implementation in 1-2
> months? There should be 4 or so people working on the project, but none of
> us have significant compiler domain knowledge. If not, is there a subset of
> the problem that's worth working on?
If no one has compiler expertise, I'd expect them to learn the basics
in around a month. That wouldn't leave much time for the rest of the
project... More people won't increase the speed of learning for each
one... You better get one subset.
> 4. Are there any examples of code(preferably in real-world projects) which
> would lend themselves to parallel compilation which come to mind? At the
end
> of the project, we'll need to document the performance of our work, so
I'd
> like to be thinking about how we'd create (good) presentable results
along
> the way.
I would compare your results with the same non-parallel LLVM running
multiple times with "make -j", otherwise, there is no point in
comparing...
> 5. Where should I start? :). Obviously this is a pretty large undertaking,
> but is there any documentation that I should look at? Any particular source
> files that would be relevant?
No idea. llvm/lib/Linker maybe?
cheers,
--renato