Suppose there is a table "invites" with columns
foo int
bar string
A Hive SQL query
SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
will be compiled to the physical query plan below, each operator is a
actually a java class,
chained together, so the whole plan can be executed in a "interpret"
way.
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Select Operator
expressions:
expr: bar
type: string
outputColumnNames: bar
Group By Operator
aggregations:
expr: count()
bucketGroup: false
keys:
expr: bar
type: string
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
What I am thinking is to translate this physical query plan to a LLVM
IR, the IR should
inline all the operators, because they are all static, and the input
data record type is known
and static too, the LLVM IR should then be compiled to native code as
functions(one for mapper,
and one for reducer maybe), finally I can integrate them with native
MapReduce runtime and run
them on Hadoop.
The input data types are probably described by some sort of schema, or
just a memory buffer
with layouts like C struct..
I don't know if I described clearly, here are some papers mentioned this:
[Google Tenzing] http://research.google.com/pubs/pub37200.html
[Efficiently Compiling Efficient Query Plans for Modern Hardware]
www.vldb.org/pvldb/vol4/p539-neumann.pdf
Thanks,
Binglin Chang
On Fri, Feb 3, 2012 at 5:21 PM, 陳韋任 <chenwj at iis.sinica.edu.tw>
wrote:>
> Hi Chang,
>
> > I am developing a Hadoop native runtime, it has C++ APIs and
libraries,
> > what I want to do is to compile Hive's logical query plan directly
to LLVM
> > IR or translate Hive's physical query plan to LLVM IR, then run on
the
> > Hadoop native runtime. As far as I know, Google's tenzing does
similar
> > things, and a few research papers mention this technique, but they
don't
> > give details.
> > Does translate physical query plan directly to LLVM IR reasonable, or
> > better using some part of clang library?
> > I need some advice to go on, like where can I find similar projects or
> > examples, or which part of code to start to read?
>
> I don't know how those query language looks like. If the query
language
> will turn into some kind of intermediate representation during the
execution
> (like how compiler does), then you might need to find what representation
> is easier to be transformed into LLVM IR. Clang is for C-like language. I
> am not sure if Clang's library can help you or not.
>
> HTH,
> chenwj
>
> --
> Wei-Ren Chen (陳韋任)
> Computer Systems Lab, Institute of Information Science,
> Academia Sinica, Taiwan (R.O.C.)
> Tel:886-2-2788-3799 #1667
> Homepage: http://people.cs.nctu.edu.tw/~chenwj