Hive Correlation Optimizer

最近看到0.12版hive中的一个优化Hive Correlation Optimizer，小感慨一下

关于作者YIN HUAI

俄亥俄州立大学 phd 个人简介 http://www.cse.ohio-state.edu/~huai/

关于Hive Correlation Optimizer

出自YIN HUAI与其在大学的老师和同学的一篇论文 YSmart: Yet another SQL-to-MapReduce Translator （见个人简介）

论文的核心思想是：当前有很多可以将SQL转化为MapReduce job的引擎，如hive/pig。这些引擎极大的提高了编写MR程序的效率，但是这些引擎生成的MR程序对于多少查询的运行效率却不如手写的MR程序。

论文中提出一种优化方法，找到SQL生成的MR Job DAG中重复使用的表，减少重复读取，找到DAG中前后两个拥有相同shuffle key的Job，减少不必要的shuffle，合并MR job，达到减少计算量、减少I/O操作的作用。

One typical type of complex queries in MapReduce is queries on multiple occurrences of the same table, including self-joins.

下面一个SQL是论文中举的一个例子

SELECT sum(l_extendedprice) / 7.0 AS avg_yearly
FROM (SELECT l_partkey, 0.2* avg(l_quantity) AS t1
FROM lineitem
GROUP BY l_partkey) AS inner,
(SELECT l_partkey,l_quantity,l_extendedprice
FROM lineitem, part
WHERE p_partkey = l_partkey) AS outer
WHERE outer.l_partkey = inner.l_partkey;
AND outer.l_quantity < inner.t1; 如果不考虑Map Join，hive会将其翻译成3个MR Job，第1个job计算第1个子查询中对lineitem表的group by操作，第2个job计算第2个子查询中lineitem表和part表的join操作，第2个job计算外层前两个临时表的join操作。 

Job1: generate inner by group/agg on lineitem
Map:
lineitem -> (k:l_partkey, v:l_quantity)
Reduce:
calculate (0.2*avg(l_quantity)) for each (l_partkey)
Job2: generate outer by join lineitem and part
Map:
lineitem -> (k: l_partkey,
v:(l_quantity,l_extendedprice))
part -> (k:p_partkey,v:null)
Reduce:
join with the same partition (l_partkey=p_partkey)
Job3: join outer and inner
Map:
outer-> (k:l_partkey, v:(l_quantity,l_extendedprice))
inner-> (k:l_partkey, v:(0.2*avg(l_quantity)))
Reduce:
join with the same partition of l_partkey

优化的思路非常简单Job1 与 Job2 都是按照相同表的相同字段lineitem.l_partkey进行shuffle，可以在一个job内完成，也不用读两次lineitem表。

Job3也是按照字段lineitem.l_partkey进行shuffle，在第一步合并之后的job中，已经是按照字段lineitem.l_partkey进行shuffle，因此job3也是没有必要的，也就很自然的将这3个job合并成一个job。

Job1: generate both inner and outer,
and then join them
Map:
lineitem -> (k: l_partkey,
v:(l_quantity,l_extendedprice))
part -> (k:p_partkey,v:null)
Reduce:
get inner: aggregate l_quantity for each (l_partkey)
get outer: join with (l_partkey=p_partkey)
join inner and outer 通过这个例子，可以总结出三种相关性：

–Input correlation (IC): independent operators share the same input tables.
–Transit correlation (TC): independent operators have input correlation and also shuffle the data in the same way (e.g. using the same keys)
–Job flow correlation (JFC): two dependent operators shuffle the data in the same way 看完上面的例子，其实核心思想非常简单，只有两点：

Eliminate unnecessary data loading

Query planner will be aware what data will be loaded
Do as many things as possible for loaded data

Eliminate unnecessary data shuffling

Query planner will be aware when data really needs to be shuffled
Do as many things as possible before shuffling the data again

13年，YIN HUAI去了Hortonworks，与hive团队一起将这篇论文实现。虽然核心思想很简单，但是真正要去实现这个优化，却没那么简单

需要做的事情有：发现相关性，变换Query Tree，common join与map join全都支持，引入新的Operator支持在Reduce阶段做尽可能多的事情。

具体的事情：

HIVE-1772 optimize join followed by a groupby
HIVE-3430 group by followed by join with the same key should be optimized
HIVE-2206 add a new optimizer for query correlation discovery and optimization

…

非常欣赏和羡慕这样一个人，他既能写出论文，又能花时间将自己的论文实现，并且能make a didfference。

TODO 代码学习

参考：

HIVE-2206 add a new optimizer for query correlation discovery and optimization

HIVE-3667 Umbrella jira for Correlation Optimizer

http://www.slideshare.net/YinHuai/hive-correlation-optimizer

http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf