Introducing Cost Based Optimizer to
Apache Hive
John Pullokkaran
Hortonworks
10/08/2013
Abstract Apache Hadoop is a framework for the distributed processing of large data sets using clusters of computers typically composed of commodity hardware. Over last few years Apache Hadoop has become the de facto platform for distributed data processing using commodity hardware. Apache Hive is a popular SQL interface for data processing using Apache Hadoop.
Apache Hadoop是由分布式大型数据处理框架,使用通用的计算机硬件作为集群组成。在最近的几年Apache Hadoop事实上已经成为了分布式数据处理的标准平台。Apache Hive是一个在Apache Hadoop之上使用SQL接口处理数据的受欢迎平台。
User submitted SQL query is converted by Hive to physical operator tree which is optimized and converted to Tez Jobs and is then executed on Hadoop cluster. Distributed SQL query processing in Hadoop differs from conventional relational query engine when it comes to handling of intermediate result sets. Hive query processing often requires sorting and reassembling of intermediate result set; this is called shuffling in Hadoop parlance.
用户提交的SQL查询通过Hive转换,并且转换为Tez作业,然后在Hadoop集群上执行物理运算。在Hadoop分布式环境中SQL查询不同于传统的关系型查询引擎,当涉及到处理的中间结果集,Hive查询处理通常需要对中间结果集排序和重组,这就是在Hadoop中所谓的shuffling说法。
Most of the existing query optimizations in Hive are about minimizing shuffling cost. Currently user would have to submit an optimized query to Hive with right join order for query to be executed efficiently. Logical optimizations in Hive are limited to filter push down, projection pruning and partition pruning. Cost based logical optimizations can significantly improve Apache Hive's query latency and ease of use.
大多数现有关于查询的优化是与如何减少shuffling的成本有关的。目前,用户必须提交一个被优化过的并且连接顺序正确的查询Hive才能够保证被有效的优化。逻辑优化在Hive中仅限于过滤推进(谓词推进)、投影修剪(列裁剪)和分区修剪。基于成本的优化(Cost based logical optimizations)可以显著的降低Apache Hive查询的延迟和提升易用性。
Join reordering and join algorithm selection are few of the optimizations that can benefit from a cost based optimizer. Cost based optimizer would free up user from having to rearrange joins in the right order or from having to specify join algorithm by using query hints and configuration options. This can potentially free up users to model their reporting and ETL needs close to business process without having to worry about query optimizations.
Join重排序和Join操作的优化算法很少能够得益于基于成本的优化操作。基于成本的优化将避免用户在join操作中加入比必要的hint提示和关心join顺序的配置选项链接连接算法。这能够让用户更加关注建模后它们的结果和ETL过程而无需担心查询的优化。
Optiq is an open source cost based query optimizer and query execution framework. Optiq currently has more than fifty query optimization rules that can rewrite query tree, and an efficient plan pruner that can select cheapest query plan in an optimal manner. In this paper we discuss how Optiq can be used to introduce Cost Based Logical Optimizer (CBO) in to Apache Hive.
Optiq是一个开源的基于成本的查询优化器和查询执行框架。Optiq目前拥有查过五十万的查询优化规则,可以重写查询树、高效的执行计划和修剪,最终选择最优的方式执行查询。在本文中,我们讨论Optiq是如何被引入到Hive的基于成本的优化逻辑(CBO)中的。
CBO will be introduced in to Hive in a Phased manner. In the first phase, Optiq would be used to reorder joins and to pick right join algorithm so as to reduce query latency. Table cardinality and Boundary statistics will be used for this cost based optimizations.
CBO在Hive以分阶段的方式存在。在第一阶段,Optiq将用于重新排序连接和选择正确的连接查询算法以减少延迟。表基数和边界统计数据将被用于此基于成本的优化。
1. INTRODUCTION
Hive is a data-warehousing infrastructure on top of Apache Hadoop. Hive takes advantage of Hadoop's massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. Hive SQL is the declarative query language, which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily.
In past Hadoop jobs tended to have high latencies and incurred substantial overheads in job submission and scheduling. As a result - latency for Hive queries was generally very high even when data sets involved were very small. As a result Hive was typically used for ETL and not much for interactive queries. With Hadoop2 and Tez the overheads for job submission and job scheduling have gone down significantly. In Hadoop version one, the jobs that could be executed could only be Map-Reduce Jobs. With Hadoop2 and Tez that limitation no longer apply.
In Hadoop the output of mapper is sorted and sometimes persisted on the local disk of the mapper. This sorted mapper output is then send to appropriate reducer which then combines sorted results from different mappers. While executing multiple map-reduce jobs where output of one job needs to be consumed by the successive map-reduce job, the output of preceding map-reduce job needs to be persisted into HDFS; this persistence is costly as the data needs to be copied to other nodes based on the replication factor of HDFS.
Hive on top of Hadoop version 1 often had to submit multiple map-reduce jobs to complete query processing. This Map-Reduce job pipeline degraded performance, as the intermediate
result set now needs to be persisted to fault tolerant HDFS. Also submitting jobs and scheduling them were relatively costly operations. With Hadoop2 and Tez the cost of job submission and scheduling is minimized. Also Tez does not restrict the job to be only Map followed by Reduce; this implies all of the query execution can be done in a single job without having to cross job boundaries. This would result in a significant cost savings, as the intermediate result sets need not be persisted to HDFS or to even local disk.
Query optimizations in a relational query engine can be broadly classified as logical query optimizations and physical query optimizations. Logical query optimizations generally refer to query optimizations that can be derived based on relational algebra independent of the physical layer in which query is executed. Physical query optimizations are query optimizations that are cognizant of physical layer primitives. For Hive, the physical layer implies Map-Reduce and Tez primitives.
Currently logical query optimizations in Hive can be broadly categorized as follows:
• Projection Pruning
• Deducing Transitive Predicates
• Predicate Push down
• Merging of Select-Select, Filter-Filter in to single operator
• Multi-way Join
• Query Rewrite to accommodate for Join skew on some column valuesPhysical optimizations in Hive can be broadly classified as follows:
• Partition Pruning
• Scan pruning based on partitions and bucketing
• Scan pruning if query is based on sampling
• Apply Group By on the map side in some cases
• Perform Join on the Mapper
• Optimize Union so that union can be performed on map side only
• Decide which table to stream last, based on user hint, in a multi way join
• Remove unnecessary reduce sink operators
• For queries with limit clause, reduce the number of files that needs to be scanned for the table.
• For queries with Limit clause, restrict the output coming from mapper by restricting what Reduce Sink operator generates.
• Reduce the number of Tez jobs needed for answering user submitted SQL query
• Avoid Map-Reduce jobs in case of simple fetch query
• For simple fetch queries with aggregation, perform aggregation without MapReduce tasks
• Rewrite Group By Query to use index table instead of original table
• Use Index scans when predicates above table scan is equality predicates and columns in predicate have indexes on it.
In Hive most of the optimizations are not based on the cost of query execution. Most of the optimizations do not rearrange the operator tree except for filter push down and operator merging. Most of the operator tree mutation is for removing reduce-sink and reducer operator. Listed below are some of optimization decisions that can benefit from a CBO:
• How to order Join
• What algorithm to use for a given Join
• Should the intermediate result be persisted or should it be recomputed on operator failure.
• The degree of parallelism at any operator (specifically number of reducers to use).
• Semi Join selection
Optiq is an open source, Apache Licensed, query planning and execution framework. Many pieces of Optiq are derived from Eigenbase Project. Optiq has optional JDBC server, query parser and validator, query optimizer and pluggable data source adapters. One of the available Optiq optimizer is a cost based optimizer based on volcano paper. Currently different pieces of Optiq is used in following projects/products:
• Apache Drill
• Cascading (Lingual)
• Lucid DB
• Mondrian/Pentaho
Optiq currently has over fifty cost based optimization rules. Some of the prominent cost based optimization rules are listed below:
• Push Join through Union
• Push Filter past Table Function
• Join Reordering
• Semi Join selection
• Push Aggregate through Union
• Pull Aggregate through Union
• Pull Constants through Aggregate
• Merge Unions
In this document we propose to use Optiq's cost based optimizer, Volcano, to perform Cost Based Optimizations in Hive. We propose to implement Optiq based CBO in a phased
manner. Note here that proposal is to use Optiq's optimizer only and nothing else. Listed below are the envisioned stages of introducing CBO in to Hive using Optiq:
• Phase1 – Join Reordering & Join algorithm Selection
1、Table cardinality and boundary statistics will be used to compute operator cardinality.
2、Hive operator tree will be converted to Optiq operator tree.
3、Volcano optimizer in Optiq will be used to rearrange joins and to pick the join algorithm.
4、 Optimized Optiq operator tree would be converted back to Hive AST and will be executed as before. So all of the Hive's existing optimizations would run on top of Optiq optimized SQL.
• Phase2 – Add support for Histograms, use other optimizations in Optiq
1、Introduce space efficient histograms
2、Change operator cardinality computation to use histograms
3、Register additional optimization rules available in Optiq like the ones listed above.
• Phase3 – Code restructuring so that Optiq generates optimized Hive Operator tree
1、Unlike phase1 Hive AST would be directly translated into Optiq operator tree.
2、 Optimize Optiq operator tree using Volcano optimizer.
3、 Convert optimized Optiq operator tree back into Hive Operator tree. This is unlike phase1 where optimized Optiq operator tree is converted to Hive AST.