Performance Optimization Techniques for Apache Hive

Best Practices to Optimize Hive Queries and Improve Performance.

Posted by Aravind Nuthalapati on August 14, 2018

This Article explains about Hive Performance Optimization Techniques for CDH.

Hive Performance Optimization Techniques

1. Partitioning Tables

Partition Hive tables based on frequently queried columns (e.g., date, region) to reduce data scans and query latency.

CREATE TABLE sales (id INT, amount DOUBLE)
PARTITIONED BY (sales_date STRING);

2. Bucketing Tables

Bucket frequently joined tables on common columns to optimize join operations.

CREATE TABLE user_activity (user_id INT, action STRING)
CLUSTERED BY (user_id) INTO 10 BUCKETS;

3. Use ORC File Format

Use the Optimized Row Columnar (ORC) format for better compression and faster performance.

CREATE TABLE orders_orc (order_id INT, order_total DOUBLE)
STORED AS ORC;

4. Vectorization

Enable vectorized query execution in Hive to significantly improve query processing speed.

SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;

5. Tez Execution Engine

Switch Hive's execution engine from MapReduce to Tez for improved query performance.

SET hive.execution.engine=tez;

6. Enable Cost-Based Optimizer (CBO)

Enable Hive's cost-based optimizer for intelligent query planning and execution optimization.

SET hive.cbo.enable=true;
SET hive.compute.query.using.stats=true;
SET hive.stats.fetch.column.stats=true;
SET hive.stats.fetch.partition.stats=true;

7. Compress Intermediate Results

Compress intermediate query outputs to reduce disk I/O and network usage.

SET hive.exec.compress.intermediate=true;
SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

8. Optimize Joins

Use map-joins (broadcast joins) for small dimension tables.

SET hive.auto.convert.join=true;
SET hive.mapjoin.smalltable.filesize=25000000;

9. Parallel Execution and Reducers

Adjust the number of parallel reducers to enhance query speed.

SET hive.exec.parallel=true;
SET hive.exec.reducers.bytes.per.reducer=134217728; --128 MB/reducer

10. JVM Reuse

Enable JVM reuse to reduce overhead in task initialization.

SET mapreduce.job.jvm.numtasks=10;

11. Statistics Collection

Regularly gather statistics to help Hive choose optimized execution plans.

ANALYZE TABLE tablename COMPUTE STATISTICS;
ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;

12. Resource Management with YARN

Tune YARN resource allocation to ensure Hive queries have sufficient resources.

  • Increase YARN container memory if tasks frequently fail with Out-of-Memory (OOM) errors.
  • Balance CPU and memory allocations based on workload type.

13. Avoid SELECT *

Explicitly define required columns instead of using SELECT * to minimize unnecessary data reading and processing.

14. Query Optimization Practices

  • Avoid unnecessary subqueries.
  • Reduce expensive aggregations and DISTINCT clauses when possible.
  • Filter data as early as possible in your queries.

15. Monitor and Analyze Execution

Regularly monitor Hive query logs and execution plans in Cloudera Manager to identify and fix bottlenecks proactively.

Summary

Implementing these optimization techniques significantly enhances Hive performance, improving query response times and maximizing your CDH cluster's efficiency.