Optimize the Performance of a Hive Union SQL Statement

Background


A Hive SQL statement like below cost several hours when querying with Spark on more than 100 executors.

select *
from
   (
     select table_a.tid, table_b.name, table_c.type
     from table_a
     join table_b on table_a.tid = table_b.tid
     left join table_c on table_a.tid = table_c.tid
     where to_date(from_unixtime(table_a.start_time)) <= '2018-05-27'
     and to_date(from_unixtime(table_a.end_time)) >= '2018-05-20'
   )
Union
  (
    select table_a.tid, table_b.name, table_c.type
    from table_a
    join table_b on table_a.tid = table_b.tid
    left join table_c on table_a.tid = table_c.tid
    where to_date(from_unixtime(table_a.start_time)) <= '2018-06-27'
    and to_date(from_unixtime(table_a.end_time)) >= '2018-06-20'
  )
table rows
table_a 27,837,053
table_b 317,261,723
table_c 888,925

Reason


The real bottleneck is the union action in the SQL statement. The union will do the distinct action, which is very expensive.

After joining, the union action will distinct on a very large dataset.

Usually, the join action cost about 10 minutes, but the distinct action cost nearly 2 hours.


Solution


It’s not easy to optimize the performance of the distinct algorithm. It’s easy to optimize the size of the dataset to distinct.

Below is the optimized SQL statement

select *
from
   (
     select a.tid, b.name, c.type
     from
        (
          select distinct tid from table a
          where to_date(from_unixtime(table_a.start_time)) <= '2018-05-27'
          and to_date(from_unixtime(table_a.end_time)) >= '2018-05-20'
        ) a
     join (select distinct tid, name from table_b) b
     on a.tid = b.tid
     left join (select distinct tid, type from table_c) c
     on a.tid = c.tid
   )
Union
  (
    select a.tid, b.name, c.type
    from
       (
         select distinct tid from table a
         where to_date(from_unixtime(table_a.start_time)) <= '2018-06-27'
         and to_date(from_unixtime(table_a.end_time)) >= '2018-06-20'
       ) a
    join (select distinct tid, name from table_b) b
    on a.tid = b.tid
    left join (select distinct tid, type from table_c) c
    on a.tid = c.tid
  )

The skill is well explained in Use Subqueries to Count Distinct 50X Faster and Performance Tuning SQL Queries