Tuesday, December 19, 2017

Partition-wise joins: "divide and conquer" for joins between partitioned table

Unlike inheritance-based partitioning, declarative partitioning introduced in PostgreSQL 10 leaves nothing to infer about how the data is divided into partitions. PostgreSQL 11's query optimizer is gearing up to take advantage of this "no-inference" representation. The first one that got committed was basic partition-wise join.

What is partition-wise join?

A join between two similarly partitioned tables can be broken down into joins between their matching partitions if there exists an equi-join condition between the partition keys of the joining tables. The equi-join between partition keys implies that all the join partners for a given row in a given partition of one partitioned table must be in the corresponding partition of the other partitioned table. Because of this, the join between partitioned tables can be broken down into joins between the matching partitions. This technique of breaking down a join between partition tables into joins between their partitions is called partition-wise join.

Partition-wise join in PostgreSQL

Let's start with an example. Consider two tables partitioned as follows:

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (5000);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (5000) TO (15000);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (15000) TO (30000);

CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (5000);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (5000) TO (15000);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (15000) TO (30000);

All join partners for a row in prt1_p1 come from prt2_p1. All join partners for a row in prt1_p2 come from prt2_p2. And all join partners for a row in prt1_p3 come from prt2_p3. Those three form the matching pairs of partitions.

Without partition-wise join, the plan for a join between these two tables looks like:

explain (costs off)
select * from prt1 t1, prt2 t2 where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                       QUERY PLAN                       
--------------------------------------------------------
 Hash Join
   Hash Cond: (t2.b = t1.a)
   ->  Append
         ->  Seq Scan on prt2_p1 t2
               Filter: ((b >= 0) AND (b <= 10000))
         ->  Index Scan using prt2_p2_b on prt2_p2 t2_1
               Index Cond: ((b >= 0) AND (b <= 10000))
   ->  Hash
         ->  Append
               ->  Seq Scan on prt1_p1 t1
                     Filter: (b = 0)
               ->  Seq Scan on prt1_p2 t1_1
                     Filter: (b = 0)
               ->  Seq Scan on prt1_p3 t1_2
                     Filter: (b = 0)
(15 rows)

With partition-wise join the plan for the same query looks like:

explain (costs off)
select * from prt1 t1, prt2 t2 where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Append
   ->  Hash Join
         Hash Cond: (t2.b = t1.a)
         ->  Seq Scan on prt2_p1 t2
               Filter: ((b >= 0) AND (b <= 10000))
         ->  Hash
               ->  Seq Scan on prt1_p1 t1
                     Filter: (b = 0)
   ->  Nested Loop
         ->  Seq Scan on prt1_p2 t1_1
               Filter: (b = 0)
         ->  Index Scan using prt2_p2_b on prt2_p2 t2_1
               Index Cond: ((b = t1_1.a) AND (b >= 0) AND (b <= 10000))
(13 rows)


There are a few things to be noted here:
  1. There exists an equi-join condition t1.a = t2.b, which includes partition keys from both the tables.
  2. Without partition-wise join, the join will be performed after "appending" all the rows from each partition of either partitioned table. With partition-wise join, the join between matching partitions is performed and the results are appended. This is advantageous when the size of join result is significantly smaller than the result of cross product. More advantageous, if the partitions themselves are foreign tables, i.e. data in the partitions resides on the foreign server.
  3. Without partition-wise join, it used hash join, but with partition-wise join it used different strategies for each join between partitions, choosing optimal strategy for each join. For example, the join between prt1_p2 and prt2_p2 uses nested loop join with index scan on prt2_p2_b as parameterized inner side, whereas the other join uses hash join.
  4. The condition t2.b between 0 and 10000 eliminated partition prt2_p3 so it does not get scanned by the plan without partition-wise join. But it didn't notice that no row in prt1_p3 had a join partner at all and still scanned that partition. With partition-wise join, it sensed the lack of a matching partition and eliminated a scan on prt1_p3 as well. Eliminating an entire partition is a significant improvement, since sequential scans are expensive.
Partition-wise join wins over unpartitioned join because it can take advantage of properties of the partitions and use smaller hash tables that may completely fit in memory, faster in-memory sorts, join push-down in case of foreign partitions and so on. We will talk about performance more in a follow-on post.

Beyond basic partition-wise join

In the basic version that was committed, the technique is applied when the joining tables have exactly the same partition key data types and have exactly matching partition bounds. But there are few enhancements possible:
  1. Even if the partition bounds do not match exactly, the technique can be used when every partition in one partitioned table has at most one matching partition from the other partitioned table. A patch for that is being worked on.
  2. A join between an unpartitioned table and a partitioned table can be executed using this technique by joining the unpartitioned table with each partition separately and combining the results of these joins. This may help the cases when a few tables in query are unpartitioned but other tables are similarly partitioned and an optimal plan interleaves partitioned and unpartitioned tables.
  3. The technique uses more memory and CPU even when partition-wise join is not an optimal strategy. Reduce the memory and CPU footprint of this technique.
  4. When two differently partitioned tables are joined, repartition one of them to match the partition scheme of the other and join using partition-wise join; a technique usually helpful to join differently sharded tables by redistributing the data.

Contributions

I authored the patch to implement basic partition-wise join, but a lot of people helped with testing and reviewing. Rajkumar Raghuvanshi, my colleague from EDB tested it. Another EDBean Rafia Sabih ran benchmarks and published results. Robert Haas committed it after several rounds of reviews. Thomas Munro, Jeevan Chalke from EDB, Amit Langote from NTT, Antonin Houska and few others also reviewed some portions of it.

Also read about this feature's role in analytics here.