Walking with the Elephants: 2017

Tuesday, December 19, 2017

Partition-wise joins: "divide and conquer" for joins between partitioned table

Unlike inheritance-based partitioning, declarative partitioning introduced in PostgreSQL 10 leaves nothing to infer about how the data is divided into partitions. PostgreSQL 11's query optimizer is gearing up to take advantage of this "no-inference" representation. The first one that got committed was basic partition-wise join.

What is partition-wise join?

A join between two similarly partitioned tables can be broken down into joins between their matching partitions if there exists an equi-join condition between the partition keys of the joining tables. The equi-join between partition keys implies that all the join partners for a given row in a given partition of one partitioned table must be in the corresponding partition of the other partitioned table. Because of this, the join between partitioned tables can be broken down into joins between the matching partitions. This technique of breaking down a join between partition tables into joins between their partitions is called partition-wise join.

Partition-wise join in PostgreSQL

Let's start with an example. Consider two tables partitioned as follows:

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);

CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (5000);

CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (5000) TO (15000);

CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (15000) TO (30000);

CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);

CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (5000);

CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (5000) TO (15000);

CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (15000) TO (30000);

All join partners for a row in prt1_p1 come from prt2_p1. All join partners for a row in prt1_p2 come from prt2_p2. And all join partners for a row in prt1_p3 come from prt2_p3. Those three form the matching pairs of partitions.

Without partition-wise join, the plan for a join between these two tables looks like:

explain (costs off)

select * from prt1 t1, prt2 t2 where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;

QUERY PLAN

--------------------------------------------------------

Hash Join

Hash Cond: (t2.b = t1.a)

-> Append

-> Seq Scan on prt2_p1 t2

Filter: ((b >= 0) AND (b <= 10000))

-> Index Scan using prt2_p2_b on prt2_p2 t2_1

Index Cond: ((b >= 0) AND (b <= 10000))

-> Hash

-> Append

-> Seq Scan on prt1_p1 t1

Filter: (b = 0)

-> Seq Scan on prt1_p2 t1_1

Filter: (b = 0)

-> Seq Scan on prt1_p3 t1_2

Filter: (b = 0)

(15 rows)

With partition-wise join the plan for the same query looks like:

explain (costs off)

select * from prt1 t1, prt2 t2 where t1.a = t2.b and t1.b = 0 and t2.b between 0 and 10000;

QUERY PLAN

------------------------------------------------------------------------

Append

-> Hash Join

Hash Cond: (t2.b = t1.a)

-> Seq Scan on prt2_p1 t2

Filter: ((b >= 0) AND (b <= 10000))

-> Hash

-> Seq Scan on prt1_p1 t1

Filter: (b = 0)

-> Nested Loop

-> Seq Scan on prt1_p2 t1_1

Filter: (b = 0)

-> Index Scan using prt2_p2_b on prt2_p2 t2_1

Index Cond: ((b = t1_1.a) AND (b >= 0) AND (b <= 10000))

(13 rows)

There are a few things to be noted here:

There exists an equi-join condition t1.a = t2.b, which includes partition keys from both the tables.
Without partition-wise join, the join will be performed after "appending" all the rows from each partition of either partitioned table. With partition-wise join, the join between matching partitions is performed and the results are appended. This is advantageous when the size of join result is significantly smaller than the result of cross product. More advantageous, if the partitions themselves are foreign tables, i.e. data in the partitions resides on the foreign server.
Without partition-wise join, it used hash join, but with partition-wise join it used different strategies for each join between partitions, choosing optimal strategy for each join. For example, the join between prt1_p2 and prt2_p2 uses nested loop join with index scan on prt2_p2_b as parameterized inner side, whereas the other join uses hash join.
The condition t2.b between 0 and 10000 eliminated partition prt2_p3 so it does not get scanned by the plan without partition-wise join. But it didn't notice that no row in prt1_p3 had a join partner at all and still scanned that partition. With partition-wise join, it sensed the lack of a matching partition and eliminated a scan on prt1_p3 as well. Eliminating an entire partition is a significant improvement, since sequential scans are expensive.

Partition-wise join wins over unpartitioned join because it can take advantage of properties of the partitions and use smaller hash tables that may completely fit in memory, faster in-memory sorts, join push-down in case of foreign partitions and so on. We will talk about performance more in a follow-on post.

Beyond basic partition-wise join

In the basic version that was committed, the technique is applied when the joining tables have exactly the same partition key data types and have exactly matching partition bounds. But there are few enhancements possible:

Even if the partition bounds do not match exactly, the technique can be used when every partition in one partitioned table has at most one matching partition from the other partitioned table. A patch for that is being worked on.
A join between an unpartitioned table and a partitioned table can be executed using this technique by joining the unpartitioned table with each partition separately and combining the results of these joins. This may help the cases when a few tables in query are unpartitioned but other tables are similarly partitioned and an optimal plan interleaves partitioned and unpartitioned tables.
The technique uses more memory and CPU even when partition-wise join is not an optimal strategy. Reduce the memory and CPU footprint of this technique.
When two differently partitioned tables are joined, repartition one of them to match the partition scheme of the other and join using partition-wise join; a technique usually helpful to join differently sharded tables by redistributing the data.

Contributions

I authored the patch to implement basic partition-wise join, but a lot of people helped with testing and reviewing. Rajkumar Raghuvanshi, my colleague from EDB tested it. Another EDBean Rafia Sabih ran benchmarks and published results. Robert Haas committed it after several rounds of reviews. Thomas Munro, Jeevan Chalke from EDB, Amit Langote from NTT, Antonin Houska and few others also reviewed some portions of it.

Also read about this feature's role in analytics here.

Wednesday, June 14, 2017

ALTERing partition bounds of a partition

PostgreSQL 10 is full with a lot of big, new and exciting features. Declarative partitioning is one of those. It is something users have wanted for years. During PGCon 2017, it was a hot topic of discussion. People wanted to know more about the feature, and were eager to try it out. The un-conference and conference session on partitioning attracted a large crowd. One of the frequently asked questions centred on whether a user can change partition bounds of an existing partition. This bears exploring, and an answer.

Implicit to the question is the use of the ALTER TABLE command. For those who are new to PostgreSQL or to PostgreSQL partitioning, partitions are tables in PostgreSQL. Database administrators use the CREATE TABLE command to create partitions using the PARTITION OF clause. Essentially, users then expect an ALTER TABLE subcommand that allows a change to the partition bounds e.g. ALTER TABLE ... ALTER FOR VALUES ... or a similar kind of command. But, there's no such ALTER TABLE subcommand in PostgreSQL v10. We may add it in the future versions, but we have not seen any such proposal yet.

DBAs planning to use partitioning features to be introduced in PostgreSQL v10 should not do so lightly. First, bad partitioning is worse than no partitioning. It's critical to choose partitioning keys, strategy, and ranges/lists after significant thought and testing. Second, there are many missing functionalities in v10 partitioning like SPLIT, MERGE. Like many other major features in PostgreSQL, partitioning will take a few (or possibly, just a couple) releases to be functionally complete or close to complete. However, the above functional deficiency is not hard to overcome, and what follows is how to do it.

Let's create a partitioned table with three partitions:

CREATE TABLE t1 (a int, b int) PARTITION BY RANGE(a);

CREATE TABLE t1p1 PARTITION OF t1 FOR VALUES FROM (0) TO (100);

CREATE TABLE t1p2 PARTITION OF t1 FOR VALUES FROM (200) TO (300);

CREATE TABLE t1p3 PARTITION OF t1 FOR VALUES FROM (300) TO (400);

Let's see how has that come out:

\d+ t1

Table "public.t1"

Partition key: RANGE (a)

Partitions: t1p1 FOR VALUES FROM (0) TO (100),

t1p2 FOR VALUES FROM (200) TO (300),

t1p3 FOR VALUES FROM (300) TO (400)

You will notice that we do not have a partition to hold rows with values for column "a" from 100 (inclusive) to 200 (exclusive). So if you try to insert a row with a = 150, it will fail with an error.

INSERT INTO t1 VALUES (150, 150);

ERROR: no partition of relation "t1" found for row

DETAIL: Partition key of the failing row contains (a) = (150).

Let's say you realize your mistake and want to correct it. Here's simple trick. Detach the partition that needs its bounds changed. This will simply detach the corresponding table from the partition hierarchy but does not remove it from the database. Therefore, the data in this partition remains untouched. Now attach that partition again with the changed bounds. If any data in the partition does not fit the new bounds, or some other partition has overlapping bounds, the command will fail. As a result, you have little concern of mistakes. Here are the actual commands (See ALTER TABLE … ATTACH/DETACH documentation for more details.):

BEGIN TRANSACTION;

ALTER TABLE t1 DETACH PARTITION t1p1;

ALTER TABLE t1 ATTACH PARTITION t1p1 FOR VALUES FROM (0) TO (200);

COMMIT TRANSACTION;

If you noticed and question the BEGIN/COMMIT transaction block around the above commands, that’s to ensure that the table remains inaccessible while the bounds are being changed. This is to prevent another transaction from adding a partition with a conflicting range or adding something to t1p1 which would conflict with the new partition bounds. Please note that ATTACH PARTITION would check that all the rows in the table comply with the new partitioning constraints. This would cause the whole table to be scanned and thus affect performance if there are many rows in that table.

Here’s how the new partitions look like. Notice the range of partition t1p1.

\d+ t1

Table "public.t1"

--------+---------+-----------+----------+---------+---------+--------------

Partition key: RANGE (a)

Partitions: t1p1 FOR VALUES FROM (0) TO (200),

t1p2 FOR VALUES FROM (200) TO (300),

t1p3 FOR VALUES FROM (300) TO (400)

The table will accept a row with a = 150.

INSERT INTO t1 VALUES (150, 150);

INSERT 0 1

The blog was first published on The EDB blogs.

Tuesday, February 28, 2017

Combining partitioning and FDWs for real time analytics.

Since v9.1 PostgreSQL is steadily improving its Foreign Data Wrapper (FDW in short) capabilities. FDWs are a way to access and manipulate data external to PostgreSQL from a PostgreSQL server. The technology is based on SQL/MED standard, which represents foreign data in the form of relational table called foreign tables. Starting with v10, PostgreSQL will support declarative partitioning, similar to the partitioning support in many popular DBMSes. What’s more interesting is that these partitions can be foreign tables, bringing two great technologies together! Here’s a glimpse of what this confluence do.

Real time analytics scenario

Consider a database recording purchases made by customers in various stores. Since the customers have affinity towards stores in regions they are located, each region has its own server, which records transactions by customers within that region. A region here can be as large as a continent or as small as a province, depending upon the nature of business, volumes of transactions etc. The transactions are required to be analyzed at a central server periodically for deciding promotions, product distribution etc. The traditional method to do so would be to dump-load-analyze.

With the help of partitioning and FDWs, the central server can create tables partitioned by region to consolidate all the data. The partitions would be foreign tables pointing to their counterparts in regional servers. This saves time spent in dumping data on regional servers and loading it to the central server. Also, the reports on the central server can be generated any time. Regional standbys can be used as foreign servers in case the central server only generates reports from this regional data.

Potential performance

I tried to measure performance gains simulating above scenario on my laptop. Assume two tables in every regional server, a. customers: recording information about the customers, b. orders: recording purchases made by customers. Central server has partitioned table 'all_customers' and 'all_orders'. Customer id in every region has higher two digits corresponding to the region (e.g. 01, 02, 03, ...). Thus partitioning tables by customer id partitions those by region. (Depending upon the other queries, we may choose to create 'region' as a column in all the required tables and then partition the tables in the central server by region.). I tried a query which produces the names and other details of top-10 consumers. I had four partitions each with 1M customers, and for each customer upto 10 purchases. The real data would have lot more customers and transactions per customer than that and accordingly the scale up would be better.

SELECT count(o.prod_id) sale_count, -- sum(price) would be more appropriate
c.id -- add c.name, c.email etc. here
FROM all_customers c, all_orders o
WHERE c.id = o.cust_id
GROUP BY c.id
ORDER BY sale_count
LIMIT 10;

Without partition-wise joins, it fetched all the data from all the regions, performed a local join and aggregation. The query took 176 seconds to complete.

With partition-wise join and postgres_fdw join pushdown but without partition-wise aggregation, the query took 155 seconds (12% improvement).

With partition-wise join, partition-wise aggregation, postgres_fdw join pushdown and aggregate pushdown it took 62 seconds, (65% improvement). In this case, aggregation produced total 4M rows from each region and each query on the foreign server took approx 15 seconds. Extrapolating that, we can say that a query which takes hours to run, would now require minutes to run.

If we were to have asynchronous query capability in FDWs, each foreign server can be queried in parallel, thus completing the query in approximately 17-20 seconds (88% improvement). If we could push SORT and LIMIT down to the foreign server, we would be fetching 10 rows instead of 4M, thus improving the query further.

If the company requires to make real time decisions about promotions targeting high-profile consumers, or product distribution and so on, it could do so using combination of partitioning and FDWs, as compared to unreal-time dump-then-load-then-analyze approach.

Patches under development

All the query optimisations we talked about are either available in PostgreSQL or are under development and will be available with next few releases. I have proposed partition-wise join optimisation to the community here. It’s being reviewed by Robert Haas. Jeevan Chalke is working on partition-wise aggregation, which will soon be proposed to the community. Asynchronous query is being discussed on hackers here.