Walking with the Elephants: 2018

Saturday, June 23, 2018

Upgrade your partitioning from inheritance to declarative

Before PostgreSQL 10, Postgres users partitioned their data using inheritance based partitioning. The method used constraints to define the partitions and rules or triggers to route the data to appropriate partition. A user had to write and maintain code for all that. PostgreSQL 10 introduced declarative partitioning, which is much easier to setup and requires almost no maintenance. PostgreSQL 11
is adding a number of partitioning related enhancements that work with declarative partitioning. Users who have implemented inheritance based partitioning would want to move to declarative partitioning (after upgrading to v11, of course) to benefit from those features. Here's how they can do so.

Example setup

Moving to declarative partitioning

One could simply create a partitioned table and required number of partitions, then create indexes and other objects on this partitioned table except the constraints, rules and triggers used for inheritance partitioning, and then copy the data from the inheritance parent to the partitioned table using SELECT INTO. A user may optimize data movement by copying data from child-table to corresponding partition again using SELECT INTO. But PostgreSQL offers something better, an ability to ATTACH an existing table as a partition to a partitioned table. This method is faster compared to other methods since there is no data movement involved. In the experiment I run with few MBs of partitioned data, it was 2X faster. As the data grows data movement takes longer time, even if you move data from child-tables to partitions. The time to ATTACH child-tables as partitions however, doesn't increase with the size of data. Here are the steps

Step 0

Take a backup of inherited parent table and all the child-tables. A database level backup would be awesome! This is optional step but very important so that you can restore the data in case something goes wrong while performing the next steps.

Step 1

Start a transaction, so that everything gets rolled back in case of an error.
BEGIN TRANSACTION;

Step 2

Create the partitioned table, with the same definition as the inheritance parent. Annotate the CREATE TABLE command with PARTITION BY clause. You will need to specify the columns or expression to use as a partition key in PARTITION BY clause. But those should be apparent from the constraints on the inheritance children. For example, in the above setup, the constraints are all based on the column 'logdate', which is the intended partition key. But the partition key may not be so evident, if there's a spaghetti of constraints surrounding each child table. If the constraints, rules or tiggers are well documented, it should not be difficult to spot the partition key. If not, a deeper examination of these objects would reveal the partition key.

CREATE TABLE measurement_part (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
) PARTITION BY RANGE (logdate);

Step 3

We need to add the child tables as partitions to the partitioned table using ALTER TABLE ... ATTACH. To do that, first we need to remove child-tables from inheritance hierarchy using NO INHERIT clause. For example,

ALTER TABLE measurement_y2006m02 NO INHERIT measurement;

Step 4

Craft FOR VALUES clause from the constraints of a given child-table. This should be straight-forward, if the partition key has been correctly identified. Now, run ALTER TABLE ... ATTACH PARTITION command as below for each of the child-tables.

ALTER TABLE measurement_part ATTACH PARTITION measurement_y2006m02 FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');

You may carry out steps 3 and 4 together for each child or perform step 3 for all children followed by step 4 for all children. Do not drop the constraints, on the child tables, which have been been converted into FOR VALUES clause before you are done with these steps. If you keep them while carrying out ATTACH step and also set client_min_messages to INFO, you will see messages like

INFO: partition constraint for table "measurement_y2006m02" is implied by existing constraints

Usually when we attach a table as a partition, the table is scanned to check if it contains any rows which would not fit that partition (to be specific, would not fit that partition's bounds). This scan is avoided if the table has constraint/s that imply the partition bounds. By retaining the original constraints, we avoid the scan, saving significant I/O and CPU time.

Step 5

Starting PostgreSQL 11, users can create indexes on the partitioned table and the partitions automatically "inherit" those. The system is intelligent enough not to create index there's already one similar to the index on the partitioned table. In our example, all the child-tables already had the
required index. So, we just create an index on the partitioned table so that the optimizer knows about it.

CREATE INDEX measurement_logdate ON measurement(logdate);

Step 6

There may be views, constraints or other SQL objects on the parent inheritance table. PostgreSQL associates a table's OID with the objects created on it. Since the partitioned table's OID is different from the inheritance parent, the old views or triggers still point to the inheritance parent even if the partitioned table is named same as the inheritance parent (albeit after renaming the inheritance parent itself). So, they won't work as they are and need to be recreated on the partitioned table.

It would actually help, if PostgreSQL had a command like ALTER TABLE ... PARTITION BY ... to convert a regular table into a partitioned table. But that's easier said than done. Hope we see somebody put significant effort in implementing that command.

Step 7

Drop the inheritance parent and all the objects created on the inheritance parent. DROP TABLE .. CASCADE might help here. These should be the same objects, except the partitioning constraints, recreated in step 6 on the partitioned table. This allows us to rename the partitioned table with the same name as the inheritance parent, so that the queries, procedures, functions point work on the partitioned table instead of inheritance parent.

DROP TABLE measurement CASCADE;
ALTER TABLE measurement_part RENAME TO measurement;

Step 8

Now drop the partitioning constraints present on the child-tables which are now partitions of the partitioned table and do not need those constraints. You may perform this step right after step 4, but delaying it might allow those constraints to be used in the later steps if necessary.

ALTER TABLE measurement_y2006m06 DROP CONSTRAINT measurement_y2006m06_logdate_check;

Step 9

Run any sanity tests before we commit the transaction. For example, check the output of \d+ command on the partitioned table and individual partitions. Make sure that those tests don't throw any errors when everything is right, lest everything we did till now rolls back.

COMMIT TRANSACTION;

Your partitioned table is now ready.

Planning queries involving foreign PostgreSQL tables

Cost based optimization

A query may be executed in many different ways, modelled as plans in query optimizer, differing in resources required and/or execution time. A typical DBMS's query optimizer tries to find all the possible plans for executing a given query and chooses the fastest plan amongst those. But it's not possible to calculate the time required by a plan unless the query is executed. Thus an optimizer tries to associate an estimate of execution time with each possible plan and choose the one with the least estimated value. PostgreSQL is no different. It associates a cost with each possible plan. The cost is a rough estimation of the time required to execute the query. The plan with the lowest cost is chosen for execution. The time required to execute a query is sum of time required to perform various operations involved in the plan being executed e.g. time required to scan the tables in the query, time required to compute joins, etc. Thus a plan's cost is sum of the costs of operations involved in the plan. In order to efficiently and correctly estimate the cost of a plan, PostgreSQL maintains the statistics about sizes of tables, indexes, the values stores in various columns of tables and so on. In a DBMS, where data keeps changing, the statistics often gets stale and needs to be updated. PostgreSQL keeps the statistics up-to-date by frequently sampling the tables. This works reasonably well as long as the tables involved in the query are part of the DBMS.

But now a days, often, applications run queries which require data external to the DBMS. PostgreSQL supports querying external data through a Foreign Data Wrapper (FDW in short), a method based on SQL/MED standard. We will discuss the methods employed by the query optimizer to plan such queries and methods to maintain the statistics about the external data, esp. the data residing in other PostgreSQL server/s, in this post.

Foreign tables and statistics

PostgreSQL allows external data to be represented as "foreign tables". While PostgreSQL scans the local regular tables frequently to keep the statistics up-to-date, it can not do so in case of a "foreign table", since accessing external data itself might consume precious network bandwidth and might take longer than accessing local data. If the "foreign table" is not accessed frequently, it performed network I/O for no reason. Hence PostgreSQL does not sample external data or foreign tables frequently by itself. Instead a user is required to run ANALYZE command on a foreign table periodically. As part of this command, FDW brings samples of external data to PostgreSQL which, in turn, derives the requires statistics from it.

Costing queries involving foreign tables

When a query involves foreign tables, the PostgreSQL optimizer works with the corresponding FDWs to produce various plans for that query. For example, if the query has a join between two foreign tables which use the same FDW, PostgreSQL has two choices.

Fetch the foreign table data from the foreign server (optionally applying any conditions at the foreign server) and perform join locally. In this case, FDW is responsible for costing the scans on the foreign server.
If the FDW is capable of performing the join itself, then delegate the join to the FDW. In most cases, this means that the two foreign tables reside on the same foreign server which is capable of performing the join in a way that the result of join is same as PostgreSQL. In this case, FDW is responsible for computing the cost of the join.

Similar logic is applied for other SQL operations like grouping, sorting etc. An FDW may use the statistics collected by PostgreSQL, use PostgreSQL's costing methods or employ entirely different methods to collect statistics and/or compute costs. Each FDW may implement its own costing model. But it is expected to produce costs that are consistent with the rest of the optimizer. Next we will take example of postgres_fdw.

postgres_fdw costing model

postgres_fdw is used for accessing external data from an other PostgreSQL server. It uses two different modes for computing costs, governed by option "use_foreign_estimate". Read more about this option here.

When 'use_remote_estimate' is true, postgres_fdw fetches the costs from the foreign server using EXPLAIN.
When it's false, postgres_fdw computes the costs based on the statistics about external data available locally.

The second method works fine for a simple scan on a foreign table as long as the statistics about the foreign table is kept up-to-date. But it doesn't do justice to complex operations like join, grouping whose performance depends upon a number of factors like availability of suitable indexes, memory for hash table or sorting, which are not covered by the statistics.

Take, for example, the following query involving two foreign tables ft1and ft2, pointing to tables t1 and t2 on the foreign server and each having columns c1 to c8. The plan with "use_remote_estimate" is disabled looks like:
explain (analyze) select * from ft1 t1 join ft2 t2 on t1.c1 = t2.c1 and t1.c1 + t2.c1 <= 20;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=247.94..301.01 rows=274 width=108) (actual time=8.289..13.270 rows=8 loops=1)
Hash Cond: (t1.c1 = t2.c1)
Join Filter: ((t1.c1 + t2.c1) <= 20)
Rows Removed by Join Filter: 814
-> Foreign Scan on ft1 t1 (cost=100.00..137.66 rows=822 width=54) (actual time=1.223..5.918 rows=822 loops=1)
-> Hash (cost=137.66..137.66 rows=822 width=54) (actual time=7.050..7.050 rows=822 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 83kB
-> Foreign Scan on ft2 t2 (cost=100.00..137.66 rows=822 width=54) (actual time=1.203..6.684 rows=822 loops=1)
Planning Time: 0.307 ms
Execution Time: 13.887 ms
(10 rows)

When we enable "use_remote_estimate" for both the tables, the plan changes to
explain (analyze) select * from ft1 t1 join ft2 t2 on t1.c1 = t2.c1 and t1.c1 + t2.c1 <= 20;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Foreign Scan (cost=131.47..172.63 rows=274 width=108) (actual time=2.153..2.154 rows=8 loops=1)
Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
Planning Time: 3.755 ms
Execution Time: 2.336 ms
(4 rows)

Observe the difference in execution and planning time. When "use_remote_estimate" is true, the planning time is ten times more than the planning time with "use_remote_estimate" off. But the execution time with "use_remote_estimate" is true is almost 6 times lesser compared to that when "use_remote_estimate" is false. The planning time has increased so much since it fires many EXPLAIN queries on the foreign server trying to cost possible plans for join. If you have turned auditing ON on the foreign server, you will find that for planning the above query, postgres_fdw has fired following EXPLAIN commands on the foreign server.

EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t1
EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t2
EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t1 ORDER BY c1 ASC NULLS LAST
EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t1 WHERE (((c1 + ((SELECT null::integer)::integer)) <= 20)) AND ((((SELECT null::integer)::integer) = c1))
EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t2 ORDER BY c1 ASC NULLS LAST
EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t2 WHERE (((((SELECT null::integer)::integer) + c1) <= 20)) AND ((((SELECT null::integer)::integer) = c1))
EXPLAIN SELECT r1.c1, r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8, r2.c1, r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8 FROM (t1 r1 INNER JOIN t2 r2 ON ((((r1.c1 + r2.c1) <= 20)) AND ((r1.c1 = r2.c1))))

Each such EXPLAIN is a network trip to the foreign server, which consumes network bandwidth and time. While turning "use_remote_estimate" ON, improved query performance, it has drastically increased query planning time. In some cases the planning and execution time together may turn out to be same with and without "use_remote_estimate" turned ON. That takes away all the benefit of join or grouping push-down, which isn't good. This may not sound that bad when only some of the queries involve foreign tables or when the data volumes are so high that the planning time, even including those network trips, are only tiny fraction of the total execution time. But it does matter a lot when we move towards an FDW based built-in sharding as described in a blog by Robert Haas and in the slides by my friends from Japan and by Bruce Momjian.

We could reduce the network trips, and thus the reduce planning time, if we turn "use_remote_estimate" OFF, but then the plan comes out to be poor as seen above. This happens because PostgreSQL tries to cost the plans without knowing the capabilities of the foreign server, the plans that the foreign server can "think of". What if we could combine best of the two approaches by making local PostgreSQL "think" like the foreign PostgreSQL server? That's possible at-least in theory.

Costing foreign operations locally

After all the foreign server in this case is a PostgreSQL with costing model, optimizer and executor same as the local PostgreSQL. If the local PostgreSQL knows values of all the parameters which affect query optimizer, the costs it could compute locally would be much closer to the cost it gets from EXPLAIN output. Here's a rough list of what those parameters are:

Various GUCs that affect the query optimizers. There are three classes of GUCs: a. the GUCs that determine the costs of certain operations like random_page_cost and cpu_operator_cost. b. the GUCs that enable/disable certain planner strategies like enable_hashjoin. c. the GUCs that constrain resources that the query executor can use e.g. parallel workers, memory.
Statistics about the tables involved in the query. We already have a method to gather statistics on the remote tables, but it requires scheduling ANALYZE commands manually. Further-more when a foreign table is ANALYZEd, postgres_fdw fetches sample data from the foreign server and derives statistics from it. That consumes network bandwidth and time. It can do better by fetching the "statistics" itself from the foreign server. After-all the statistics on the foreign server was collected by a method similar to the one used to derive statistics locally. It will also help whether or not to ANALYZE a foreign table automatically is left to the user to decide. A user may be happy to spend the network bandwidth but keep the statistics up-to-date. In that case, s/he will be happy to let that happen automatically like a regular table rather than set up a cron-job and maintain it.
Metadata used by the query optimizer. Query optimizer uses the knowledge about the constraints on the table, indexes on the table to create and cost various plans. Right now PostgreSQL supports "declarative constraints" on the foreign table, i.e. constraints which are not enforced but used by the query optimizer. But a user needs to set those up itself. It would be better if postgres_fdw can setup them up itself by knowing the constraints on the foreign server. However, PostgreSQL has no knowledge of indexes available at the foreign server. It would be better if postgres_fdw support declarative indexes on the foreign table as well.

Empowered with this knowledge, the costs of delegated operations computed locally using the local costing model would be much closer to the actual costs computed at the foreign server. This would eliminate any need to have a "use_remote_estimate" option.

The tricky part is to keep this knowledge about the foreign server up-to-date. The values of GUCs and the metadata may change on the foreign server without local PostgreSQL knowing about it. But those changes are not that frequent and syncing those sufficiently frequently would suffice. Statistics about the foreign data, however, may change rapidly if the transaction rate on the foreign server is high. Keeping that fresh enough to choose the optimal plans would be a challenge. But it can be achieved if we allow the foreign server to periodically push the information to the local server and over a right wire-protocol.

Wednesday, March 21, 2018

Containing bloat with partitions

PGConf India 2018 attracted a large number of PostgreSQL users and developers. I talked about "Query optimization techniques for partitioned tables" (slides). Last year, I had held an introductory talk about PostgreSQL's declarative partitioning support (slides). Many conference participants shared their perspective on partitioning with me. One particular query got me experimenting a bit.

The user had a huge table, almost 1TB in size, with one of the columns recording the data-creation time. Applications added MBs of new data daily and updated only the recent data. The old data was retained in the table for reporting and compliance purposes. The updates bloated the table, autovacuum wasn't clearing the bloat efficiently. Manual vacuum was out of scope as that would have locked the table for much longer. As a result queries were slow, and performance degraded day by day. (Read more about bloats and vacuum here and here.) The user was interested in knowing if partitioning would help.

Hot and Cold partitioning

The concept of hot and cold partitioning isn't new. The idea is to separate data being accessed and modified frequently (Hot data) from the data which is accessed and modified rarely (Cold data). In the above case, that can be achieved by partitioning the data by the creation timestamp. The partitions should be sized such that the updates and inserts access only a handful Hot partitions (ideally at most two). The Cold partitions containing the stale data would remain almost unchanged. Since the updates are taking place in the Hot partitions, those get bloated, but their sizes are much smaller than the whole table. Vacuuming those doesn't take as much time as the whole table. Once they become Cold, they hardly need any Vacuuming. Thus containing the bloat effectively.

In PostgreSQL autovacuum, if enabled on the given table, runs its job when the number of inserted, deleted or updated rows are above certain thresholds (See details). Since all the action happens in the Hot partitions, only those partitions can have their counts rise beyond the threshold. Those counts are hardly expected to change for Cold partitions (or once they become cold, if they were hot in the past). Thus autovacuum too automatically starts working on only the Hot partitions instead of entire table. This isn't the case with an unpartitioned table, for which autovacuum always runs on the entire table; it possibly never completes its job because of sheer size of the table and not necessarily because of the rate at which bloat is created.

Experiment

That's all theory, but I ventured to see how effective this could be. I created a table with two partitions, one Hot partition, representing Hot data and one Cold partition, representing Cold data. (Note: all the code below serves only as example and should be used with necessary caution.).

create table part (a int, b int, c varchar, d varchar, e varchar) partition by range(a);

create table part_active partition of part for values from (990000) to (1000001);

create table part_default partition of part default;

I then inserted one million rows in this table, each row with distinct value for column a from 1 to 1000000. This means that the partition "part_active" which represents the Hot data (or latest data, if a is interpreted as some kind of timestamp) contains 1% of the total data in table "part".

For comparison, I also created an unpartitioned table "upart" with similar schema and populated it with the same data.

create table upart (a int, b int, c varchar, d varchar, e varchar);

I disabled autovacuum on these tables to create bloat using "with (autovacuum_enabled = 'false')", but that's only for the experimentation. In production or test environment, one should set this option as per the requirements of the setup.

Then I ran commands to update each of the rows with a between 990000 and 1000000 thrice. In the partitioned table this updated the rows in partition part_active, the Hot partition. Since autovacuum is disabled, it would not remove the old versions of the tuples. So, we see the statistics as

select n_tup_upd, n_tup_hot_upd, n_dead_tup, n_live_tup from pg_stat_user_tables where relid = 'upart'::regclass;
n_tup_upd | n_tup_hot_upd | n_dead_tup | n_live_tup
-----------+---------------+------------+------------
30003 | 23 | 30003 | 1000000
(1 row)

select n_tup_upd, n_tup_hot_upd, n_dead_tup, n_live_tup from pg_stat_user_tables where relid = 'part_active'::regclass;
n_tup_upd | n_tup_hot_upd | n_dead_tup | n_live_tup
-----------+---------------+------------+------------
30003 | 23 | 30003 | 10001
(1 row)

select n_tup_upd, n_tup_hot_upd, n_dead_tup, n_live_tup from pg_stat_user_tables where relid = 'part_default'::regclass;
n_tup_upd | n_tup_hot_upd | n_dead_tup | n_live_tup
-----------+---------------+------------+------------
0 | 0 | 0 | 989999

(1 row)

As you will see that the unpartitioned table and the Hot partition both have same number of dead tuples. The Cold partition doesn't have any dead tuples since no row in that partition was updated. At this point the sizes of unpartitioned table and the Hot partition are.

select pg_size_pretty(pg_relation_size('upart'::regclass));
pg_size_pretty
----------------
326 MB

(1 row)

select pg_size_pretty(pg_relation_size('part_active'::regclass));

pg_size_pretty

----------------

13 MB

(1 row)

The Hot partition which contains only 1% of the rows of the unpartitioned table has its size much larger than that proportion. That's because all the bloat in partitioned table is concentrated in that partition.

Now, let's try to run VACUUM ANALYZE on these tables to remove the bloat and update statistics.

\timing on

vacuum full analyze upart;

Time: 4663.576 ms (00:04.664)

\timing off

\timing on

vacuum full analyze part_active;

Time: 53.314 ms

\timing off

After vacuuming the sizes of the unpartitioned table and the Hot partition are

select pg_size_pretty(pg_relation_size('upart'::regclass));

pg_size_pretty

----------------

326 MB

(1 row)

select pg_size_pretty(pg_relation_size('part_active'::regclass));

pg_size_pretty

----------------

3336 kB

(1 row)

Now the sizes of the table are in expected proportion with the bloat removed.

Notice that the time required for vacuuming Hot partition is 80 times lesser than the time required for vacuuming the unpartitioned table. Effectively, the bloat in entire partitioned table is cleared since partitioning has restricted the bloat only to the Hot partition. Since vacuum is now taking much lesser time it's possible to schedule it within the down-time and the time for which the table remains locked is also within the reasonable limits. This isn't magic (neither is partitioning a spell simple to cast). Observe that the reduction in time is inline with the proportion of Hot data in the total data. When vacuum is run on the unpartitioned table, it has to scan the whole unpartitioned table. But in case of a partitioned table, it needs to scan only the Hot partition which is much smaller than the unpartitioned table and thus takes much lesser time.

The customer had 1TB of data and the experiment above runs with only MBs of data. But that's all my laptop could afford and that's all time permitted me. But you got the idea. EnterpriseDB is implementing zero bloat heap which avoids bloat to start with, but it's going to take some time. Meanwhile you may try this option, but experiment with real-sized data.

Word of caution

Declarative partitioning is a new feature in PostgreSQL 10. Not all the functionalities like, foreign key, unique constraints, primary key that work with a regular table work with a partitioned table. Many of them will be part of PostgreSQL 11, but it may take few more releases to cover all the ground. It's always advisable to use the latest version of PostgreSQL and test the applications, for performance and correctness, before deploying in production.