Tuesday, August 8, 2023

Partitioning as a query optimization strategy?

I had discussed about query optimization techniques applicable to queries involving partitioned tables in PGConf.India 2018 (Video recording, slides). (My previous blog discusses these techniques in detail.) The takeaway from that presentation was these query optimization techniques improved query performance if the tables were already partitioned. But partitioning wasn't good enough as a query optimization strategy by itself even when partitionwise join, partitionwise aggregate and parallel query were all used together on small data-sizes. Experiments then hinted that if the data was large enough partitioning would become a query optimization strategy. But we didn't know how large is large enough. Experiments to establish would require beefy machines with larger resources which were costly, took long time to procure or get access to. On top of them it took long time to setup and finish the runs. At one point we stopped experimenting. Fast forward to today and things have changed drastically, thanks to the cloud!

EDB's BigAnimal comes to help

EnterpriseDB offers PostgreSQL-as-a-service in the form of a DBAAS platform called BigAnimal. It allows its users to deploy and run PostgreSQL in cloud on hardware configuration of their choice. It also provides a starter free credit to try out this platform. I experimented with very large datasets by using BigAnimal. I ran the experiments on PostgreSQL 15 hosted on a m5.4xlarge instance (64 GB RAM, 16 vCPUs) with 1500 GB storage. All of this without wasting much time and also money; I destroyed the instance as soon as my experiments were over.

Experiment

I wanted to see the impact of only partitioning as a query optimization strategy. So instead of using whole TPCH setup, I crafted a micro-benchmark with two queries involving two tables li and ord modeled after lineitem and orders tables in TPCH benchmark. When partitioned each of these two tables have matching 1000 partitions each. The tables have following schema

$\d+ li
                                            Table "public.li"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 key1   | integer |           | not null |         | plain    |             |              |
 key2   | integer |           | not null |         | plain    |             |              |
 d      | date    |           |          |         | plain    |             |              |
 m      | money   |           |          |         | plain    |             |              |
 t1     | text    |           |          |         | extended |             |              |
 t2     | text    |           |          |         | extended |             |              |
Indexes:
    "li_pkey" PRIMARY KEY, btree (key1, key2)
Access method: heap

$\d+ ord
                                           Table "public.ord"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 key1   | integer |           | not null |         | plain    |             |              |
 d      | date    |           |          |         | plain    |             |              |
 m      | money   |           |          |         | plain    |             |              |
 t1     | text    |           |          |         | extended |             |              |
 t2     | text    |           |          |         | extended |             |              |
Indexes:
    "ord_pkey" PRIMARY KEY, btree (key1)
Access method: heap

When partitioned they are partitioned by range on key1. Each row in ord has 3 matching rows in li, roughly imitating the data-size ratio between corresponding tables in TPCH benchmark.

Query 1 which extracts relevant parts of TPCH Q3 or Q4 looks like

select count(*)
    from (select o.key1, sum(o.m) revenue, o.d
            from li l, ord o
            where l.key1 = o.key1 and
                o.d > current_date + 300 and
                l.d < current_date + 700
            group by o.key1, o.d
            order by revenue, o.d
    ) as t1

 Query 2 which is a pure join between li and ord looks like

select o.key1, l.key1, o.d
            from li l, ord o
            where l.key1 = o.key1 and
                o.d > current_date + 300 and
                l.d < current_date + 700

The time required to execute these two queries is measured using EXPLAIN ANALYZE. We varied the number of rows per partition as well as the number of partitions.

The execution times for queries are given in tables below.

Table 1: 10K rows per partition


Average execution time Q1 (ms) Average execution time Q2 (ms)
No. of partitions unpartitioned table partitioned table without PWJ partitioned table with PWJ unpartitioned table partitioned table without PWJ partitioned table with PWJ
5 83.05 93.29 53.68 48.83 60.55 50.85
10 195.87 221.33 90.24 104.40 129.06 105.20
50 1,183.25 1,487.00 432.07 584.31 723.90 527.97
100 2,360.19 3,001.81 888.46 1,342.69 1,595.53 1,053.91
500 11,968.68 15,220.69 4,350.62 6,903.91 8,082.09 5,381.46
1000 33,772.31 31,090.57 8,847.61 16,461.44 17,646.42 10,875.05

Table 2: 100K rows per partition


Average execution time Q1 (ms) Average execution time Q2 (ms)
No. of partitions unpartitioned table partitioned table without PWJ partitioned table with PWJ unpartitioned table partitioned table without PWJ partitioned table with PWJ
5 1,157.23 1,489.53 514.68 609.81 773.31 582.07
10 2,326.40 2,990.32 1,041.11 1,375.69 1,597.55 1,152.33
50 11,899.34 15,181.49 4,792.88 7,196.35 8,446.64 5,828.54
100 24,139.10 30,660.87 9,594.33 14,277.53 16,753.36 11,512.35
500 1,53,922.35 1,65,550.06 50,308.85 74,387.34 85,175.79 58,282.17
1000 3,13,534.59 3,38,718.63 1,31,482.31 2,03,569.14 1,32,688.60 1,23,643.18


Same numbers in the form of graphs are better to understand. Next we see graphs depicting the average execution time of each of these queries varying with the number of partitions. In each graph Y-axis shows the execution times in logarithmic scale, X-axis shows the number of partitions. Blue line shows the query execution times when tables are not partitioned. Red line shows query execution times when tables are partitioned but partitionwise join and aggregation are not used (turning both enable_partitionwise_join and enable_partitionwise_aggregate OFF). Yellow line shows query execution times when tables are partitioned and partitionwise join and partitionwise aggregate is used.

Note that the Y-axis denoting the execution time is drawn with logarithmic scale. Thus the linear difference on that axis shows improvement in integer multiples instead of fractions. For example, Q1's execution time improves almost by 4 times when tables are partitioned and partitionwise join and aggregate are enabled.

Graph 1


Graph 2

Graph 3

Graph 4

Key takeaways

The graphs above make it clear that when datasizes are very large partitioning can also be used as a query optimization technique along with its other advantages. I will share some key points here

  1.  When the total data size reaches the house of millions, partitioning can be considered as a query optimization strategy. The exact number of partitions and average rows per partition do not make much difference. We see similar performance whether 5M rows are divided into 500 partitions or 50 partitions.
  2. The exact thresholds depend upon properties of data and queries. E.g. size of each rows, columns used in query, operations performed by the query etc.
  3. Since these optimization techniques are very much dependent upon the partition key, choosing the right partition key is very important.
  4. When tables are partitioned, queries perform better when partitionwise operations are used irrespective of the datasize.

Each workload is different. Above charts provide some guidance. But experimenting with the size and number of partitions as well as the partition key is important to know whether partitioning will help you optimize queries in your application or not. Experimentation shouldn't be an issue anymore. EDB's BigAnimal platform allows its users to experiment quickly without requiring a large upfront investment.

Wednesday, June 14, 2023

PostgreSQL internals development with VSCode

In my brief stint with Hive (which resulted in this and this blog), I used IntelliJ IDEA. I was reintroduced to the marvels of IDEs and how easy they make a developer's life. I had used Visual Studio, TurboC and many other language specific IDEs back in my college days. But once I started working exclusively with C and Linux, I was confined to vim, gdb, cgdb and at the most ddd. (Didn't use emacs. But I hear that's cool IDE as well.) I had kinda forgot what comfort it is to work in an IDE. These tools are certainly great and if one spends enough time, they can be better than any of the IDEs out there. But there's a sharp learning curve there. So, I was reminded of that comfort and sorely missed a good IDE when I started working with PostgreSQL again. But by then VSCode was made available on Linux. It's not as fantastic as IntelliJ or GoLand but it's good enough to improve a C developer's life; not to mention efficiency.

I like a. ability to edit, browse and debug code simultaneously, b. all the contextual language specific auto-suggestions c. and ease of code navigation. I sorely miss Ctrl+t and Ctrl+] stacking in vim but otherwise it has vim emulator too. I am yet to explore and utilize other features like git.

In this blog we will see how to configure VSCode for PostgreSQL internal development including the development of extensions, proprietary forks. We will talk about two things in this blog 1. how to configure make so that code browsing, navigation, error highlighting and auto-suggestions are sensible 2. how to configure a debugger. These are the two things I struggled with when it came to working with PostgreSQL code in VSCode. Otherwise, you will find plenty of references on C/C++ development in VSCode like this, this and this.

Please feel free to hit me with suggestions, corrections. Drop your VSCode tips and trick or suggest a topic I can cover in my future blog.

1. Getting started with PostgreSQL code

I have a script which clones the PostgreSQL github repository, runs configure. Assume that the code is cloned in "$HOME/vscode_trial/coderoot/pg" directory. coderoot will contain all the VSCode specific files and directories where as coderoot/pg will contain purely PG clone. I am using VSCode version shown in the image on Ubuntu 22.04. I start by clicking VSCode icon in application tray.


Open coderoot folder using File->Open Folder. Save workspace using File->Save Workspace As in coderoot folder. Add folder coderoot/pg using File->Add Folder to Workspace.

2. Setting up make

PostgreSQL uses Makefile to build and manage binaries. VSCode by default uses CMake. So you will need to configure its build tasks to use Make instead of CMake. I have my scripts to build PostgreSQL so I don't need the tool to build binaries per say. But when we point VSCode to PostgreSQL's Makefile, its Intellisense uses Makefile and does a better job at code navigation, error detection and auto-suggestion.


Please install Makefile Tools extension so that VSCode can use make. Point it to the PostgreSQL Makefile selecting the options highlighted in the image below





You will find Makefile Tools extension button on the left side bar. Click it to configure the default tasks or to build binaries. The tool is smart and picks up all the make targets from the Makefile hierarchy. Click the "pencil" icon against "Build target" to choose the target you want and then click the "bug" icon at the top highlighted in the image below. This will run make install. You may ignore an error about launch task not being configured.











3. Debugging a PostgreSQL backend

This configuration baffled me a lot, especially in the newer version of VSCode. Playing with run and build symbol on the left hardly had any success. The trick is to open a source file, any .c file really and then click configuration symbol (highlighted in the image) to configuration debug tasks. I choose C/C++: gcc build and debug active file.
This will open the launch.json file in the workspace. Click Add Configuration button on the bottom right. Most of the time I have to debug a running backend. This requires configuring C/C++ gdb (Attach) option. Add processPid value as shown below. Also provide the path to postgres binary as progream.

Click on the run and debug option from the left bar and choose (gdb) Attach option at the top as shown in the image below.

In order to debug a given backend, click Run->Start Debugging. This should pop-up all PIDs. Choose the one from the list (may want to search postgres). This will attach gdb to that backend and you are ready to go. Enjoy all the blessings of debugging via GUI as described in the documentation here.

More on debugging through VSCode is here.

4. TAP tests

PostgreSQL code include TAP tests written in Perl. You will need Perl navigator and Perl language server and debugger extensions. The second extension is only required if you want to debug TAP tests.

Wednesday, January 26, 2022

Advanced partition matching for partition-wise join

Earlier I had written a blog about partition-wise join in PostgreSQL. In that blog I had talked about an advanced partition matching technique which will allow partition-wise join to be used in more cases. In this blog we will discuss this technique in detail. I will suggest to read my blog on basic partition-wise join again to get familiar with the technique.

Basic partition matching technique allows a join between two partitioned tables to be performed using partition-wise join technique if the two partitioned tables had exactly same partitions (more precisely exactly matching partition bounds). For example consider two partitioned tables prt1 and prt2

\d+ prt1
... [output clipped]
Partition key: RANGE (a)
Partitions: prt1_p1 FOR VALUES FROM (0) TO (5000),
            prt1_p2 FOR VALUES FROM (5000) TO (15000),
            prt1_p3 FOR VALUES FROM (15000) TO (30000)
and

\d+ prt2
... [ output clipped ]
Partition key: RANGE (b)
Partitions: prt2_p1 FOR VALUES FROM (0) TO (5000),
            prt2_p2 FOR VALUES FROM (5000) TO (15000),
            prt2_p3 FOR VALUES FROM (15000) TO (30000)

A join between prt1 and prt2 is executed as join between matching partitions i.e. prt1_p1 joins prt2_p1, prt1_p2 joins prt2_p2 and prt1_p3 joins prt2_p3. This has many advantages as discussed in my previous blog.

But basic partition matching can not join two partition tables with different partitions (more precisely different partition bounds). For example, in the above case if prt1 an extra partition prt1_p4 FOR VALUES FROM (30000) TO (50000), a join between prt1 and prt2 would not use partition-wise join. Many applications use partitions to segregate actively used and stale data, a technique I discussed in my another blog. The stale data is eventually removed by dropping partitions. New partitions are created to accommodate fresh data. Let's say the partition scheme of two such tables is such that they usually have matching partitions. But when an active partition gets added to one of these tables or a stale one gets deleted, they will have mismatched partitions for a small duration. We don't want a join hitting the database during this small duration to perform bad since it can not use partition-wise join. Advanced partition matching algorithm helps here.

Advanced partition matching algorithm

Advanced partition matching is very much similar to the merge join algorithm. It takes the sorted partition bounds and finds matching partitions by comparing the bounds from both the tables in their sorted order. Any two partitions, one from either partitioned table, whose bounds match exactly or overlap are considered to be joining partners since they may contain rows that join. Continuing with the above example:

\d+ prt1
... [output clipped]
Partition key: RANGE (a)
Partitions: prt1_p1 FOR VALUES FROM (0) TO (5000),
            prt1_p2 FOR VALUES FROM (5000) TO (15000),
            prt1_p3 FOR VALUES FROM (15000) TO (30000)
and
\d+ prt2
... [ output clipped ]
Partition key: RANGE (b)
Partitions: prt2_p1 FOR VALUES FROM (0) TO (5000),
            prt2_p2 FOR VALUES FROM (5000) TO (15000),
            prt2_p3 FOR VALUES FROM (15000) TO (30000),
            prt1_p4 FOR VALUES FROM (30000) TO (50000)

Similar to the basic partition matching algorithm this will join prt1_p1 and prt2_p1prt1_p2 and prt2_p2, and prt1_p3 and prt2_p3. But unlike basic partition matching it will also know that prt1_p4 does not have any join partner in prt1. Thus if the join between prt1 and prt2 is INNER join or when prt2 is INNER relation of join, the join will contain only three joins leaving prt2_p4 aside. In PostgreSQL, a join where prt2 is OUTER relation, we won't be able to use partition-wise join even if We will come back to this again when we will discuss outer joins further.

This is simple right, but consider another example of listed partitioned tables
\d+ plt1_adv
Partition key: LIST (c)
Partitions: plt1_adv_p1 FOR VALUES IN ('0001', '0003'),
            plt1_adv_p2 FOR VALUES IN ('0004', '0006'),
            plt1_adv_p3 FOR VALUES IN ('0008', '0009')

and

\d+ plt2_adv
Partition key: LIST (c)
Partitions: plt2_adv_p1 FOR VALUES IN ('0002', '0003'),
            plt2_adv_p2 FOR VALUES IN ('0004', '0006'),
            plt2_adv_p3 FOR VALUES IN ('0007', '0009')

Observe that there are exactly three partitions in both the relations but lists corresponding plt1_adv_p2 match exactly that of plt2_adv_p2 but other two partitions do not have exactly matching lists. Advanced partition matching algorithm helps to determine that plt1_adv_p1 and plt2_adv_p1 have overlapping lists and their lists do not overlap with any other partition from the other relation. Similarly for plt1_adv_p3 and plt2_adv_p3. Thus it allows join between plt1_adv and plt2_adv to be executed as partition wise join by joining their matching partitions. The algorithm can find matching partitions in even more complex partition bound sets.

The problem with outer joins

Outer joins pose a particular problem in PostgreSQL world. Consider again the example of join between prt2 LEFT JOIN prt1. prt2_p4 does not have a joining partner in prt1 and yet the rows in that partition will be part of the join since it is an outer relation, albeit with the columns from prt1 all "null"ed. Usually in PostgreSQL when the INNER side is empty, it's represented by a "dummy" relation which emits no rows but still knows the schema of that relation. Without partition-wise join a "concrete" relation which has some presence in the original query turns dummy and thus planner has "something" to join the outer relation with. So PostgreSQL's planner doesn't have to do anything extra when such outer joins occur. But when there is no matching inner partition for an outer partition e.g. prt2_p4, there is "no entity" which can represent the "dummy" inner side of that outer join. PostgreSQL does not have a way right now to induce such "dummy" relations during planning right now. But that's not required. Ideally such a join with empty inner only requires schema of the inner relation and not an entire relation itself. Once we build support to execute such a join with a solid outer relation and schema of inner relation, we will be able to tackle partition-wise join where there are no matching partitions on inner side. Hopefully we will solve that problem some time soon.

When there is no matching partition on the outer side of the join, the inner partition does not contribute to the result of join and can be just ignored. So partition-wise joins where there are no matching partitions on the inner side are not a problem at all.

Multiple matching partitions

When the partitioned tables are such that multiple partitions from one side match one partition or more partitions on the other side, partition-wise join simply bails out since there is no way to induce an "Append" relation during planning time which represents two or more partitions together. Hopefully we will remove that limitation as well sometime.

Curious case of hash partitioned tables

It doesn't make much sense to use it for a hash partitioned table since usually partitions of two hash partitioned table using same modulo always match. When the modulo is different, the data from one a given partition of one table can find its join partners in all the partitions of the other, thus rendering partition-wise join ineffective.

Even with all these limitation, what we have today is a very useful solution which serves most of the practical cases. Needless to say that this feature works seemlessly with FDW join push down to adding to sharding capabilities that PostgreSQL already has!

Monday, February 4, 2019

New addition to the Zoo: Metastore

Hive Metastore

Hive requires a metastore service which manages metadata about the data managed by Hive. It is backed by persistent storage in the form of tables in an RDBMS. There are multiple ways to configure a metastore. A metastore service can be run in the same JVM as the hive server when it is configured as an embedded metastore. A remote metastore, on the other hand, is configured to run in a separate JVM or even on a separate host. Such a remote metastore can be shared by multiple hive server instances and is also available to be used by other applications. More about configuring a remote metastore can be found here. For achieving high availability, load balancing and scalability, more than one metastore services (all serving a consistent view of the same metadata) can be started and their URIs are listed in "hive.metastore.uris" (notice the plural "uris"). A metastore client (running from within a hive server) uses URIs specified in  configuration "hive.metastore.uris" to connect to any one of the remote metastore servers specified there. In the process it may try multiple URIs, if some of them are unreachable for various reasons.

Discoverying metastore services

As long as the set of metastore services remains the same, this works fine. But in real world that's not guaranteed. A metastore service can stop working, network has its own whims and the platforms hosting metastore services themselves go crazy. Neither is the demand for a service static. We may require more metastore services when the demand increases. When the set of metastore services changes, the list in "hive.metastore.uris" is required to be manually changed so that the client is aware of the new metastore servers that have come up or to avoid time wasted by a client trying to connect a failed metastore server or to avoid connecting to already loaded service. Obviously this has all the faults that any service requiring manual intervention has! There can be longer delays in detecting a failed metastore or a new metastore and also delays in updating the list of URIs. What if the client has access to a list of URIs that is always up-to-date?

Metastore service comes to the Zoo

That's why metastore is admitted to the Zoo! HIVE-20794 added a Hive metastore feature to use ZooKeeper for dynamic metastore service discovery. A ZooKeeper is a centralized service for maintaining configuration information, naming, and providing group services. When a metastore server is configured to use ZooKeeper, it registers itself to the ZooKeeper when starting up. When a new metstore server is started to meet higher demand, the ZooKeeper makes a note of it. The ZooKeeper then keeps a watch on the status of metastore service and deregisters it when it becomes unavailable. Thus a ZooKeeper knows when a new metastore service is available and also when it becomes unavailable, thus keeping the list of metastore services up-to-date. A metastore client, when configured to use ZooKeeper, connects to the ZooKeeper ensemble and gets the up-to-date list of registered metastores and connects to any one of those at random balancing the load across metastores. It's not necessary that the client will connect to the first metastore in the list.  The client does not  It does not waste cycles trying to connect a failed metastore. Also knows about a new metastore automatically.


How to use this feature

You need a single ZooKeeper or a ZooKeeper ensemble setup and running in order to use this feature. Here's how you can do that. Once it's setup, change metastore configuration as described below in all the sites where Hive servers are running and where metastore services are running. They need to be consistent at all these sites.
  • set "metastore.service.discovery.mode" to "zookeeper" so that the client and server know that ZooKeeper has been configured.
  • set "hive.metastore.uris" to the URI (for a single ZooKeeper) or URIs (for an ensemble) of the ZooKeeper cluster. Yes, this is the same old configuration you would use to list metastore URIs earlier. You can continue to use it in the old way if you don't want to use ZooKeeper. When all the ZooKeeper instances in an ensemble have the same port one can specify only URIs in "hive.metastore.uris" and specify the port in "metastore.zookeeper.client.port".
  • metastore.zookeeper.namespace to any string and used as the root under which the metastore services are listed in the ZooKeeper hierarchial namespace. Once set users don't need to worry about it much.
  • Some more configurations like metastore.zookeeper.connection.timeout, metastore.zookeeper.connection.max.retries and metastore.zookeeper.connection.basesleeptime are used as parameters to the ZooKeeper connection and you may want to leave those untouched. You may want to set different values for these at different sites depending upon the network bandwidth and other parameters at those sites.
Know more about the feature and its usage here.

Saturday, June 23, 2018

Upgrade your partitioning from inheritance to declarative

Before PostgreSQL 10, Postgres users partitioned their data using inheritance based partitioning. The method used constraints to define the partitions and rules or triggers to route the data to appropriate partition. A user had to write and maintain code for all that. PostgreSQL 10 introduced declarative partitioning, which is much easier to setup and requires almost no maintenance. PostgreSQL 11
is adding a number of partitioning related enhancements that work with declarative partitioning. Users who have implemented inheritance based partitioning would want to move to declarative partitioning (after upgrading to v11, of course) to benefit from those features. Here's how they can do so.

Example setup

You may have created a parent table and several child tables, one per partition, and triggers, rules and constraints as required. Here's an example setup similar to the one described in PostgreSQL documentation.

\d+ measurement
                                Table "inh_part.measurement"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 city_id   | integer |           | not null |         | plain   |              | 
 logdate   | date    |           | not null |         | plain   |              | 
 peaktemp  | integer |           |          |         | plain   |              | 
 unitsales | integer |           |          |         | plain   |              | 
Child tables: measurement_y2006m02,
              measurement_y2006m03,
              measurement_y2006m04,
              measurement_y2006m05,
              measurement_y2006m06

-- here's how a child looks like
\d+ measurement_y2006m03
                            Table "inh_part.measurement_y2006m03"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 city_id   | integer |           | not null |         | plain   |              | 
 logdate   | date    |           | not null |         | plain   |              | 
 peaktemp  | integer |           |          |         | plain   |              | 
 unitsales | integer |           |          |         | plain   |              | 
Indexes:
    "measurement_y2006m03_logdate" btree (logdate)
Check constraints:
    "measurement_y2006m03_logdate_check" CHECK (logdate >= '2006-03-01'::date AND logdate < '2006-04-01'::date)
Inherits: measurement

Moving to declarative partitioning

One could simply create a partitioned table and required number of partitions, then create indexes and other objects on this partitioned table except the constraints, rules and triggers used for inheritance partitioning, and then copy the data from the inheritance parent to the partitioned table using SELECT INTO. A user may optimize data movement by copying data from child-table to corresponding partition again using SELECT INTO. But PostgreSQL offers something better, an ability to ATTACH an existing table as a partition to a partitioned table. This method is faster compared to other methods since there is no data movement involved. In the experiment I run with few MBs of partitioned data, it was 2X faster. As the data grows data movement takes longer time, even if you move data from child-tables to partitions. The time to ATTACH child-tables as partitions however, doesn't increase with the size of data. Here are the steps

Step 0

Take a backup of inherited parent table and all the child-tables. A database level backup would be awesome! This is optional step but very important so that you can restore the data in case something goes wrong while performing the next steps.

Step 1

Start a transaction, so that everything gets rolled back in case of an error.
BEGIN TRANSACTION;

Step 2

Create the partitioned table, with the same definition as the inheritance parent. Annotate the CREATE TABLE command with PARTITION BY clause. You will need to specify the columns or expression to use as a partition key in PARTITION BY clause. But those should be apparent from the constraints on the inheritance children. For example, in the above setup, the constraints are all based on the column 'logdate', which is the intended partition key. But the partition key may not be so evident, if there's a spaghetti of constraints surrounding each child table. If the constraints, rules or tiggers are well documented, it should not be difficult to spot the partition key. If not, a deeper examination of these objects would reveal the partition key.

CREATE TABLE measurement_part (
    city_id         int not null,
    logdate         date not null,
    peaktemp        int,
    unitsales       int
) PARTITION BY RANGE (logdate);

Step 3

We need to add the child tables as partitions to the partitioned table using ALTER TABLE ... ATTACH. To do that, first we need to remove child-tables from inheritance hierarchy using NO INHERIT clause. For example,

ALTER TABLE measurement_y2006m02 NO INHERIT measurement; 

Step 4

Craft FOR VALUES clause from the constraints of a given child-table. This should be straight-forward, if the partition key has been correctly identified. Now, run ALTER TABLE ... ATTACH PARTITION command as below for each of the child-tables.

ALTER TABLE measurement_part ATTACH PARTITION measurement_y2006m02 FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');

You may carry out steps 3 and 4 together for each child or perform step 3 for all children followed by step 4 for all children. Do not drop the constraints, on the child tables, which have been been converted into FOR VALUES clause before you are done with these steps. If you keep them while carrying out ATTACH step and also set client_min_messages to INFO, you will see messages like

INFO:  partition constraint for table "measurement_y2006m02" is implied by existing constraints

Usually when we attach a table as a partition, the table is scanned to check if it contains any rows which would not fit that partition (to be specific, would not fit that partition's bounds). This scan is avoided if the table has constraint/s that imply the partition bounds. By retaining the original constraints, we avoid the scan, saving significant I/O and CPU time.

Step 5

Starting PostgreSQL 11, users can create indexes on the partitioned table and the partitions automatically "inherit" those. The system is intelligent enough not to create index there's already one similar to the index on the partitioned table. In our example, all the child-tables already had the
required index. So, we just create an index on the partitioned table so that the optimizer knows about it.

CREATE INDEX measurement_logdate ON measurement(logdate);

Step 6

There may be views, constraints or other SQL objects on the parent inheritance table. PostgreSQL associates a table's OID with the objects created on it.  Since the partitioned table's OID is different from the inheritance parent, the old views or triggers still point to the inheritance parent even if the partitioned table is named same as the inheritance parent (albeit after renaming the inheritance parent itself). So, they won't work as they are and need to be recreated on the partitioned table.

It would actually help, if PostgreSQL had a command like ALTER TABLE ... PARTITION BY ... to convert a regular table into a partitioned table. But that's easier said than done. Hope we see somebody put significant effort in implementing that command.

Step 7

Drop the inheritance parent and all the objects created on the inheritance parent. DROP TABLE .. CASCADE might help here. These should be the same objects, except the partitioning constraints, recreated in step 6 on the partitioned table. This allows us to rename the partitioned table with the same name as the inheritance parent, so that the queries, procedures, functions point work on the partitioned table instead of inheritance parent.

DROP TABLE measurement CASCADE;
ALTER TABLE measurement_part RENAME TO measurement;

Step 8

Now drop the partitioning constraints present on the child-tables which are now partitions of the partitioned table and do not need those constraints. You may perform this step right after step 4, but delaying it might allow those constraints to be used in the later steps if necessary.

ALTER TABLE measurement_y2006m06 DROP CONSTRAINT  measurement_y2006m06_logdate_check;

Step 9

Run any sanity tests before we commit the transaction. For example, check the output of \d+ command on the partitioned table and individual partitions. Make sure that those tests don't throw any errors when everything is right, lest everything we did till now rolls back.

COMMIT TRANSACTION;

Your partitioned table is now ready.

Planning queries involving foreign PostgreSQL tables

Cost based optimization

A query may be executed in many different ways, modelled as plans in query optimizer, differing in resources required and/or execution time. A typical DBMS's query optimizer tries to find all the possible plans for executing a given query and chooses the fastest plan amongst those. But it's not possible to calculate the time required by a plan unless the query is executed. Thus an optimizer tries to associate an estimate of execution time with each possible plan and choose the one with the least estimated value. PostgreSQL is no different. It associates a cost with each possible plan. The cost is a rough estimation of the time required to execute the query. The plan with the lowest cost is chosen for execution. The time required to execute a query is sum of time required to perform various operations involved in the plan being executed e.g. time required to scan the tables in the query, time required to compute joins, etc. Thus a plan's cost is sum of the costs of operations involved in the plan. In order to efficiently and correctly estimate the cost of a plan, PostgreSQL maintains the statistics about sizes of tables, indexes, the values stores in various columns of tables and so on. In a DBMS, where data keeps changing, the statistics often gets stale and needs to be updated. PostgreSQL keeps the statistics up-to-date by frequently sampling the tables. This works reasonably well as long as the tables involved in the query are part of the DBMS.

But now a days, often, applications run queries which require data external to the DBMS. PostgreSQL supports querying external data through a Foreign Data Wrapper (FDW in short), a method based on SQL/MED standard. We will discuss the methods employed by the query optimizer to plan such queries and methods to maintain the statistics about the external data, esp. the data residing in other PostgreSQL server/s, in this post.

Foreign tables and statistics

PostgreSQL allows external data to be represented as "foreign tables". While PostgreSQL scans the local regular tables frequently to keep the statistics up-to-date, it can not do so in case of a "foreign table", since accessing external data itself might consume precious network bandwidth and might take longer than accessing local data. If the "foreign table" is not accessed frequently, it performed network I/O for no reason. Hence PostgreSQL does not sample external data or foreign tables frequently by itself. Instead a user is required to run ANALYZE command on a foreign table periodically. As part of this command, FDW brings samples of external data to PostgreSQL which, in turn, derives the requires statistics from it.

Costing queries involving foreign tables

When a query involves foreign tables, the PostgreSQL optimizer works with the corresponding FDWs to produce various plans for that query. For example, if the query has a join between two foreign tables which use the same FDW, PostgreSQL has two choices.
  1. Fetch the foreign table data from the foreign server (optionally applying any conditions at the foreign server) and perform join locally. In this case, FDW is responsible for costing the scans on the foreign server.
  2. If the FDW is capable of performing the join itself, then delegate the join to the FDW. In most cases, this means that the two foreign tables reside on the same foreign server which is capable of performing the join in a way that the result of join is same as PostgreSQL. In this case, FDW is responsible for computing the cost of the join.
Similar logic is applied for other SQL operations like grouping, sorting etc. An FDW may use the statistics collected by PostgreSQL, use PostgreSQL's costing methods or employ entirely different methods to collect statistics and/or compute costs. Each FDW may implement its own costing model. But it is expected to produce costs that are consistent with the rest of the optimizer. Next we will take example of postgres_fdw.

postgres_fdw costing model

postgres_fdw is used for accessing external data from an other PostgreSQL server. It uses two different modes for computing costs, governed by option "use_foreign_estimate". Read more about this option here.
  1. When 'use_remote_estimate' is true, postgres_fdw fetches the costs from the foreign server using EXPLAIN.
  2. When it's false, postgres_fdw computes the costs based on the statistics about external data available locally.

The second method works fine for a simple scan on a foreign table as long as the statistics about the foreign table is kept up-to-date. But it doesn't do justice to complex operations like join, grouping whose performance depends upon a number of factors like availability of suitable indexes, memory for hash table or sorting, which are not covered by the statistics.

Take, for example, the following query involving two foreign tables ft1and ft2, pointing to tables t1 and t2 on the foreign server and each having columns c1 to c8. The plan with "use_remote_estimate" is disabled looks like:
explain (analyze) select * from ft1 t1 join ft2 t2 on t1.c1 = t2.c1 and t1.c1 + t2.c1 <= 20;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=247.94..301.01 rows=274 width=108) (actual time=8.289..13.270 rows=8 loops=1)
   Hash Cond: (t1.c1 = t2.c1)
   Join Filter: ((t1.c1 + t2.c1) <= 20)
   Rows Removed by Join Filter: 814
   ->  Foreign Scan on ft1 t1  (cost=100.00..137.66 rows=822 width=54) (actual time=1.223..5.918 rows=822 loops=1)
   ->  Hash  (cost=137.66..137.66 rows=822 width=54) (actual time=7.050..7.050 rows=822 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 83kB
         ->  Foreign Scan on ft2 t2  (cost=100.00..137.66 rows=822 width=54) (actual time=1.203..6.684 rows=822 loops=1)
 Planning Time: 0.307 ms
 Execution Time: 13.887 ms
(10 rows)

When we enable "use_remote_estimate" for both the tables, the plan changes to
explain (analyze) select * from ft1 t1 join ft2 t2 on t1.c1 = t2.c1 and t1.c1 + t2.c1 <= 20;
                                            QUERY PLAN                                            
--------------------------------------------------------------------------------------------------
 Foreign Scan  (cost=131.47..172.63 rows=274 width=108) (actual time=2.153..2.154 rows=8 loops=1)
   Relations: (public.ft1 t1) INNER JOIN (public.ft2 t2)
 Planning Time: 3.755 ms
 Execution Time: 2.336 ms
(4 rows)

Observe the difference in execution and planning time. When "use_remote_estimate" is true, the planning time is ten times more than the planning time with "use_remote_estimate" off. But the execution time with "use_remote_estimate" is true is almost 6 times lesser compared to that when "use_remote_estimate" is false. The planning time has increased so much since it fires many EXPLAIN queries on the foreign server trying to cost possible plans for join. If you have turned auditing ON on the foreign server, you will find that for planning the above query, postgres_fdw has fired following EXPLAIN commands on the foreign server.

  • EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t1
  • EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t2
  • EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t1 ORDER BY c1 ASC NULLS LAST
  • EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t1 WHERE (((c1 + ((SELECT null::integer)::integer)) <= 20)) AND ((((SELECT null::integer)::integer) = c1))
  • EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t2 ORDER BY c1 ASC NULLS LAST
  • EXPLAIN SELECT c1, c2, c3, c4, c5, c6, c7, c8 FROM t2 WHERE (((((SELECT null::integer)::integer) + c1) <= 20)) AND ((((SELECT null::integer)::integer) = c1))
  • EXPLAIN SELECT r1.c1, r1.c2, r1.c3, r1.c4, r1.c5, r1.c6, r1.c7, r1.c8, r2.c1, r2.c2, r2.c3, r2.c4, r2.c5, r2.c6, r2.c7, r2.c8 FROM (t1 r1 INNER JOIN t2 r2 ON ((((r1.c1 + r2.c1) <= 20)) AND ((r1.c1 = r2.c1))))

Each such EXPLAIN is a network trip to the foreign server, which consumes network bandwidth and time. While turning "use_remote_estimate" ON, improved query performance, it has drastically increased query planning time. In some cases the planning and execution time together may turn out to be same with and without "use_remote_estimate" turned ON. That takes away all the benefit of join or grouping push-down, which isn't good. This may not sound that bad when only some of the queries involve foreign tables or when the data volumes are so high that the planning time, even including those network trips, are only tiny fraction of the total execution time. But it does matter a lot when we move towards an FDW based built-in sharding as described in a blog by Robert Haas and in the slides by my friends from Japan and by Bruce Momjian.

We could reduce the network trips, and thus the reduce planning time, if we turn "use_remote_estimate" OFF, but then the plan comes out to be poor as seen above. This happens because PostgreSQL tries to cost the plans without knowing the capabilities of the foreign server, the plans that the foreign server can "think of". What if we could combine best of the two approaches by making local PostgreSQL "think" like the foreign PostgreSQL server? That's possible at-least in theory.

Costing foreign operations locally

After all the foreign server in this case is a PostgreSQL with costing model, optimizer and executor same as the local PostgreSQL. If the local PostgreSQL knows values of all the parameters which affect query optimizer, the costs it could compute locally would be much closer to the cost it gets from EXPLAIN output. Here's a rough list of what those parameters are:
  1. Various GUCs that affect the query optimizers. There are three classes of GUCs: a. the GUCs that determine the costs of certain operations like random_page_cost and cpu_operator_cost. b. the GUCs that enable/disable certain planner strategies like enable_hashjoin. c. the GUCs that constrain resources that the query executor can use e.g. parallel workers, memory.
  2. Statistics about the tables involved in the query. We already have a method to gather statistics on the remote tables, but it requires scheduling ANALYZE commands manually. Further-more when a foreign table is ANALYZEd, postgres_fdw fetches sample data from the foreign server and derives statistics from it. That consumes network bandwidth and time. It can do better by fetching the "statistics" itself from the foreign server. After-all the statistics on the foreign server was collected by a method similar to the one used to derive statistics locally. It will also help whether or not to ANALYZE a foreign table automatically is left to the user to decide. A user may be happy to spend the network bandwidth but keep the statistics up-to-date. In that case, s/he will be happy to let that happen automatically like a regular table rather than set up a cron-job and maintain it.
  3. Metadata used by the query optimizer. Query optimizer uses the knowledge about the constraints on the table, indexes on the table to create and cost various plans. Right now PostgreSQL supports "declarative constraints" on the foreign table, i.e. constraints which are not enforced but used by the query optimizer. But a user needs to set those up itself. It would be better if postgres_fdw can setup them up itself by knowing the constraints on the foreign server. However, PostgreSQL has no knowledge of indexes available at the foreign server. It would be better if postgres_fdw support declarative indexes on the foreign table as well.

Empowered with this knowledge, the costs of delegated operations computed locally using the local costing model would be much closer to the actual costs computed at the foreign server. This would eliminate any need to have a "use_remote_estimate" option.

The tricky part is to keep this knowledge about the foreign server up-to-date. The values of GUCs and the metadata may change on the foreign server without local PostgreSQL knowing about it. But those changes are not that frequent and syncing those sufficiently frequently would suffice. Statistics about the foreign data, however, may change rapidly if the transaction rate on the foreign server is high. Keeping that fresh enough to choose the optimal plans would be a challenge. But it can be achieved if we allow the foreign server to periodically push the information to the local server and over a right wire-protocol.