Friday, June 7, 2024

SQL/PGQ and graph theory

The story goes almost two decades back. I was studying computer engineering in College of Engineering Pune. Prof Vinayak Joshi had recently joined COEP and had research interest in discrete mathematic, especially Lattices. He needed some software for exploring some patterns in graphs. Matlab, though almost three decades old by then, was new to us and quite expensive. He approached my scholar friend Amit Bose and requested him to write a program for exploring the desired graph patterns. If memory serves, Amit wrote the program in C and quite possibly used TurboC as an IDE. That was a remarkable feat given that graphs could consume a huge memory, C is a very basic language and TurboC was clumsy and clunky back then. May be Amit knew how to use gcc and linux which allowed huge memory models. (I am sure today's software engineers haven't heard about memory models.) In the due course we studied discrete mathematics, graph theory and also lattices.

Few years later I joined post-graduate program in Computer Science and Engineering at IIT Bombay. The institute employed eminent faculty in the area of theoretical computer science, including graph theory. I worked as a teaching assistant to Prof. Ajit Diwan and Prof. Abhiram Ranade. I also took a course run by Prof. Sundar Vishwanathan. That developed my interest in the graph theory. But I felt any work in graph theory required a lot of patience (more details to follow) and it was too difficult and intangible. Like many others in my class, I dared not take a research project in the subject.

Graph theory still fascinates me. Graphs are still my first go-to-tool to model and solve any problem in profession or life. If graph theory fails I use other tools. Fast forward to the present, when I saw opportunity to work with graphs in RDBMS, I immediately grabbed it. Follow pgsql-hackers thread for more about that work. I don't know whether Prof. Ajit had graph databases in his mind when he said to our class, "Do you think that I can not offer projects in databases? I can offer a lot of them". But that's quite true. In the age of AI and analytics, graph databases are once again a hot topic.

In Prof. Sundar prescribed "Introduction to graph theory (second edition)" by Douglas West for his course. the exercises in that book were like puzzles for me. I liked to work those out myself. Most of the times I didn't. One such problem was related to "king". If memory serves me, it was 1.4.38 on page 66 of that book, which I still have with me. I spent hours trying to prove the theorem but did not succeed. I went to Prof. Sundar for help. He patiently listened to all the things I had tried to solve that problem. He said I was very close to the solution and any hint from him would be as good as the solution itself. He suggested that I sit in his room and try again. After an hour of struggle, I left his room without any success. The problem still haunts me.

I don't know whether Matlab is still expensive and whether Prof. Joshi is still using programs to explore his graphs. But if he is, SQL/PGQ might come handy. Having it in PostgreSQL means they can use it for free. All the database capabilities allow them to store and retrieve the graphs they have tried in their research. Let's take a simple example, of a king.

In a digraph, a king is a vertex from which every vertex is reachable by a path of length at most 2. In other words, if a vertex v is a king, it is connected to every other vertex by a path of length at most two. Let's see how to do that with SQL/PGQ. Assume a table "vertexes" which contains all the vertexes of the graph and a table "edges" which contains all the edges in the graph connecting those vertices.

create table vertexes (id int primary key,
                        name varchar(10));
create table edges (id int primary key,
                    src int references vertexes(id),
                    dest int references vertexes(id),
                    name varchar(10));
create property graph tournament
    vertex tables (vertexes default label)
    edge tables (edges source key (src) references vertexes(id)
                        destination key(dest) references vertexes(id)
                        default label)

Let's build the query to find a king step by step. First step would be to find all the nodes reachable from a given node by a path of length at most 2.

select src_name, dest_name 
    from graph_table (tournament 
                        match (src is vertexes)->{1,2}(dest is vertexes)
                        where (src.id <> dest.id)
                        columns (src.name as src_name, dest.name as dest_name))
    order by src_name;;

I have discussed most of the constructs in my previous posts on DBaaG and its components. {1, 2} is a new construct being used here which indicates that the path between src and dest is of length maximum 2. Thus it lists all src nodes and respective dest nodes which are connected to their respective src node by a path of length at most 2. Also notice that we are eliminating the src node being reported as dest node to simplify the next step in the query.

Now we have to find node/s which is connected to all the nodes in such a way. To do that we simply count the distinct dest nodes an src node is reachable to. If this count is same as the number of vertexes in the graph but one, corresponding src node is the king.

select src_name, count(distinct dest_name) num_reachable_nodes 
    from graph_table (tournament
                match (src is vertexes)->{1,2}(dest is vertexes)
                where (src.id <> dest.id)
                columns (src.name as src_name, dest.name as dest_name))
    group by src_name
    having count(distinct dest_name) = (select count(*) - 1 from vertexes);

distinct in aggregate count makes sure to count each dest node only once. having clause filters every node which is not connected to all  the other nodes in the graph. If we populate vertex and edge tables as follows:

insert into vertexes values (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e');
insert into edges values (1, 1, 2, 'a-b'), (2, 2, 3, 'b-c'), (3, 1, 3, 'a-c'), (4, 3, 4, 'c-d'), (5, 4, 5, 'd-e'), (6, 2, 1, 'b-a');

This will create a graph as shown in the figure below. Notice the cycle a->b->a. Node e is reachable from node a by a path of length 3. All other nodes are reachable from node a by a path of length at most 2.

Above query does not return any rows since there is no king in this graph.







Let's remove outlier node e and the edge connecting nodes d and e.

delete from edges where name = 'd-e';
delete from vertexes where name = 'e';
With that the above query returns two king nodes a and b.

src_name   num_reachable_nodes
---------- -------------------
a   | 3
b   | 3

PostgreSQL is loved by developers. Hope introduction of SQL/PGQ makes it popular among graph theory researchers as well.

Friday, May 3, 2024

Property graphs: elements, labels and properties

A property graph consists of three types of "things" in it: elements, labels and properties. 

Elements are nodes or edges in the graphs. They form the basic structure of a graph. An edge connects two nodes. Two nodes may be connected by multiple edges corresponding to different relationships between them.

Labels classify the elements. An element may belong to multiple classes and thus have multiple labels.

Properties are key-value pairs providing more information about an element. All the elements with the same label expose same set of keys or properties. A property of a given element may be exposed through multiple labels associated with that element.

Let's use the diagram here to understand the concepts better. There are three elements: N1, N2, the vertexes and an edge connecting them, labels L1 to L4, properties P1 to P7. The arrows connecting a label to an property indicates that that label exposes that property. E.g. label L3 exposes properties P2, P3, P5. Property P1 is exposed by both L1 and L2. An arrow between an element and a label indicates that that label is associated with that element. N1 has labels L1 and L2 whereas the edge has just one label L4. The properties that are associated with (and are exposed by) an element are decided by the labels associated with it. E.g. the properties P1, P2 and P4, which are union of properties associated with labels L1 and L2, are exposed by element N1. P1 has the same value v1 irrespective of which label is considered for this association. E.g height of a person will not change whether that person is classified as a teacher, businessman or a plumber. Similarly notice that the edge exposes properties P6 and P7 since it is labelled as L4.

SQL/PGQ's path pattern specification language allows to specify paths in terms of labels ultimately exposing the properties of individual paths that obey that patterns. E.g. (a IS L1 | L2)-[]->(b IS L3) COLUMNS (a.P3) will returns values of property P2 of all the nodes with labels L1 or L2. If you notice that N1 and N2 are the elements associated with either L1 or L2 or both. But N1 does not expose property P3. Hence we might expect that the above query would return an error. But instead the standard specified that it should report NULL, quite inline with the spirit of SQL NULL which means unknown.


The way I see it, a property can not exist without at least one label exposing it. A label can not exist without being associated with at least an element. But once defined, they have quite an independent existence.

Wednesday, April 24, 2024

PostgreSQL's memory allocations

There's a thread on hackers about recovering memory consumed by paths. A reference count is maintained in each path. Once paths are created for all the upper level relations that a given relation participates in, any unused paths, for which reference count is 0, are freed. This adds extra code and CPU cycles to traverse the paths, maintain reference counts and free the paths. Yet, the patch did not show any performance degradation. I was curious to know why. I ran a small experiment.

Experiment

I wrote an extension palloc_test which adds two SQL-callable functions palloc_pfree() and mem_context_free() written in C. Function definitions can be found here. The first function palloc's some memory and then pfree's it immediately. Other function just palloc's but never pfrees, assuming that the memory will be freed when the per-tuple memory context is freed. Both functions take the number of iterations and size of memory allocated in each iteration respectively as inputs. These functions return amount of time taken to execute the loop allocating memory. It appears that the first function spends CPU cycles to free memory and the second one doesn't. So the first one should be slower than the second one.

Results

The table below shows the amount of time reported by the respective functions to execute the loop as many times as the value in the first column, each iteration allocating 100 bytes. The figure shows the same as a plot. The time taken to finish the loop increases linearly for both the function indicating that the palloc logic is O(n) in terms of number of allocations. But the lines cross each other around 300K allocations.


 

countpalloc_pfreememory context reset
1000.00290.007124
1001002.56465.079862
2001005.168210.375552
3001007.637315.704286
40010010.182719.038238
50010012.701323.847599
60010015.283828.708501
70010017.825536.982928
80010020.371841.863638
90010023.070644.332727
100010051.331152.546201
200010056.7407104.747792
300010076.3961154.225157
4000100102.3415206.510045
5000100126.1954256.367685
6000100155.8812314.178951
7000100179.9267367.597501
8000100206.2112420.003351
9000100234.7584474.137076


Inference and conclusion

This agrees with the observations I posted on the thread. Instead of letting all the useless path to be freed when query finishes, freeing them periodically during planning is time efficient as well as memory efficient. It compensates for the extra CPU cycles spent to maintain reference counts, traverse and free paths.

The actual memory allocation and freeing pattern as implemented in that patch is different from that in the experiment, so it might be worth repeating those experiments by simulating similar pattern.

I used chunk size of 100 since I thought it's closer to the order of average path size. But it might be worth repeating the experiment with larger chunk sizes to generalize the result.

Tuesday, April 23, 2024

DBaaG with SQL/PGQ

For those who have studied ERD-lore, it's not new that a relational database is very much like a graph. But it has taken SQL, more than 30 years since it became a standard and almost half a century since its inception to incorporate construct that will allow a DataBase to be treated as a Graph, DBaaG. This is surprising given that SQL was developed as language for relational databases which are modeled using ER diagrams. Better late than never. SQL/PGQ has arrived as 16th part of SQL:2023.

Entity Relationship Diagram, ERD in short, is a tool to model and visualize a database as entity types (which classify the things of interest) and relationships that can exist between them. Entity types and the relationships both map to relations in a Relational DataBase Management System (RDBMS in short). The rows in the relations represent entities (instances of entity types) and relationship between entities respectively. Fig. 1 below shows an ERD for a hypothetical shop.

This diagram very much looks like a graph with entity types represented as nodes and relationships represented by edges. That's exactly what SQL/PGQ is about. It adds language constructs to SQL to present underlying database as a "Property Graph". For example, property graph definition corresponding to the above ERD would look like

CREATE PROPERTY GRAPH shop
VERTEX TABLES (
    CreditCard label Payment,
    BankAccount label Payment,
    Person label Customer,
    Company label Customer,
    Trust label Customer,
    Wishlist label ProdLink,
    Order label ProdLink,
    Product)
EDGE TABLES (
    CCOwns label Owns
    BAHolds lable Owns,
    CustOrders label CustLink,
    CustWishlist label CustLink,
    CompanyOrders label CustLink,
    CompanyWishlist label CustLink,
    TrustOrders label CustLink,
    TrustWishlist label CustLink,
    OrderCCPayment label OrderPayment,
    OrderBAPayment label OrderPayment,
    OrderItems label ItemLink,
    WishlistItems label ItemLink);
 
Clever readers may have noticed that some of the entity types have some commonality. CreditCard and BankAccount are both Payment methods. Person, Company and Trust all can be considered as "Customers" by the shop. In a graph entities with commonalities will be represented by visual annotations like colors in Fig. 2. SQL/PGQ chooses to represents them by labels. Columns of the underlying tables are exposed through properties of labels. Traditionally the labels may be implemented as table inheritance or through tables abstracting commonalities. But it may not be necessary anymore.

Augmented with query constructs in SQL/PGQ they make it easy to write queries, especially analytical. Imagine a query to find all the products paid via credit card. There will be tons of JOIN and UNIONs over those joins. That's almost like "implementing" a logic in SQL. You would ask which parts do I JOIN before UNION, and which parts do I UNION before JOIN and so on. That's against the "imperative" spirit of SQL which should allow you to tell "what" you want and leave "how" for the DBMS to figure out. With SQL/PGQ you tell the DBaaG which paths in the graph to traverse. How to traverse them is the system's responsibility. So the SQL/PGQ query looks like below. Much simpler than joins and unions. In fact, it allows me not to mention edge tables at all in the query.

SELECT distinct name FROM
    GRAPH_TABLE (shop
                MATCHES
                   (o IS Orders)->(py IS Payment WHERE py.type = 'CC')<-(c IS Customer)->(o IS Order)->(p is Product)
                COLUMNS (p.name));
 
I must note that the query looks more like a mathematical equation than SQL which till now followed natural language syntax. But well, there it is.

What those () mean? What about various literals in it? How to specify properties? I am sure I have roused more questions than those answered here. I plan to write more about it in future. This blog has gone longer than I initially intended it to be, but I hope it has aroused your interest in SQL/PGQ nonetheless.

Oh! but before I end, please note that we are working on implementing SQL/PGQ in PostgreSQL. If you are interested and want to contribute, please follow and respond on pgsql-hackers thread.

   
 

Tuesday, August 8, 2023

Partitioning as a query optimization strategy?

I had discussed about query optimization techniques applicable to queries involving partitioned tables in PGConf.India 2018 (Video recording, slides). (My previous blog discusses these techniques in detail.) The takeaway from that presentation was these query optimization techniques improved query performance if the tables were already partitioned. But partitioning wasn't good enough as a query optimization strategy by itself even when partitionwise join, partitionwise aggregate and parallel query were all used together on small data-sizes. Experiments then hinted that if the data was large enough partitioning would become a query optimization strategy. But we didn't know how large is large enough. Experiments to establish would require beefy machines with larger resources which were costly, took long time to procure or get access to. On top of them it took long time to setup and finish the runs. At one point we stopped experimenting. Fast forward to today and things have changed drastically, thanks to the cloud!

EDB's BigAnimal comes to help

EnterpriseDB offers PostgreSQL-as-a-service in the form of a DBAAS platform called BigAnimal. It allows its users to deploy and run PostgreSQL in cloud on hardware configuration of their choice. It also provides a starter free credit to try out this platform. I experimented with very large datasets by using BigAnimal. I ran the experiments on PostgreSQL 15 hosted on a m5.4xlarge instance (64 GB RAM, 16 vCPUs) with 1500 GB storage. All of this without wasting much time and also money; I destroyed the instance as soon as my experiments were over.

Experiment

I wanted to see the impact of only partitioning as a query optimization strategy. So instead of using whole TPCH setup, I crafted a micro-benchmark with two queries involving two tables li and ord modeled after lineitem and orders tables in TPCH benchmark. When partitioned each of these two tables have matching 1000 partitions each. The tables have following schema

$\d+ li
                                            Table "public.li"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 key1   | integer |           | not null |         | plain    |             |              |
 key2   | integer |           | not null |         | plain    |             |              |
 d      | date    |           |          |         | plain    |             |              |
 m      | money   |           |          |         | plain    |             |              |
 t1     | text    |           |          |         | extended |             |              |
 t2     | text    |           |          |         | extended |             |              |
Indexes:
    "li_pkey" PRIMARY KEY, btree (key1, key2)
Access method: heap

$\d+ ord
                                           Table "public.ord"
 Column |  Type   | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
 key1   | integer |           | not null |         | plain    |             |              |
 d      | date    |           |          |         | plain    |             |              |
 m      | money   |           |          |         | plain    |             |              |
 t1     | text    |           |          |         | extended |             |              |
 t2     | text    |           |          |         | extended |             |              |
Indexes:
    "ord_pkey" PRIMARY KEY, btree (key1)
Access method: heap

When partitioned they are partitioned by range on key1. Each row in ord has 3 matching rows in li, roughly imitating the data-size ratio between corresponding tables in TPCH benchmark.

Query 1 which extracts relevant parts of TPCH Q3 or Q4 looks like

select count(*)
    from (select o.key1, sum(o.m) revenue, o.d
            from li l, ord o
            where l.key1 = o.key1 and
                o.d > current_date + 300 and
                l.d < current_date + 700
            group by o.key1, o.d
            order by revenue, o.d
    ) as t1

 Query 2 which is a pure join between li and ord looks like

select o.key1, l.key1, o.d
            from li l, ord o
            where l.key1 = o.key1 and
                o.d > current_date + 300 and
                l.d < current_date + 700

The time required to execute these two queries is measured using EXPLAIN ANALYZE. We varied the number of rows per partition as well as the number of partitions.

The execution times for queries are given in tables below.

Table 1: 10K rows per partition


Average execution time Q1 (ms) Average execution time Q2 (ms)
No. of partitions unpartitioned table partitioned table without PWJ partitioned table with PWJ unpartitioned table partitioned table without PWJ partitioned table with PWJ
5 83.05 93.29 53.68 48.83 60.55 50.85
10 195.87 221.33 90.24 104.40 129.06 105.20
50 1,183.25 1,487.00 432.07 584.31 723.90 527.97
100 2,360.19 3,001.81 888.46 1,342.69 1,595.53 1,053.91
500 11,968.68 15,220.69 4,350.62 6,903.91 8,082.09 5,381.46
1000 33,772.31 31,090.57 8,847.61 16,461.44 17,646.42 10,875.05

Table 2: 100K rows per partition


Average execution time Q1 (ms) Average execution time Q2 (ms)
No. of partitions unpartitioned table partitioned table without PWJ partitioned table with PWJ unpartitioned table partitioned table without PWJ partitioned table with PWJ
5 1,157.23 1,489.53 514.68 609.81 773.31 582.07
10 2,326.40 2,990.32 1,041.11 1,375.69 1,597.55 1,152.33
50 11,899.34 15,181.49 4,792.88 7,196.35 8,446.64 5,828.54
100 24,139.10 30,660.87 9,594.33 14,277.53 16,753.36 11,512.35
500 1,53,922.35 1,65,550.06 50,308.85 74,387.34 85,175.79 58,282.17
1000 3,13,534.59 3,38,718.63 1,31,482.31 2,03,569.14 1,32,688.60 1,23,643.18


Same numbers in the form of graphs are better to understand. Next we see graphs depicting the average execution time of each of these queries varying with the number of partitions. In each graph Y-axis shows the execution times in logarithmic scale, X-axis shows the number of partitions. Blue line shows the query execution times when tables are not partitioned. Red line shows query execution times when tables are partitioned but partitionwise join and aggregation are not used (turning both enable_partitionwise_join and enable_partitionwise_aggregate OFF). Yellow line shows query execution times when tables are partitioned and partitionwise join and partitionwise aggregate is used.

Note that the Y-axis denoting the execution time is drawn with logarithmic scale. Thus the linear difference on that axis shows improvement in integer multiples instead of fractions. For example, Q1's execution time improves almost by 4 times when tables are partitioned and partitionwise join and aggregate are enabled.

Graph 1


Graph 2

Graph 3

Graph 4

Key takeaways

The graphs above make it clear that when datasizes are very large partitioning can also be used as a query optimization technique along with its other advantages. I will share some key points here

  1.  When the total data size reaches the house of millions, partitioning can be considered as a query optimization strategy. The exact number of partitions and average rows per partition do not make much difference. We see similar performance whether 5M rows are divided into 500 partitions or 50 partitions.
  2. The exact thresholds depend upon properties of data and queries. E.g. size of each rows, columns used in query, operations performed by the query etc.
  3. Since these optimization techniques are very much dependent upon the partition key, choosing the right partition key is very important.
  4. When tables are partitioned, queries perform better when partitionwise operations are used irrespective of the datasize.

Each workload is different. Above charts provide some guidance. But experimenting with the size and number of partitions as well as the partition key is important to know whether partitioning will help you optimize queries in your application or not. Experimentation shouldn't be an issue anymore. EDB's BigAnimal platform allows its users to experiment quickly without requiring a large upfront investment.

Wednesday, June 14, 2023

PostgreSQL internals development with VSCode

In my brief stint with Hive (which resulted in this and this blog), I used IntelliJ IDEA. I was reintroduced to the marvels of IDEs and how easy they make a developer's life. I had used Visual Studio, TurboC and many other language specific IDEs back in my college days. But once I started working exclusively with C and Linux, I was confined to vim, gdb, cgdb and at the most ddd. (Didn't use emacs. But I hear that's cool IDE as well.) I had kinda forgot what comfort it is to work in an IDE. These tools are certainly great and if one spends enough time, they can be better than any of the IDEs out there. But there's a sharp learning curve there. So, I was reminded of that comfort and sorely missed a good IDE when I started working with PostgreSQL again. But by then VSCode was made available on Linux. It's not as fantastic as IntelliJ or GoLand but it's good enough to improve a C developer's life; not to mention efficiency.

I like a. ability to edit, browse and debug code simultaneously, b. all the contextual language specific auto-suggestions c. and ease of code navigation. I sorely miss Ctrl+t and Ctrl+] stacking in vim but otherwise it has vim emulator too. I am yet to explore and utilize other features like git.

In this blog we will see how to configure VSCode for PostgreSQL internal development including the development of extensions, proprietary forks. We will talk about two things in this blog 1. how to configure make so that code browsing, navigation, error highlighting and auto-suggestions are sensible 2. how to configure a debugger. These are the two things I struggled with when it came to working with PostgreSQL code in VSCode. Otherwise, you will find plenty of references on C/C++ development in VSCode like this, this and this.

Please feel free to hit me with suggestions, corrections. Drop your VSCode tips and trick or suggest a topic I can cover in my future blog.

1. Getting started with PostgreSQL code

I have a script which clones the PostgreSQL github repository, runs configure. Assume that the code is cloned in "$HOME/vscode_trial/coderoot/pg" directory. coderoot will contain all the VSCode specific files and directories where as coderoot/pg will contain purely PG clone. I am using VSCode version shown in the image on Ubuntu 22.04. I start by clicking VSCode icon in application tray.


Open coderoot folder using File->Open Folder. Save workspace using File->Save Workspace As in coderoot folder. Add folder coderoot/pg using File->Add Folder to Workspace.

2. Setting up make

PostgreSQL uses Makefile to build and manage binaries. VSCode by default uses CMake. So you will need to configure its build tasks to use Make instead of CMake. I have my scripts to build PostgreSQL so I don't need the tool to build binaries per say. But when we point VSCode to PostgreSQL's Makefile, its Intellisense uses Makefile and does a better job at code navigation, error detection and auto-suggestion.


Please install Makefile Tools extension so that VSCode can use make. Point it to the PostgreSQL Makefile selecting the options highlighted in the image below





You will find Makefile Tools extension button on the left side bar. Click it to configure the default tasks or to build binaries. The tool is smart and picks up all the make targets from the Makefile hierarchy. Click the "pencil" icon against "Build target" to choose the target you want and then click the "bug" icon at the top highlighted in the image below. This will run make install. You may ignore an error about launch task not being configured.











3. Debugging a PostgreSQL backend

This configuration baffled me a lot, especially in the newer version of VSCode. Playing with run and build symbol on the left hardly had any success. The trick is to open a source file, any .c file really and then click configuration symbol (highlighted in the image) to configuration debug tasks. I choose C/C++: gcc build and debug active file.
This will open the launch.json file in the workspace. Click Add Configuration button on the bottom right. Most of the time I have to debug a running backend. This requires configuring C/C++ gdb (Attach) option. Add processPid value as shown below. Also provide the path to postgres binary as progream.

Click on the run and debug option from the left bar and choose (gdb) Attach option at the top as shown in the image below.

In order to debug a given backend, click Run->Start Debugging. This should pop-up all PIDs. Choose the one from the list (may want to search postgres). This will attach gdb to that backend and you are ready to go. Enjoy all the blessings of debugging via GUI as described in the documentation here.

More on debugging through VSCode is here.

4. TAP tests

PostgreSQL code include TAP tests written in Perl. You will need Perl navigator and Perl language server and debugger extensions. The second extension is only required if you want to debug TAP tests.

Wednesday, January 26, 2022

Advanced partition matching for partition-wise join

Earlier I had written a blog about partition-wise join in PostgreSQL. In that blog I had talked about an advanced partition matching technique which will allow partition-wise join to be used in more cases. In this blog we will discuss this technique in detail. I will suggest to read my blog on basic partition-wise join again to get familiar with the technique.

Basic partition matching technique allows a join between two partitioned tables to be performed using partition-wise join technique if the two partitioned tables had exactly same partitions (more precisely exactly matching partition bounds). For example consider two partitioned tables prt1 and prt2

\d+ prt1
... [output clipped]
Partition key: RANGE (a)
Partitions: prt1_p1 FOR VALUES FROM (0) TO (5000),
            prt1_p2 FOR VALUES FROM (5000) TO (15000),
            prt1_p3 FOR VALUES FROM (15000) TO (30000)
and

\d+ prt2
... [ output clipped ]
Partition key: RANGE (b)
Partitions: prt2_p1 FOR VALUES FROM (0) TO (5000),
            prt2_p2 FOR VALUES FROM (5000) TO (15000),
            prt2_p3 FOR VALUES FROM (15000) TO (30000)

A join between prt1 and prt2 is executed as join between matching partitions i.e. prt1_p1 joins prt2_p1, prt1_p2 joins prt2_p2 and prt1_p3 joins prt2_p3. This has many advantages as discussed in my previous blog.

But basic partition matching can not join two partition tables with different partitions (more precisely different partition bounds). For example, in the above case if prt1 an extra partition prt1_p4 FOR VALUES FROM (30000) TO (50000), a join between prt1 and prt2 would not use partition-wise join. Many applications use partitions to segregate actively used and stale data, a technique I discussed in my another blog. The stale data is eventually removed by dropping partitions. New partitions are created to accommodate fresh data. Let's say the partition scheme of two such tables is such that they usually have matching partitions. But when an active partition gets added to one of these tables or a stale one gets deleted, they will have mismatched partitions for a small duration. We don't want a join hitting the database during this small duration to perform bad since it can not use partition-wise join. Advanced partition matching algorithm helps here.

Advanced partition matching algorithm

Advanced partition matching is very much similar to the merge join algorithm. It takes the sorted partition bounds and finds matching partitions by comparing the bounds from both the tables in their sorted order. Any two partitions, one from either partitioned table, whose bounds match exactly or overlap are considered to be joining partners since they may contain rows that join. Continuing with the above example:

\d+ prt1
... [output clipped]
Partition key: RANGE (a)
Partitions: prt1_p1 FOR VALUES FROM (0) TO (5000),
            prt1_p2 FOR VALUES FROM (5000) TO (15000),
            prt1_p3 FOR VALUES FROM (15000) TO (30000)
and
\d+ prt2
... [ output clipped ]
Partition key: RANGE (b)
Partitions: prt2_p1 FOR VALUES FROM (0) TO (5000),
            prt2_p2 FOR VALUES FROM (5000) TO (15000),
            prt2_p3 FOR VALUES FROM (15000) TO (30000),
            prt1_p4 FOR VALUES FROM (30000) TO (50000)

Similar to the basic partition matching algorithm this will join prt1_p1 and prt2_p1prt1_p2 and prt2_p2, and prt1_p3 and prt2_p3. But unlike basic partition matching it will also know that prt1_p4 does not have any join partner in prt1. Thus if the join between prt1 and prt2 is INNER join or when prt2 is INNER relation of join, the join will contain only three joins leaving prt2_p4 aside. In PostgreSQL, a join where prt2 is OUTER relation, we won't be able to use partition-wise join even if We will come back to this again when we will discuss outer joins further.

This is simple right, but consider another example of listed partitioned tables
\d+ plt1_adv
Partition key: LIST (c)
Partitions: plt1_adv_p1 FOR VALUES IN ('0001', '0003'),
            plt1_adv_p2 FOR VALUES IN ('0004', '0006'),
            plt1_adv_p3 FOR VALUES IN ('0008', '0009')

and

\d+ plt2_adv
Partition key: LIST (c)
Partitions: plt2_adv_p1 FOR VALUES IN ('0002', '0003'),
            plt2_adv_p2 FOR VALUES IN ('0004', '0006'),
            plt2_adv_p3 FOR VALUES IN ('0007', '0009')

Observe that there are exactly three partitions in both the relations but lists corresponding plt1_adv_p2 match exactly that of plt2_adv_p2 but other two partitions do not have exactly matching lists. Advanced partition matching algorithm helps to determine that plt1_adv_p1 and plt2_adv_p1 have overlapping lists and their lists do not overlap with any other partition from the other relation. Similarly for plt1_adv_p3 and plt2_adv_p3. Thus it allows join between plt1_adv and plt2_adv to be executed as partition wise join by joining their matching partitions. The algorithm can find matching partitions in even more complex partition bound sets.

The problem with outer joins

Outer joins pose a particular problem in PostgreSQL world. Consider again the example of join between prt2 LEFT JOIN prt1. prt2_p4 does not have a joining partner in prt1 and yet the rows in that partition will be part of the join since it is an outer relation, albeit with the columns from prt1 all "null"ed. Usually in PostgreSQL when the INNER side is empty, it's represented by a "dummy" relation which emits no rows but still knows the schema of that relation. Without partition-wise join a "concrete" relation which has some presence in the original query turns dummy and thus planner has "something" to join the outer relation with. So PostgreSQL's planner doesn't have to do anything extra when such outer joins occur. But when there is no matching inner partition for an outer partition e.g. prt2_p4, there is "no entity" which can represent the "dummy" inner side of that outer join. PostgreSQL does not have a way right now to induce such "dummy" relations during planning right now. But that's not required. Ideally such a join with empty inner only requires schema of the inner relation and not an entire relation itself. Once we build support to execute such a join with a solid outer relation and schema of inner relation, we will be able to tackle partition-wise join where there are no matching partitions on inner side. Hopefully we will solve that problem some time soon.

When there is no matching partition on the outer side of the join, the inner partition does not contribute to the result of join and can be just ignored. So partition-wise joins where there are no matching partitions on the inner side are not a problem at all.

Multiple matching partitions

When the partitioned tables are such that multiple partitions from one side match one partition or more partitions on the other side, partition-wise join simply bails out since there is no way to induce an "Append" relation during planning time which represents two or more partitions together. Hopefully we will remove that limitation as well sometime.

Curious case of hash partitioned tables

It doesn't make much sense to use it for a hash partitioned table since usually partitions of two hash partitioned table using same modulo always match. When the modulo is different, the data from one a given partition of one table can find its join partners in all the partitions of the other, thus rendering partition-wise join ineffective.

Even with all these limitation, what we have today is a very useful solution which serves most of the practical cases. Needless to say that this feature works seemlessly with FDW join push down to adding to sharding capabilities that PostgreSQL already has!