Walking with the Elephants
My encounter with open source databases.
Friday, June 7, 2024
SQL/PGQ and graph theory
Friday, May 3, 2024
Property graphs: elements, labels and properties
A property graph consists of three types of "things" in it: elements, labels and properties.
Elements are nodes or edges in the graphs. They form the basic structure of a graph. An edge connects two nodes. Two nodes may be connected by multiple edges corresponding to different relationships between them.
Labels classify the elements. An element may belong to multiple classes and thus have multiple labels.
Properties are key-value pairs providing more information about an element. All the elements with the same label expose same set of keys or properties. A property of a given element may be exposed through multiple labels associated with that element.
Wednesday, April 24, 2024
PostgreSQL's memory allocations
There's a thread on hackers about recovering memory consumed by paths. A reference count is maintained in each path. Once paths are created for all the upper level relations that a given relation participates in, any unused paths, for which reference count is 0, are freed. This adds extra code and CPU cycles to traverse the paths, maintain reference counts and free the paths. Yet, the patch did not show any performance degradation. I was curious to know why. I ran a small experiment.
Experiment
I wrote an extension palloc_test which adds two SQL-callable functions palloc_pfree() and mem_context_free() written in C. Function definitions can be found here. The first function palloc's some memory and then pfree's it immediately. Other function just palloc's but never pfrees, assuming that the memory will be freed when the per-tuple memory context is freed. Both functions take the number of iterations and size of memory allocated in each iteration respectively as inputs. These functions return amount of time taken to execute the loop allocating memory. It appears that the first function spends CPU cycles to free memory and the second one doesn't. So the first one should be slower than the second one.
Results
The table below shows the amount of time reported by the respective functions to execute the loop as many times as the value in the first column, each iteration allocating 100 bytes. The figure shows the same as a plot. The time taken to finish the loop increases linearly for both the function indicating that the palloc logic is O(n) in terms of number of allocations. But the lines cross each other around 300K allocations.
count | palloc_pfree | memory context reset |
100 | 0.0029 | 0.007124 |
100100 | 2.5646 | 5.079862 |
200100 | 5.1682 | 10.375552 |
300100 | 7.6373 | 15.704286 |
400100 | 10.1827 | 19.038238 |
500100 | 12.7013 | 23.847599 |
600100 | 15.2838 | 28.708501 |
700100 | 17.8255 | 36.982928 |
800100 | 20.3718 | 41.863638 |
900100 | 23.0706 | 44.332727 |
1000100 | 51.3311 | 52.546201 |
2000100 | 56.7407 | 104.747792 |
3000100 | 76.3961 | 154.225157 |
4000100 | 102.3415 | 206.510045 |
5000100 | 126.1954 | 256.367685 |
6000100 | 155.8812 | 314.178951 |
7000100 | 179.9267 | 367.597501 |
8000100 | 206.2112 | 420.003351 |
9000100 | 234.7584 | 474.137076 |
Inference and conclusion
This agrees with the observations I posted on the thread. Instead of letting all the useless path to be freed when query finishes, freeing them periodically during planning is time efficient as well as memory efficient. It compensates for the extra CPU cycles spent to maintain reference counts, traverse and free paths.
The actual memory allocation and freeing pattern as implemented in that patch is different from that in the experiment, so it might be worth repeating those experiments by simulating similar pattern.
I used chunk size of 100 since I thought it's closer to the order of average path size. But it might be worth repeating the experiment with larger chunk sizes to generalize the result.
Tuesday, April 23, 2024
DBaaG with SQL/PGQ
For those who have studied ERD-lore, it's not new that a relational database is very much like a graph. But it has taken SQL, more than 30 years since it became a standard and almost half a century since its inception to incorporate construct that will allow a DataBase to be treated as a Graph, DBaaG. This is surprising given that SQL was developed as language for relational databases which are modeled using ER diagrams. Better late than never. SQL/PGQ has arrived as 16th part of SQL:2023.
Entity Relationship Diagram, ERD in short, is a tool to model and visualize a database as entity types (which classify the things of interest) and relationships that can exist between them. Entity types and the relationships both map to relations in a Relational DataBase Management System (RDBMS in short). The rows in the relations represent entities (instances of entity types) and relationship between entities respectively. Fig. 1 below shows an ERD for a hypothetical shop.
This diagram very much looks like a graph with entity types represented as nodes and relationships represented by edges. That's exactly what SQL/PGQ is about. It adds language constructs to SQL to present underlying database as a "Property Graph". For example, property graph definition corresponding to the above ERD would look like
Tuesday, August 8, 2023
Partitioning as a query optimization strategy?
I had discussed about query optimization techniques applicable to queries involving partitioned tables in PGConf.India 2018 (Video recording, slides). (My previous blog discusses these techniques in detail.) The takeaway from that presentation was these query optimization techniques improved query performance if the tables were already partitioned. But partitioning wasn't good enough as a query optimization strategy by itself even when partitionwise join, partitionwise aggregate and parallel query were all used together on small data-sizes. Experiments then hinted that if the data was large enough partitioning would become a query optimization strategy. But we didn't know how large is large enough. Experiments to establish would require beefy machines with larger resources which were costly, took long time to procure or get access to. On top of them it took long time to setup and finish the runs. At one point we stopped experimenting. Fast forward to today and things have changed drastically, thanks to the cloud!
EDB's BigAnimal comes to help
EnterpriseDB offers PostgreSQL-as-a-service in the form of a DBAAS platform called BigAnimal. It allows its users to deploy and run PostgreSQL in cloud on hardware configuration of their choice. It also provides a starter free credit to try out this platform. I experimented with very large datasets by using BigAnimal. I ran the experiments on PostgreSQL 15 hosted on a m5.4xlarge instance (64 GB RAM, 16 vCPUs) with 1500 GB storage. All of this without wasting much time and also money; I destroyed the instance as soon as my experiments were over.
Experiment
Table "public.li"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
key1 | integer | | not null | | plain | | |
key2 | integer | | not null | | plain | | |
d | date | | | | plain | | |
m | money | | | | plain | | |
t1 | text | | | | extended | | |
t2 | text | | | | extended | | |
Indexes:
"li_pkey" PRIMARY KEY, btree (key1, key2)
Access method: heap
$\d+ ord
Table "public.ord"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
--------+---------+-----------+----------+---------+----------+-------------+--------------+-------------
key1 | integer | | not null | | plain | | |
d | date | | | | plain | | |
m | money | | | | plain | | |
t1 | text | | | | extended | | |
t2 | text | | | | extended | | |
Indexes:
"ord_pkey" PRIMARY KEY, btree (key1)
Access method: heap
When partitioned they are partitioned by range on key1. Each row in ord has 3 matching rows in li, roughly imitating the data-size ratio between corresponding tables in TPCH benchmark.
Query 1 which extracts relevant parts of TPCH Q3 or Q4 looks like
select count(*)
from (select o.key1, sum(o.m) revenue, o.d
from li l, ord o
where l.key1 = o.key1 and
o.d > current_date + 300 and
l.d < current_date + 700
group by o.key1, o.d
order by revenue, o.d
) as t1
Query 2 which is a pure join between li and ord looks like
select o.key1, l.key1, o.d
from li l, ord o
where l.key1 = o.key1 and
o.d > current_date + 300 and
l.d < current_date + 700
The time required to execute these two queries is measured using EXPLAIN ANALYZE. We varied the number of rows per partition as well as the number of partitions.
The execution times for queries are given in tables below.
Table 1: 10K rows per partition
Average execution time Q1 (ms) | Average execution time Q2 (ms) | |||||
No. of partitions | unpartitioned table | partitioned table without PWJ | partitioned table with PWJ | unpartitioned table | partitioned table without PWJ | partitioned table with PWJ |
5 | 83.05 | 93.29 | 53.68 | 48.83 | 60.55 | 50.85 |
10 | 195.87 | 221.33 | 90.24 | 104.40 | 129.06 | 105.20 |
50 | 1,183.25 | 1,487.00 | 432.07 | 584.31 | 723.90 | 527.97 |
100 | 2,360.19 | 3,001.81 | 888.46 | 1,342.69 | 1,595.53 | 1,053.91 |
500 | 11,968.68 | 15,220.69 | 4,350.62 | 6,903.91 | 8,082.09 | 5,381.46 |
1000 | 33,772.31 | 31,090.57 | 8,847.61 | 16,461.44 | 17,646.42 | 10,875.05 |
Table 2: 100K rows per partition
Average execution time Q1 (ms) | Average execution time Q2 (ms) | |||||
No. of partitions | unpartitioned table | partitioned table without PWJ | partitioned table with PWJ | unpartitioned table | partitioned table without PWJ | partitioned table with PWJ |
5 | 1,157.23 | 1,489.53 | 514.68 | 609.81 | 773.31 | 582.07 |
10 | 2,326.40 | 2,990.32 | 1,041.11 | 1,375.69 | 1,597.55 | 1,152.33 |
50 | 11,899.34 | 15,181.49 | 4,792.88 | 7,196.35 | 8,446.64 | 5,828.54 |
100 | 24,139.10 | 30,660.87 | 9,594.33 | 14,277.53 | 16,753.36 | 11,512.35 |
500 | 1,53,922.35 | 1,65,550.06 | 50,308.85 | 74,387.34 | 85,175.79 | 58,282.17 |
1000 | 3,13,534.59 | 3,38,718.63 | 1,31,482.31 | 2,03,569.14 | 1,32,688.60 | 1,23,643.18 |
Same numbers in the form of graphs are better to understand. Next we see graphs depicting the average execution time of each of these queries varying with the number of partitions. In each graph Y-axis shows the execution times in logarithmic scale, X-axis shows the number of partitions. Blue line shows the query execution times when tables are not partitioned. Red line shows query execution times when tables are partitioned but partitionwise join and aggregation are not used (turning both enable_partitionwise_join and enable_partitionwise_aggregate OFF). Yellow line shows query execution times when tables are partitioned and partitionwise join and partitionwise aggregate is used.
Note that the Y-axis denoting the execution time is drawn with logarithmic scale. Thus the linear difference on that axis shows improvement in integer multiples instead of fractions. For example, Q1's execution time improves almost by 4 times when tables are partitioned and partitionwise join and aggregate are enabled.
Graph 1 |
Graph 2 |
Graph 3 |
Graph 4 |
Key takeaways
The graphs above make it clear that when datasizes are very large partitioning can also be used as a query optimization technique along with its other advantages. I will share some key points here
- When the total data size reaches the house of millions, partitioning can be considered as a query optimization strategy. The exact number of partitions and average rows per partition do not make much difference. We see similar performance whether 5M rows are divided into 500 partitions or 50 partitions.
- The exact thresholds depend upon properties of data and queries. E.g. size of each rows, columns used in query, operations performed by the query etc.
- Since these optimization techniques are very much dependent upon the partition key, choosing the right partition key is very important.
- When tables are partitioned, queries perform better when partitionwise operations are used irrespective of the datasize.
Each workload is different. Above charts provide some guidance. But experimenting with the size and number of partitions as well as the partition key is important to know whether partitioning will help you optimize queries in your application or not. Experimentation shouldn't be an issue anymore. EDB's BigAnimal platform allows its users to experiment quickly without requiring a large upfront investment.
Wednesday, June 14, 2023
PostgreSQL internals development with VSCode
In my brief stint with Hive (which resulted in this and this blog), I used IntelliJ IDEA. I was reintroduced to the marvels of IDEs and how easy they make a developer's life. I had used Visual Studio, TurboC and many other language specific IDEs back in my college days. But once I started working exclusively with C and Linux, I was confined to vim, gdb, cgdb and at the most ddd. (Didn't use emacs. But I hear that's cool IDE as well.) I had kinda forgot what comfort it is to work in an IDE. These tools are certainly great and if one spends enough time, they can be better than any of the IDEs out there. But there's a sharp learning curve there. So, I was reminded of that comfort and sorely missed a good IDE when I started working with PostgreSQL again. But by then VSCode was made available on Linux. It's not as fantastic as IntelliJ or GoLand but it's good enough to improve a C developer's life; not to mention efficiency.
I like a. ability to edit, browse and debug code simultaneously, b. all the contextual language specific auto-suggestions c. and ease of code navigation. I sorely miss Ctrl+t and Ctrl+] stacking in vim but otherwise it has vim emulator too. I am yet to explore and utilize other features like git.
In this blog we will see how to configure VSCode for PostgreSQL internal development including the development of extensions, proprietary forks. We will talk about two things in this blog 1. how to configure make so that code browsing, navigation, error highlighting and auto-suggestions are sensible 2. how to configure a debugger. These are the two things I struggled with when it came to working with PostgreSQL code in VSCode. Otherwise, you will find plenty of references on C/C++ development in VSCode like this, this and this.
Please feel free to hit me with suggestions, corrections. Drop your VSCode tips and trick or suggest a topic I can cover in my future blog.
1. Getting started with PostgreSQL code
Open coderoot folder using File->Open Folder. Save workspace using File->Save Workspace As in coderoot folder. Add folder coderoot/pg using File->Add Folder to Workspace.
2. Setting up make
3. Debugging a PostgreSQL backend
4. TAP tests
Wednesday, January 26, 2022
Advanced partition matching for partition-wise join
Basic partition matching technique allows a join between two partitioned tables to be performed using partition-wise join technique if the two partitioned tables had exactly same partitions (more precisely exactly matching partition bounds). For example consider two partitioned tables prt1 and prt2
\d+ prt1
... [output clipped]
Partition key: RANGE (a)
Partitions: prt1_p1 FOR VALUES FROM (0) TO (5000),
prt1_p2 FOR VALUES FROM (5000) TO (15000),
prt1_p3 FOR VALUES FROM (15000) TO (30000)
Advanced partition matching algorithm
The problem with outer joins
Multiple matching partitions
Curious case of hash partitioned tables
Even with all these limitation, what we have today is a very useful solution which serves most of the practical cases. Needless to say that this feature works seemlessly with FDW join push down to adding to sharding capabilities that PostgreSQL already has!