Walking with the Elephants

Monday, July 21, 2025

Perks of speaking at PostgreSQL conferences

A couple weeks back, I received my speaker's gift from POSETTE: An Event for Postgres.

The woolen elephant was claimed by my daughter as soon as the parcel ws opened (and gave me this photo in exchange). My son has been collecting the activity books and stickers for some time now. He claimed those two. Me and my wife are eyeing the pair of socks. The most precious, a handwritten personal thank you note, remains with me!

PostgreSQL conferences are an important part of any PostgreSQL community member's life, developers, users, DBAs, anybody who is involved with PostgreSQL in any role. There are plenty of them that run throughout the year. Some of them are meetups which run for a few hours, some are PGDays which usually run for a day and then some are full blown conferences which run for at least a couple of days and have auxilliary events surrounding the conference days. Each of them is unique because of geography, language or the content. While their schedules facilitate networking and people interaction more, they are centered around talks. The conferences usually do not pay their speakers but they convey their appreciation through speaker's gifts. I thought I would post about speaker's gifts I have received so far and are still with me, the ones in the photo.

Leftmost is a bottle, made of mostly bamboo and thus environment friendly, which I received from Hari Kiran who runs https://lnkd.in/giwyMh96 and PGDay Hyderabad. Next to that is a black bottle I received from PGConf India last year. PGConf.Asia, when it was hosted in Tokyo, presented a steel glass with each speaker's name embossed on it. That is the oldest surviving gift but the steel is still strong and shiny as it was back then. I actually liked the box with deep shiny blue satin lining that contained it. I still have the box too.

The frame is again from PGConf.Asia when it was hosted in China but virtually. It accompanied me in my office in the COVID lock-down. It's eyes are peculiar and make it quite lively. I don't know what cultural significance the mask has in China but some people visiting me in my office have saluted it thinking it to be some protector deity.

Next to it is a simple looking bottle of candies from PGConf.dev 🐘 this year. They are different from usual candies we get in India or those that I have tasted till now. first they have something written on them. Every candy reminds you of pgconf.dev 2025. They seem to be small pieces of long bars of made of stands of candies which are made of different flavours. So when you put a candy in your mouth those strands separate and leave different flavours on your tongue at a time. Very enjoyable experience. That's why it's only half empty - we are consuming them carefully.

While the conference organizers work hard to make each conference memorable, these gifts make them even more so.

Friday, July 11, 2025

Effect of hash table size on hash operations

When implementing an optimization for derived clause lookup myself, Amit Langote and David Rowley argued about the initial size of hash table (which would hold the clauses). See some discussions around this email on pgsql-hackers.

The hash_create() API in PostgreSQL takes initial size as an argument. It allocates memory for those many hash entries upfront. If more entries are added, it will expand that memory later. The point of argument was what should be the initial size of the hash table, introduced by that patch, containing the derived clauses. During the discussion, David hypothesised that the size of the hash table affects the efficiency of the hash table operations depending upon whether the hash table fits cache line. While I thought it's reasonable to assume so, the practical impact wouldn't be noticeable. I thought that beyond saving a few bytes choosing the right hash table size wasn't going to have any noticeable effects. If an derived clause lookup or insert became a bit slower, nobody would even notice it. It was practically easy to address David's concern by using the number of derived clauses at the time of creating the hash table to decide initial size of the hash table. The patch was committed.

Within a few months, I faced the same problem again when working on resizing shared buffers without server restart. The buffer manager maintains a buffer look table in the form of a hash table to map a page to buffer. When the number of configured buffers changes upon a server restart the size of buffer lookup table also changes. Doing that in a running server would be significant work. To avoid that, we could create a buffer lookup table large enough to accommodate future buffer size needs. Even if the buffer pool shrinks or expands, the size of the buffer lookup table would not change. As long as the expansion is within the buffer lookup table size limit, it could be done without a restart. Buffer lookup table isn't as large as the buffer pool itself, thus wasting a bit of memory can be considered worth the flexibility it provided. However, David's concern about the hash table size came back again. This time though, I decided to actually measure the impact.

Experiment

I am building an extension that can be used to measure performance of various PostgreSQL APIs or data structures. I used function sparse_hashes() in branch for this experiment. The function takes number of entries in the hash table and inital size of hash table as inputs and outputs time taken to insert, search and delete the given number of entries from the hash table. Since I was interested in measuring the effect of hash table size on buffer lookup table, I used buffer look up table's key and entry structures to create the hash table.

/* Create a hash table, BufferTag maps to Buffer */
info.keysize = sizeof(BufferTag);
info.entrysize = sizeof(BufferLookupEnt);
info.num_partitions = NUM_BUFFER_PARTITIONS;

hashtab = hash_create("experimental buffer hash table", max_entries, &info,
HASH_ELEM | HASH_BLOBS);

I invoked this function with varying hash table sizes but keeping the number of elements to 16384 (the default buffer pool size), using following query:

create table msmts_v as
select run,
hash_size,
(sparse_hashes(hash_size, 16 * 1024)).*
from generate_series(16 * 1024, 16 * 1024 * 1024, 8 * 1024) hash_size,
generate_series(1, 120) run;

Mind you, the query takes a lot of time to run. Following query is used to consolidate the results.

select hash_size/(16 * 1024) "hash size as multiple of number of elements",
round((log(hash_size/(16 * 1024))/log(64))::numeric, 1) as "logarithmic hash size steps",
avg(insert_time) insert,
avg(search_time) search,
avg(delete_time) delete
from msmts
group by hash_size
order by hash_size;

with that, plot of insert, search and delete timings, for 16384 elements, against the hash size looks like

The plot clearly shows that the hash table performance degrades as the hash table size increases. So David's hypotheses was right. However it doesn't degrade with a small change in hash table size which was what was being debated in the derived clauses thread. Even if the hash table size is hundreds of times higher than the actual number of elements in it, there's hardly noticeable effect on performance. Next interesting thing to notice is the degradation is step-wise. Looking at the purple, green and blue lines against the orange line, we can infer that the steps roughly correspond to some multiple of log of hash table to the base of 64, which is roughly the cache line size. Again David seems to be correct in that the degradation is because of cache line faults. Nonetheless our notion of hash table size being immaterial in O(1) is questionable!

Going back to the shared buffer lookup table size, configuring a very large buffer lookup table would certainly impact the performance but whether that will be noticeable in TPS is subject of another experiment and another post.

Friday, June 6, 2025

Avoiding disk spills due to PostgreSQL's logical replication

Logical replication is a versatile feature offered in PostgreSQL. I have discussed the the theoretical background of this feature in detail in my POSETTE talk. At the end of the talk, I emphasize the need for monitoring logical replication setup. If you are using logical replication and have setup monitoring you will be familiar with pg_stat_replication_slots. In some cases this view shows high amount of spill_txns, spill_count and spill_bytes, which indicates that the WAL sender corresponding to that replication slot is using high amount of disk space. This increases load on the IO subsystem affecting the performance. It also means that there is less disk available for user data and regular transactions to operate. This is an indication that logical_decoding_work_mem has been configured too low. That's the subject of this blog: how to decide the right configuration value for logical_decoding_work_mem. Let's first discuss the purpose of this GUC. Blog might serve as a good background before reading further.

Reorder buffer and logical_decoding_work_mem

When decoding WAL, a logical WAL sender accumulates the transaction in an in-memory data structure called reorder buffer. For every transaction that WAL sender encounters, it maintains a queue of changes in that transaction. As it reads each WAL records, it finds the transaction ID which it belongs to and adds it to the corresponding queue of changes. As soon as it sees a COMMIT record of a transaction, it decodes all the changes in the corresponding queue and sends downstream. If the reorder buffer fills up by transactions whose COMMIT record is yet to be seen, it spills the queue to the disk. We see such disk spills accounted in spill_txns, spill_count and spill_bytes. The amount of memory allocated to reorder buffer is decided by logical_decoding_work_mem GUC. If GUC value is lower, it will cause high disk spills and if the value is higher it will waste memory. Every WAL sender in the server will allocate logical_decoding_work_mem amount of memory, thus the total memory consumed for maintaining reorder buffer is {number of WAL senders} * logical_decoding_work_mem which can go upto max_wal_senders * logical_decoding_work_mem.

Setting logical_decoding_work_mem optimally

It's clear that reorder buffer should be able to hold WAL records of all the concurrent transactions to avoid disk spills. How many concurrent transactions there can be? Every backend in PostgreSQL, client as well as worker can potentially start a transaction and there can be only one transaction active at a given time in a given backend. Thus the higher bound on the number of concurrent transactions in a server is decided by max_connections which decided the maximum number of client backends in the server, max_prepared_transactions which decides the number of prepared transaction in addition to the transactions in client backends, max_worker_processes and autovacuum_max_workers which together decide the backends other than the client backends which may execute transactions. The sum of all these GUCs gives the higher bound on the number of concurrent transactions that can be running at a time in a server. Assuming that average amount of WAL produced by each transaction is known, the total amount WAL that may get added to reorder buffers is {maximum number of transactions in the system} * {average amount of WAL produced by each transaction}. The question is how to find the average?

Transactions by different applications and transactions by worker processes all may have different characteristics and thus produce different amounts of WAL. But they all compete for space in reorder buffer and they all are part of a single WAL stream, which can be examined by pg_waldump. There are a few ways, we can utilize this utility to estimate logical_decoding_work_mem.

Count the number of commits or aborts in a given set of WAL segments and divide the total size of WAL segments by that count. The total size of WAL segments will be {number of WAL segments] * {size of each WAL segment}. If you are seeing transactions being spilled to disk, the total amount of WAL generated by concurrent transactions is higher than logical_decoding_work_mem which by default is 64MB which is equivalent to 4 WAL segments of, default size, 16MB each. So you will need to analyze several WAL segments not just a few.
pg_waldump reports WAL records by transaction. It can be used for a better estimate by sampling typical transactions from pg_waldump and estimating sizes of each such typical transactions and their counts.
Modify pg_waldump to keep a running count of amount of WAL accumulated in reorg buffer. The algorithm would look like below

T = 0
Read a WAL record. If the record belongs to transaction x, Cx = Cx + size of WAL record, where Cx maintains the total size of WAL records of transaction x so far. If x is a new transaction, Cx = size of WAL record
T = T + Cx where T is the total size of WAL records accumulated in reorder buffer when that record was read.
When a COMMIT or ABORT WAL record of transaction x is read, T = T - Cx.

This way T tracks the size of WAL records accumulated in the reorder buffer at a given point in time. Maximum over T can be used to estimate logical_decoding_work_mem.

If you are not comfortable with C or pg_waldump, above option can be implemented by parsing output of pg_waldump using higher level languages like python.

Once you have estimated the maximum amount of WAL that may get accumulated in the reorder buffer, add about 5% overhead of other data structures with reorder buffer and you have your first estimate of logical_decoding_work_mem. It can be refined further by setting the GUC and monitoring pg_stat_replication_slots.

However, remember that each WAL sender will consume logical_decoding_work_mem amount of memory which may affect the total memory available for the regular server operation. You may find an optimal value which leaves enough memory for regular server operation while reducing the disk spills. Option 3 and 4 would help you with that. If you plot the curve of T against time, you will find memory consumed by the WAL senders in the steady state, eliminating any picks or troughs in memory usage by logical decoding. logical_decoding_work_mem should be configured keeping this steady state consumption in mind.

Even after doing all this, the disk spill is high or there's too much memory consumed by WAL senders, you best bet is to use streaming in-progress transactions by specifying streaming parameter to logical replication protocol. Find more about that in this blog.

If you know other ways to estimate logical decoding work memory or avoiding disk spill, please comment on this blog.

Friday, November 29, 2024

The PostgreSQL operator labyrinth

While working on SQL/PGQ patch I wanted to find an equality operator for given left and right argument types to construct a condition to match an edge with its adjacent vertexes. It would look as simple as calling C function oper() with operator as "=" and required left and right data types. But soon it turned out to be a walk in PostgreSQL's operator labyrinth, which held my equality operator at the center instead of Minotaur.

First and foremost in PostgreSQL '=' does not necessarily mean and equality operator. It's simply a name of an operator used for comparing operands for equality. One could get swallowed by Sphinx for that. So oper() is useless. Equality operators are instead identified by strategies EqualStrategyNumbers like HTEqualStrategyNumber, BTEqualStrategyNumber, RTEqualStrategyNumber and so on. But there's no C function which would provide you an equality operator given the strategy number and data types of left and right operands. Suddenly I found myself trapped in the index labyrinth since BT, HT and RT are related to hash, b-tree and R-tree indexes. Now, all I was doing was begging to get out of the labyrinth rather than finding answer to my seemingly simple question. But this Wit-Sharpening Potion helped me to find my path out of the labyrinth and also answered my question.

The path is surprising simple Index -> Operator Class -> Operator Family -> Operator. Like Daedalus's labyrinth, it's unicursal but has a four course design instead of seven course. Like the An index needs operators to compare values of a column or an indexed expression. All values being indexes are of the same datatype. An operator class holds all the required comparison operators for that datatype. However, a a value being searched or compared to in that index may not necessarily have the same datatype. For example an index may be on an column of type int4 but it could still be used to search a value of type int2. PostgreSQL requires different operators for different pairs of operand data types as the semantics to compare values from same datatype may be different from those from different data types. That's where an operator family comes into picture. It holds operator classes, one for each data type in the "family" of data types e.g. integers. Each operator class would still contains operators comparing values of the same datatype. "Loose" operators in an operator family are used to compare values from different datatypes.

If you know an operator family, equality strategy and the data types of left and right operands, you can find the operator using get_opfamily_member(). But there's no ready function to get operator family given the data types of operands. Instead you have to taken a convoluted route, otherwise I would't call that simple path a labyrinth. From the two datatypes we choose one, usually the datatype of the values in a set being searched. Like datatype of primary key, which holds the set of values in which we search for a foreign key value. Find the comparison operators for that datatype using get_sort_group_operators(). Using sorting operator returned by that function, search for the operator family using get_ordering_op_properties(). Pass that operator family (and strategy) to get_opfamily_member() along with the datatypes of operands to reach the operator you want. Interestingly, get_sort_group_operators() calls lookup_type_cache() which saves the preferred operator family tree in type cache. But it's not exposed outside.

Hope this blog serves as a Cretan coin depicting PostgreSQL operator labyrinth.

Update on 3rd December 2024 - There's another passage from datatypes to equality operator. Use GetDefaultOpclass() to get the default operator class for the chosen datatype. From there get the operator family of that operator class using get_opclass_family(). Use get_opfamily_member() to get the desired operator. With this method, you can try both hash and btree methods since an equality operator is available in both the methods. In the earlier method you could get the operator family only if there existed an ordering which is not available in hash method. It doesn't look unicursal anymore and thus not a labyrinth but a maze!

Friday, June 7, 2024

SQL/PGQ and graph theory

The story goes almost two decades back. I was studying computer engineering in College of Engineering Pune. Prof Vinayak Joshi had recently joined COEP and had research interest in discrete mathematic, especially Lattices. He needed some software for exploring some patterns in graphs. Matlab, though almost three decades old by then, was new to us and quite expensive. He approached my scholar friend Amit Bose and requested him to write a program for exploring the desired graph patterns. If memory serves, Amit wrote the program in C and quite possibly used TurboC as an IDE. That was a remarkable feat given that graphs could consume a huge memory, C is a very basic language and TurboC was clumsy and clunky back then. May be Amit knew how to use gcc and linux which allowed huge memory models. (I am sure today's software engineers haven't heard about memory models.) In the due course we studied discrete mathematics, graph theory and also lattices.

Few years later I joined post-graduate program in Computer Science and Engineering at IIT Bombay. The institute employed eminent faculty in the area of theoretical computer science, including graph theory. I worked as a teaching assistant to Prof. Ajit Diwan and Prof. Abhiram Ranade. I also took a course run by Prof. Sundar Vishwanathan. That developed my interest in the graph theory. But I felt any work in graph theory required a lot of patience (more details to follow) and it was too difficult and intangible. Like many others in my class, I dared not take a research project in the subject.

Graph theory still fascinates me. Graphs are still my first go-to-tool to model and solve any problem in profession or life. If graph theory fails I use other tools. Fast forward to the present, when I saw opportunity to work with graphs in RDBMS, I immediately grabbed it. Follow pgsql-hackers thread for more about that work. I don't know whether Prof. Ajit had graph databases in his mind when he said to our class, "Do you think that I can not offer projects in databases? I can offer a lot of them". But that's quite true. In the age of AI and analytics, graph databases are once again a hot topic.

In Prof. Sundar prescribed "Introduction to graph theory (second edition)" by Douglas West for his course. the exercises in that book were like puzzles for me. I liked to work those out myself. Most of the times I didn't. One such problem was related to "king". If memory serves me, it was 1.4.38 on page 66 of that book, which I still have with me. I spent hours trying to prove the theorem but did not succeed. I went to Prof. Sundar for help. He patiently listened to all the things I had tried to solve that problem. He said I was very close to the solution and any hint from him would be as good as the solution itself. He suggested that I sit in his room and try again. After an hour of struggle, I left his room without any success. The problem still haunts me.

I don't know whether Matlab is still expensive and whether Prof. Joshi is still using programs to explore his graphs. But if he is, SQL/PGQ might come handy. Having it in PostgreSQL means they can use it for free. All the database capabilities allow them to store and retrieve the graphs they have tried in their research. Let's take a simple example, of a king.

In a digraph, a king is a vertex from which every vertex is reachable by a path of length at most 2. In other words, if a vertex v is a king, it is connected to every other vertex by a path of length at most two. Let's see how to do that with SQL/PGQ. Assume a table "vertexes" which contains all the vertexes of the graph and a table "edges" which contains all the edges in the graph connecting those vertices.

create table vertexes (id int primary key,

name varchar(10));

create table edges (id int primary key,

src int references vertexes(id),

dest int references vertexes(id),

name varchar(10));

create property graph tournament

vertex tables (vertexes default label)

edge tables (edges source key (src) references vertexes(id)

destination key(dest) references vertexes(id)

default label)

Let's build the query to find a king step by step. First step would be to find all the nodes reachable from a given node by a path of length at most 2.

select src_name, dest_name

from graph_table (tournament

match (src is vertexes)->{1,2}(dest is vertexes)

where (src.id <> dest.id)

columns (src.name as src_name, dest.name as dest_name))

order by src_name;;

I have discussed most of the constructs in my previous posts on DBaaG and its components. {1, 2} is a new construct being used here which indicates that the path between src and dest is of length maximum 2. Thus it lists all src nodes and respective dest nodes which are connected to their respective src node by a path of length at most 2. Also notice that we are eliminating the src node being reported as dest node to simplify the next step in the query.

Now we have to find node/s which is connected to all the nodes in such a way. To do that we simply count the distinct dest nodes an src node is reachable to. If this count is same as the number of vertexes in the graph but one, corresponding src node is the king.

select src_name, count(distinct dest_name) num_reachable_nodes

from graph_table (tournament

match (src is vertexes)->{1,2}(dest is vertexes)

where (src.id <> dest.id)

columns (src.name as src_name, dest.name as dest_name))

group by src_name

having count(distinct dest_name) = (select count(*) - 1 from vertexes);

distinct in aggregate count makes sure to count each dest node only once. having clause filters every node which is not connected to all the other nodes in the graph. If we populate vertex and edge tables as follows:

insert into vertexes values (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e');

insert into edges values (1, 1, 2, 'a-b'), (2, 2, 3, 'b-c'), (3, 1, 3, 'a-c'), (4, 3, 4, 'c-d'), (5, 4, 5, 'd-e'), (6, 2, 1, 'b-a');

This will create a graph as shown in the figure below. Notice the cycle a->b->a. Node e is reachable from node a by a path of length 3. All other nodes are reachable from node a by a path of length at most 2.

Above query does not return any rows since there is no king in this graph.

Let's remove outlier node e and the edge connecting nodes d and e.

delete from edges where name = 'd-e';

delete from vertexes where name = 'e';

With that the above query returns two king nodes a and b.

src_name num_reachable_nodes

---------- -------------------

a | 3

b | 3

PostgreSQL is loved by developers. Hope introduction of SQL/PGQ makes it popular among graph theory researchers as well.

Friday, May 3, 2024

Property graphs: elements, labels and properties

A property graph consists of three types of "things" in it: elements, labels and properties.

Elements are nodes or edges in the graphs. They form the basic structure of a graph. An edge connects two nodes. Two nodes may be connected by multiple edges corresponding to different relationships between them.

Labels classify the elements. An element may belong to multiple classes and thus have multiple labels.

Properties are key-value pairs providing more information about an element. All the elements with the same label expose same set of keys or properties. A property of a given element may be exposed through multiple labels associated with that element.

Let's use the diagram here to understand the concepts better. There are three elements: N1, N2, the vertexes and an edge connecting them, labels L1 to L4, properties P1 to P7. The arrows connecting a label to an property indicates that that label exposes that property. E.g. label L3 exposes properties P2, P3, P5. Property P1 is exposed by both L1 and L2. An arrow between an element and a label indicates that that label is associated with that element. N1 has labels L1 and L2 whereas the edge has just one label L4. The properties that are associated with (and are exposed by) an element are decided by the labels associated with it. E.g. the properties P1, P2 and P4, which are union of properties associated with labels L1 and L2, are exposed by element N1. P1 has the same value v1 irrespective of which label is considered for this association. E.g height of a person will not change whether that person is classified as a teacher, businessman or a plumber. Similarly notice that the edge exposes properties P6 and P7 since it is labelled as L4.

SQL/PGQ's path pattern specification language allows to specify paths in terms of labels ultimately exposing the properties of individual paths that obey that patterns. E.g. (a IS L1 | L2)-[]->(b IS L3) COLUMNS (a.P3) will returns values of property P2 of all the nodes with labels L1 or L2. If you notice that N1 and N2 are the elements associated with either L1 or L2 or both. But N1 does not expose property P3. Hence we might expect that the above query would return an error. But instead the standard specified that it should report NULL, quite inline with the spirit of SQL NULL which means unknown.

The way I see it, a property can not exist without at least one label exposing it. A label can not exist without being associated with at least an element. But once defined, they have quite an independent existence.

Wednesday, April 24, 2024

PostgreSQL's memory allocations

There's a thread on hackers about recovering memory consumed by paths. A reference count is maintained in each path. Once paths are created for all the upper level relations that a given relation participates in, any unused paths, for which reference count is 0, are freed. This adds extra code and CPU cycles to traverse the paths, maintain reference counts and free the paths. Yet, the patch did not show any performance degradation. I was curious to know why. I ran a small experiment.

Experiment

I wrote an extension palloc_test which adds two SQL-callable functions palloc_pfree() and mem_context_free() written in C. Function definitions can be found here. The first function palloc's some memory and then pfree's it immediately. Other function just palloc's but never pfrees, assuming that the memory will be freed when the per-tuple memory context is freed. Both functions take the number of iterations and size of memory allocated in each iteration respectively as inputs. These functions return amount of time taken to execute the loop allocating memory. It appears that the first function spends CPU cycles to free memory and the second one doesn't. So the first one should be slower than the second one.

Results

The table below shows the amount of time reported by the respective functions to execute the loop as many times as the value in the first column, each iteration allocating 100 bytes. The figure shows the same as a plot. The time taken to finish the loop increases linearly for both the function indicating that the palloc logic is O(n) in terms of number of allocations. But the lines cross each other around 300K allocations.

count	palloc_pfree	memory context reset
100	0.0029	0.007124
100100	2.5646	5.079862
200100	5.1682	10.375552
300100	7.6373	15.704286
400100	10.1827	19.038238
500100	12.7013	23.847599
600100	15.2838	28.708501
700100	17.8255	36.982928
800100	20.3718	41.863638
900100	23.0706	44.332727
1000100	51.3311	52.546201
2000100	56.7407	104.747792
3000100	76.3961	154.225157
4000100	102.3415	206.510045
5000100	126.1954	256.367685
6000100	155.8812	314.178951
7000100	179.9267	367.597501
8000100	206.2112	420.003351
9000100	234.7584	474.137076

Inference and conclusion

This agrees with the observations I posted on the thread. Instead of letting all the useless path to be freed when query finishes, freeing them periodically during planning is time efficient as well as memory efficient. It compensates for the extra CPU cycles spent to maintain reference counts, traverse and free paths.

The actual memory allocation and freeing pattern as implemented in that patch is different from that in the experiment, so it might be worth repeating those experiments by simulating similar pattern.

I used chunk size of 100 since I thought it's closer to the order of average path size. But it might be worth repeating the experiment with larger chunk sizes to generalize the result.