Walking with the Elephants: April 2024

There's a thread on hackers about recovering memory consumed by paths. A reference count is maintained in each path. Once paths are created for all the upper level relations that a given relation participates in, any unused paths, for which reference count is 0, are freed. This adds extra code and CPU cycles to traverse the paths, maintain reference counts and free the paths. Yet, the patch did not show any performance degradation. I was curious to know why. I ran a small experiment.

Experiment

I wrote an extension palloc_test which adds two SQL-callable functions palloc_pfree() and mem_context_free() written in C. Function definitions can be found here. The first function palloc's some memory and then pfree's it immediately. Other function just palloc's but never pfrees, assuming that the memory will be freed when the per-tuple memory context is freed. Both functions take the number of iterations and size of memory allocated in each iteration respectively as inputs. These functions return amount of time taken to execute the loop allocating memory. It appears that the first function spends CPU cycles to free memory and the second one doesn't. So the first one should be slower than the second one.

Results

The table below shows the amount of time reported by the respective functions to execute the loop as many times as the value in the first column, each iteration allocating 100 bytes. The figure shows the same as a plot. The time taken to finish the loop increases linearly for both the function indicating that the palloc logic is O(n) in terms of number of allocations. But the lines cross each other around 300K allocations.

count	palloc_pfree	memory context reset
100	0.0029	0.007124
100100	2.5646	5.079862
200100	5.1682	10.375552
300100	7.6373	15.704286
400100	10.1827	19.038238
500100	12.7013	23.847599
600100	15.2838	28.708501
700100	17.8255	36.982928
800100	20.3718	41.863638
900100	23.0706	44.332727
1000100	51.3311	52.546201
2000100	56.7407	104.747792
3000100	76.3961	154.225157
4000100	102.3415	206.510045
5000100	126.1954	256.367685
6000100	155.8812	314.178951
7000100	179.9267	367.597501
8000100	206.2112	420.003351
9000100	234.7584	474.137076

Inference and conclusion

This agrees with the observations I posted on the thread. Instead of letting all the useless path to be freed when query finishes, freeing them periodically during planning is time efficient as well as memory efficient. It compensates for the extra CPU cycles spent to maintain reference counts, traverse and free paths.

The actual memory allocation and freeing pattern as implemented in that patch is different from that in the experiment, so it might be worth repeating those experiments by simulating similar pattern.

I used chunk size of 100 since I thought it's closer to the order of average path size. But it might be worth repeating the experiment with larger chunk sizes to generalize the result.

For those who have studied ERD-lore, it's not new that a relational database is very much like a graph. But it has taken SQL, more than 30 years since it became a standard and almost half a century since its inception to incorporate construct that will allow a DataBase to be treated as a Graph, DBaaG. This is surprising given that SQL was developed as language for relational databases which are modeled using ER diagrams. Better late than never. SQL/PGQ has arrived as 16th part of SQL:2023.

Entity Relationship Diagram, ERD in short, is a tool to model and visualize a database as entity types (which classify the things of interest) and relationships that can exist between them. Entity types and the relationships both map to relations in a Relational DataBase Management System (RDBMS in short). The rows in the relations represent entities (instances of entity types) and relationship between entities respectively. Fig. 1 below shows an ERD for a hypothetical shop.

This diagram very much looks like a graph with entity types represented as nodes and relationships represented by edges. That's exactly what SQL/PGQ is about. It adds language constructs to SQL to present underlying database as a "Property Graph". For example, property graph definition corresponding to the above ERD would look like

CREATE PROPERTY GRAPH shop

VERTEX TABLES (

CreditCard label Payment,

BankAccount label Payment,

Person label Customer,

Company label Customer,

Trust label Customer,

Wishlist label ProdLink,

Order label ProdLink,

Product)

EDGE TABLES (

CCOwns label Owns

BAHolds lable Owns,

CustOrders label CustLink,

CustWishlist label CustLink,

CompanyOrders label CustLink,

CompanyWishlist label CustLink,

TrustOrders label CustLink,

TrustWishlist label CustLink,

OrderCCPayment label OrderPayment,

OrderBAPayment label OrderPayment,

OrderItems label ItemLink,

WishlistItems label ItemLink);

Clever readers may have noticed that some of the entity types have some commonality. CreditCard and BankAccount are both Payment methods. Person, Company and Trust all can be considered as "Customers" by the shop. In a graph entities with commonalities will be represented by visual annotations like colors in Fig. 2. SQL/PGQ chooses to represents them by labels. Columns of the underlying tables are exposed through properties of labels. Traditionally the labels may be implemented as table inheritance or through tables abstracting commonalities. But it may not be necessary anymore.

Augmented with query constructs in SQL/PGQ they make it easy to write queries, especially analytical. Imagine a query to find all the products paid via credit card. There will be tons of JOIN and UNIONs over those joins. That's almost like "implementing" a logic in SQL. You would ask which parts do I JOIN before UNION, and which parts do I UNION before JOIN and so on. That's against the "imperative" spirit of SQL which should allow you to tell "what" you want and leave "how" for the DBMS to figure out. With SQL/PGQ you tell the DBaaG which paths in the graph to traverse. How to traverse them is the system's responsibility. So the SQL/PGQ query looks like below. Much simpler than joins and unions. In fact, it allows me not to mention edge tables at all in the query.

SELECT distinct name FROM

GRAPH_TABLE (shop

MATCHES

(o IS Orders)->(py IS Payment WHERE py.type = 'CC')<-(c IS Customer)->(o IS Order)->(p is Product)

COLUMNS (p.name));

I must note that the query looks more like a mathematical equation than SQL which till now followed natural language syntax. But well, there it is.

What those () mean? What about various literals in it? How to specify properties? I am sure I have roused more questions than those answered here. I plan to write more about it in future. This blog has gone longer than I initially intended it to be, but I hope it has aroused your interest in SQL/PGQ nonetheless.

Oh! but before I end, please note that we are working on implementing SQL/PGQ in PostgreSQL. If you are interested and want to contribute, please follow and respond on pgsql-hackers thread.

Walking with the Elephants

Wednesday, April 24, 2024

PostgreSQL's memory allocations

Tuesday, April 23, 2024

DBaaG with SQL/PGQ