Wednesday, March 21, 2018

Containing bloat with partitions

PGConf India 2018 attracted a large number of PostgreSQL users and developers. I talked about "Query optimization techniques for partitioned tables" (slides). Last year, I had held an introductory talk about PostgreSQL's declarative partitioning support (slides). Many conference participants shared their perspective on partitioning with me. One particular query got me experimenting a bit.

The user had a huge table, almost 1TB in size, with one of the columns recording the data-creation time. Applications added MBs of new data daily and updated only the recent data. The old data was retained in the table for reporting and compliance purposes. The updates bloated the table, autovacuum wasn't clearing the bloat efficiently. Manual vacuum was out of scope as that would have locked the table for much longer. As a result queries were slow, and performance degraded day by day. (Read more about bloats and vacuum here and here.) The user was interested in knowing if partitioning would help.

Hot and Cold partitioning

The concept of hot and cold partitioning isn't new. The idea is to separate data being accessed and modified frequently (Hot data) from the data which is accessed and modified rarely (Cold data). In the above case, that can be achieved by partitioning the data by the creation timestamp. The partitions should be sized such that the updates and inserts access only a handful Hot partitions (ideally at most two). The Cold partitions containing the stale data would remain almost unchanged. Since the updates are taking place in the Hot partitions, those get bloated, but their sizes are much smaller than the whole table. Vacuuming those doesn't take as much time as the whole table. Once they become Cold, they hardly need any Vacuuming. Thus containing the bloat effectively.

In PostgreSQL autovacuum, if enabled on the given table, runs its job when the number of inserted, deleted or updated rows are above certain thresholds (See details). Since all the action happens in the Hot partitions, only those partitions can have their counts rise beyond the threshold. Those counts are hardly expected to change for Cold partitions (or once they become cold, if they were hot in the past). Thus autovacuum too automatically starts working on only the Hot partitions instead of entire table. This isn't the case with an unpartitioned table, for which autovacuum always runs on the entire table; it possibly never completes its job because of sheer size of the table and not necessarily because of the rate at which bloat is created.

Experiment

That's all theory, but I ventured to see how effective this could be. I created a table with two partitions, one Hot partition, representing Hot data and one Cold partition, representing Cold data. (Note: all the code below serves only as example and should be used with necessary caution.).

create table part (a int, b int, c varchar, d varchar, e varchar) partition by range(a);
create table part_active partition of part for values from (990000) to (1000001);
create table part_default partition of part default;

I then inserted one million rows in this table, each row with distinct value for column a from 1 to 1000000. This means that the partition "part_active" which represents the Hot data (or latest data, if a is interpreted as some kind of timestamp) contains 1% of the total data in table "part".

For comparison, I also created an unpartitioned table "upart" with similar schema and populated it with the same data.


create table upart (a int, b int, c varchar, d varchar, e varchar);

I disabled autovacuum on these tables to create bloat using "with (autovacuum_enabled = 'false')", but that's only for the experimentation. In production or test environment, one should set this option as per the requirements of the setup.

Then I ran commands to update each of the rows with a between 990000 and 1000000 thrice. In the partitioned table this updated the rows in partition part_active, the Hot partition. Since autovacuum is disabled, it would not remove the old versions of the tuples. So, we see the statistics as

select n_tup_upd, n_tup_hot_upd, n_dead_tup, n_live_tup from pg_stat_user_tables where relid = 'upart'::regclass;
 n_tup_upd | n_tup_hot_upd | n_dead_tup | n_live_tup 
-----------+---------------+------------+------------
     30003 |            23 |      30003 |    1000000
(1 row)

select n_tup_upd, n_tup_hot_upd, n_dead_tup, n_live_tup from pg_stat_user_tables where relid = 'part_active'::regclass;
 n_tup_upd | n_tup_hot_upd | n_dead_tup | n_live_tup 
-----------+---------------+------------+------------
     30003 |            23 |      30003 |      10001
(1 row)

select n_tup_upd, n_tup_hot_upd, n_dead_tup, n_live_tup from pg_stat_user_tables where relid = 'part_default'::regclass;
 n_tup_upd | n_tup_hot_upd | n_dead_tup | n_live_tup 
-----------+---------------+------------+------------
         0 |             0 |          0 |     989999

(1 row)

As you will see that the unpartitioned table and the Hot partition both have same number of dead tuples. The Cold partition doesn't have any dead tuples since no row in that partition was updated. At this point the sizes of unpartitioned table and the Hot partition are.

select pg_size_pretty(pg_relation_size('upart'::regclass));
 pg_size_pretty 
----------------
 326 MB

(1 row)

select pg_size_pretty(pg_relation_size('part_active'::regclass));
 pg_size_pretty 
----------------
 13 MB
(1 row)

The Hot partition which contains only 1% of the rows of the unpartitioned table has its size much larger than that proportion. That's because all the bloat in partitioned table is concentrated in that partition.

Now, let's try to run VACUUM ANALYZE on these tables to remove the bloat and update statistics.

\timing on
vacuum full analyze upart;
Time: 4663.576 ms (00:04.664)
\timing off

\timing on
vacuum full analyze part_active;
Time: 53.314 ms
\timing off

After vacuuming the sizes of the unpartitioned table and the Hot partition are
select pg_size_pretty(pg_relation_size('upart'::regclass));
 pg_size_pretty 
----------------
 326 MB
(1 row)

select pg_size_pretty(pg_relation_size('part_active'::regclass));
 pg_size_pretty 
----------------
 3336 kB
(1 row)

Now the sizes of the table are in expected proportion with the bloat removed.

Notice that the time required for vacuuming Hot partition is 80 times lesser than the time required for vacuuming the unpartitioned table. Effectively, the bloat in entire partitioned table is cleared since partitioning has restricted the bloat only to the Hot partition. Since vacuum is now taking much lesser time it's possible to schedule it within the down-time and the time for which the table remains locked is also within the reasonable limits. This isn't magic (neither is partitioning a spell simple to cast). Observe that the reduction in time is inline with the proportion of Hot data in the total data. When vacuum is run on the unpartitioned table, it has to scan the whole unpartitioned table. But in case of a partitioned table, it needs to scan only the Hot partition which is much smaller than the unpartitioned table and thus takes much lesser time.

The customer had 1TB of data and the experiment above runs with only MBs of data. But that's all my laptop could afford and that's all time permitted me. But you got the idea. EnterpriseDB is implementing zero bloat heap which avoids bloat to start with, but it's going to take some time. Meanwhile you may try this option, but experiment with real-sized data.

Word of caution

Declarative partitioning is a new feature in PostgreSQL 10. Not all the functionalities like, foreign key, unique constraints, primary key that work with a regular table work with a partitioned table. Many of them will be part of PostgreSQL 11, but it may take few more releases to cover all the ground. It's always advisable to use the latest version of PostgreSQL and test the applications, for performance and correctness, before deploying in production.