Feature proposal: immutable/sealed partitions (and maybe tables, too) - Mailing list pgsql-general

From Levi Aul
Subject Feature proposal: immutable/sealed partitions (and maybe tables, too)
Date
Msg-id CAMFocdCm0qqzUy30xZn4_tR4VLnv599GD8qx3VRJjCQmtz-6Ag@mail.gmail.com
Whole thread Raw
Responses Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)
Re: Feature proposal: immutable/sealed partitions (and maybe tables, too)
List pgsql-general
My company runs some large OLAP data warehouses with append-only, time-partitioned datasets. Our workloads involve aggregations and joins, and query the data in ways not amenable to constraint-exclusion; and we serve a high concurrent number of these queries at once from a single DB.

In other words, our workload is inherently one that acquires "way too many locks." Our largest performance bottleneck, according to pg_wait_sampling, is the LockManager itself. Despite most of our queries spending only milliseconds actually executing, they often spend seconds during planning waiting to acquire hundreds of access-shared locks.

Given that our datasets are append-only, all our partitions for each table save for the one "active" one (the one for the current time period) are effectively immutable. No DML-triggered writes will occur to these. I think this is pretty common in data-warehouse use-cases of PG.

If PG could avoid the need to acquire the locks for these effectively-immutable partitions, then the remaining number of tables would be low enough to fit into the per-backend LWLock slots set, and so avoid LockManager contention. I believe this could be a large optimization not just for our use-case, but in a number of other high-concurrency OLAP use-cases.

My proposal for how this "lock elision under large numbers of immutable partitions" could be accomplished:

1. Add some DDL statement to mark partitions as sealed/unsealed. (ALTER TABLE ... SEAL PARTITION foo)
2. When query-planning DML against a partition or a partitioned table, treat a sealed partition as if it had an always-false check constraint.
2. Define a "locking group" abstraction, where many entities can register themselves under the same lock, such that access to all members of the locking group requires only acquiring the single locking-group lock. All sealed partitions of the same table would share a locking group.

Under such a setup, querying a time-based partitioned table with one active (unsealed) partition would only ever require acquiring, at most, two locks — the one for the active partition, and the one for the sealed-partitions locking group.

The trade-off for this is that acquiring an exclusive-access lock on the sealed-partitions locking-group for a table becomes much more expensive than it would have been to acquire for a single partition. But this isn't a problem in practice, because hot-path operations that take an exclusive-access lock (DML writes) are disallowed against sealed partitions. The only time the lock-group would need to be exclusive-access acquired, would be to change its membership — an administrative DDL operation.

Besides being useful operationally, such a mechanism would also be helpful on the business-logic level, as you can rely on partition sealing to turn accidental insertions of new data into any but the active partition(s) into a constraint violation. (Currently, to achieve this, separate triggers need to be maintained on each sealed partition.)

And, with knowledge of the administrative intent for a table to be immutable, further operational optimizations could be performed. A few off the top of my head:

1. Running CLUSTER or VACUUM (FULL, FREEZE) after the partition is marked as immutable, could rewrite the table using an implicit "heap_immutable" access method (still reported as "heap"), which would drop the min_xid column (as everything in a sealed table is guaranteed to be always-visible), and thus remove the table for consideration for xid-wraparound-protection rewriting. Such partitions would then require a rewrite back to "heap" if unsealed.

2. Alternatively, such storage-rewriting DDL statements could switch the table — and its indices — over to using an entirely different access-methods, which would store the data+indices in "perfect" packed forms, to maximize read performance while also minimizing disk usage.

3. ANALYZE could have an (otherwise-impractical) "EXACT" argument, to populate statistics with exact aggregate values, requiring reading all rows rather than sampling rows. This could pre-bake table-level aggregates for most columns, like having a single, table-sized BRIN block-range.

If this concept of "marking as sealed" were extended to tables rather than only partitions, then further work could be done related to optimization of bulk loads — e.g. having CREATE TABLE AS ... SEALED not generate WAL segments for the table as it is populated, but rather treat the table as UNLOGGED during population, and then, after creation, take the entire finalized/sealed table's backing files and either pass them directly to archive_command / send them directly to WAL receivers; or split+stream them into retrospective WAL segments (each segment containing a single "put this 16MB of data into this file at this position" op), and send those.

pgsql-general by date:

Previous
From: "Dirschel, Steve"
Date:
Subject: log_min_messages = warning
Next
From: Adrian Klaver
Date:
Subject: Re: log_min_messages = warning