[Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW - Mailing list pgsql-hackers
| From | Adam Brusselback |
|---|---|
| Subject | [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW |
| Date | |
| Msg-id | CAMjNa7eFzTQ5=oZMQiB2bMkez5KP4A77JC7SRjeVEkOrh7cUHw@mail.gmail.com Whole thread Raw |
| Responses |
Re: [Patch] Add WHERE clause support to REFRESH MATERIALIZED VIEW
|
| List | pgsql-hackers |
Attached is a patch implementing support for a WHERE clause in REFRESH MATERIALIZED VIEW.
The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';
I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ
This allows the user to restrict the refresh operation to a subset of the view. The qualification is applied to the view's output columns. The optimizer can then push this condition down to the underlying base tables, avoiding a full scan when only a known subset of data has changed.
Implementation notes:
1. The grammar accepts an optional WHERE clause. We forbid volatile functions in the clause to ensure correctness.
2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified, the operation performs an in-place modification using a `ROW EXCLUSIVE` lock.
* This mode requires a unique index to ensure constraint violations are handled correctly (e.g., when a row's values change such that it "drifts" into or out of the `WHERE` clause scope).
* It executes a Prune + Upsert strategy:
* `DELETE` all rows in the materialized view that match the `WHERE` clause.
* `INSERT` the new data from the source query.
* It uses `ON CONFLICT DO UPDATE` during the insert phase to handle concurrency edge cases, ensuring the refresh is robust against constraint violations.
3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting the scope of the diff (and the temporary table population) to the rows matching the predicate. This requires an `EXCLUSIVE` lock and a unique index, consistent with existing concurrent refresh behavior. It is much slower than `Non-Concurrent Partial Refresh`
4. The execution logic uses SPI to inject the predicate into the source queries during execution.
I have attached a benchmark suite to validate performance and correctness:
* `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing system (`invoices` and `invoice_lines`). It includes an aggregated materialized view (`invoice_summary`) and a control table (`invoice_summary_table`).
* `workload_*.sql`: pgbench scripts simulating a high-churn environment (45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset sizes while generating significant refresh work.
* `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across multiple scale factors and concurrency levels.
The benchmark compares strategies for keeping a summary up to date (vs baseline):
* Partial Refresh: Triggers on the base table collect modified IDs and execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
* Materialized Table (Control): A standard table maintained via complex PL/pgSQL triggers (the traditional manual workaround).
* Full Refresh (Legacy): Manually refresh the view after changes.
Results are below:
Concurrency: 1 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 5309.05 | 0.002x 0.437x 0.470x
20000 50 | 1209.32 | 0.010x 0.600x 0.598x
20000 1000 | 56.05 | 0.164x 0.594x 0.576x
400000 1 | 5136.91 | 0 x 0.450x 0.487x
400000 50 | 1709.17 | 0 x 0.497x 0.482x
400000 1000 | 110.35 | 0.006x 0.507x 0.460x
Concurrency: 4 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 19197.50 | 0x 0.412x 0.435x
20000 50 | 1016.14 | 0.007x 0.966x 1.036x
20000 1000 | 9.94 | 0.708x 1.401x 1.169x
400000 1 | 19637.36 | 0x 0.436x 0.483x
400000 50 | 4669.32 | 0x 0.574x 0.566x
400000 1000 | 23.26 | 0.029x 1.147x 0.715x
Concurrency: 8 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 30358.32 | 0x 0.440x 0.457x
20000 50 | 262.75 | 0.026x 2.943x 2.740x
20000 1000 | 11.28 | 0.575x 0.840x 0.578x
400000 1 | 36007.15 | 0x 0.430x 0.464x
400000 50 | 6664.58 | 0x 0.563x 0.494x
400000 1000 | 11.61 | 0.058x 1.000x 1.277x
In these tests, the partial refresh behaves as O(delta) rather than O(total), performing comparably to the manual PL/pgSQL approach but with significantly lower code complexity for the user.
I recognize that adding a WHERE clause to REFRESH is an extension to the SQL standard. I believe the syntax is intuitive, but I am open to discussion regarding alternative implementation strategies or syntax if the community feels a different approach is warranted.
New regression tests are included in the patch.
This is my first time submitting a patch to PostgreSQL, so please bear with me if I've missed anything or made any procedural mistakes. I'm happy to address any feedback.
Thanks,
Adam Brusselback
The syntax allows for targeted refreshes:
REFRESH MATERIALIZED VIEW mv WHERE invoice_id = ANY('{1,2,3}');
REFRESH MATERIALIZED VIEW CONCURRENTLY mv WHERE customer_id = 42;
REFRESH MATERIALIZED VIEW mv WHERE order_date >= '2023-01-01';
I was inspired to implement this feature after watching the Hacking Postgres discussion on the topic: https://www.youtube.com/watch?v=6cZvHjDrmlQ
This allows the user to restrict the refresh operation to a subset of the view. The qualification is applied to the view's output columns. The optimizer can then push this condition down to the underlying base tables, avoiding a full scan when only a known subset of data has changed.
Implementation notes:
1. The grammar accepts an optional WHERE clause. We forbid volatile functions in the clause to ensure correctness.
2. Non-Concurrent Partial Refresh: When `CONCURRENTLY` is not specified, the operation performs an in-place modification using a `ROW EXCLUSIVE` lock.
* This mode requires a unique index to ensure constraint violations are handled correctly (e.g., when a row's values change such that it "drifts" into or out of the `WHERE` clause scope).
* It executes a Prune + Upsert strategy:
* `DELETE` all rows in the materialized view that match the `WHERE` clause.
* `INSERT` the new data from the source query.
* It uses `ON CONFLICT DO UPDATE` during the insert phase to handle concurrency edge cases, ensuring the refresh is robust against constraint violations.
3. Concurrent Partial Refresh: When `CONCURRENTLY` is specified, it uses the existing diff/merge infrastructure (`refresh_by_match_merge`), limiting the scope of the diff (and the temporary table population) to the rows matching the predicate. This requires an `EXCLUSIVE` lock and a unique index, consistent with existing concurrent refresh behavior. It is much slower than `Non-Concurrent Partial Refresh`
4. The execution logic uses SPI to inject the predicate into the source queries during execution.
I have attached a benchmark suite to validate performance and correctness:
* `setup.sql`: Creates a schema `mv_benchmark` modeling an invoicing system (`invoices` and `invoice_lines`). It includes an aggregated materialized view (`invoice_summary`) and a control table (`invoice_summary_table`).
* `workload_*.sql`: pgbench scripts simulating a high-churn environment (45% inserts, 10% updates, 45% deletes) to maintain roughly stable dataset sizes while generating significant refresh work.
* `run_benchmark_comprehensive.sh`: Orchestrates the benchmark across multiple scale factors and concurrency levels.
The benchmark compares strategies for keeping a summary up to date (vs baseline):
* Partial Refresh: Triggers on the base table collect modified IDs and execute `REFRESH MATERIALIZED VIEW ... WHERE ...`.
* Materialized Table (Control): A standard table maintained via complex PL/pgSQL triggers (the traditional manual workaround).
* Full Refresh (Legacy): Manually refresh the view after changes.
Results are below:
Concurrency: 1 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 5309.05 | 0.002x 0.437x 0.470x
20000 50 | 1209.32 | 0.010x 0.600x 0.598x
20000 1000 | 56.05 | 0.164x 0.594x 0.576x
400000 1 | 5136.91 | 0 x 0.450x 0.487x
400000 50 | 1709.17 | 0 x 0.497x 0.482x
400000 1000 | 110.35 | 0.006x 0.507x 0.460x
Concurrency: 4 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 19197.50 | 0x 0.412x 0.435x
20000 50 | 1016.14 | 0.007x 0.966x 1.036x
20000 1000 | 9.94 | 0.708x 1.401x 1.169x
400000 1 | 19637.36 | 0x 0.436x 0.483x
400000 50 | 4669.32 | 0x 0.574x 0.566x
400000 1000 | 23.26 | 0.029x 1.147x 0.715x
Concurrency: 8 client(s)
----------------------------------------------------------------------------------
Scale Batch | Baseline TPS | Full (Rel) Partial (Rel) Table (Rel)
---------- ------ | ------------ | ------------ ------------ ------------
20000 1 | 30358.32 | 0x 0.440x 0.457x
20000 50 | 262.75 | 0.026x 2.943x 2.740x
20000 1000 | 11.28 | 0.575x 0.840x 0.578x
400000 1 | 36007.15 | 0x 0.430x 0.464x
400000 50 | 6664.58 | 0x 0.563x 0.494x
400000 1000 | 11.61 | 0.058x 1.000x 1.277x
In these tests, the partial refresh behaves as O(delta) rather than O(total), performing comparably to the manual PL/pgSQL approach but with significantly lower code complexity for the user.
I recognize that adding a WHERE clause to REFRESH is an extension to the SQL standard. I believe the syntax is intuitive, but I am open to discussion regarding alternative implementation strategies or syntax if the community feels a different approach is warranted.
New regression tests are included in the patch.
This is my first time submitting a patch to PostgreSQL, so please bear with me if I've missed anything or made any procedural mistakes. I'm happy to address any feedback.
Thanks,
Adam Brusselback
Attachment
pgsql-hackers by date: