Thread: Query Rewrite for Materialized Views (FDW Extension)
Hi, I wanted to share the project I've been working on which dynamically rewrites queries to target materialized views when viewsare available that can satisfy part of a query with lower cost plans. I embarked upon as an interesting side project. It took me a bit more time than I anticipated, but the result works for myuse case. Because of that, I thought it worth sharing. However I would caution that my use case is not exactly of a commercialscale... so please heed the following obligatory warning: **NOTE: this is not "production ready" code — if it works for you, then great, but use it after thorough testing, and withappropriate caution.** There are some limitations to the rewrite opportunities it takes up, and it will almost certainly fail on complex materializedviews composed of deeply nested queries. The extension does not have extensive (actually: any) documentation, but the few test cases should make obvious to the inclinedreader how it works. This is deliberate at this early a stage: I don't want to encourage uninformed adoption becauseof the possibility of data loss or incorrect query rewrites. The extension is written against a Postgres 10.1 source tree. Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite Questions or comments are very welcome. denty.
Hi folks,
I thought I’d share an update to my pet project, which dynamically rewrites queries to target materialized views when they are available and can satisfy a query (or part of it) with a lower cost plan.
The extension is now a regular EXTENSION and no longer tied in to the FDW mechanism. As a result, it may now be more generally usable, and less complicated to integrate into an existing system. (The FDW approach was an easy way for me to get started, but it ultimately added complexity and was rather limiting.)
Same caution as before applies:
**NOTE: this is not "production ready" code — if it works for you, then great, but use it after thorough testing, and with appropriate caution.**
Built, and has rudimentary testing against Postgres 10.1..10.3.
Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
Hope it is useful or interesting for someone! Questions or comments are very welcome.
denty.
Begin original message:From: Dent John <denty@QQdd.eu>Subject: Query Rewrite for Materialized Views (FDW Extension)Date: 5 April 2018 at 14:41:15 BSTHi,
I wanted to share the project I've been working on which dynamically rewrites queries to target materialized views when views are available that can satisfy part of a query with lower cost plans.
I embarked upon as an interesting side project. It took me a bit more time than I anticipated, but the result works for my use case. Because of that, I thought it worth sharing. However I would caution that my use case is not exactly of a commercial scale... so please heed the following obligatory warning:
**NOTE: this is not "production ready" code — if it works for you, then great, but use it after thorough testing, and with appropriate caution.**
There are some limitations to the rewrite opportunities it takes up, and it will almost certainly fail on complex materialized views composed of deeply nested queries.
The extension does not have extensive (actually: any) documentation, but the few test cases should make obvious to the inclined reader how it works. This is deliberate at this early a stage: I don't want to encourage uninformed adoption because of the possibility of data loss or incorrect query rewrites.
The extension is written against a Postgres 10.1 source tree.
Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
Questions or comments are very welcome.
denty.
2018-06-16 16:21 GMT+02:00 John Dent <denty@qqdd.eu>:
Hi folks,I thought I’d share an update to my pet project, which dynamically rewrites queries to target materialized views when they are available and can satisfy a query (or part of it) with a lower cost plan.The extension is now a regular EXTENSION and no longer tied in to the FDW mechanism. As a result, it may now be more generally usable, and less complicated to integrate into an existing system. (The FDW approach was an easy way for me to get started, but it ultimately added complexity and was rather limiting.)Same caution as before applies:**NOTE: this is not "production ready" code — if it works for you, then great, but use it after thorough testing, and with appropriate caution.**Built, and has rudimentary testing against Postgres 10.1..10.3.Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite Hope it is useful or interesting for someone! Questions or comments are very welcome.
good idea.
Regards
Pavel
denty.Begin original message:From: Dent John <denty@QQdd.eu>Subject: Query Rewrite for Materialized Views (FDW Extension)Date: 5 April 2018 at 14:41:15 BSTHi,
I wanted to share the project I've been working on which dynamically rewrites queries to target materialized views when views are available that can satisfy part of a query with lower cost plans.
I embarked upon as an interesting side project. It took me a bit more time than I anticipated, but the result works for my use case. Because of that, I thought it worth sharing. However I would caution that my use case is not exactly of a commercial scale... so please heed the following obligatory warning:
**NOTE: this is not "production ready" code — if it works for you, then great, but use it after thorough testing, and with appropriate caution.**
There are some limitations to the rewrite opportunities it takes up, and it will almost certainly fail on complex materialized views composed of deeply nested queries.
The extension does not have extensive (actually: any) documentation, but the few test cases should make obvious to the inclined reader how it works. This is deliberate at this early a stage: I don't want to encourage uninformed adoption because of the possibility of data loss or incorrect query rewrites.
The extension is written against a Postgres 10.1 source tree.
Source code: https://github.com/d-e-n-t-y/pg_fdw_mv_rewrite
Questions or comments are very welcome.
denty.
Hope it is useful or interesting for someone! Questions or comments are very welcome.good idea.RegardsPavel
In a recent PgConf NYC presentation [1] I was talking about the technical hurdles to implementing materialized views that could be kept up to date at all times, and the benefits of having such a thing.
Some use cases can be addressed with eventually-consistent derivative table structures (Vertica's projections, PipelineDB's continuous views, etc), but those methods rely on the source data never having deletes or updates, or confining those updates to the "hot" part of the source tables, so it generally works for time-series data, but not for other cases.
It has occurred to me that Dave Fetter's work on ASSERTIONS [2] has common underpinnings with true continuous materialized views. In both cases, the creation of a system object causes the creations of insert/update/delete triggers on one or more existing tables. In the case of assertions, those triggers are run with the goal of raising an error if rows are returned from a query. In the case of a materialized view, those same triggers would be used to delete rows from a CMV and insert replacements rows.
If we can get always-up-to-date materialized views, then Denty's work on query rewrite would have greatly enhanced utility.
I commented to Corey (privately) that, while my rewrite extension has gotten me a server that responds quickly to aggregatequeries, the constant need to refresh the supporting MVs means the system’s load average is constant and much higherthan before. I’m happy with the tradeoff for now, but it’s a huge waste of energy, and I’m sure it must thrash my disk. I’m very interested in what other people think of Corey’s idea.
Hi John, Can you characterize the class of queries that the MVs eligible for query rewrite may include, and the class of statements that may be rewritten using those mvs, in terms of joins and join types, aggregation, constraints, types of aggregate operators, and allowed aggregate expressions? For example: - the mv may contain any number of joins, which may be either inner or left outer, semi-, or anti joins - aggregates are optional, but if present, all columns in aggregate expressions must refer to the same table - MIN, MAX, SUM, COUNT, AVG aggregate operations are allowed - to be eligible for query rewrite a user must ... <fill in the details> - local predicates are / not allowed in the MV - the statement being rewritten may have one or more joins. - the joins need not be identical to the corresponding joins in the MV, but <fill in the details> - there may be joins in the mv that are not in the statement provided that ... <fill in the details> - any restrictions or assumed dependence on fdw's, since the name 'fdw extension' implies as much - MV rewrite will / not be cost-based - The statement may / not have local predicates that are not identical to the local predicates in the mv If you're going to do MV rewrite using MVs that aren't transactionally consistent with the underlying tables, then you're going to need a different sort of isolation mode, or some sort of permissions model that enables users to opt-in to permit the optimizer to give back "wrong results" from potentially stale MVs. I didn't attend the Ottawa conference this year, so I'd be interested to hear a summary of what the community thinks about MV incremental maintenance as well. There are several important subtypes of MV incremental maintenance: 1) incremental refresh on-commit, for individual row DML (enables MVs to be transactionally consistent, but incurs overhead at COMMIT time) 2) bulk incremental refresh on commit or on demand, after a COPY operation (more for a DW use case, but much more efficient for bulk-insert scenarios, and permits some optimizations that are not possible for the first case) 3) full refresh by partition (not an automatic solution, and there are some subtleties here about rows that move from one partition to another, but it's an easy first step toward incremental mv maintenance without most of the complexity) there is also the possibility of doing mv refresh from other mv's that have already been refreshed, so mv rewrite and mv refresh sometimes interact. thank you, /Jim F -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On Mon, Jun 18, 2018 at 07:38:13PM +0100, Dent John wrote: > I commented to Corey (privately) that, while my rewrite extension has > gotten me a server that responds quickly to aggregate queries, the > constant need to refresh the supporting MVs means the system’s load > average is constant and much higher than before. I’m happy with the > tradeoff for now, but it’s a huge waste of energy, and I’m sure it > must thrash my disk. > > I’m very interested in what other people think of Corey’s idea. I've written an alternative materialization extension (entirely as PlPgSQL) based on PG's internals, but my version has a few big wins that might help here. I'm thinking of properly integrating it with PG. Some of the features include: - you can write triggers that update the materialization This is because the materialization is just a regular table in my implementation. - you can mark a view as needing a refresh (e.g., in a trigger) - you can declare a PK, other constraints, and indexes on a materialization The DMLs used to refresh a view concurrently can take advantage of the PK and/or other indexes to go fast. - you get a history table which records updates to the materialization This is useful for generating incremental updates to external systems. Keeping track of refresh times should help decide whether to use or not use a materialization in some query, or whether to refresh it first, or not use it at all. One of the things I'd eventually like to do is analyze the view query AST to automatically generate triggers to update materializations or mark them as needing refreshes. A first, very very rough sketch of such an analysis looks like this: - if the view query has CTEs -> create triggers on all its table sources to mark the materialization as needing a refresh - else if a table appears more than once as a table source in the view query -> create triggers on that table that mark the materialization as needing a refresh - else if a table appears anywhere other than the top-level -> create triggers .. mark as needing refresh - else if a table is a right-side of a left join -> create triggers .. mark as needing refresh - else if a table has no PK -> create triggers .. mark as needing refresh - else if the query has no GROUP BY, or only does a GROUP BY on this table and a list of columns prefixed by the table's PK -> rewrite the query to have WHERE eq conditions on values for the table's PK columns analyze this query if the result shows this table source as the first table in the plan -> create triggers on this table to update the materialization directly from querying the source view - else -> create triggers .. mark as needing refresh Nico --
Hi Nico, I’m pretty impressed anything in this space can be written entirely in PlPGQSL! If you did integrate your implementation, it would be easy for my Extension to read from a table other than the one whichit gets the MV definition from... Although having said that, if you went down the route you suggest, would not you makethat “regular table” into a first class scheme object very much like Corey’s CONTINUOUS MATERIALIZED VIEW object concept? It is interesting that you can put triggers onto the table though, as that leads well in to use cases where it is desirableto “stack” MVs upon each other. (I’m not immediately sure whether such a use case is still needed in face of analways-up-to-date MV feature such as is described, but I’ve seen it elsewhere.) You say you’d like to base it off a VIEW’s AST (rather than, I presume, you must parse the reconstructed VIEW source textas SQL?), and I do agree — that’s probably the right direction... it does seem to me there is scope to leverage the “bottomhalf” of the ASSERTION stuff from Dave Fetter that Corey linked to — i.e., the part of it that manages triggers. Stillleaves the AST crawling deciding what to actually do once a change is caught. Really good to hear about progress in this area. d.
On Tue, Jun 19, 2018 at 08:46:06AM +0100, Dent John wrote: > I’m pretty impressed anything in this space can be written entirely in > PlPGQSL! https://github.com/twosigma/postgresql-contrib PG is quite powerful! I have even implemented a COMMIT TRIGGER in pure PlPgSQL. You'll notice I make extensive use of record/table types. > If you did integrate your implementation, it would be easy for my > Extension to read from a table other than the one which it gets the MV > definition from... Although having said that, if you went down the > route you suggest, would not you make that “regular table” into a > first class scheme object very much like Corey’s CONTINUOUS > MATERIALIZED VIEW object concept? I know nothing about the CONTINUOUS MATERIALIZED VIEW concept. What that would imply to me seems... difficult to achieve. There will be view queries that are difficult or impossible to automatically write triggers for that update an MV synchronously. > It is interesting that you can put triggers onto the table though, as > that leads well in to use cases where it is desirable to “stack” MVs > upon each other. (I’m not immediately sure whether such a use case is > still needed in face of an always-up-to-date MV feature such as is > described, but I’ve seen it elsewhere.) I have done exactly this sort of MV chaining. In my use case I had an MV of a nesting group transitive closure and then another of a join between that and user group memberships to get a complete user group transitive closure. The transitive closure of nesting groups being computed via a RECURSIVE CTE... In principle one can understand such a query and automatically write DMLs to update the MV on the fly (I've done this _manually_), but the moment you do any bulk updates out of sequence you can't, and then you have to refresh the view, so you see, I don't quite believe we can have a true continuously materialized view :( For me the key requirement is the ability to generate incremental updates to an external system, but also the whole thing has to be fast. > You say you’d like to base it off a VIEW’s AST (rather than, I > presume, you must parse the reconstructed VIEW source text as SQL?), PG already stores the AST. There's no need to write a new parser when PG already has one. At the end of the day you need to analyze an AST for the MV's source query in order to automatically write the triggers to keep it updated (or mark it as needing a refresh). > and I do agree — that’s probably the right direction... it does seem > to me there is scope to leverage the “bottom half” of the ASSERTION > stuff from Dave Fetter that Corey linked to — i.e., the part of it > that manages triggers. Still leaves the AST crawling deciding what to > actually do once a change is caught. I'll search for this. > Really good to hear about progress in this area. Eh, I've not actually implemented any automatic generation of triggers to update MVs. I've written enough such triggers manually to believe that *some* of them could be written by software. If you look at my sketch for how to do it, you'll notice that many of the sorts of queries that one would choose to materialize... are not really amenable to this treatment -- that's precisely because those make for the sorts of slow queries that make you reach for materialization in the first place :( But even so, automatically-generated triggers that mark an MV as needing a refresh are always possible, and that is a huge improvement anyways, especially if concurrent view refreshes can be made to go faster (by having PKs on the MVs). The idea is to have some sort of adaptive automatic background, concurrent MV refresh running on a frequency based in part of the amount of time it takes to refresh the VIEW. BTW, MERGE would be a significant optimization for concurrent MV refresh. Think of MERGE as a statement that can scan a source, FULL OUTER JOIN it to a target table, and for each row do an INSERT, UPDATE, or DELETE -- this is 3x faster than the three INSERT/UPDATE/DELETE statements you need to do the same work without a MERGE! Nico --
Hi Nico,
By the way, I do agree with your point about MERGE — if we can factor MV updates in that fashion, it will certainly save.
I didn’t reply immediately because your point caught me off guard:
[…] If you look at my
sketch for how to do it, you'll notice that many of the sorts of queries
that one would choose to materialize... are not really amenable to this
treatment […]
I’d rather presumed that there would be many examples of DML on base relations that could trigger a direct (incremental) update to the MV. I had presumed it, but not actually done any research.
So I did a bit of a trawl. There’s actually quite a lot of academic research out there, including [1] and [2]. [2] references a bunch of methods for incremental MV refresh, and ties them into a graph query context. I’m not sure if the graph query context itself is relevant for Postgres, but it’s certainly interesting and perhaps suggests that incremental refresh of at least some RECURSIVE MVs may not be entirely impossible. I also found [3], which is /very/ dated, but it strongly supports that MVs are a performant path to executing certain types of query.
So I definitely agree with you in the general case, but it seems there is scope to provide an incremental MV refresh capability that is broadly useful.
Almost certainly, any initial implementation would quickly fall back to a “full refresh”. But the refresh planner’s capability develops, I wonder if it could not embody an intelligent strategy that might even recognise common recursive patterns such as the transitive closure you mention, and refresh on an incremental basis — that would be really quite a cool capability to have.
denty.