Re: matview incremental maintenance - Mailing list pgsql-hackers
From | Nico Williams |
---|---|
Subject | Re: matview incremental maintenance |
Date | |
Msg-id | 20161128231816.GA24797@localhost Whole thread Raw |
In response to | matview incremental maintenance (Kevin Grittner <kgrittn@ymail.com>) |
List | pgsql-hackers |
On Mon, Jun 17, 2013 at 07:41:15AM -0700, Kevin Grittner wrote: > Since there seems to be interest in discussing incremental > maintenance of materialized views *now*, I'm starting this thread > to try to avoid polluting unrelated threads with the discussion. I > don't intend to spend a lot of time on it until the CF in progress > completes, but at that point the work will start in earnest. So > I'll say where I'm at, and welcome anyone who has time to spare > outside of the CF to comment or contribute ideas. I have an implementation that supports updates, but it doesn't implement automatic updates as described in the paper you cited. > The paper at the core of the discussion can be found by searching > for "maintaining views incrementally gupta mumick subrahmanian" -- > it's on both the ACM and CiteSeerX websites. Of course, one > doesn't need to understand that paper to discuss techniques for > capturing the base deltas, but I'm hoping that's not what takes up > most of the discussion. I expect the most important discussions to > be around how best to handle the "count(t)" (or count_t) column, > what form should be use for intermediate results, how to modify or > add execution nodes which know how to deal with the count, how to > generate set operations to use those nodes, and how to modify the > planner to choose the best plan for these operations. Whether to > pull the deltas off the WAL stream or stuff them into a tuplestore > as they are written seems to me to be a relatively minor point. If > properly abstracted, the performance and complexity of alternatives > can be compared. Sure. Automatically converting INSERTs/UPDATEs/DELETEs of MATERIALIZED VIEW table sources is not trivial, though I've had some luck with a particular multi-join query by just manually adding constraints from the OLD.* and NEW.* rows, though unfortunately the PG query planner was unable to find the obvious fast query plan (more on that in another thread) and I ended up having to manually optimize the query using WITH... So at least for some queries it's easy enough to automatically propagate constraints into the VIEW's query. For some recursive queries it may also be easy to propagate OLD.*/NEW.* into a seed. Anyways, this is a bit far afield though, as what I have managed so far is very useful even without automatically producing updates based only on the VIEW's query. Nor am I certain that automatically updating a materialized view is always the right thing to do. A case in point for me is an authorization system (which generally means there's something of a transitive closure involved, which means recursive queries). In this system adding grants is cheap enough, but massive revocation (e.g., deleting all of a user's entitlements, perhaps by deleting the user and cascading the deletion) is not: it can be faster to just refresh the view old-style than to update it dynamically! (You noted this problem.) The queries I use for dynamically updating the materialized views are hand-optimized as mentioned above. They are not too unlike what an automatic system would have generated. Granted, the problem partly is that an ORM is involved, which adds obnoxious overhead: one statement per grant deletion versus one statement for deleting all of them. But it's still possible that at some point it's best to refresh, even if an ORM were not involved. > At the developer meeting last month, we talked about the special > new count column for a bit, and everyone seemed to agree that > adding such an animal, ... What's that? > Long term, timings for incremental maintenance that people would > like to see (from most eager to least eager) are: > > - as part of completing each statement, so that the affect on the > matview is immediately visible to the transaction which modifies a > supporting table, and becomes visible at commit to other > transactions That can be done now with hand-coded triggers, as I do in my use case. Though I do leave some cases to a deferred refresh as mentioned above. > - at transaction commit time, so that other transactions see the > changes to the base tables and the referencing matviews at the same > point in time See above. > - from a FIFO queue which is processed by a background process > whenever data is present (possibly with pacing) I do this. I use NOTIFY/LISTEN and timeouts to drive refreshes as needed. > - from a FIFO queue based on a schedule, so that matviews are > stable between applications and/or to avoid burdening the machine > during peak periods Ditto. > - incremental update, or even full refresh, on an attempt to query > a "stale" matview I record and store (in a per-view history table) differences computed during a refresh. > - explicit request to apply incremental updates or refresh I make this decision by using one trigger function or another. One set of trigger functions generates updates properly (by updating the materialization table), and another merely marks the view as needing a refresh and NOTIFYies a waiting daemon. > Incremental maintenance of a materialized view is a heuristic, to > refresh contents more quickly than might happen by re-running the > query which defines the matview. There will always be cases where > the changes are so extensive that applying the delta will be slower > than a refresh. At some point we should have a cost-based way to > recognize when we have crossed that threshold, and fall back to the > refresh technique. That's not for this release, though. Yes. One way to do this is to keep track of refresh times and interrupt dynamic updates that take longer, then running a refresh instead. However, I do feel that users will want some measure of control over this at the schema level -- after all, I do! The solution I use lets me have this control. > In previous discussion there seemed to be a consensus that before > incremental maintenance for a materialized view could be turned on, > the matview would need to be populated and all referenced tables > would need to be flagged as generating delta information, through a > new ALTER TABLE option. That seems strange to me. I'd rather express this through CREATE/ALTER MATERIALIZED VIEW. Enabling updates for changes of a particular table source should cause PG to create internal trigger functions and triggers on the given table source. Using triggers for this (even hidden, internal ones) does create a problem in that we might need a way to control order of execution, so that user-defined triggers run first so that they can make any additional changes that might also trigger updates of the view, but this can be addressed later. In any case, even now, three years later, isn't it the case that the code needed to perform query algebra to produce the necessary "triggers" is missing in action? Wouldn't it be better to do something in the interim? Here's my attempt: https://github.com/twosigma/postgresql-contrib Summary: - PlPgSQL-coded materialized views (no syntax, just functions) - only concurrent refresh available - deltas computed on refresh are stored in a history table, with "old" and "new" columns of the record type of the materializationtable - triggers on the materialization table automatically record updates of the materialization table in the history table - last refresh timestamp and "refresh needed" state kept for each materialized view - NOTIFY is used to broadcast the need to refresh (for cases where it would be faster to refres) to a waiting daemon (notincluded, but it is scripted around the pqasyncnotifier tool included in this repo) - hand-coded triggers (not included here, naturally) can update the materialized views, which as explained above, automaticallyupdates the history table This works very well. I use the history table to generate deltas for an external non-SQL system. So I get something close to real-time updates, and if NOTIFY/LISTEN were safer (e.g., indicated who NOTIFYied, and encoded proper row values), then I could use NOTIFY/LISTEN to distribute updates in something closer to real-time. I would really like to see something like this integrated into PG. I can do some of the work for that. I do believe this will not only not conflict, but instead maybe even enable, further work on automatic mat view updates. Sorry for the length of this reply, Nico --
pgsql-hackers by date: