Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features - Mailing list pgsql-hackers
From | Nico Williams |
---|---|
Subject | Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features |
Date | |
Msg-id | 20161128164447.GA11117@localhost Whole thread Raw |
In response to | Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features (Jim Nasby <Jim.Nasby@BlueTreble.com>) |
List | pgsql-hackers |
On Sat, Nov 26, 2016 at 04:13:19PM -0600, Jim Nasby wrote: > On 11/22/16 9:11 PM, Nico Williams wrote: > >But we needed a method for recording deltas from REFRESHes, and that's > >not supported. So I coded up my own version of materialized views, in > >PlPgSQL, that does provide a history feature. > > Getting history tracking included in core is going to take a LOT of effort; > not from a code standpoint, but to get everyone to agree on the use cases, > interface, etc. In short: I wouldn't go there. Thanks for the reply. Well, I want to go there :) And, frankly, I think the interface details aren't that hard, though I could be wrong. The bare minimum[0] columns for the history table are: a transaction ID, a column of the view's record type[1] for "old"/deleted rows, and a column of the view's record type for "new"/inserted rows. If a row was deleted, then you get a row in the history table with that row's record value as the "old" column's value, with the "new" column's value being NULL. Similarly for insertions. For updates, if there's a primary key[2] we could have a single history row with non-NULL values for all three columns, else two history rows, one with the old row as the "old" column's value and NULL for the "new" column, and another history row with a NULL for the "old" column and the new row as the value of the "new" column. The trickiest thing here is the "transaction ID". In my use case I have to deal with an ORM's notion of transaction ID, so no PG notion of TX ID helps me :( > What *would* be useful is work on generating delta information from a set of > changes to a table: see below. My implementation does that too (via triggers). > >Besides a history feature, this includes the ability to record changes > >made to a materialized view's materialization table, which means I can > >have triggers that update the materialized view. > > Just looking at the commend block in the file, it's not clear what you mean > by this. Does this mean if you ALTER the "materialization table" (which I > assume is the materialized output?), does the view somehow get modified? It means that if you INSERT/UPDATE/DELETE on a VIEW materialized with my implementation, then those changes are recorded in the history table for the view. However, it is important that such changes be consistent with the VIEW, else the next refresh will undo them (naturally). > >Of particular interest may be the fact that the FULL OUTER JOIN that PG > >does for REFRESH CONCURRENTLY, and which I copied here, doesn't deal > >well with views that have NULLs in any columns used in the join. It > >would be nice to have an equijoin that uses IS NOT DISTINCT FROM rather > >than just =, and then refreshing could use such a join in order to deal > >properly with NULLs. > > Well, you could potentially just cast the composite types to text and join > on that, but... Only if there's a non-ambiguous text representation of row values. Also, this might make refreshes slow. > Kevin Grittner (author of the original matviews patch) has mentioned > detecting changes to underlying relations previously[1]. Getting even a > simple form of that working is going to be critical for any kind of > incremental matview updating. It sounds like you're doing something > different from incremental update, but the core problem is still the same: > how to efficiently identify changes to underlying matview data and apply > those changes to the view. Incremental update is *exactly* what I'm doing. It's also exactly what REFRESH CONCURRENTLY does today, except that REFRESH CONCURRENTLY doesn't save the history but my implementation does. > I suggest taking a look at what Kevin's written about that, as well as the Sure. > paper he mentioned. Some of this does assume that you have the equivalent to > NEW and OLD, but IIRC those are not actually mandatory (ie: you could use > the current matview contents as OLD and the dynamic view as NEW). Even a My implementation is rather trivial. You might want to look at it. Its guts are actually copied from what PG already does for REFRESH CONCURRENTLY. > pure SQL/plpgsql implementation of what's in the paper that Kevin mentioned > would probably be valuable to ongoing work, as well as being applicable to > what you've done. > > 1: https://www.postgresql.org/message-id/1371480075.55528.YahooMailNeo@web162901.mail.bf1.yahoo.com Thanks for the link. My implementation even supports auto-refresh of stale VIEWs and uses NOTIFY/LISTEN channels as the FIFO. There is a PGSQL-coded conditional refresh function (refresh_if_needed()) that a daemon calls periodically and whenever it gets a NOTIFY on a channel that it LISTENs on; the "sla", last refresh timestamp, and "refresh needed" indicators for each materialized view, are recorded in a table. That's what the pqasyncnotifier.c file is about: it makes it possible[3] to shell-code a daemon that refreshes stale materialized views automatically. The daemon for that is not included in our contribution, but it is trivial, though ideally it should be coded in C in the postgres server and run as a child process. [0] In my implementation I also added a timestamp, but this -and related information about transactions- should be normalizedaway. [1] That's right, I use record types so as to make sure that all history tables have the exact same number of columns. I must say it's annoying that I can't declare a TABLE as having the same record type as another TABLE, and that I can'tCAST from one record type to another with the exact same number of columns and the same types for them. [2] VIEWs don't get PRIMARY KEYs, naturally, but a MATERIALIZED VIEW is a VIEW plus a hidden TABLE to hold the materialization. A PK makes a lot of sense for a MATERIALIZED VIEW, which is why the syntax I gave for an ideal CREATEMATERIALIZED VIEW includes an optional PK declaration. [3] It's not possible to shell-code this with psql(1) because psql(1) doesn't write to stdout when a notification is receivedon a channel; it writes to stdout about notifications only when input on stdin is _also_ received. Nico --
pgsql-hackers by date: