Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features - Mailing list pgsql-hackers

From Nico Williams
Subject Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features
Date
Msg-id 20161128164447.GA11117@localhost
Whole thread Raw
In response to Re: Alternative MATERIALIZED VIEW design and implementation with history table and other features  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
List pgsql-hackers
On Sat, Nov 26, 2016 at 04:13:19PM -0600, Jim Nasby wrote:
> On 11/22/16 9:11 PM, Nico Williams wrote:
> >But we needed a method for recording deltas from REFRESHes, and that's
> >not supported.  So I coded up my own version of materialized views, in
> >PlPgSQL, that does provide a history feature.
> 
> Getting history tracking included in core is going to take a LOT of effort;
> not from a code standpoint, but to get everyone to agree on the use cases,
> interface, etc. In short: I wouldn't go there.

Thanks for the reply.

Well, I want to go there :)  And, frankly, I think the interface details
aren't that hard, though I could be wrong.

The bare minimum[0] columns for the history table are: a transaction ID,
a column of the view's record type[1] for "old"/deleted rows, and a
column of the view's record type for "new"/inserted rows.

If a row was deleted, then you get a row in the history table with that
row's record value as the "old" column's value, with the "new" column's
value being NULL.  Similarly for insertions.  For updates, if there's a
primary key[2] we could have a single history row with non-NULL values
for all three columns, else two history rows, one with the old row as
the "old" column's value and NULL for the "new" column, and another
history row with a NULL for the "old" column and the new row as the
value of the "new" column.

The trickiest thing here is the "transaction ID".  In my use case I have
to deal with an ORM's notion of transaction ID, so no PG notion of TX ID
helps me :(

> What *would* be useful is work on generating delta information from a set of
> changes to a table: see below.

My implementation does that too (via triggers).

> >Besides a history feature, this includes the ability to record changes
> >made to a materialized view's materialization table, which means I can
> >have triggers that update the materialized view.
> 
> Just looking at the commend block in the file, it's not clear what you mean
> by this. Does this mean if you ALTER the "materialization table" (which I
> assume is the materialized output?), does the view somehow get modified?

It means that if you INSERT/UPDATE/DELETE on a VIEW materialized with my
implementation, then those changes are recorded in the history table for
the view.  However, it is important that such changes be consistent with
the VIEW, else the next refresh will undo them (naturally).

> >Of particular interest may be the fact that the FULL OUTER JOIN that PG
> >does for REFRESH CONCURRENTLY, and which I copied here, doesn't deal
> >well with views that have NULLs in any columns used in the join.  It
> >would be nice to have an equijoin that uses IS NOT DISTINCT FROM rather
> >than just =, and then refreshing could use such a join in order to deal
> >properly with NULLs.
> 
> Well, you could potentially just cast the composite types to text and join
> on that, but...

Only if there's a non-ambiguous text representation of row values.
Also, this might make refreshes slow.

> Kevin Grittner (author of the original matviews patch) has mentioned
> detecting changes to underlying relations previously[1]. Getting even a
> simple form of that working is going to be critical for any kind of
> incremental matview updating. It sounds like you're doing something
> different from incremental update, but the core problem is still the same:
> how to efficiently identify changes to underlying matview data and apply
> those changes to the view.

Incremental update is *exactly* what I'm doing.  It's also exactly what
REFRESH CONCURRENTLY does today, except that REFRESH CONCURRENTLY
doesn't save the history but my implementation does.

> I suggest taking a look at what Kevin's written about that, as well as the

Sure.

> paper he mentioned. Some of this does assume that you have the equivalent to
> NEW and OLD, but IIRC those are not actually mandatory (ie: you could use
> the current matview contents as OLD and the dynamic view as NEW). Even a

My implementation is rather trivial.  You might want to look at it.  Its
guts are actually copied from what PG already does for REFRESH
CONCURRENTLY.

> pure SQL/plpgsql implementation of what's in the paper that Kevin mentioned
> would probably be valuable to ongoing work, as well as being applicable to
> what you've done.
> 
> 1: https://www.postgresql.org/message-id/1371480075.55528.YahooMailNeo@web162901.mail.bf1.yahoo.com

Thanks for the link.

My implementation even supports auto-refresh of stale VIEWs and uses
NOTIFY/LISTEN channels as the FIFO.  There is a PGSQL-coded conditional
refresh function (refresh_if_needed()) that a daemon calls periodically
and whenever it gets a NOTIFY on a channel that it LISTENs on; the
"sla", last refresh timestamp, and "refresh needed" indicators for each
materialized view, are recorded in a table.

That's what the pqasyncnotifier.c file is about: it makes it possible[3]
to shell-code a daemon that refreshes stale materialized views
automatically.  The daemon for that is not included in our contribution,
but it is trivial, though ideally it should be coded in C in the
postgres server and run as a child process.

[0] In my implementation I also added a timestamp, but this -and related   information about transactions- should be
normalizedaway.
 

[1] That's right, I use record types so as to make sure that all history   tables have the exact same number of
columns.
   I must say it's annoying that I can't declare a TABLE as having the   same record type as another TABLE, and that I
can'tCAST from one   record type to another with the exact same number of columns and the   same types for them.
 

[2] VIEWs don't get PRIMARY KEYs, naturally, but a MATERIALIZED VIEW is   a VIEW plus a hidden TABLE to hold the
materialization. A PK makes   a lot of sense for a MATERIALIZED VIEW, which is why the syntax I   gave for an ideal
CREATEMATERIALIZED VIEW includes an optional PK   declaration.
 

[3] It's not possible to shell-code this with psql(1) because psql(1)   doesn't write to stdout when a notification is
receivedon a   channel; it writes to stdout about notifications only when input on   stdin is _also_ received.
 

Nico
-- 



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: Tackling JsonPath support
Next
From: Tom Lane
Date:
Subject: Re: A bug of psql completion