Re: Implementing Incremental View Maintenance - Mailing list pgsql-hackers

From Mitar
Subject Re: Implementing Incremental View Maintenance
Date
Msg-id CAKLmikMjJqVJbKijOxivhb9=A5PfoZcit2QowbJ4rFszzr4Zhw@mail.gmail.com
Whole thread Raw
In response to Implementing Incremental View Maintenance  (Yugo Nagata <nagata@sraoss.co.jp>)
Responses Re: Implementing Incremental View Maintenance  (Yugo Nagata <nagata@sraoss.co.jp>)
List pgsql-hackers
Hi!

On Thu, Dec 27, 2018 at 4:57 AM Yugo Nagata <nagata@sraoss.co.jp> wrote:
> I would like to implement Incremental View Maintenance (IVM) on PostgreSQL.
> IVM is a technique to maintain materialized views which computes and applies
> only the incremental changes to the materialized views rather than
> recomputate the contents as the current REFRESH command does.

That sounds great! I am interested in this topic because I am
interested in reactive/live queries and support for them in
PostgreSQL. [1]

In that context, the problem is very similar: based on some state of
query results and updated source tables, determine what should be new
updates to send to the client describing changes to the query results.
So after computing those incremental changes, instead of applying them
to materialized view I would send them to the client. One could see
materialized views only type of consumers of such information about
incremental change.

So I would like to ask if whatever is done in this setting is done in
a way that one could also outside of the context of materialized view.
Not sure what would API be thought.

From the perspective of reactive/live queries, this package [2] is
interesting. To my understanding, it adds to all base tables two
columns, one for unique ID and one for revision of the row. And then
rewrites queries so that this information is passed all the way to
query results. In this way it can then determine mapping between
inputs and outputs. I am not sure if it then does incremental update
or just uses that to determine if view is invalidated. Not sure if
there is anything about such approach in literature. Or why both index
and revision columns are needed.

> For these reasons, we started to think to implement IVM without relying on OIDs
> and made a bit more surveys.

I also do not see much difference between asking users to have primary
key on base tables or asking them to have OIDs. Why do you think that
a requirement for primary keys is a hard one? I think we should first
focus on having IVM with base tables with primary keys. Maybe then
later on we could improve on that and make it also work without.

To me personally, having unique index on source tables and also on
materialized view is a reasonable restriction for this feature.
Especially for initial versions of it.

> However, the discussion about IVM is now stoped, so we would like to restart and
> progress this.

What would be next steps in your view to move this further?

> If we can represent a change of UPDATE on a base table as query-like rather than
> OLD and NEW, it may be possible to update the materialized view directly instead
> of performing delete & insert.

Why do you need OLD and NEW? Don't you need just NEW and a list of
columns which changed from those in NEW? I use such diffing query [4]
to represent changes: first column has a flag telling if the row is
representing insert, update, and remove, the second column tells which
column are being changed in the case of the update, and then the NEW
columns follow.

I think that maybe standardizing structure for representing those
changes would be a good step towards making this modular and reusable.
Because then we can have three parts:

* Recording and storing changes in a standard format.
* A function which given original data, stored changes, computes
updates needed, also in some standard format.
* A function which given original data and updates needed, applies them.

> In the previous discussion[4], it is planned to start from "eager" approach. In our PoC
> implementaion, we used the other aproach, that is, using REFRESH command to perform IVM.
> I am not sure which is better as a start point, but I begin to think that the eager
> approach may be more simple since we don't have to maintain base table changes in other
> past transactions.

I think if we split things into three parts as I described above, then
this is just a question of configuration. Or you call all three inside
one trigger to update in "eager" fashion. Or you store computed
updates somewhere and then on demand apply those in "lazy" fashion.

> In the eager maintenance approache, we have to consider a race condition where two
> different transactions change base tables simultaneously as discussed in [4].

But in the case of "lazy" maintenance there is a mirror problem: what
if later changes to base tables invalidate some previous change to the
materialized view. Imagine that one cell in a base table is first
updated too "foo" and we compute an update for the materialized view
to set it to "foo". And then the same cell is updated to "bar" and we
compute an update for the materialized view again. If we have not
applied any of those updates (because we are "lazy") now the
previously computed update can be discarded. We could still apply
both, but it would not be efficient.

[1] https://www.postgresql.org/message-id/flat/CAKLmikP%2BPPB49z8rEEvRjFOD0D2DV72KdqYN7s9fjh9sM_32ZA%40mail.gmail.com
[2] https://github.com/nothingisdead/pg-live-query
[3] https://www.postgresql.org/docs/devel/sql-createtable.html
[4] https://github.com/tozd/node-reactive-postgres/blob/eeda4f28d096b6e552d04c5ea138c258cb5b9389/index.js#L329-L340


Mitar

-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m


pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: [Sender Address Forgery]Re: error message when subscriptiontarget is a partitioned table
Next
From: David Rowley
Date:
Subject: Re: speeding up planning with partitions