Re: Implementing Incremental View Maintenance - Mailing list pgsql-hackers

From Yugo Nagata
Subject Re: Implementing Incremental View Maintenance
Date
Msg-id 20191223100716.af0796faef3a1bd94890b3a3@sraoss.co.jp
Whole thread Raw
In response to Re: Implementing Incremental View Maintenance  (nuko yokohama <nuko.yokohama@gmail.com>)
List pgsql-hackers
On Sun, 22 Dec 2019 20:54:41 +0900
nuko yokohama <nuko.yokohama@gmail.com> wrote:

> SELECT statement that is not IMMUTABLE must not be specified when creating
> a view.
> 
> An expression SELECT statement that is not IMMUTABLE must not be specified
> when creating a view.
> 
> In the current implementation, a SELECT statement containing an expression
> that is not IMMUTABLE can be specified when creating a view.
> If an incremental materialized view is created from a SELECT statement that
> contains an expression that is not IMMUTABLE, applying the SELECT statement
> to the view returns incorrect results.
> To prevent this, we propose that the same error occur when a non-IMMUTABLE
> expression is specified in the "CREATE INDEX" statement.

Thank you for pointing out this. That makes sense.  The check of not-IMMUTABLE
epressions is missing at creating IMMV.  We'll add this.

Thanks,
Yugo Nagata

> 
> The following is an inappropriate example.
> ----
> CREATE TABLE base (id int primary key, data text, ts timestamp);
> CREATE TABLE
> CREATE VIEW base_v AS SELECT * FROM base
>   WHERE ts >= (now() - '3 second'::interval);
> CREATE VIEW
> CREATE MATERIALIZED VIEW base_mv AS SELECT * FROM base
>   WHERE ts >= (now() - '3 second'::interval);
> SELECT 0
> CREATE INCREMENTAL MATERIALIZED VIEW base_imv AS SELECT * FROM base
>   WHERE ts >= (now() - '3 second'::interval);
> SELECT 0
>                                       View "public.base_v"
>  Column |            Type             | Collation | Nullable | Default |
> Storage  | Description
> --------+-----------------------------+-----------+----------+---------+----------+-------------
>  id     | integer                     |           |          |         |
> plain    |
>  data   | text                        |           |          |         |
> extended |
>  ts     | timestamp without time zone |           |          |         |
> plain    |
> View definition:
>  SELECT base.id,
>     base.data,
>     base.ts
>    FROM base
>   WHERE base.ts >= (now() - '00:00:03'::interval);
> 
>                                       Materialized view "public.base_mv"
>  Column |            Type             | Collation | Nullable | Default |
> Storage  | Stats target | Description
> --------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
>  id     | integer                     |           |          |         |
> plain    |              |
>  data   | text                        |           |          |         |
> extended |              |
>  ts     | timestamp without time zone |           |          |         |
> plain    |              |
> View definition:
>  SELECT base.id,
>     base.data,
>     base.ts
>    FROM base
>   WHERE base.ts >= (now() - '00:00:03'::interval);
> Access method: heap
> 
>                                          Materialized view "public.base_imv"
>     Column     |            Type             | Collation | Nullable |
> Default | Storage  | Stats target | Description
>
---------------+-----------------------------+-----------+----------+---------+----------+--------------+-------------
>  id            | integer                     |           |          |
>   | plain    |              |
>  data          | text                        |           |          |
>   | extended |              |
>  ts            | timestamp without time zone |           |          |
>   | plain    |              |
>  __ivm_count__ | bigint                      |           |          |
>   | plain    |              |
> View definition:
>  SELECT base.id,
>     base.data,
>     base.ts
>    FROM base
>   WHERE base.ts >= (now() - '00:00:03'::interval);
> Access method: heap
> Incremental view maintenance: yes
> 
> INSERT INTO base VALUES (generate_series(1,3), 'dummy', clock_timestamp());
> INSERT 0 3
> SELECT * FROM base_v ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   1 | dummy | 2019-12-22 11:38:26.367481
>   2 | dummy | 2019-12-22 11:38:26.367599
>   3 | dummy | 2019-12-22 11:38:26.367606
> (3 rows)
> 
> SELECT * FROM base_mv ORDER BY id;
>  id | data | ts
> ----+------+----
> (0 rows)
> 
> REFRESH MATERIALIZED VIEW base_mv;
> REFRESH MATERIALIZED VIEW
> SELECT * FROM base_mv ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   1 | dummy | 2019-12-22 11:38:26.367481
>   2 | dummy | 2019-12-22 11:38:26.367599
>   3 | dummy | 2019-12-22 11:38:26.367606
> (3 rows)
> 
> SELECT * FROM base_imv ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   1 | dummy | 2019-12-22 11:38:26.367481
>   2 | dummy | 2019-12-22 11:38:26.367599
>   3 | dummy | 2019-12-22 11:38:26.367606
> (3 rows)
> 
> SELECT pg_sleep(3);
>  pg_sleep
> ----------
> 
> (1 row)
> 
> INSERT INTO base VALUES (generate_series(4,6), 'dummy', clock_timestamp());
> INSERT 0 3
> SELECT * FROM base_v ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   4 | dummy | 2019-12-22 11:38:29.381414
>   5 | dummy | 2019-12-22 11:38:29.381441
>   6 | dummy | 2019-12-22 11:38:29.381444
> (3 rows)
> 
> SELECT * FROM base_mv ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   1 | dummy | 2019-12-22 11:38:26.367481
>   2 | dummy | 2019-12-22 11:38:26.367599
>   3 | dummy | 2019-12-22 11:38:26.367606
> (3 rows)
> 
> REFRESH MATERIALIZED VIEW base_mv;
> REFRESH MATERIALIZED VIEW
> SELECT * FROM base_mv ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   4 | dummy | 2019-12-22 11:38:29.381414
>   5 | dummy | 2019-12-22 11:38:29.381441
>   6 | dummy | 2019-12-22 11:38:29.381444
> (3 rows)
> 
> SELECT * FROM base_imv ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   1 | dummy | 2019-12-22 11:38:26.367481
>   2 | dummy | 2019-12-22 11:38:26.367599
>   3 | dummy | 2019-12-22 11:38:26.367606
>   4 | dummy | 2019-12-22 11:38:29.381414
>   5 | dummy | 2019-12-22 11:38:29.381441
>   6 | dummy | 2019-12-22 11:38:29.381444
> (6 rows)
> 
> REFRESH MATERIALIZED VIEW base_mv;
> REFRESH MATERIALIZED VIEW
> SELECT * FROM base_imv ORDER BY id;
>  id | data  |             ts
> ----+-------+----------------------------
>   1 | dummy | 2019-12-22 11:38:26.367481
>   2 | dummy | 2019-12-22 11:38:26.367599
>   3 | dummy | 2019-12-22 11:38:26.367606
>   4 | dummy | 2019-12-22 11:38:29.381414
>   5 | dummy | 2019-12-22 11:38:29.381441
>   6 | dummy | 2019-12-22 11:38:29.381444
> (6 rows)
> ----
> 
> 2018年12月27日(木) 21:57 Yugo Nagata <nagata@sraoss.co.jp>:
> 
> > Hi,
> >
> > I would like to implement Incremental View Maintenance (IVM) on
> > PostgreSQL.
> > IVM is a technique to maintain materialized views which computes and
> > applies
> > only the incremental changes to the materialized views rather than
> > recomputate the contents as the current REFRESH command does.
> >
> > I had a presentation on our PoC implementation of IVM at PGConf.eu 2018
> > [1].
> > Our implementation uses row OIDs to compute deltas for materialized
> > views.
> > The basic idea is that if we have information about which rows in base
> > tables
> > are contributing to generate a certain row in a matview then we can
> > identify
> > the affected rows when a base table is updated. This is based on an idea of
> > Dr. Masunaga [2] who is a member of our group and inspired from ID-based
> > approach[3].
> >
> > In our implementation, the mapping of the row OIDs of the materialized view
> > and the base tables are stored in "OID map". When a base relation is
> > modified,
> > AFTER trigger is executed and the delta is recorded in delta tables using
> > the transition table feature. The accual udpate of the matview is triggerd
> > by REFRESH command with INCREMENTALLY option.
> >
> > However, we realize problems of our implementation. First, WITH OIDS will
> > be removed since PG12, so OIDs are no longer available. Besides this, it
> > would
> > be hard to implement this since it needs many changes of executor nodes to
> > collect base tables's OIDs during execuing a query. Also, the cost of
> > maintaining
> > OID map would be high.
> >
> > For these reasons, we started to think to implement IVM without relying on
> > OIDs
> > and made a bit more surveys.
> >
> > We also looked at Kevin Grittner's discussion [4] on incremental matview
> > maintenance.  In this discussion, Kevin proposed to use counting algorithm
> > [5]
> > to handle projection views (using DISTNICT) properly. This algorithm need
> > an
> > additional system column, count_t, in materialized views and delta tables
> > of
> > base tables.
> >
> > However, the discussion about IVM is now stoped, so we would like to
> > restart and
> > progress this.
> >
> >
> > Through our PoC inplementation and surveys, I think we need to think at
> > least
> > the followings for implementing IVM.
> >
> > 1. How to extract changes on base tables
> >
> > I think there would be at least two approaches for it.
> >
> >  - Using transition table in AFTER triggers
> >  - Extracting changes from WAL using logical decoding
> >
> > In our PoC implementation, we used AFTER trigger and transition tables,
> > but using
> > logical decoding might be better from the point of performance of base
> > table
> > modification.
> >
> > If we can represent a change of UPDATE on a base table as query-like
> > rather than
> > OLD and NEW, it may be possible to update the materialized view directly
> > instead
> > of performing delete & insert.
> >
> >
> > 2. How to compute the delta to be applied to materialized views
> >
> > Essentially, IVM is based on relational algebra. Theorically, changes on
> > base
> > tables are represented as deltas on this, like "R <- R + dR", and the
> > delta on
> > the materialized view is computed using base table deltas based on "change
> > propagation equations".  For implementation, we have to derive the
> > equation from
> > the view definition query (Query tree, or Plan tree?) and describe this as
> > SQL
> > query to compulte delta to be applied to the materialized view.
> >
> > There could be several operations for view definition: selection,
> > projection,
> > join,  aggregation, union, difference, intersection, etc.  If we can
> > prepare a
> > module for each operation, it makes IVM extensable, so we can start a
> > simple
> > view definition, and then support more complex views.
> >
> >
> > 3. How to identify rows to be modifed in materialized views
> >
> > When applying the delta to the materialized view, we have to identify
> > which row
> > in the matview is corresponding to a row in the delta.  A naive method is
> > matching
> > by using all columns in a tuple, but clearly this is unefficient. If
> > thematerialized
> > view has unique index, we can use this. Maybe, we have to force
> > materialized views
> > to have all primary key colums in their base tables.  In our PoC
> > implementation, we
> > used OID to identify rows, but this will be no longer available as said
> > above.
> >
> >
> > 4. When to maintain materialized views
> >
> > There are two candidates of the timing of maintenance, immediate (eager)
> > or deferred.
> >
> > In eager maintenance, the materialized view is updated in the same
> > transaction
> > where the base table is updated. In deferred maintenance, this is done
> > after the
> > transaction is commited, for example, when view is accessed, as a response
> > to user
> > request, etc.
> >
> > In the previous discussion[4], it is planned to start from "eager"
> > approach. In our PoC
> > implementaion, we used the other aproach, that is, using REFRESH command
> > to perform IVM.
> > I am not sure which is better as a start point, but I begin to think that
> > the eager
> > approach may be more simple since we don't have to maintain base table
> > changes in other
> > past transactions.
> >
> > In the eager maintenance approache, we have to consider a race condition
> > where two
> > different transactions change base tables simultaneously as discussed in
> > [4].
> >
> >
> > [1]
> >
https://www.postgresql.eu/events/pgconfeu2018/schedule/session/2195-implementing-incremental-view-maintenance-on-postgresql/
> > [2]
> >
https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_detail&page_id=13&block_id=8&item_id=191254&item_no=1
> > (Japanese only)
> > [3] https://dl.acm.org/citation.cfm?id=2750546
> > [4]
> > https://www.postgresql.org/message-id/flat/1368561126.64093.YahooMailNeo%40web162904.mail.bf1.yahoo.com
> > [5] https://dl.acm.org/citation.cfm?id=170066
> >
> > Regards,
> > --
> > Yugo Nagata <nagata@sraoss.co.jp>
> >
> >


-- 
Yugo Nagata <nagata@sraoss.co.jp>



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Drongo vs. 9.4 initdb TAP test
Next
From: Tatsuo Ishii
Date:
Subject: Re: Implementing Incremental View Maintenance