Thread: Materialized views proposal

Materialized views proposal

From

Jonathan Gardner

Date:

26 November 2003, 13:04:18

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I apologize if this post is inappropriate.

Doing some development work, me and my co-worker discussed some
optimizations strategies. One of the ideas that came up was materialized
views. Trading disk space to summarize queries, and paying for a trigger
waterfall on certain kinds of updates seems like a small price to pay right
now in our situation.

I searched through the archives and I couldn't seem to find anything
relevant.

As I've no familiarity with the internals, but I do know the front-end
pretty well, what can I do to help implement materialized views? Is such an
idea feasible? Is there some reason why it just isn't a good idea? (And if
not, what is a better idea?)

If someone is willing to guide me, I can probably contribute a fair amount
of C code. I do have experience with C.

The bird's eye view of the project would probably be turning a statement (is
there such a statement in SQL92?) in the creation of a table, the initial
population of the table, and the creation of several triggers strategically
placed so that the data in the materialized view table is always in sync.
Direct inserts or updates on the materialized view should be illegal,
except by the triggers. However, perhaps like views work today, we can
allow rules to be added to the table.

Certain restrictions on the materialized views should be enforced: No
mutables, in particular.

- --
Jonathan Gardner
jgardner@jonathangardner.net
Live Free, Use Linux!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/xNzcWgwF3QvpWNwRAkbiAJ4m1zRPe+y3Tha0649QXH30y9eITwCfTjsv
ow9Nwnnwrc6x9QaAB1AfHWQ=
=Ofb5
-----END PGP SIGNATURE-----

Re: Materialized views proposal

From

Hannu Krosing

Date:

26 November 2003, 13:20:14

Jonathan Gardner kirjutas K, 26.11.2003 kell 19:03:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I apologize if this post is inappropriate.
> 
> Doing some development work, me and my co-worker discussed some 
> optimizations strategies. One of the ideas that came up was materialized 
> views. Trading disk space to summarize queries, and paying for a trigger 
> waterfall on certain kinds of updates seems like a small price to pay right 
> now in our situation.
> 
> I searched through the archives and I couldn't seem to find anything 
> relevant.
> 
> As I've no familiarity with the internals, but I do know the front-end 
> pretty well, what can I do to help implement materialized views? 

First, You could start by implementing materialized views manually,
using tables and triggers, to get the feel of what should be generated.

Next, still working from frontend, try to make some code (probably in
some RAD/scripring language) which does the transform from CREATE VIEW
(or whatever) syntax (or manually constructed syntax tree) into create
table + create trigger statements.

If you have the above working well, talk about moving this to backend.

> Is such an 
> idea feasible? Is there some reason why it just isn't a good idea? (And if 
> not, what is a better idea?)
> 
> If someone is willing to guide me, I can probably contribute a fair amount 
> of C code. I do have experience with C.

What is needed is good algorithms. Writing C code is secondary to that.

Similar problem has kept us from implementing updatable views for quite
some time.

--------------------
Hannu

Re: Materialized views proposal

From

Jonathan Gardner

Date:

28 November 2003, 16:01:20

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 26 November 2003 09:19, Hannu Krosing wrote:
>
> First, You could start by implementing materialized views manually,
> using tables and triggers, to get the feel of what should be generated.
>
> Next, still working from frontend, try to make some code (probably in
> some RAD/scripring language) which does the transform from CREATE VIEW
> (or whatever) syntax (or manually constructed syntax tree) into create
> table + create trigger statements.
>
> If you have the above working well, talk about moving this to backend.
>

Thanks for the suggestion. I will definitely do that.


> What is needed is good algorithms. Writing C code is secondary to that.
>
> Similar problem has kept us from implementing updatable views for quite
> some time.
>

You are definitely correct.

- --
Jonathan M. Gardner
Web Developer, Amazon.com
jonagard@amazon.com - (206) 266-2906
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/xPfOBFeYcclU5Q0RAhgfAKDAYmm67Jc5n7WwyUDtn5IJOhhrXwCfcRud
EAR9U63FEqlHbGrdzKxGmdw=
=a6RK
-----END PGP SIGNATURE-----

Re: Materialized views proposal

From

Jonathan Gardner

Date:

29 November 2003, 17:00:22

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 26 November 2003 10:58 am, Jonathan Gardner wrote:
> On Wednesday 26 November 2003 09:19, Hannu Krosing wrote:
> > What is needed is good algorithms. Writing C code is secondary to that.
> >
> > Similar problem has kept us from implementing updatable views for quite
> > some time.
>
> You are definitely correct.

Preliminary research has shown that:
1) We would put triggers on each table that contributes to a view.

2) We would only be interested in those inserts and deletes (counting an
update as a delete and then an insert) that satisfy the "WHERE" clause of
the view.

3) We would implement some sort of differential view update scheme based on
the paper "Efficiently Updating Materialized Views"[1]. They classify all
select queries into one of four categories and describe how to go about
producing a differential update. I couldn't understand this part at all.

- ---

I am already too deep in database theory. While I enjoy reading a good
paper, I just don't have the background in DB science like I do in Physics.
For instance, my set theory is very weak (I can't tell if U is a union or
an intersection, for instanct).

At this point, I have a choice. Drop a couple hundred dollars on database
theory textbooks and spend the next three months learning it, or hoping one
of you database theorists out there take pity on me and pick this up or
coach me through it. Any takers on the second one?

- ---

I have a proposal. Let me hear what you think. There will be two kinds of
materialized view in PostgreSQL:

The first is the "snapshot" approach. Provide a materialized view, but don't
provide automatic updating. The user can call a refresh statement to
repopulate the entire view from time to time as appropriate. The benefit of
this is that you can use queries that have functions that are immutable or
stable. (You could probably use volatile as well but I wouldn't recommend
it). This is so easy to implement, it isn't even funny.

Optimizing this would involve collecting all the inserted / updated /deleted
rows since the last snapshot. We can have a logging table that accumulates
all the changes since the last refresh for this purpose. (This kind of
table may exist anyway for replication or other purposes.)

Finally, we could examine the changed rows, ignore the irrelevant ones,
ignore the redundant ones (IE, rows that have been inserted and then
deleted during that time), and decide whether doing a complete refresh
would be quicker than doing several differential updates to the
materialized view.

The second is the automatically updated materialized view. Each insert,
update, or delete executed against the table that are queried for the view
will trigger a function call using the algorithms proposed in [1]. Every
function in the query must be stable.

We could provide a mechanism to alter the materialized view between snapshot
to auto-updated. It would be as simple as refreshing the snapshot and then
enabling the triggers, or as simple as disabling the triggers.

I already have written a set of scripts in perl that provides the first
proposal without the optimization idea. I will put them up at GBorg.

Footnotes:
[1] ftp://ftp.research.microsoft.com/users/palarson/sigmod86.ps

- --
Jonathan Gardner
jgardner@jonathangardner.net
Live Free, Use Linux!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/yQjZWgwF3QvpWNwRAnqkAKCasoODeV2+KcP52DMXiEyq7pfhmACcCFBV
X28Nd5PvlhR8Xta/G4w2qBQ=
=OLa8
-----END PGP SIGNATURE-----

Re: Materialized views proposal

From

Neil Conway

Date:

29 November 2003, 20:18:35

Jonathan Gardner <jgardner@jonathangardner.net> writes:
> 3) We would implement some sort of differential view update scheme
>    based on the paper "Efficiently Updating Materialized Views"[1].

One resource on this topic that appears to be quite authoritative is
"Materialized Views: Techniques, Implementations, and Applications" by
A. Gupta and I. Mumick (MIT Press, 1999). That might present the
material in a manner that is easier to grok than the original papers
themselves.

I didn't realize that it was possible to spend 600 pages only
discussing materialized views, but it apparently it is :-)

-Neil

Re: Materialized views proposal

From

Hannu Krosing

Date:

30 November 2003, 09:09:46

Neil Conway kirjutas P, 30.11.2003 kell 02:18:
> Jonathan Gardner <jgardner@jonathangardner.net> writes:
> > 3) We would implement some sort of differential view update scheme
> >    based on the paper "Efficiently Updating Materialized Views"[1].

Maybe the TelegraphCQ engine can give some ideas

http://telegraph.cs.berkeley.edu/

from brief reading, it seems that all they do is kind of materialized
views ;) - they call it Continuous Dataflow Processing, i.e. queries
that run continuously over incoming data.

----------------
Hannu

Re: Materialized views proposal

From

Sailesh Krishnamurthy

Date:

30 November 2003, 18:47:26

>>>>> "Hannu" == Hannu Krosing <hannu@tm.ee> writes:
Hannu> Neil Conway kirjutas P, 30.11.2003 kell 02:18: >> Jonathan Gardner <jgardner@jonathangardner.net> writes: >
3)We >> would implement some sort of differential view update scheme > >> based on the paper "Efficiently Updating
Materialized >> Views"[1].

Hannu> Maybe the TelegraphCQ engine can give some ideas
Hannu> http://telegraph.cs.berkeley.edu/
Hannu> from brief reading, it seems that all they do is kind of Hannu> materialized views ;) - they call it
ContinuousDataflow Hannu> Processing, i.e. queries that run continuously over Hannu> incoming data.

A fair portion of the community argues that everything that we do is
merely materialized views :-)

While materialized views are certainly related, there are however,
things that are quite different in processing continuous queries over
data streams. The first ideas on stream processing did come from work
on scalable triggers which is intimately related to materialized
views. Work on mviews isn't really concerned with things like windowed
joins and aggregate processing.

My advice is to take a very incremental approach to materialized view
implementation. There are essentially 2 pieces - one the maintenance
of the views and second the routing of queries to automagically use
the materialized views. The former is done by something like triggers
and the latter is through either a query-rewrite mechanism or
something like an optimizer choosing a different access method.

As for the former, there are again two subdivisions - what changes you
propagate and when you apply 'em. What you propagate are the
deltas. So you have the following choices:

IPIA - Immediate Propagate, Immediate Apply
IPDA - Immediate Propagate, Deferred Apply
DPDA - Deferred Propagate, Deferred Apply

DPIA - makes no sense ..

The easiest is DPDA .. in this model you essentially recompute the
view on demand. The next is IPIA.. with IPIA, you can choose to either
recompute the entire view or only figure out how to translate a delta
into appropriate changes in the mview. After that there is IPDA
.. this involves creating a delta table that is populated based on the
triggers. Every so often you process a set of deltas and recompute the
view (either in its entirety or in parts>

Another way to slice the problem is to limit the scope of queries that
can be used to define views.

I suggest that first we don't consider joins .. and only consider
grouped aggregates over a single table. Joins are more hairy with
changes in one table potentially requiring a wholesale recomputation
of the join.

Note that all this discussion is only about the first part of the
problem .. maintenance of the views. The other part, routing is almost
equally complicated .. there you have to solve the query subsumption
problem without also eliminating choices in your access plan.

As people have said there are plenty of papers on this in the
literature. While I am no theorist I can certainly help with reading
the papers .. not every bit of a paper is very useful.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh