Thread: GSoC - proposal - Materialized Views in PostgreSQL

GSoC - proposal - Materialized Views in PostgreSQL

From

pavelbaros

Date:

09 April 2010, 17:38:22

Hello,

I am sending my proposal on GSoC. Details are listed below. Please, if 
you have ideas, tips, or if you only want to say you opinion about my 
project, go ahead.

thanks,
Pavel Baros

Abstract:

It is effort to implement snapshot materialized view (are only updated 
when refreshed) in PostgreSQL. In this time I finished some part of it 
and I am trying to publish my present work on git.postgresql.org


Benefits to the PostgreSQL Community

First of all, it would be the best if my work is helpful to everybody 
who misses materialized views in PostgreSQL, because PostgreSQL do not 
have still implemented materialized views. In addition, MV is mentioned 
as feature in TODO list.


Deliverables

First of all, at the end of whole my project is not only writing 
bachelors thesis, but finish it as patch and if possible, get patch into 
next PostgrSQL release, or keep git repository actual to last PosgreSQL 
version. I have also personal goals. Arouse the interest about 
implementing MV in PostgreSQL, or at least arouse discussion about it.


Project Schedule

My work goes quite good, I am on good way to finish main parts on 
backend in few weeks. After that I will make and run tests and implement 
related changes to PosgreSQL tools (psql, pg_dump, etc.). I am also 
waiting for approval for my repository named "materialized_view" on 
git.postgresql.org, so I could publish completed parts. For now next 
step will be to discuss implementation on postgresql.hackers.


Bio

I am from Czech Republic and I am studying on Faculty of Electrical 
Engineering on Czech Technical University in Prague www.fel.cvut.cz/en/. 
My bachelor thesis is based on this project, implementing MV in PostgreSQL.

I've experienced many different jobs. The best experience for me was, 
when I've worked as tester and software engineer in C/C++ and C# for 
Radiant Systems Inc. for more than year. After that I've worked as Web 
developer with Internet technologies (PHP, HTML, CSS, ...), where the 
goal was to make an internal system for an advertising agency. Finally 
my recent job was as Windows Mobile Developer. Except the first 
experience, others lasts only few months mainly because those were 
temporary projects. For now I am looking for some part time job, of 
course, preferably something closer to database systems.


Implementation:  could be divided to few steps:

1) create materialized view
- modify grammar (parser): CREATE MATERIALIZED VIEW mv_name AS SELECT ...
- change executor, so that it will create physical table defined by 
select statement

2) change rewriter
- usually, view is relation with defined rule and when rewriting, rule 
is fired and relation (view) is replaced by definition of view. If 
relation do not have rule, planner and executor behave to it as physical 
table (relation). In case of materialized view we want to rewrite select 
statement only in case when we refreshing MV. In other cases rewriter 
should skip rewriting and pick up physical relation. Exclude situation 
when other rewrite rules which are not related to MV definition are 
specified.

3) create command that takes snapshot (refresh MV)
- modify grammar: ALTER MATERIALIZED VIEW mv_name REFRESH;
- taking snapshot (refreshing) is similar to command "SELECT INTO ..." 
and I decided to follow the way it works. After parsing query and before 
transformation is MANUALLY created tree representation of "SELECT * INTO 
..." with flag IntoClause->isrefresh set true, indicating it is 
refreshing materialized view. Everithing acts as it would be regular 
"SELECT INTO ..." except functions OpenIntoRel() and CloseIntoRel(). In 
function OpenIntoRel is created temp table (without catalog) and set as 
destination for result of select. In function CloseIntoRel executor swap 
relfilenode's of temp table and original table and finally delete temp 
table. Behavior of CloseIntoRel function is inspired by CLUSTER statement.


Contacts:  baros.p@seznam.cz

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

"Kevin Grittner"

Date:

09 April 2010, 17:54:19

pavelbaros <baros.p@seznam.cz> wrote:
> I am also waiting for approval for my repository named
> "materialized_view" on git.postgresql.org
They seem to prefer that you get a repository under your name and
use materialized_view as a branch name.  See my account on
git.postgresql.org and its serializable branch for an example.
I learned by putting in a request similar to your pending one.
;-)
-Kevin

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

09 April 2010, 19:05:46

2010/4/9 pavelbaros <baros.p@seznam.cz>:
> Implementation:  could be divided to few steps:
>
> 1) create materialized view
> - modify grammar (parser): CREATE MATERIALIZED VIEW mv_name AS SELECT ...
> - change executor, so that it will create physical table defined by select
> statement

This basically needs to work the same was as CREATE TABLE ... AS
SELECT ... - save that it should also stuff the rewritten query
someplace, so that it can be re-executed.  I think one of the
important design questions here is figuring out exactly where that
"someplace" should be.

I also suspect that we want to block any write access to the relation
except for view refreshes.  IOW, INSERT, UPDATE, and DELETE on the
underlying relation should be rejected (though perhaps rewrite rules
redirecting such operations to other tables could be allowed).

> 2) change rewriter
> - usually, view is relation with defined rule and when rewriting, rule is
> fired and relation (view) is replaced by definition of view. If relation do
> not have rule, planner and executor behave to it as physical table
> (relation). In case of materialized view we want to rewrite select statement
> only in case when we refreshing MV. In other cases rewriter should skip
> rewriting and pick up physical relation. Exclude situation when other
> rewrite rules which are not related to MV definition are specified.
>
> 3) create command that takes snapshot (refresh MV)
> - modify grammar: ALTER MATERIALIZED VIEW mv_name REFRESH;
> - taking snapshot (refreshing) is similar to command "SELECT INTO ..." and I
> decided to follow the way it works. After parsing query and before
> transformation is MANUALLY created tree representation of "SELECT * INTO
> ..." with flag IntoClause->isrefresh set true, indicating it is refreshing
> materialized view. Everithing acts as it would be regular "SELECT INTO ..."
> except functions OpenIntoRel() and CloseIntoRel(). In function OpenIntoRel
> is created temp table (without catalog) and set as destination for result of
> select. In function CloseIntoRel executor swap relfilenode's of temp table
> and original table and finally delete temp table. Behavior of CloseIntoRel
> function is inspired by CLUSTER statement.

I'll have to read the code before I can comment on the rest of this in detail.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

10 April 2010, 00:53:49

pavelbaros wrote:
> I am also waiting for approval for my repository named 
> "materialized_view" on git.postgresql.org, so I could publish 
> completed parts. 

Presuming that you're going to wander there and get assigned what looks 
like an official repo name for this project is a bit...optimistic.  I 
would recommend that you publish to something like github instead (you 
can fork http://github.com/postgres/postgres ), and if the work looks 
good enough that it gets picked up by the community maybe you migrate it 
onto the main site eventually.  git.postgresql.org is really not setup 
to be general hosting space for everyone who has a PostgreSQL related 
project; almost every repo on there belongs to someone who has already 
been a steady project contributor for a number of years.

(Switching to boilerplate mode for a paragraph...) You have picked a 
PostgreSQL feature that is dramatically more difficult than it appears 
to be, and I wouldn't expect you'll actually finish even a fraction of 
your goals in a summer of work.  You're at least in plentiful 
company--most students do the same.  As a rule, if you see a feature on 
our TODO list that looks really useful and fun to work on, it's only 
still there because people have tried multiple times to build it 
completely but not managed to do so because it's harder than it 
appears.  This is certainly the case with materialized views.

You've outlined a reasonable way to build a prototype that does a 
limited implementation here.  The issue is what it will take to extend 
that into being production quality for the real-world uses of 
materialized views.  How useful your prototype is depends on how well it 
implements a subset of that in a way that will get used by the final design.

The main hidden complexity in this particular project relates to 
handling view refreshes.  The non-obvious problem is that when the view 
updates, you need something like a SQL MERGE to really handle that in a 
robust way that doesn't conflict with concurrent access to queries 
against the materialized view.  And work on MERGE support is itself 
blocked behind the fact that PostgreSQL doesn't have a good way to lock 
access to a key value that doesn't exist yet--what other databases call 
key range locking.  See the notes for "Add SQL-standard 
MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo 
for more information.

You can work around that to build a prototype by grabbing a full table 
lock on the materialized view when updating it, but that's not a 
production quality solution.  Solving that little detail is actually 
more work than the entire project you've outlined.  Your suggested 
implementation--"In function CloseIntoRel executor swap relfilenode's of 
temp table and original table and finally delete temp table"--is where 
the full table lock is going to end up at.  The exact use cases that 
need materialized views cannot handle a CLUSTER-style table recreation 
each time that needs an exclusive lock to switchover, so that whole part 
of your design is going to be a prototype that doesn't work at all like 
what needs to get built to make this feature committable.  It's also not 
a reasonable assumption that you have enough disk space to hold a second 
copy of the MV in a production system.

Once there's a good way to merge updates, how to efficiently generate 
them against the sort of large data sets that need materalized views--so 
you just write out the updates rather than a whole new copy--is itself a 
large project with a significant quantity of academic research to absorb 
before starting.  Dan Colish at Portland State has been playing around 
with prototypes for the specific problem of finding a good algorithm for 
view refreshing that is compatible with PostgreSQL's execution model.  
He's already recognized the table lock issue here and for the moment is 
ignoring that part.  I don't have a good feel yet for how long the 
targeted update code will take to mature, but based on what I do know I 
suspect that little detail is also a larger effort than the entire scope 
you're envisioning.  There's a reason why the MIT Press compendium 
"Materialized Views: Techniques, Implementations, and Applications" is 
over 600 pages long--I hope you've already started digging through that 
material.

Now, with all that said, that doesn't mean there's not a useful project 
for you buried in this mess.  The first two steps in your plan:

1) create materialized view
2) change rewriter

Include building a prototype grammer, doing an initial executor 
implementation, and getting some sort of rewriter working.  That is 
potentially good groundwork to lay here.  I would suggest that you 
completely drop your step 3:

3) create command that takes snapshot (refresh MV)

Because you cannot built that in a way that will be useful (and by that 
I mean committable quality) until there's a better way to handle updates 
than writing a whole new table and grabbing a full relation lock to 
switch to it.  To do a good job just on the first two steps should take 
at least a whole summer anyway--there's a whole stack of background 
research needed I haven't seen anyone do yet, and that isn't on your 
plan yet.  There is a precedent for taking this approach.  After getting 
stalled trying to add the entirety of easy partitioning to PostgreSQL, 
the current scope has been scaled back to just trying to get the syntax 
and on-disk structure right, then finish off the implementation.  See 
http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how 
that's been broken into those two major chunks.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

10 April 2010, 01:32:52

2010/4/9 Greg Smith <greg@2ndquadrant.com>:
> The main hidden complexity in this particular project relates to handling
> view refreshes.  The non-obvious problem is that when the view updates, you
> need something like a SQL MERGE to really handle that in a robust way that
> doesn't conflict with concurrent access to queries against the materialized
> view.  And work on MERGE support is itself blocked behind the fact that
> PostgreSQL doesn't have a good way to lock access to a key value that
> doesn't exist yet--what other databases call key range locking.  See the
> notes for "Add SQL-standard MERGE/REPLACE/UPSERT command" at
> http://wiki.postgresql.org/wiki/Todo for more information.
>
> You can work around that to build a prototype by grabbing a full table lock
> on the materialized view when updating it, but that's not a production
> quality solution.  Solving that little detail is actually more work than the

Hmm... I am not sure you're right about this.  It's not obvious to me
that a brief full-table lock wouldn't be acceptable for an initial
implementation.  Obviously it wouldn't be suitable for every use case
but since we're talking about manually refreshed views that was bound
to be true anyway.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Heikki Linnakangas

Date:

10 April 2010, 04:07:54

Greg Smith wrote:
> The main hidden complexity in this particular project relates to
> handling view refreshes.  The non-obvious problem is that when the view
> updates, you need something like a SQL MERGE to really handle that in a
> robust way that doesn't conflict with concurrent access to queries
> against the materialized view.  And work on MERGE support is itself
> blocked behind the fact that PostgreSQL doesn't have a good way to lock
> access to a key value that doesn't exist yet--what other databases call
> key range locking.  See the notes for "Add SQL-standard
> MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo
> for more information.
> 
> You can work around that to build a prototype by grabbing a full table
> lock on the materialized view when updating it, but that's not a
> production quality solution.

It would still be useful for many applications. And it would provide a
basis to extend later. You don't need to solve all problems at once, as
long as what you implement is a useful subset.

> Now, with all that said, that doesn't mean there's not a useful project
> for you buried in this mess.  The first two steps in your plan:
> 
> 1) create materialized view
> 2) change rewriter
> 
> Include building a prototype grammer, doing an initial executor
> implementation, and getting some sort of rewriter working.  That is
> potentially good groundwork to lay here.  I would suggest that you
> completely drop your step 3:
> 
> 3) create command that takes snapshot (refresh MV)
> 
> Because you cannot built that in a way that will be useful (and by that
> I mean committable quality) until there's a better way to handle updates
> than writing a whole new table and grabbing a full relation lock to
> switch to it.  To do a good job just on the first two steps should take
> at least a whole summer anyway--there's a whole stack of background
> research needed I haven't seen anyone do yet, and that isn't on your
> plan yet.  There is a precedent for taking this approach.  After getting
> stalled trying to add the entirety of easy partitioning to PostgreSQL,
> the current scope has been scaled back to just trying to get the syntax
> and on-disk structure right, then finish off the implementation.  See
> http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how
> that's been broken into those two major chunks.

The good thing about this subject for GSoC is that it can be divided
into many small steps. There's two largely independent main parts:

1. Keep the materialized view up-to-date when the base tables change.
This can be further divided into many steps, you can begin by supporting
automatic updates only on very simple views with e.g a single table and
a where clause. Then extend that to support joins, aggregates,
subqueries etc. Keeping it really limited, you could even require the
user to write the required triggers himself.

2. Teach the planner to use materialized views automatically when a
query references the base tables. So if you issue the query "SELECT *
FROM table WHERE foo > 10 AND bar = 10", and there's a materialized view
on "SELECT * FROM table WHERE bar = 10", the planner can transform the
original query into "SELECT * FROM materializedview WHERE foo > 10".
This largely depends on 1, although some DBMSs offer the option to use
manually refreshed materialized views too, knowing that they might not
be completely up-to-date.

There's a lot room to choose which problems you want to tackle, which is
good for a summer-of-code project. Your proposal basically describes
doing 1, in a limited fashion where the view is not updated
automatically, but only when the DBA runs a command to refresh it. I'm
not sure if that's useful enough on its own, writing "CREATE
MATERIALIZED VIEW ... SELECT ..." doesn't seem any easier than just
writing "CREATE TABLA AS ...". But if you can do something about 2, or
even a very limited part of 1, keeping the view up-to-date
automatically, it becomes much more useful.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Andrew Dunstan

Date:

10 April 2010, 10:15:39

Heikki Linnakangas wrote:
>
> 1. Keep the materialized view up-to-date when the base tables change.
> This can be further divided into many steps, you can begin by supporting
> automatic updates only on very simple views with e.g a single table and
> a where clause. Then extend that to support joins, aggregates,
> subqueries etc. Keeping it really limited, you could even require the
> user to write the required triggers himself.
>
>   

That last bit doesn't strike me as much of an advance. Isn't the whole 
point of this to automate it? Creating greedy materialized views is 
usually not terribly difficult now, but you do have to write the triggers.

The other thing that could be interesting about this would be some 
scheme for lazy refresh that didn't involve re-extracting the whole data 
set.

cheers

andrew

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

10 April 2010, 10:18:47

2010/4/10 Andrew Dunstan <andrew@dunslane.net>:
> Heikki Linnakangas wrote:
>>
>> 1. Keep the materialized view up-to-date when the base tables change.
>> This can be further divided into many steps, you can begin by supporting
>> automatic updates only on very simple views with e.g a single table and
>> a where clause. Then extend that to support joins, aggregates,
>> subqueries etc. Keeping it really limited, you could even require the
>> user to write the required triggers himself.
>
> That last bit doesn't strike me as much of an advance. Isn't the whole point
> of this to automate it? Creating greedy materialized views is usually not
> terribly difficult now, but you do have to write the triggers.

Yeah, I agree.

> The other thing that could be interesting about this would be some scheme
> for lazy refresh that didn't involve re-extracting the whole data set.

One way to do this would be to infer a primary key for the result set
based on the input query.  But I think we don't really have the
infrastructure to do this right now, so not really a project for a
beginner.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

11 April 2010, 00:04:02

Heikki Linnakangas wrote:
> Your proposal basically describes
> doing 1, in a limited fashion where the view is not updated
> automatically, but only when the DBA runs a command to refresh it. I'm
> not sure if that's useful enough on its own, writing "CREATE
> MATERIALIZED VIEW ... SELECT ..." doesn't seem any easier than just
> writing "CREATE TABLA AS ...". But if you can do something about 2, or
> even a very limited part of 1, keeping the view up-to-date
> automatically, it becomes much more useful.
>   

You've hit upon the core issue here.  You can build materialized views 
right now using "CREATE TABLE AS".  You can even update them by creating 
a new table the same way, with a new name, and doing the 
LOCK/RENAME/DROP shuffle--what people used to do for rewriting bloated 
tables before there was CLUSTER.  The first step in the proposal here is 
essentially syntax to give an easier UI for that.  That's an interesting 
step, but recognize that it doesn't actually provide anything you can't 
do already.

If you then note that doing any sort of incremental update to the view 
is a hard problem, and that a lot of the useful cases for materialized 
views involve tables where it's impractical to recreate the whole thing 
anyway, you'll inevitably find yourself deeply lost in the minutia of 
how to handle the updates.  It's really the core problem in building 
what people expect from a materialized view implementation in a serious 
database.  Chipping away at the other pieces around it doesn't move the 
feature that far forward, even if you get every single one of them 
except incremental updates finished, because everything else combined is 
still not that much work in comparison to the issues around updates.

There certainly are a fair number of subproblems you can break out of 
here.  I just think it's important to recognize that the path that leads 
to a useful GSoC project and the one that gives a production quality 
materialized view implementation may not have that much in common, and 
to manage expectations on both sides accordingly.  If Pavel thinks he's 
going to end up being able to say "I added materialized views to 
PostgreSQL" at the end of the summer, that's going to end in 
disappointment.  And if people think this project plan will lead to 
being able to claim PostgreSQL now has this feature, that's also not 
going to go well.  If the scope is "add initial grammar and rewriting 
moving toward a future materialized view feature", which the underlying 
implementation noted as a stub prototype, that might work out OK.  This 
is why I likened it to the work on "Syntax for partitioning", which has 
a similarly focused subgoal structure.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

11 April 2010, 00:41:02

Robert Haas wrote:
> It's not obvious to me
> that a brief full-table lock wouldn't be acceptable for an initial
> implementation.  Obviously it wouldn't be suitable for every use case
> but since we're talking about manually refreshed views that was bound
> to be true anyway.
>   

There already is an initial implementation of sorts.  There are a couple 
of ways you can build these right now, so any new development has to 
look like it will end with good odds of being an improvement over what's 
already available before it's worth putting development resources into.

As a rough idea of what people want these for in the field, based on 
what I've seen requests for, imagine that someone has a 1TB table 
they're materializing a view on in order to get at least a 10:1, and 
hopefully close to a 100:1, speedup on viewing summary data.  Now, 
picture what happens if you have someone doing a sequential scan on the 
MV, which is still quite big, the updater process lines up to grab an 
exclusive lock when it's done, and now a second user wanting to read a 
single row quickly comes along behind it.  Given a full-table lock 
implementation, that scenario is unlikely to play out with the second 
user getting a fast response.  They'll likely sit in a lock queue for 
some potentially long period of time instead, waiting for the active seq 
scan to finish then the update to happen.  You have to build it that way 
or a steady stream of people reading could block out updates forever.

To be frank, that makes for a materalized view implementation of little 
value over what you can currently do as far as I'm concerned.  It might 
be interesting as a prototype, but that's not necessarily going to look 
like what's needed to do this for real at all.  I'm not a big fan of 
dumping work into projects when you can see exactly how it's going to 
fail before you even get started.  As I see if, if you know where it's 
going to fall down, you don't need to build a prototype as an exercise 
to show you how to build it--you should work on that part first instead.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

11 April 2010, 02:08:57

On Sat, Apr 10, 2010 at 11:40 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> To be frank, that makes for a materalized view implementation of little
> value over what you can currently do as far as I'm concerned.  It might be
> interesting as a prototype, but that's not necessarily going to look like
> what's needed to do this for real at all.  I'm not a big fan of dumping work
> into projects when you can see exactly how it's going to fail before you
> even get started.  As I see if, if you know where it's going to fall down,
> you don't need to build a prototype as an exercise to show you how to build
> it--you should work on that part first instead.

Hopefully, you're already aware that I have enormous respect for your
opinions on a wide variety of topics; if not, let me publicly say that
I absolutely do.

Having said that, I disagree with your conclusions in this instance.
I see nothing but upside from this work.  It is vastly easier to write
a patch that builds on existing functionality than it is to write
something new from scratch.  If there's any value in having manually
refreshed materialized views, then having the simplest possible
implementation of what those can look like committed will make it far
easier to plan out next steps.  While the proposed implementation may
not solve a huge number of real-world problems, I think there's a good
argument that some people will get good use of it.  Not everyone has
1TB tables with continuous access patterns.  And, provided that it
doesn't conflict with anything we want to do in the future, being
useful to some people is a good enough reason to put it in.

I also think that you're underestimating the number of problems that
will have to be solved to get this done.  It's going to take some
significant work - both design work and coding work - to figure out
how this should integrate into the rest of the system.  (What should
be the value of pg_class.relkind?  Where should the node
representation of the snapshot query be stored?  And did we handle all
of those OID dependencies correctly?)

Where I can see this possibly falling down (other than being just too
much work for a relative PostgreSQL novice to get it done in one
summer) is if there are concerns about it being incompatible with
incrementally-updated views.  I imagine that we're going to want to
eventually support both, so we need to make sure that this
implementation doesn't box us into a corner.  But as far as snapshot
views go, complaining that the proposed locking is too strong doesn't
seem quite fair.  Fixing that, AFAICS, is a very hard project,
possibly involving significant planner support and an implementation
of MERGE, and I would much rather try to land a fundamentals patch
like this first and then deal with the gyrations that will be involved
in making this work than try to land the whole thing all at once.

Of course, if I'm missing something, and there's a SIMPLE way to get
materialized views that can be refreshed without a full-table lock,
that's another story altogether - maybe you have an idea?

Finally, even if we decided NOT to merge this patch because of the
limitations you mention (and right now that doesn't seem to be the
consensus), having this part of it completed as a starting point for
future work might be reason enough by itself.

In short: I think you may be letting the perfect be the enemy of the good.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

11 April 2010, 06:24:31

Robert Haas wrote:
> I also think that you're underestimating the number of problems that
> will have to be solved to get this done.  It's going to take some
> significant work - both design work and coding work - to figure out
> how this should integrate into the rest of the system.  (What should
> be the value of pg_class.relkind?  Where should the node
> representation of the snapshot query be stored?  And did we handle all
> of those OID dependencies correctly?)
>   

I don't think I'm underestimating all that, but I suspect Pavel is by a 
considerable amount.  This is why I've been suggesting that a GSoC scope 
here might just be wrestling with this area of the problem for the whole 
summer--not even getting into updates beyond a completely trivial 
implementation, if any at all.  Things like "handle OID dependencies" 
are definitely not on the fun side of the development work that people 
tend to think about in advance.

> Where I can see this possibly falling down (other than being just too
> much work for a relative PostgreSQL novice to get it done in one
> summer) is if there are concerns about it being incompatible with
> incrementally-updated views.  I imagine that we're going to want to
> eventually support both, so we need to make sure that this
> implementation doesn't box us into a corner.

Exactly my concern; comitting this part without knowing how that's later 
going to fit into place strikes me the sort of the thing this project 
doesn't like to do.  The alternate approach of starting with the update 
machinery is less likely IMHO to get stuck wondering if there's a future 
blind spot coming or not, since you'd be building from the bottom up 
starting with the hardest parts.
From the rest of your comments, I'm comfortable that you're in sync 
with the not necessarily obvious risky spots here I wanted to raise 
awareness of.  It's unreasonable to expect we'll have exactly the same 
priorities  here, and I doubt it's useful to debate how I perceive the 
merit of various development subsets here compared to yourself.  I don't 
think it's really important whether anyone agrees with me or not about 
exactly the value of a full table lock implementation.  The main thing 
I'm concerned about is just that it's noted as a known risky part, one 
that could end up blocking the project's ability to commit even a subset 
of the proposed patch here.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Heikki Linnakangas

Date:

11 April 2010, 11:26:20

Robert Haas wrote:
> 2010/4/10 Andrew Dunstan <andrew@dunslane.net>:
>> Heikki Linnakangas wrote:
>>> 1. Keep the materialized view up-to-date when the base tables change.
>>> This can be further divided into many steps, you can begin by supporting
>>> automatic updates only on very simple views with e.g a single table and
>>> a where clause. Then extend that to support joins, aggregates,
>>> subqueries etc. Keeping it really limited, you could even require the
>>> user to write the required triggers himself.
>> That last bit doesn't strike me as much of an advance. Isn't the whole point
>> of this to automate it? Creating greedy materialized views is usually not
>> terribly difficult now, but you do have to write the triggers.
> 
> Yeah, I agree.

It doesn't accomplish anything interesting on its own. But if you do the
planner changes to automatically use the materialized view to satisfy
queries (item 2. in my previous email), it's useful.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

11 April 2010, 15:47:49

On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Robert Haas wrote:
>> 2010/4/10 Andrew Dunstan <andrew@dunslane.net>:
>>> Heikki Linnakangas wrote:
>>>> 1. Keep the materialized view up-to-date when the base tables change.
>>>> This can be further divided into many steps, you can begin by supporting
>>>> automatic updates only on very simple views with e.g a single table and
>>>> a where clause. Then extend that to support joins, aggregates,
>>>> subqueries etc. Keeping it really limited, you could even require the
>>>> user to write the required triggers himself.
>>> That last bit doesn't strike me as much of an advance. Isn't the whole point
>>> of this to automate it? Creating greedy materialized views is usually not
>>> terribly difficult now, but you do have to write the triggers.
>>
>> Yeah, I agree.
>
> It doesn't accomplish anything interesting on its own. But if you do the
> planner changes to automatically use the materialized view to satisfy
> queries (item 2. in my previous email), it's useful.

But you can't do that with a snapshot view, only a continuous updated one.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

"Florian G. Pflug"

Date:

11 April 2010, 23:20:30

On 11.04.10 20:47 , Robert Haas wrote:
> On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> Robert Haas wrote:
>>> 2010/4/10 Andrew Dunstan<andrew@dunslane.net>:
>>>> Heikki Linnakangas wrote:
>>>>> 1. Keep the materialized view up-to-date when the base tables
>>>>> change. This can be further divided into many steps, you can
>>>>> begin by supporting automatic updates only on very simple
>>>>> views with e.g a single table and a where clause. Then extend
>>>>> that to support joins, aggregates, subqueries etc. Keeping it
>>>>> really limited, you could even require the user to write the
>>>>> required triggers himself.
>>>> That last bit doesn't strike me as much of an advance. Isn't
>>>> the whole point of this to automate it? Creating greedy
>>>> materialized views is usually not terribly difficult now, but
>>>> you do have to write the triggers.
>>>
>>> Yeah, I agree.
>>
>> It doesn't accomplish anything interesting on its own. But if you
>> do the planner changes to automatically use the materialized view
>> to satisfy queries (item 2. in my previous email), it's useful.
>
> But you can't do that with a snapshot view, only a continuous updated
> one.

If continuous updates prove to be too hard initially, you could instead
update the view on select if it's outdated. Such a materialized view
would be a kind of inter-session cache for subselects.

The hard part would probably be to figure out how to decide whether the
view is outdated or not, and to deal with two concurrent transactions
trying to use an outdates view (and both trying to refresh it). What
makes the second problem hard is that you wouldn't want one of the
transactions to wait for the other to complete, because this is not how
SELECTs traditionally behave.

best regards, Florian Pflug

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

12 April 2010, 00:01:11

On Sun, Apr 11, 2010 at 10:13 PM, Florian G. Pflug <fgp@phlo.org> wrote:
> If continuous updates prove to be too hard initially, you could instead
> update the view on select if it's outdated. Such a materialized view
> would be a kind of inter-session cache for subselects.
>
> The hard part would probably be to figure out how to decide whether the
> view is outdated or not, and to deal with two concurrent transactions
> trying to use an outdates view (and both trying to refresh it). What
> makes the second problem hard is that you wouldn't want one of the
> transactions to wait for the other to complete, because this is not how
> SELECTs traditionally behave.

Well, the proposed project is to create views that only get refreshed manually.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

12 April 2010, 00:09:57

On Sun, Apr 11, 2010 at 5:24 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> From the rest of your comments, I'm comfortable that you're in sync with the
> not necessarily obvious risky spots here I wanted to raise awareness of.
>  It's unreasonable to expect we'll have exactly the same priorities  here,
> and I doubt it's useful to debate how I perceive the merit of various
> development subsets here compared to yourself.  I don't think it's really
> important whether anyone agrees with me or not about exactly the value of a
> full table lock implementation.  The main thing I'm concerned about is just
> that it's noted as a known risky part, one that could end up blocking the
> project's ability to commit even a subset of the proposed patch here.

I think that one of the things that we need to get our hands around is
how we're going to distinguish the "snapshot" flavor of materialized
view from the "continuous update" flavor.  By definition, the latter
will only ever be supportable for a fairly restricted subset of all
possible queries, and I am assuming that we will not want to decide
what the behavior is going to be based on the query but rather based
on what the user specifies.  Anything else seems like it would be have
the potential for severe POLA violations.  So we need to think now
about how we'll distinguish between the two flavors.  I imagine some
sort of syntactic marker would be appropriate; not sure what.

Reading this thread, I'm starting to grow concerned that some people
may feel that manually refreshed materialized views are not even worth
bothering with, because (the argument goes) you could just use some
table and write a function that updates it.  There's probably some
truth to that, but I guess my thought is that it would have some value
as a convenience feature; and eventually we might optimize it to the
point where it would make more sense to use the built-in feature
rather than rolling your own.  However, if we're going to have
complaints that manually refreshed materialized views suck and we
should only ever support materialized views to the extent that we can
make them automatically update on-the-fly, then let's have those
complaints now before someone spends several months of their life on
the project only to be told that we don't want it.  Let's be clear: I
think it's useful, but, if other people disagree, we need to iron that
out now.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Heikki Linnakangas

Date:

12 April 2010, 02:26:39

Robert Haas wrote:
> On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Robert Haas wrote:
>>> 2010/4/10 Andrew Dunstan <andrew@dunslane.net>:
>>>> Heikki Linnakangas wrote:
>>>>> 1. Keep the materialized view up-to-date when the base tables change.
>>>>> This can be further divided into many steps, you can begin by supporting
>>>>> automatic updates only on very simple views with e.g a single table and
>>>>> a where clause. Then extend that to support joins, aggregates,
>>>>> subqueries etc. Keeping it really limited, you could even require the
>>>>> user to write the required triggers himself.
>>>> That last bit doesn't strike me as much of an advance. Isn't the whole point
>>>> of this to automate it? Creating greedy materialized views is usually not
>>>> terribly difficult now, but you do have to write the triggers.
>>> Yeah, I agree.
>> It doesn't accomplish anything interesting on its own. But if you do the
>> planner changes to automatically use the materialized view to satisfy
>> queries (item 2. in my previous email), it's useful.
> 
> But you can't do that with a snapshot view, only a continuous updated one.

A materialized view with manually-defined triggers to keep it up-to-date
is a continuously updated one.

Other DBMSs allow that with snapshot views too, you just don't get
totally up-to-date results, but I not sure we want to go there.

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Pavel Stehule

Date:

12 April 2010, 03:16:29

2010/4/12 Robert Haas <robertmhaas@gmail.com>:
> On Sun, Apr 11, 2010 at 5:24 AM, Greg Smith <greg@2ndquadrant.com> wrote:
>> From the rest of your comments, I'm comfortable that you're in sync with the
>> not necessarily obvious risky spots here I wanted to raise awareness of.
>>  It's unreasonable to expect we'll have exactly the same priorities  here,
>> and I doubt it's useful to debate how I perceive the merit of various
>> development subsets here compared to yourself.  I don't think it's really
>> important whether anyone agrees with me or not about exactly the value of a
>> full table lock implementation.  The main thing I'm concerned about is just
>> that it's noted as a known risky part, one that could end up blocking the
>> project's ability to commit even a subset of the proposed patch here.
>
> I think that one of the things that we need to get our hands around is
> how we're going to distinguish the "snapshot" flavor of materialized
> view from the "continuous update" flavor.  By definition, the latter
> will only ever be supportable for a fairly restricted subset of all
> possible queries, and I am assuming that we will not want to decide
> what the behavior is going to be based on the query but rather based
> on what the user specifies.  Anything else seems like it would be have
> the potential for severe POLA violations.  So we need to think now
> about how we'll distinguish between the two flavors.  I imagine some
> sort of syntactic marker would be appropriate; not sure what.
>
> Reading this thread, I'm starting to grow concerned that some people
> may feel that manually refreshed materialized views are not even worth
> bothering with, because (the argument goes) you could just use some
> table and write a function that updates it.  There's probably some
> truth to that, but I guess my thought is that it would have some value
> as a convenience feature; and eventually we might optimize it to the
> point where it would make more sense to use the built-in feature
> rather than rolling your own.  However, if we're going to have
> complaints that manually refreshed materialized views suck and we
> should only ever support materialized views to the extent that we can
> make them automatically update on-the-fly, then let's have those
> complaints now before someone spends several months of their life on
> the project only to be told that we don't want it.  Let's be clear: I
> think it's useful, but, if other people disagree, we need to iron that
> out now.
>
> ...Robert

I thing so manually refreshed materialized views has sense. It is
similar to replication - there was replications like slony, but for
some people is more important integrated replication in 9.0. More -
manually refreshed (periodically refreshed) views can share lot if
infrastructure with dynamically actualised views. I am sure so
dynamical materialised views is bad task for GSoC - it is too large,
too complex. Manually refreshed views is adequate to two months work
and it has sense.

Regards
Pavel Stehule

>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

12 April 2010, 07:52:18

On Mon, Apr 12, 2010 at 2:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> I am sure so
> dynamical materialised views is bad task for GSoC - it is too large,
> too complex. Manually refreshed views is adequate to two months work
> and it has sense.

That is my feeling also - though I fear that even the simplest
possible implementation of this feature may be a stretch.  Anyway we
agree: keep it simple.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Josh Berkus

Date:

12 April 2010, 14:51:09

On 4/9/10 1:36 PM, pavelbaros wrote:
> 2) change rewriter
> - usually, view is relation with defined rule and when rewriting, rule
> is fired and relation (view) is replaced by definition of view. If
> relation do not have rule, planner and executor behave to it as physical
> table (relation). In case of materialized view we want to rewrite select
> statement only in case when we refreshing MV. In other cases rewriter
> should skip rewriting and pick up physical relation. Exclude situation
> when other rewrite rules which are not related to MV definition are
> specified.

This was done (although not completed) against PostgreSQL 7.1 by
students in Georgia, USA, I believe.  It might be worthwhile looking at
their work if I can find it (if nowhere else, it should be in the ACM).

There are basically 2 major parts for materialized views:

A) Planner: Getting the query planner to swap in the MatView for part of
a query automatically for query plan portions which the MatView supports;

B) Maintenance: maintaining the MatView data according to the programmed
scheme (synch, asynch, periodic).

I do not believe it is possible to do both of the above in one summer.
Of the two, (A) would be more useful since it is possible to manually
implement (B) using triggers, queues and cron jobs today.

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

12 April 2010, 15:04:35

On Mon, Apr 12, 2010 at 1:50 PM, Josh Berkus <josh@agliodbs.com> wrote:
> On 4/9/10 1:36 PM, pavelbaros wrote:
>> 2) change rewriter
>> - usually, view is relation with defined rule and when rewriting, rule
>> is fired and relation (view) is replaced by definition of view. If
>> relation do not have rule, planner and executor behave to it as physical
>> table (relation). In case of materialized view we want to rewrite select
>> statement only in case when we refreshing MV. In other cases rewriter
>> should skip rewriting and pick up physical relation. Exclude situation
>> when other rewrite rules which are not related to MV definition are
>> specified.
>
> This was done (although not completed) against PostgreSQL 7.1 by
> students in Georgia, USA, I believe.  It might be worthwhile looking at
> their work if I can find it (if nowhere else, it should be in the ACM).
>
> There are basically 2 major parts for materialized views:
>
> A) Planner: Getting the query planner to swap in the MatView for part of
> a query automatically for query plan portions which the MatView supports;
>
> B) Maintenance: maintaining the MatView data according to the programmed
> scheme (synch, asynch, periodic).
>
> I do not believe it is possible to do both of the above in one summer.
> Of the two, (A) would be more useful since it is possible to manually
> implement (B) using triggers, queues and cron jobs today.

I don't believe that it's possible to do EITHER of those things in one
summer.  I believe that a basic implementation that has NO bells and
whistles at all, as originally proposed, is going to be a Very Hard
Project.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

12 April 2010, 16:44:33

Josh Berkus wrote:
> There are basically 2 major parts for materialized views:
> A) Planner: Getting the query planner to swap in the MatView for part of
> a query automatically for query plan portions which the MatView supports;
> B) Maintenance: maintaining the MatView data according to the programmed
> scheme (synch, asynch, periodic).
>   

I'm run more into problems where it's perfectly fine to specify using 
the materialized view directly in the query, but keeping that view up to 
date usefully was the real problem.  The whole idea of getting a MV used 
automatically is valuable, but far down the roadmap as I see it.

Not everyone would agree of course, and your description does suggest a 
better way to organize a high-level summary though; here's a first cut:

1) Creation of materalized view
Current state:  using "CREATE TABLE AS" or similar mechanism, maintain 
manually
Optimal:  "CREATE MATERIALIZED VIEW" grammar, metadata to store MV data, 
dump/reload support

2) Updating materialized views
Current state:  periodically create new snapshots, or maintain using 
triggers
Optimal:  Built-in refresh via multiple strategies, with minimal locking 
as to improve concurrent access

3) Using materialized views in the planner
Current state:  specify the manually created MV in queries that can use it
Optimal:  Automatically accelerate queries that could be satisfied by 
substituting available MVs

With (1) being what I think is the only GSoC sized subset here.

I'm not saying someone can't jump right into (3), using the current 
implementations for (1) and (2) that are floating around out there.  I 
just think it would end up wasting a fair amount of work on prototypes 
that don't work quite the same way as the eventual fully integrated 
version.  You certainly can start working on (3) without a fully fleshed 
out implementation of (2), I don't know that it makes sense to work on 
before (1) though.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

12 April 2010, 17:03:22

On Mon, Apr 12, 2010 at 3:43 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Josh Berkus wrote:
>>
>> There are basically 2 major parts for materialized views:
>> A) Planner: Getting the query planner to swap in the MatView for part of
>> a query automatically for query plan portions which the MatView supports;
>> B) Maintenance: maintaining the MatView data according to the programmed
>> scheme (synch, asynch, periodic).
>>
>
> I'm run more into problems where it's perfectly fine to specify using the
> materialized view directly in the query, but keeping that view up to date
> usefully was the real problem.  The whole idea of getting a MV used
> automatically is valuable, but far down the roadmap as I see it.
>
> Not everyone would agree of course, and your description does suggest a
> better way to organize a high-level summary though; here's a first cut:
>
> 1) Creation of materalized view
> Current state:  using "CREATE TABLE AS" or similar mechanism, maintain
> manually
> Optimal:  "CREATE MATERIALIZED VIEW" grammar, metadata to store MV data,
> dump/reload support
>
> 2) Updating materialized views
> Current state:  periodically create new snapshots, or maintain using
> triggers
> Optimal:  Built-in refresh via multiple strategies, with minimal locking as
> to improve concurrent access
>
> 3) Using materialized views in the planner
> Current state:  specify the manually created MV in queries that can use it
> Optimal:  Automatically accelerate queries that could be satisfied by
> substituting available MVs
>
> With (1) being what I think is the only GSoC sized subset here.
>
> I'm not saying someone can't jump right into (3), using the current
> implementations for (1) and (2) that are floating around out there.  I just
> think it would end up wasting a fair amount of work on prototypes that don't
> work quite the same way as the eventual fully integrated version.  You
> certainly can start working on (3) without a fully fleshed out
> implementation of (2), I don't know that it makes sense to work on before
> (1) though.

Good summary.

...Robert

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Josh Berkus

Date:

12 April 2010, 19:22:10

Greg,

> I'm not saying someone can't jump right into (3), using the current
> implementations for (1) and (2) that are floating around out there.  I
> just think it would end up wasting a fair amount of work on prototypes
> that don't work quite the same way as the eventual fully integrated
> version.  You certainly can start working on (3) without a fully fleshed
> out implementation of (2), I don't know that it makes sense to work on
> before (1) though.

What would be the use case for (1) by itself?


--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

12 April 2010, 19:47:04

Josh Berkus wrote:
> What would be the use case for (1) by itself?
>   

There isn't any use case for just working on the infrastructure, just 
like there's no use case for "Syntax for partitioning" on its own.  That 
why people rarely work on that part of these problems--it's boring and 
produces no feature of value on its own.  I believe that in both cases, 
attempts to build the more complicated parts, ones that don't first 
address some of the core infrastructure first, will continue to produce 
only prototypes.

I don't want to see Materialized Views wander down the same path as 
partitioning, where lots of people produce "fun parts" patches, while 
ignoring the grunt work of things like production quality catalog 
support for the feature.  I think Pavel's proposal got that part right 
by starting with the grammar and executor setup trivia.  And Robert's 
comments about the details in that area it's easy to forget about hit 
the mark too.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Josh Berkus

Date:

12 April 2010, 22:31:35

> I don't want to see Materialized Views wander down the same path as
> partitioning, where lots of people produce "fun parts" patches, while
> ignoring the grunt work of things like production quality catalog
> support for the feature.  I think Pavel's proposal got that part right
> by starting with the grammar and executor setup trivia.  And Robert's
> comments about the details in that area it's easy to forget about hit
> the mark too.

Good point.  And GSoC may be one of the few times we can get people to
do that kind of work.  Other than Simon, of course.  ;-)

I just worry about any feature which doesn't get as far as a
user-visible implementation.  If someone doesn't do the rest of the
parts soon, such features tend to atrophy because nobody is using them.


--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Greg Smith

Date:

13 April 2010, 02:20:34

Josh Berkus wrote:
> I just worry about any feature which doesn't get as far as a
> user-visible implementation.  If someone doesn't do the rest of the
> parts soon, such features tend to atrophy because nobody is using them.
>   

While they're limited, there are complexly viable prototype quality 
implementations possible here without a large amount of work to get them 
started.  I'm not worried too much about this feature being unused.  As 
I was just reminded when assembling an page on the wiki about it:  
http://wiki.postgresql.org/wiki/Materalized_Views it's currently ranked 
#1--by a large margin--on the UserVoice feature request survey that 
Peter kicked off.

-- 
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Tom Lane

Date:

13 April 2010, 17:01:28

Josh Berkus <josh@agliodbs.com> writes:
> There are basically 2 major parts for materialized views:

> A) Planner: Getting the query planner to swap in the MatView for part of
> a query automatically for query plan portions which the MatView supports;

> B) Maintenance: maintaining the MatView data according to the programmed
> scheme (synch, asynch, periodic).

> I do not believe it is possible to do both of the above in one summer.
> Of the two, (A) would be more useful since it is possible to manually
> implement (B) using triggers, queues and cron jobs today.

A patch that implements only (A) will be DOA.  The reason is that the
planner can *never* "swap in a MatView" on its own authority, unless it
can prove that this does not change the semantics of the query.  Which
it obviously will be unable to do unless there's a fully transparent
continuous-update scheme in place.

So the correct approach is to work on (B) first.  When and if we get to
a sufficiently transparent update implementation, we can think about
changing the planner.
        regards, tom lane

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Pavel

Date:

20 April 2010, 16:35:20

Greg Smith wrote:

> pavelbaros wrote:
>> I am also waiting for approval for my repository named 
>> "materialized_view" on git.postgresql.org, so I could publish 
>> completed parts. 
>
> Presuming that you're going to wander there and get assigned what 
> looks like an official repo name for this project is a 
> bit...optimistic.  I would recommend that you publish to something 
> like github instead (you can fork http://github.com/postgres/postgres 
> ), and if the work looks good enough that it gets picked up by the 
> community maybe you migrate it onto the main site eventually.  
> git.postgresql.org is really not setup to be general hosting space for 
> everyone who has a PostgreSQL related project; almost every repo on 
> there belongs to someone who has already been a steady project 
> contributor for a number of years.

Yes, you're true, I'm kind of newbe in this kind of project and 
specially in PostgreSQL. But I think it is best way to get into 
PostgreSQL. When I chose my bachelor thesis I did not know I could 
participate GSoC or try to make it commitable. Anyway I will make repo 
on github, so everybody could look at it, as soon as posible.

<http://github.com/pbaros/postgres>
>  
> (Switching to boilerplate mode for a paragraph...) You have picked a 
> PostgreSQL feature that is dramatically more difficult than it appears 
> to be, and I wouldn't expect you'll actually finish even a fraction of 
> your goals in a summer of work.  You're at least in plentiful 
> company--most students do the same.  As a rule, if you see a feature 
> on our TODO list that looks really useful and fun to work on, it's 
> only still there because people have tried multiple times to build it 
> completely but not managed to do so because it's harder than it 
> appears.  This is certainly the case with materialized views.
>
> You've outlined a reasonable way to build a prototype that does a 
> limited implementation here.  The issue is what it will take to extend 
> that into being production quality for the real-world uses of 
> materialized views.  How useful your prototype is depends on how well 
> it implements a subset of that in a way that will get used by the 
> final design.
>
> The main hidden complexity in this particular project relates to 
> handling view refreshes.  The non-obvious problem is that when the 
> view updates, you need something like a SQL MERGE to really handle 
> that in a robust way that doesn't conflict with concurrent access to 
> queries against the materialized view.  And work on MERGE support is 
> itself blocked behind the fact that PostgreSQL doesn't have a good way 
> to lock access to a key value that doesn't exist yet--what other 
> databases call key range locking.  See the notes for "Add SQL-standard 
> MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo 
> for more information.
>
> You can work around that to build a prototype by grabbing a full table 
> lock on the materialized view when updating it, but that's not a 
> production quality solution.  Solving that little detail is actually 
> more work than the entire project you've outlined.  Your suggested 
> implementation--"In function CloseIntoRel executor swap relfilenode's 
> of temp table and original table and finally delete temp table"--is 
> where the full table lock is going to end up at.  The exact use cases 
> that need materialized views cannot handle a CLUSTER-style table 
> recreation each time that needs an exclusive lock to switchover, so 
> that whole part of your design is going to be a prototype that doesn't 
> work at all like what needs to get built to make this feature 
> committable.  It's also not a reasonable assumption that you have 
> enough disk space to hold a second copy of the MV in a production system.

For now I know it is not commitable in actual state, but for my thesis 
it is enough and I know it will not be commitable with this design at 
all. In case of GSoC it will depends on the time I will be able to spend 
on it, if I will consider some other design.

>
> Once there's a good way to merge updates, how to efficiently generate 
> them against the sort of large data sets that need materalized 
> views--so you just write out the updates rather than a whole new 
> copy--is itself a large project with a significant quantity of 
> academic research to absorb before starting.  Dan Colish at Portland 
> State has been playing around with prototypes for the specific problem 
> of finding a good algorithm for view refreshing that is compatible 
> with PostgreSQL's execution model.  He's already recognized the table 
> lock issue here and for the moment is ignoring that part.  I don't 
> have a good feel yet for how long the targeted update code will take 
> to mature, but based on what I do know I suspect that little detail is 
> also a larger effort than the entire scope you're envisioning.  
> There's a reason why the MIT Press compendium "Materialized Views: 
> Techniques, Implementations, and Applications" is over 600 pages 
> long--I hope you've already started digging through that material.

I would like to start to dig through that, but I'm in a hurry now. I 
already have made a small research on MV as part of my thesis. I also 
plan to continue study PostgreSQL and Materialized Views more into the 
depth, preferably as my master thesis. But I realize MV feature 
commitable to PostgreSQL is not project for one person, of course.

>
> Now, with all that said, that doesn't mean there's not a useful 
> project for you buried in this mess.  The first two steps in your plan:
>
> 1) create materialized view
> 2) change rewriter
>
> Include building a prototype grammer, doing an initial executor 
> implementation, and getting some sort of rewriter working.  That is 
> potentially good groundwork to lay here.  I would suggest that you 
> completely drop your step 3:
>
> 3) create command that takes snapshot (refresh MV)
>
> Because you cannot built that in a way that will be useful (and by 
> that I mean committable quality) until there's a better way to handle 
> updates than writing a whole new table and grabbing a full relation 
> lock to switch to it.  To do a good job just on the first two steps 
> should take at least a whole summer anyway--there's a whole stack of 
> background research needed I haven't seen anyone do yet, and that 
> isn't on your plan yet.  There is a precedent for taking this 
> approach.  After getting stalled trying to add the entirety of easy 
> partitioning to PostgreSQL, the current scope has been scaled back to 
> just trying to get the syntax and on-disk structure right, then finish 
> off the implementation.  See 
> http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how 
> that's been broken into those two major chunks.
>

Anyway thanks for all of your advices and help.

best regards,       Pavel Baros

Re: GSoC - proposal - Materialized Views in PostgreSQL

From

Robert Haas

Date:

21 April 2010, 12:59:24

2010/4/20 Pavel <baros.p@seznam.cz>:
> For now I know it is not commitable in actual state, but for my thesis it is
> enough and I know it will not be commitable with this design at all. In case
> of GSoC it will depends on the time I will be able to spend on it, if I will
> consider some other design.

I am not sure about this, but I would think we would not want to
accept the project unless you intend to try to make it committable.  I
haven't looked at your actual code to see how much work I think that
would take.

...Robert