Thread: GSoC - proposal - Materialized Views in PostgreSQL
Hello, I am sending my proposal on GSoC. Details are listed below. Please, if you have ideas, tips, or if you only want to say you opinion about my project, go ahead. thanks, Pavel Baros Abstract: It is effort to implement snapshot materialized view (are only updated when refreshed) in PostgreSQL. In this time I finished some part of it and I am trying to publish my present work on git.postgresql.org Benefits to the PostgreSQL Community First of all, it would be the best if my work is helpful to everybody who misses materialized views in PostgreSQL, because PostgreSQL do not have still implemented materialized views. In addition, MV is mentioned as feature in TODO list. Deliverables First of all, at the end of whole my project is not only writing bachelors thesis, but finish it as patch and if possible, get patch into next PostgrSQL release, or keep git repository actual to last PosgreSQL version. I have also personal goals. Arouse the interest about implementing MV in PostgreSQL, or at least arouse discussion about it. Project Schedule My work goes quite good, I am on good way to finish main parts on backend in few weeks. After that I will make and run tests and implement related changes to PosgreSQL tools (psql, pg_dump, etc.). I am also waiting for approval for my repository named "materialized_view" on git.postgresql.org, so I could publish completed parts. For now next step will be to discuss implementation on postgresql.hackers. Bio I am from Czech Republic and I am studying on Faculty of Electrical Engineering on Czech Technical University in Prague www.fel.cvut.cz/en/. My bachelor thesis is based on this project, implementing MV in PostgreSQL. I've experienced many different jobs. The best experience for me was, when I've worked as tester and software engineer in C/C++ and C# for Radiant Systems Inc. for more than year. After that I've worked as Web developer with Internet technologies (PHP, HTML, CSS, ...), where the goal was to make an internal system for an advertising agency. Finally my recent job was as Windows Mobile Developer. Except the first experience, others lasts only few months mainly because those were temporary projects. For now I am looking for some part time job, of course, preferably something closer to database systems. Implementation: could be divided to few steps: 1) create materialized view - modify grammar (parser): CREATE MATERIALIZED VIEW mv_name AS SELECT ... - change executor, so that it will create physical table defined by select statement 2) change rewriter - usually, view is relation with defined rule and when rewriting, rule is fired and relation (view) is replaced by definition of view. If relation do not have rule, planner and executor behave to it as physical table (relation). In case of materialized view we want to rewrite select statement only in case when we refreshing MV. In other cases rewriter should skip rewriting and pick up physical relation. Exclude situation when other rewrite rules which are not related to MV definition are specified. 3) create command that takes snapshot (refresh MV) - modify grammar: ALTER MATERIALIZED VIEW mv_name REFRESH; - taking snapshot (refreshing) is similar to command "SELECT INTO ..." and I decided to follow the way it works. After parsing query and before transformation is MANUALLY created tree representation of "SELECT * INTO ..." with flag IntoClause->isrefresh set true, indicating it is refreshing materialized view. Everithing acts as it would be regular "SELECT INTO ..." except functions OpenIntoRel() and CloseIntoRel(). In function OpenIntoRel is created temp table (without catalog) and set as destination for result of select. In function CloseIntoRel executor swap relfilenode's of temp table and original table and finally delete temp table. Behavior of CloseIntoRel function is inspired by CLUSTER statement. Contacts: baros.p@seznam.cz
pavelbaros <baros.p@seznam.cz> wrote: > I am also waiting for approval for my repository named > "materialized_view" on git.postgresql.org They seem to prefer that you get a repository under your name and use materialized_view as a branch name. See my account on git.postgresql.org and its serializable branch for an example. I learned by putting in a request similar to your pending one. ;-) -Kevin
2010/4/9 pavelbaros <baros.p@seznam.cz>: > Implementation: could be divided to few steps: > > 1) create materialized view > - modify grammar (parser): CREATE MATERIALIZED VIEW mv_name AS SELECT ... > - change executor, so that it will create physical table defined by select > statement This basically needs to work the same was as CREATE TABLE ... AS SELECT ... - save that it should also stuff the rewritten query someplace, so that it can be re-executed. I think one of the important design questions here is figuring out exactly where that "someplace" should be. I also suspect that we want to block any write access to the relation except for view refreshes. IOW, INSERT, UPDATE, and DELETE on the underlying relation should be rejected (though perhaps rewrite rules redirecting such operations to other tables could be allowed). > 2) change rewriter > - usually, view is relation with defined rule and when rewriting, rule is > fired and relation (view) is replaced by definition of view. If relation do > not have rule, planner and executor behave to it as physical table > (relation). In case of materialized view we want to rewrite select statement > only in case when we refreshing MV. In other cases rewriter should skip > rewriting and pick up physical relation. Exclude situation when other > rewrite rules which are not related to MV definition are specified. > > 3) create command that takes snapshot (refresh MV) > - modify grammar: ALTER MATERIALIZED VIEW mv_name REFRESH; > - taking snapshot (refreshing) is similar to command "SELECT INTO ..." and I > decided to follow the way it works. After parsing query and before > transformation is MANUALLY created tree representation of "SELECT * INTO > ..." with flag IntoClause->isrefresh set true, indicating it is refreshing > materialized view. Everithing acts as it would be regular "SELECT INTO ..." > except functions OpenIntoRel() and CloseIntoRel(). In function OpenIntoRel > is created temp table (without catalog) and set as destination for result of > select. In function CloseIntoRel executor swap relfilenode's of temp table > and original table and finally delete temp table. Behavior of CloseIntoRel > function is inspired by CLUSTER statement. I'll have to read the code before I can comment on the rest of this in detail. ...Robert
pavelbaros wrote: > I am also waiting for approval for my repository named > "materialized_view" on git.postgresql.org, so I could publish > completed parts. Presuming that you're going to wander there and get assigned what looks like an official repo name for this project is a bit...optimistic. I would recommend that you publish to something like github instead (you can fork http://github.com/postgres/postgres ), and if the work looks good enough that it gets picked up by the community maybe you migrate it onto the main site eventually. git.postgresql.org is really not setup to be general hosting space for everyone who has a PostgreSQL related project; almost every repo on there belongs to someone who has already been a steady project contributor for a number of years. (Switching to boilerplate mode for a paragraph...) You have picked a PostgreSQL feature that is dramatically more difficult than it appears to be, and I wouldn't expect you'll actually finish even a fraction of your goals in a summer of work. You're at least in plentiful company--most students do the same. As a rule, if you see a feature on our TODO list that looks really useful and fun to work on, it's only still there because people have tried multiple times to build it completely but not managed to do so because it's harder than it appears. This is certainly the case with materialized views. You've outlined a reasonable way to build a prototype that does a limited implementation here. The issue is what it will take to extend that into being production quality for the real-world uses of materialized views. How useful your prototype is depends on how well it implements a subset of that in a way that will get used by the final design. The main hidden complexity in this particular project relates to handling view refreshes. The non-obvious problem is that when the view updates, you need something like a SQL MERGE to really handle that in a robust way that doesn't conflict with concurrent access to queries against the materialized view. And work on MERGE support is itself blocked behind the fact that PostgreSQL doesn't have a good way to lock access to a key value that doesn't exist yet--what other databases call key range locking. See the notes for "Add SQL-standard MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo for more information. You can work around that to build a prototype by grabbing a full table lock on the materialized view when updating it, but that's not a production quality solution. Solving that little detail is actually more work than the entire project you've outlined. Your suggested implementation--"In function CloseIntoRel executor swap relfilenode's of temp table and original table and finally delete temp table"--is where the full table lock is going to end up at. The exact use cases that need materialized views cannot handle a CLUSTER-style table recreation each time that needs an exclusive lock to switchover, so that whole part of your design is going to be a prototype that doesn't work at all like what needs to get built to make this feature committable. It's also not a reasonable assumption that you have enough disk space to hold a second copy of the MV in a production system. Once there's a good way to merge updates, how to efficiently generate them against the sort of large data sets that need materalized views--so you just write out the updates rather than a whole new copy--is itself a large project with a significant quantity of academic research to absorb before starting. Dan Colish at Portland State has been playing around with prototypes for the specific problem of finding a good algorithm for view refreshing that is compatible with PostgreSQL's execution model. He's already recognized the table lock issue here and for the moment is ignoring that part. I don't have a good feel yet for how long the targeted update code will take to mature, but based on what I do know I suspect that little detail is also a larger effort than the entire scope you're envisioning. There's a reason why the MIT Press compendium "Materialized Views: Techniques, Implementations, and Applications" is over 600 pages long--I hope you've already started digging through that material. Now, with all that said, that doesn't mean there's not a useful project for you buried in this mess. The first two steps in your plan: 1) create materialized view 2) change rewriter Include building a prototype grammer, doing an initial executor implementation, and getting some sort of rewriter working. That is potentially good groundwork to lay here. I would suggest that you completely drop your step 3: 3) create command that takes snapshot (refresh MV) Because you cannot built that in a way that will be useful (and by that I mean committable quality) until there's a better way to handle updates than writing a whole new table and grabbing a full relation lock to switch to it. To do a good job just on the first two steps should take at least a whole summer anyway--there's a whole stack of background research needed I haven't seen anyone do yet, and that isn't on your plan yet. There is a precedent for taking this approach. After getting stalled trying to add the entirety of easy partitioning to PostgreSQL, the current scope has been scaled back to just trying to get the syntax and on-disk structure right, then finish off the implementation. See http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how that's been broken into those two major chunks. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
2010/4/9 Greg Smith <greg@2ndquadrant.com>: > The main hidden complexity in this particular project relates to handling > view refreshes. The non-obvious problem is that when the view updates, you > need something like a SQL MERGE to really handle that in a robust way that > doesn't conflict with concurrent access to queries against the materialized > view. And work on MERGE support is itself blocked behind the fact that > PostgreSQL doesn't have a good way to lock access to a key value that > doesn't exist yet--what other databases call key range locking. See the > notes for "Add SQL-standard MERGE/REPLACE/UPSERT command" at > http://wiki.postgresql.org/wiki/Todo for more information. > > You can work around that to build a prototype by grabbing a full table lock > on the materialized view when updating it, but that's not a production > quality solution. Solving that little detail is actually more work than the Hmm... I am not sure you're right about this. It's not obvious to me that a brief full-table lock wouldn't be acceptable for an initial implementation. Obviously it wouldn't be suitable for every use case but since we're talking about manually refreshed views that was bound to be true anyway. ...Robert
Greg Smith wrote: > The main hidden complexity in this particular project relates to > handling view refreshes. The non-obvious problem is that when the view > updates, you need something like a SQL MERGE to really handle that in a > robust way that doesn't conflict with concurrent access to queries > against the materialized view. And work on MERGE support is itself > blocked behind the fact that PostgreSQL doesn't have a good way to lock > access to a key value that doesn't exist yet--what other databases call > key range locking. See the notes for "Add SQL-standard > MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo > for more information. > > You can work around that to build a prototype by grabbing a full table > lock on the materialized view when updating it, but that's not a > production quality solution. It would still be useful for many applications. And it would provide a basis to extend later. You don't need to solve all problems at once, as long as what you implement is a useful subset. > Now, with all that said, that doesn't mean there's not a useful project > for you buried in this mess. The first two steps in your plan: > > 1) create materialized view > 2) change rewriter > > Include building a prototype grammer, doing an initial executor > implementation, and getting some sort of rewriter working. That is > potentially good groundwork to lay here. I would suggest that you > completely drop your step 3: > > 3) create command that takes snapshot (refresh MV) > > Because you cannot built that in a way that will be useful (and by that > I mean committable quality) until there's a better way to handle updates > than writing a whole new table and grabbing a full relation lock to > switch to it. To do a good job just on the first two steps should take > at least a whole summer anyway--there's a whole stack of background > research needed I haven't seen anyone do yet, and that isn't on your > plan yet. There is a precedent for taking this approach. After getting > stalled trying to add the entirety of easy partitioning to PostgreSQL, > the current scope has been scaled back to just trying to get the syntax > and on-disk structure right, then finish off the implementation. See > http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how > that's been broken into those two major chunks. The good thing about this subject for GSoC is that it can be divided into many small steps. There's two largely independent main parts: 1. Keep the materialized view up-to-date when the base tables change. This can be further divided into many steps, you can begin by supporting automatic updates only on very simple views with e.g a single table and a where clause. Then extend that to support joins, aggregates, subqueries etc. Keeping it really limited, you could even require the user to write the required triggers himself. 2. Teach the planner to use materialized views automatically when a query references the base tables. So if you issue the query "SELECT * FROM table WHERE foo > 10 AND bar = 10", and there's a materialized view on "SELECT * FROM table WHERE bar = 10", the planner can transform the original query into "SELECT * FROM materializedview WHERE foo > 10". This largely depends on 1, although some DBMSs offer the option to use manually refreshed materialized views too, knowing that they might not be completely up-to-date. There's a lot room to choose which problems you want to tackle, which is good for a summer-of-code project. Your proposal basically describes doing 1, in a limited fashion where the view is not updated automatically, but only when the DBA runs a command to refresh it. I'm not sure if that's useful enough on its own, writing "CREATE MATERIALIZED VIEW ... SELECT ..." doesn't seem any easier than just writing "CREATE TABLA AS ...". But if you can do something about 2, or even a very limited part of 1, keeping the view up-to-date automatically, it becomes much more useful. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > > 1. Keep the materialized view up-to-date when the base tables change. > This can be further divided into many steps, you can begin by supporting > automatic updates only on very simple views with e.g a single table and > a where clause. Then extend that to support joins, aggregates, > subqueries etc. Keeping it really limited, you could even require the > user to write the required triggers himself. > > That last bit doesn't strike me as much of an advance. Isn't the whole point of this to automate it? Creating greedy materialized views is usually not terribly difficult now, but you do have to write the triggers. The other thing that could be interesting about this would be some scheme for lazy refresh that didn't involve re-extracting the whole data set. cheers andrew
2010/4/10 Andrew Dunstan <andrew@dunslane.net>: > Heikki Linnakangas wrote: >> >> 1. Keep the materialized view up-to-date when the base tables change. >> This can be further divided into many steps, you can begin by supporting >> automatic updates only on very simple views with e.g a single table and >> a where clause. Then extend that to support joins, aggregates, >> subqueries etc. Keeping it really limited, you could even require the >> user to write the required triggers himself. > > That last bit doesn't strike me as much of an advance. Isn't the whole point > of this to automate it? Creating greedy materialized views is usually not > terribly difficult now, but you do have to write the triggers. Yeah, I agree. > The other thing that could be interesting about this would be some scheme > for lazy refresh that didn't involve re-extracting the whole data set. One way to do this would be to infer a primary key for the result set based on the input query. But I think we don't really have the infrastructure to do this right now, so not really a project for a beginner. ...Robert
Heikki Linnakangas wrote: > Your proposal basically describes > doing 1, in a limited fashion where the view is not updated > automatically, but only when the DBA runs a command to refresh it. I'm > not sure if that's useful enough on its own, writing "CREATE > MATERIALIZED VIEW ... SELECT ..." doesn't seem any easier than just > writing "CREATE TABLA AS ...". But if you can do something about 2, or > even a very limited part of 1, keeping the view up-to-date > automatically, it becomes much more useful. > You've hit upon the core issue here. You can build materialized views right now using "CREATE TABLE AS". You can even update them by creating a new table the same way, with a new name, and doing the LOCK/RENAME/DROP shuffle--what people used to do for rewriting bloated tables before there was CLUSTER. The first step in the proposal here is essentially syntax to give an easier UI for that. That's an interesting step, but recognize that it doesn't actually provide anything you can't do already. If you then note that doing any sort of incremental update to the view is a hard problem, and that a lot of the useful cases for materialized views involve tables where it's impractical to recreate the whole thing anyway, you'll inevitably find yourself deeply lost in the minutia of how to handle the updates. It's really the core problem in building what people expect from a materialized view implementation in a serious database. Chipping away at the other pieces around it doesn't move the feature that far forward, even if you get every single one of them except incremental updates finished, because everything else combined is still not that much work in comparison to the issues around updates. There certainly are a fair number of subproblems you can break out of here. I just think it's important to recognize that the path that leads to a useful GSoC project and the one that gives a production quality materialized view implementation may not have that much in common, and to manage expectations on both sides accordingly. If Pavel thinks he's going to end up being able to say "I added materialized views to PostgreSQL" at the end of the summer, that's going to end in disappointment. And if people think this project plan will lead to being able to claim PostgreSQL now has this feature, that's also not going to go well. If the scope is "add initial grammar and rewriting moving toward a future materialized view feature", which the underlying implementation noted as a stub prototype, that might work out OK. This is why I likened it to the work on "Syntax for partitioning", which has a similarly focused subgoal structure. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Robert Haas wrote: > It's not obvious to me > that a brief full-table lock wouldn't be acceptable for an initial > implementation. Obviously it wouldn't be suitable for every use case > but since we're talking about manually refreshed views that was bound > to be true anyway. > There already is an initial implementation of sorts. There are a couple of ways you can build these right now, so any new development has to look like it will end with good odds of being an improvement over what's already available before it's worth putting development resources into. As a rough idea of what people want these for in the field, based on what I've seen requests for, imagine that someone has a 1TB table they're materializing a view on in order to get at least a 10:1, and hopefully close to a 100:1, speedup on viewing summary data. Now, picture what happens if you have someone doing a sequential scan on the MV, which is still quite big, the updater process lines up to grab an exclusive lock when it's done, and now a second user wanting to read a single row quickly comes along behind it. Given a full-table lock implementation, that scenario is unlikely to play out with the second user getting a fast response. They'll likely sit in a lock queue for some potentially long period of time instead, waiting for the active seq scan to finish then the update to happen. You have to build it that way or a steady stream of people reading could block out updates forever. To be frank, that makes for a materalized view implementation of little value over what you can currently do as far as I'm concerned. It might be interesting as a prototype, but that's not necessarily going to look like what's needed to do this for real at all. I'm not a big fan of dumping work into projects when you can see exactly how it's going to fail before you even get started. As I see if, if you know where it's going to fall down, you don't need to build a prototype as an exercise to show you how to build it--you should work on that part first instead. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Sat, Apr 10, 2010 at 11:40 PM, Greg Smith <greg@2ndquadrant.com> wrote: > To be frank, that makes for a materalized view implementation of little > value over what you can currently do as far as I'm concerned. It might be > interesting as a prototype, but that's not necessarily going to look like > what's needed to do this for real at all. I'm not a big fan of dumping work > into projects when you can see exactly how it's going to fail before you > even get started. As I see if, if you know where it's going to fall down, > you don't need to build a prototype as an exercise to show you how to build > it--you should work on that part first instead. Hopefully, you're already aware that I have enormous respect for your opinions on a wide variety of topics; if not, let me publicly say that I absolutely do. Having said that, I disagree with your conclusions in this instance. I see nothing but upside from this work. It is vastly easier to write a patch that builds on existing functionality than it is to write something new from scratch. If there's any value in having manually refreshed materialized views, then having the simplest possible implementation of what those can look like committed will make it far easier to plan out next steps. While the proposed implementation may not solve a huge number of real-world problems, I think there's a good argument that some people will get good use of it. Not everyone has 1TB tables with continuous access patterns. And, provided that it doesn't conflict with anything we want to do in the future, being useful to some people is a good enough reason to put it in. I also think that you're underestimating the number of problems that will have to be solved to get this done. It's going to take some significant work - both design work and coding work - to figure out how this should integrate into the rest of the system. (What should be the value of pg_class.relkind? Where should the node representation of the snapshot query be stored? And did we handle all of those OID dependencies correctly?) Where I can see this possibly falling down (other than being just too much work for a relative PostgreSQL novice to get it done in one summer) is if there are concerns about it being incompatible with incrementally-updated views. I imagine that we're going to want to eventually support both, so we need to make sure that this implementation doesn't box us into a corner. But as far as snapshot views go, complaining that the proposed locking is too strong doesn't seem quite fair. Fixing that, AFAICS, is a very hard project, possibly involving significant planner support and an implementation of MERGE, and I would much rather try to land a fundamentals patch like this first and then deal with the gyrations that will be involved in making this work than try to land the whole thing all at once. Of course, if I'm missing something, and there's a SIMPLE way to get materialized views that can be refreshed without a full-table lock, that's another story altogether - maybe you have an idea? Finally, even if we decided NOT to merge this patch because of the limitations you mention (and right now that doesn't seem to be the consensus), having this part of it completed as a starting point for future work might be reason enough by itself. In short: I think you may be letting the perfect be the enemy of the good. ...Robert
Robert Haas wrote: > I also think that you're underestimating the number of problems that > will have to be solved to get this done. It's going to take some > significant work - both design work and coding work - to figure out > how this should integrate into the rest of the system. (What should > be the value of pg_class.relkind? Where should the node > representation of the snapshot query be stored? And did we handle all > of those OID dependencies correctly?) > I don't think I'm underestimating all that, but I suspect Pavel is by a considerable amount. This is why I've been suggesting that a GSoC scope here might just be wrestling with this area of the problem for the whole summer--not even getting into updates beyond a completely trivial implementation, if any at all. Things like "handle OID dependencies" are definitely not on the fun side of the development work that people tend to think about in advance. > Where I can see this possibly falling down (other than being just too > much work for a relative PostgreSQL novice to get it done in one > summer) is if there are concerns about it being incompatible with > incrementally-updated views. I imagine that we're going to want to > eventually support both, so we need to make sure that this > implementation doesn't box us into a corner. Exactly my concern; comitting this part without knowing how that's later going to fit into place strikes me the sort of the thing this project doesn't like to do. The alternate approach of starting with the update machinery is less likely IMHO to get stuck wondering if there's a future blind spot coming or not, since you'd be building from the bottom up starting with the hardest parts. From the rest of your comments, I'm comfortable that you're in sync with the not necessarily obvious risky spots here I wanted to raise awareness of. It's unreasonable to expect we'll have exactly the same priorities here, and I doubt it's useful to debate how I perceive the merit of various development subsets here compared to yourself. I don't think it's really important whether anyone agrees with me or not about exactly the value of a full table lock implementation. The main thing I'm concerned about is just that it's noted as a known risky part, one that could end up blocking the project's ability to commit even a subset of the proposed patch here. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Robert Haas wrote: > 2010/4/10 Andrew Dunstan <andrew@dunslane.net>: >> Heikki Linnakangas wrote: >>> 1. Keep the materialized view up-to-date when the base tables change. >>> This can be further divided into many steps, you can begin by supporting >>> automatic updates only on very simple views with e.g a single table and >>> a where clause. Then extend that to support joins, aggregates, >>> subqueries etc. Keeping it really limited, you could even require the >>> user to write the required triggers himself. >> That last bit doesn't strike me as much of an advance. Isn't the whole point >> of this to automate it? Creating greedy materialized views is usually not >> terribly difficult now, but you do have to write the triggers. > > Yeah, I agree. It doesn't accomplish anything interesting on its own. But if you do the planner changes to automatically use the materialized view to satisfy queries (item 2. in my previous email), it's useful. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Robert Haas wrote: >> 2010/4/10 Andrew Dunstan <andrew@dunslane.net>: >>> Heikki Linnakangas wrote: >>>> 1. Keep the materialized view up-to-date when the base tables change. >>>> This can be further divided into many steps, you can begin by supporting >>>> automatic updates only on very simple views with e.g a single table and >>>> a where clause. Then extend that to support joins, aggregates, >>>> subqueries etc. Keeping it really limited, you could even require the >>>> user to write the required triggers himself. >>> That last bit doesn't strike me as much of an advance. Isn't the whole point >>> of this to automate it? Creating greedy materialized views is usually not >>> terribly difficult now, but you do have to write the triggers. >> >> Yeah, I agree. > > It doesn't accomplish anything interesting on its own. But if you do the > planner changes to automatically use the materialized view to satisfy > queries (item 2. in my previous email), it's useful. But you can't do that with a snapshot view, only a continuous updated one. ...Robert
On 11.04.10 20:47 , Robert Haas wrote: > On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Robert Haas wrote: >>> 2010/4/10 Andrew Dunstan<andrew@dunslane.net>: >>>> Heikki Linnakangas wrote: >>>>> 1. Keep the materialized view up-to-date when the base tables >>>>> change. This can be further divided into many steps, you can >>>>> begin by supporting automatic updates only on very simple >>>>> views with e.g a single table and a where clause. Then extend >>>>> that to support joins, aggregates, subqueries etc. Keeping it >>>>> really limited, you could even require the user to write the >>>>> required triggers himself. >>>> That last bit doesn't strike me as much of an advance. Isn't >>>> the whole point of this to automate it? Creating greedy >>>> materialized views is usually not terribly difficult now, but >>>> you do have to write the triggers. >>> >>> Yeah, I agree. >> >> It doesn't accomplish anything interesting on its own. But if you >> do the planner changes to automatically use the materialized view >> to satisfy queries (item 2. in my previous email), it's useful. > > But you can't do that with a snapshot view, only a continuous updated > one. If continuous updates prove to be too hard initially, you could instead update the view on select if it's outdated. Such a materialized view would be a kind of inter-session cache for subselects. The hard part would probably be to figure out how to decide whether the view is outdated or not, and to deal with two concurrent transactions trying to use an outdates view (and both trying to refresh it). What makes the second problem hard is that you wouldn't want one of the transactions to wait for the other to complete, because this is not how SELECTs traditionally behave. best regards, Florian Pflug
On Sun, Apr 11, 2010 at 10:13 PM, Florian G. Pflug <fgp@phlo.org> wrote: > If continuous updates prove to be too hard initially, you could instead > update the view on select if it's outdated. Such a materialized view > would be a kind of inter-session cache for subselects. > > The hard part would probably be to figure out how to decide whether the > view is outdated or not, and to deal with two concurrent transactions > trying to use an outdates view (and both trying to refresh it). What > makes the second problem hard is that you wouldn't want one of the > transactions to wait for the other to complete, because this is not how > SELECTs traditionally behave. Well, the proposed project is to create views that only get refreshed manually. ...Robert
On Sun, Apr 11, 2010 at 5:24 AM, Greg Smith <greg@2ndquadrant.com> wrote: > From the rest of your comments, I'm comfortable that you're in sync with the > not necessarily obvious risky spots here I wanted to raise awareness of. > It's unreasonable to expect we'll have exactly the same priorities here, > and I doubt it's useful to debate how I perceive the merit of various > development subsets here compared to yourself. I don't think it's really > important whether anyone agrees with me or not about exactly the value of a > full table lock implementation. The main thing I'm concerned about is just > that it's noted as a known risky part, one that could end up blocking the > project's ability to commit even a subset of the proposed patch here. I think that one of the things that we need to get our hands around is how we're going to distinguish the "snapshot" flavor of materialized view from the "continuous update" flavor. By definition, the latter will only ever be supportable for a fairly restricted subset of all possible queries, and I am assuming that we will not want to decide what the behavior is going to be based on the query but rather based on what the user specifies. Anything else seems like it would be have the potential for severe POLA violations. So we need to think now about how we'll distinguish between the two flavors. I imagine some sort of syntactic marker would be appropriate; not sure what. Reading this thread, I'm starting to grow concerned that some people may feel that manually refreshed materialized views are not even worth bothering with, because (the argument goes) you could just use some table and write a function that updates it. There's probably some truth to that, but I guess my thought is that it would have some value as a convenience feature; and eventually we might optimize it to the point where it would make more sense to use the built-in feature rather than rolling your own. However, if we're going to have complaints that manually refreshed materialized views suck and we should only ever support materialized views to the extent that we can make them automatically update on-the-fly, then let's have those complaints now before someone spends several months of their life on the project only to be told that we don't want it. Let's be clear: I think it's useful, but, if other people disagree, we need to iron that out now. ...Robert
Robert Haas wrote: > On Sun, Apr 11, 2010 at 10:26 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> Robert Haas wrote: >>> 2010/4/10 Andrew Dunstan <andrew@dunslane.net>: >>>> Heikki Linnakangas wrote: >>>>> 1. Keep the materialized view up-to-date when the base tables change. >>>>> This can be further divided into many steps, you can begin by supporting >>>>> automatic updates only on very simple views with e.g a single table and >>>>> a where clause. Then extend that to support joins, aggregates, >>>>> subqueries etc. Keeping it really limited, you could even require the >>>>> user to write the required triggers himself. >>>> That last bit doesn't strike me as much of an advance. Isn't the whole point >>>> of this to automate it? Creating greedy materialized views is usually not >>>> terribly difficult now, but you do have to write the triggers. >>> Yeah, I agree. >> It doesn't accomplish anything interesting on its own. But if you do the >> planner changes to automatically use the materialized view to satisfy >> queries (item 2. in my previous email), it's useful. > > But you can't do that with a snapshot view, only a continuous updated one. A materialized view with manually-defined triggers to keep it up-to-date is a continuously updated one. Other DBMSs allow that with snapshot views too, you just don't get totally up-to-date results, but I not sure we want to go there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
2010/4/12 Robert Haas <robertmhaas@gmail.com>: > On Sun, Apr 11, 2010 at 5:24 AM, Greg Smith <greg@2ndquadrant.com> wrote: >> From the rest of your comments, I'm comfortable that you're in sync with the >> not necessarily obvious risky spots here I wanted to raise awareness of. >> It's unreasonable to expect we'll have exactly the same priorities here, >> and I doubt it's useful to debate how I perceive the merit of various >> development subsets here compared to yourself. I don't think it's really >> important whether anyone agrees with me or not about exactly the value of a >> full table lock implementation. The main thing I'm concerned about is just >> that it's noted as a known risky part, one that could end up blocking the >> project's ability to commit even a subset of the proposed patch here. > > I think that one of the things that we need to get our hands around is > how we're going to distinguish the "snapshot" flavor of materialized > view from the "continuous update" flavor. By definition, the latter > will only ever be supportable for a fairly restricted subset of all > possible queries, and I am assuming that we will not want to decide > what the behavior is going to be based on the query but rather based > on what the user specifies. Anything else seems like it would be have > the potential for severe POLA violations. So we need to think now > about how we'll distinguish between the two flavors. I imagine some > sort of syntactic marker would be appropriate; not sure what. > > Reading this thread, I'm starting to grow concerned that some people > may feel that manually refreshed materialized views are not even worth > bothering with, because (the argument goes) you could just use some > table and write a function that updates it. There's probably some > truth to that, but I guess my thought is that it would have some value > as a convenience feature; and eventually we might optimize it to the > point where it would make more sense to use the built-in feature > rather than rolling your own. However, if we're going to have > complaints that manually refreshed materialized views suck and we > should only ever support materialized views to the extent that we can > make them automatically update on-the-fly, then let's have those > complaints now before someone spends several months of their life on > the project only to be told that we don't want it. Let's be clear: I > think it's useful, but, if other people disagree, we need to iron that > out now. > > ...Robert I thing so manually refreshed materialized views has sense. It is similar to replication - there was replications like slony, but for some people is more important integrated replication in 9.0. More - manually refreshed (periodically refreshed) views can share lot if infrastructure with dynamically actualised views. I am sure so dynamical materialised views is bad task for GSoC - it is too large, too complex. Manually refreshed views is adequate to two months work and it has sense. Regards Pavel Stehule > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Mon, Apr 12, 2010 at 2:16 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote: > I am sure so > dynamical materialised views is bad task for GSoC - it is too large, > too complex. Manually refreshed views is adequate to two months work > and it has sense. That is my feeling also - though I fear that even the simplest possible implementation of this feature may be a stretch. Anyway we agree: keep it simple. ...Robert
On 4/9/10 1:36 PM, pavelbaros wrote: > 2) change rewriter > - usually, view is relation with defined rule and when rewriting, rule > is fired and relation (view) is replaced by definition of view. If > relation do not have rule, planner and executor behave to it as physical > table (relation). In case of materialized view we want to rewrite select > statement only in case when we refreshing MV. In other cases rewriter > should skip rewriting and pick up physical relation. Exclude situation > when other rewrite rules which are not related to MV definition are > specified. This was done (although not completed) against PostgreSQL 7.1 by students in Georgia, USA, I believe. It might be worthwhile looking at their work if I can find it (if nowhere else, it should be in the ACM). There are basically 2 major parts for materialized views: A) Planner: Getting the query planner to swap in the MatView for part of a query automatically for query plan portions which the MatView supports; B) Maintenance: maintaining the MatView data according to the programmed scheme (synch, asynch, periodic). I do not believe it is possible to do both of the above in one summer. Of the two, (A) would be more useful since it is possible to manually implement (B) using triggers, queues and cron jobs today. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Mon, Apr 12, 2010 at 1:50 PM, Josh Berkus <josh@agliodbs.com> wrote: > On 4/9/10 1:36 PM, pavelbaros wrote: >> 2) change rewriter >> - usually, view is relation with defined rule and when rewriting, rule >> is fired and relation (view) is replaced by definition of view. If >> relation do not have rule, planner and executor behave to it as physical >> table (relation). In case of materialized view we want to rewrite select >> statement only in case when we refreshing MV. In other cases rewriter >> should skip rewriting and pick up physical relation. Exclude situation >> when other rewrite rules which are not related to MV definition are >> specified. > > This was done (although not completed) against PostgreSQL 7.1 by > students in Georgia, USA, I believe. It might be worthwhile looking at > their work if I can find it (if nowhere else, it should be in the ACM). > > There are basically 2 major parts for materialized views: > > A) Planner: Getting the query planner to swap in the MatView for part of > a query automatically for query plan portions which the MatView supports; > > B) Maintenance: maintaining the MatView data according to the programmed > scheme (synch, asynch, periodic). > > I do not believe it is possible to do both of the above in one summer. > Of the two, (A) would be more useful since it is possible to manually > implement (B) using triggers, queues and cron jobs today. I don't believe that it's possible to do EITHER of those things in one summer. I believe that a basic implementation that has NO bells and whistles at all, as originally proposed, is going to be a Very Hard Project. ...Robert
Josh Berkus wrote: > There are basically 2 major parts for materialized views: > A) Planner: Getting the query planner to swap in the MatView for part of > a query automatically for query plan portions which the MatView supports; > B) Maintenance: maintaining the MatView data according to the programmed > scheme (synch, asynch, periodic). > I'm run more into problems where it's perfectly fine to specify using the materialized view directly in the query, but keeping that view up to date usefully was the real problem. The whole idea of getting a MV used automatically is valuable, but far down the roadmap as I see it. Not everyone would agree of course, and your description does suggest a better way to organize a high-level summary though; here's a first cut: 1) Creation of materalized view Current state: using "CREATE TABLE AS" or similar mechanism, maintain manually Optimal: "CREATE MATERIALIZED VIEW" grammar, metadata to store MV data, dump/reload support 2) Updating materialized views Current state: periodically create new snapshots, or maintain using triggers Optimal: Built-in refresh via multiple strategies, with minimal locking as to improve concurrent access 3) Using materialized views in the planner Current state: specify the manually created MV in queries that can use it Optimal: Automatically accelerate queries that could be satisfied by substituting available MVs With (1) being what I think is the only GSoC sized subset here. I'm not saying someone can't jump right into (3), using the current implementations for (1) and (2) that are floating around out there. I just think it would end up wasting a fair amount of work on prototypes that don't work quite the same way as the eventual fully integrated version. You certainly can start working on (3) without a fully fleshed out implementation of (2), I don't know that it makes sense to work on before (1) though. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On Mon, Apr 12, 2010 at 3:43 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Josh Berkus wrote: >> >> There are basically 2 major parts for materialized views: >> A) Planner: Getting the query planner to swap in the MatView for part of >> a query automatically for query plan portions which the MatView supports; >> B) Maintenance: maintaining the MatView data according to the programmed >> scheme (synch, asynch, periodic). >> > > I'm run more into problems where it's perfectly fine to specify using the > materialized view directly in the query, but keeping that view up to date > usefully was the real problem. The whole idea of getting a MV used > automatically is valuable, but far down the roadmap as I see it. > > Not everyone would agree of course, and your description does suggest a > better way to organize a high-level summary though; here's a first cut: > > 1) Creation of materalized view > Current state: using "CREATE TABLE AS" or similar mechanism, maintain > manually > Optimal: "CREATE MATERIALIZED VIEW" grammar, metadata to store MV data, > dump/reload support > > 2) Updating materialized views > Current state: periodically create new snapshots, or maintain using > triggers > Optimal: Built-in refresh via multiple strategies, with minimal locking as > to improve concurrent access > > 3) Using materialized views in the planner > Current state: specify the manually created MV in queries that can use it > Optimal: Automatically accelerate queries that could be satisfied by > substituting available MVs > > With (1) being what I think is the only GSoC sized subset here. > > I'm not saying someone can't jump right into (3), using the current > implementations for (1) and (2) that are floating around out there. I just > think it would end up wasting a fair amount of work on prototypes that don't > work quite the same way as the eventual fully integrated version. You > certainly can start working on (3) without a fully fleshed out > implementation of (2), I don't know that it makes sense to work on before > (1) though. Good summary. ...Robert
Greg, > I'm not saying someone can't jump right into (3), using the current > implementations for (1) and (2) that are floating around out there. I > just think it would end up wasting a fair amount of work on prototypes > that don't work quite the same way as the eventual fully integrated > version. You certainly can start working on (3) without a fully fleshed > out implementation of (2), I don't know that it makes sense to work on > before (1) though. What would be the use case for (1) by itself? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > What would be the use case for (1) by itself? > There isn't any use case for just working on the infrastructure, just like there's no use case for "Syntax for partitioning" on its own. That why people rarely work on that part of these problems--it's boring and produces no feature of value on its own. I believe that in both cases, attempts to build the more complicated parts, ones that don't first address some of the core infrastructure first, will continue to produce only prototypes. I don't want to see Materialized Views wander down the same path as partitioning, where lots of people produce "fun parts" patches, while ignoring the grunt work of things like production quality catalog support for the feature. I think Pavel's proposal got that part right by starting with the grammar and executor setup trivia. And Robert's comments about the details in that area it's easy to forget about hit the mark too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
> I don't want to see Materialized Views wander down the same path as > partitioning, where lots of people produce "fun parts" patches, while > ignoring the grunt work of things like production quality catalog > support for the feature. I think Pavel's proposal got that part right > by starting with the grammar and executor setup trivia. And Robert's > comments about the details in that area it's easy to forget about hit > the mark too. Good point. And GSoC may be one of the few times we can get people to do that kind of work. Other than Simon, of course. ;-) I just worry about any feature which doesn't get as far as a user-visible implementation. If someone doesn't do the rest of the parts soon, such features tend to atrophy because nobody is using them. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > I just worry about any feature which doesn't get as far as a > user-visible implementation. If someone doesn't do the rest of the > parts soon, such features tend to atrophy because nobody is using them. > While they're limited, there are complexly viable prototype quality implementations possible here without a large amount of work to get them started. I'm not worried too much about this feature being unused. As I was just reminded when assembling an page on the wiki about it: http://wiki.postgresql.org/wiki/Materalized_Views it's currently ranked #1--by a large margin--on the UserVoice feature request survey that Peter kicked off. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
Josh Berkus <josh@agliodbs.com> writes: > There are basically 2 major parts for materialized views: > A) Planner: Getting the query planner to swap in the MatView for part of > a query automatically for query plan portions which the MatView supports; > B) Maintenance: maintaining the MatView data according to the programmed > scheme (synch, asynch, periodic). > I do not believe it is possible to do both of the above in one summer. > Of the two, (A) would be more useful since it is possible to manually > implement (B) using triggers, queues and cron jobs today. A patch that implements only (A) will be DOA. The reason is that the planner can *never* "swap in a MatView" on its own authority, unless it can prove that this does not change the semantics of the query. Which it obviously will be unable to do unless there's a fully transparent continuous-update scheme in place. So the correct approach is to work on (B) first. When and if we get to a sufficiently transparent update implementation, we can think about changing the planner. regards, tom lane
Greg Smith wrote: > pavelbaros wrote: >> I am also waiting for approval for my repository named >> "materialized_view" on git.postgresql.org, so I could publish >> completed parts. > > Presuming that you're going to wander there and get assigned what > looks like an official repo name for this project is a > bit...optimistic. I would recommend that you publish to something > like github instead (you can fork http://github.com/postgres/postgres > ), and if the work looks good enough that it gets picked up by the > community maybe you migrate it onto the main site eventually. > git.postgresql.org is really not setup to be general hosting space for > everyone who has a PostgreSQL related project; almost every repo on > there belongs to someone who has already been a steady project > contributor for a number of years. Yes, you're true, I'm kind of newbe in this kind of project and specially in PostgreSQL. But I think it is best way to get into PostgreSQL. When I chose my bachelor thesis I did not know I could participate GSoC or try to make it commitable. Anyway I will make repo on github, so everybody could look at it, as soon as posible. <http://github.com/pbaros/postgres> > > (Switching to boilerplate mode for a paragraph...) You have picked a > PostgreSQL feature that is dramatically more difficult than it appears > to be, and I wouldn't expect you'll actually finish even a fraction of > your goals in a summer of work. You're at least in plentiful > company--most students do the same. As a rule, if you see a feature > on our TODO list that looks really useful and fun to work on, it's > only still there because people have tried multiple times to build it > completely but not managed to do so because it's harder than it > appears. This is certainly the case with materialized views. > > You've outlined a reasonable way to build a prototype that does a > limited implementation here. The issue is what it will take to extend > that into being production quality for the real-world uses of > materialized views. How useful your prototype is depends on how well > it implements a subset of that in a way that will get used by the > final design. > > The main hidden complexity in this particular project relates to > handling view refreshes. The non-obvious problem is that when the > view updates, you need something like a SQL MERGE to really handle > that in a robust way that doesn't conflict with concurrent access to > queries against the materialized view. And work on MERGE support is > itself blocked behind the fact that PostgreSQL doesn't have a good way > to lock access to a key value that doesn't exist yet--what other > databases call key range locking. See the notes for "Add SQL-standard > MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo > for more information. > > You can work around that to build a prototype by grabbing a full table > lock on the materialized view when updating it, but that's not a > production quality solution. Solving that little detail is actually > more work than the entire project you've outlined. Your suggested > implementation--"In function CloseIntoRel executor swap relfilenode's > of temp table and original table and finally delete temp table"--is > where the full table lock is going to end up at. The exact use cases > that need materialized views cannot handle a CLUSTER-style table > recreation each time that needs an exclusive lock to switchover, so > that whole part of your design is going to be a prototype that doesn't > work at all like what needs to get built to make this feature > committable. It's also not a reasonable assumption that you have > enough disk space to hold a second copy of the MV in a production system. For now I know it is not commitable in actual state, but for my thesis it is enough and I know it will not be commitable with this design at all. In case of GSoC it will depends on the time I will be able to spend on it, if I will consider some other design. > > Once there's a good way to merge updates, how to efficiently generate > them against the sort of large data sets that need materalized > views--so you just write out the updates rather than a whole new > copy--is itself a large project with a significant quantity of > academic research to absorb before starting. Dan Colish at Portland > State has been playing around with prototypes for the specific problem > of finding a good algorithm for view refreshing that is compatible > with PostgreSQL's execution model. He's already recognized the table > lock issue here and for the moment is ignoring that part. I don't > have a good feel yet for how long the targeted update code will take > to mature, but based on what I do know I suspect that little detail is > also a larger effort than the entire scope you're envisioning. > There's a reason why the MIT Press compendium "Materialized Views: > Techniques, Implementations, and Applications" is over 600 pages > long--I hope you've already started digging through that material. I would like to start to dig through that, but I'm in a hurry now. I already have made a small research on MV as part of my thesis. I also plan to continue study PostgreSQL and Materialized Views more into the depth, preferably as my master thesis. But I realize MV feature commitable to PostgreSQL is not project for one person, of course. > > Now, with all that said, that doesn't mean there's not a useful > project for you buried in this mess. The first two steps in your plan: > > 1) create materialized view > 2) change rewriter > > Include building a prototype grammer, doing an initial executor > implementation, and getting some sort of rewriter working. That is > potentially good groundwork to lay here. I would suggest that you > completely drop your step 3: > > 3) create command that takes snapshot (refresh MV) > > Because you cannot built that in a way that will be useful (and by > that I mean committable quality) until there's a better way to handle > updates than writing a whole new table and grabbing a full relation > lock to switch to it. To do a good job just on the first two steps > should take at least a whole summer anyway--there's a whole stack of > background research needed I haven't seen anyone do yet, and that > isn't on your plan yet. There is a precedent for taking this > approach. After getting stalled trying to add the entirety of easy > partitioning to PostgreSQL, the current scope has been scaled back to > just trying to get the syntax and on-disk structure right, then finish > off the implementation. See > http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how > that's been broken into those two major chunks. > Anyway thanks for all of your advices and help. best regards, Pavel Baros
2010/4/20 Pavel <baros.p@seznam.cz>: > For now I know it is not commitable in actual state, but for my thesis it is > enough and I know it will not be commitable with this design at all. In case > of GSoC it will depends on the time I will be able to spend on it, if I will > consider some other design. I am not sure about this, but I would think we would not want to accept the project unless you intend to try to make it committable. I haven't looked at your actual code to see how much work I think that would take. ...Robert