Re: Materialized views WIP patch - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Materialized views WIP patch |
Date | |
Msg-id | CA+TgmoazPwB1eMW3gRnYJE9trJ1RjS_r9yBOF-eMOoGpZBn73Q@mail.gmail.com Whole thread Raw |
In response to | Re: Materialized views WIP patch (Kevin Grittner <kgrittn@ymail.com>) |
Responses |
Re: Materialized views WIP patch
Re: Materialized views WIP patch Re: Materialized views WIP patch Re: Materialized views WIP patch |
List | pgsql-hackers |
On Tue, Mar 5, 2013 at 7:15 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > I don't think I disagree with any of what Simon says other than his > feelings about the planning cost. Imagine that there are ten MVs > that might apply to a complex query, but some of them are mutually > exclusive, so there are a large number of permutations of MVs which > could be used to replace parts of the original query. And maybe > some of base tables have indexes which could reduce execution cost > which aren't present in some or all of the MVs. And each MV has a > number of indexes. The combinatorial explosion of possible plans > would make it hard to constrain plan time without resorting to some > crude rules about what to choose. That's not an unsolvable > problem, but I see it as a pretty big problem. I'm not sure I agree. Suppose you have a query like SELECT * FROM a INNER JOIN b ON a.x = b.x INNER JOIN c ON a.y = c.y WHERE <some stuff>. The query planner will construct paths for scans on a, b, and c. Then it will construct joinrels for (a b), (a c), (b c), and eventually (a b c) and calculate a set of promising paths for each of them. If there is a materialized view available for one of those joinrels, all we really need to do is add the possible paths for scanning that materialized view to the joinrel. They'll either be better than the existing paths, or they won't. If they're better, the paths that otherwise would have gotten chosen will get kicked out. If they're worse, the materialized-view paths will get kicked out. Either way, we don't have any more paths than we would have otherwise - so no combinatorial explosion. There is one case where we might end up with more paths than we had before. Suppose there's a materialized view on the query SELECT a.x, a.y, a.z, b.t FROM a INNER JOIN b ON a.x = b.x ORDER BY a.z and the users enters just that query. Suppose further that the materialized view has an index on column z, but table a does not. In that case, the best paths for the joinrel (a b) not involving the materialized view will be an unordered path, but we could scan the materialized view using the index and so will have a path that is ordered by a.z to add to the joinrel. This path will stick around even if it's more expensive than the unordered path because we know it avoids a final sort. So in that case we do have more paths, but they are potentially useful paths, so I don't see that as a problem. It seems to me that the tricky part of this is not that it might add a lot more paths (because I don't think it will, and if it does I think they're useful paths), but that figuring out whether or not a materialized view matches any particular joinrel might be expensive. I think we're going to need a pretty accurate heuristic for quickly discarding materialized views that can't possibly be relevant to the query as a whole, or to particular joinrels. There are a couple of possible ways to approach that. The most manual of these is probably to have a command like ALTER TABLE x {ENABLE | DISABLE } REWRITE MATERIALIZED y, where the user has to explicitly ask for materialized views to be considered, or else they aren't. That sort of fine-grained control might have other benefits as well. I think a more automated solution is also possible, if we want it. For a materialized view to match a query, all of the baserels in the materialized view must also be present in the query. (Actually, there are situations where this isn't true; e.g. the materialized view has an extra table, but it's joined in a way that could be pruned away by the join removal code, but I think we could ignore such somewhat-pathological cases at least initially.) It seems to me that if we could figure out a very-cheap way to throw away all of the materialized views that don't pass that basic test, we'd be reasonably close to a workable solution. A database with tons of materialized views defined on the same set of target relations might still have some planning time problems, but so might a database with tons of indexes. In that regard, what I was thinking about is to use something sort of like a Bloom filter. Essentially, produce a "fingerprint" for each materialized view. For the sake of argument, let's say that the fingerprint is a 64-bit integer, although it could be a bit string of any length. To construct the fingerprint, hash the OID of each relation involved in the view onto a bit position between 0 and 63. Set the bits for all relations involved in the materialized view. Then, construct a fingerprint for the query in the same way. Any materialized view where (matview_fingerprint & query_fingerprint) != matview_fingerprint needn't be considered; moreover, for a given joinrel, any materialized view where matview_fingerprint != joinrel_fingerprint (computed using the same scheme) needn't be considered. Of course, a matching fingerprint doesn't mean that the materialized view matches, or even necessarily that it involves the correct set of relations, so there's a lot more double-checking that has to be done afterwards before you can decide that the view is in fact a match, but at least it's a fast way to throw away some large percentage of materialized views that are definitely NOT relevant to the query, and only do more detailed evaluation on the ones that might be matches. The chances of false positives can be reduced if you care to make the fingerprints bigger (and thus more expensive to compare). All that having been said, it's hard for me to imagine that anyone really cares about any of this until we have an incremental update feature, which right now we don't. Actually, I'm betting that's going to be significantly harder than automatic-query-rewrite, when all is said and done. Automatic-query-rewrite, if and when we get it, will not be easy and will require a bunch of work from someone with a good understanding of the planner, but it strikes me as the sort of thing that might work out to one large project and then it's done. Whereas, incremental update sounds to me like a series of projects over a series of releases targeting various special cases, where we can always point to some improvements vs. release N-1 but we're never actually done and able to move on to the next thing. As a roadmap goes, I think that's OK. Even a reasonably simplistic and partial implementation of incremental update will benefit a lot of users. But in terms of relative difficulty, it's not at all obvious to me that that's the easier part of the project. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: