Re: Materialized views WIP patch - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Materialized views WIP patch
Date
Msg-id CA+TgmoazPwB1eMW3gRnYJE9trJ1RjS_r9yBOF-eMOoGpZBn73Q@mail.gmail.com
Whole thread Raw
In response to Re: Materialized views WIP patch  (Kevin Grittner <kgrittn@ymail.com>)
Responses Re: Materialized views WIP patch  (Kevin Grittner <kgrittn@ymail.com>)
Re: Materialized views WIP patch  (Nicolas Barbier <nicolas.barbier@gmail.com>)
Re: Materialized views WIP patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Materialized views WIP patch  (Greg Stark <stark@mit.edu>)
List pgsql-hackers
On Tue, Mar 5, 2013 at 7:15 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> I don't think I disagree with any of what Simon says other than his
> feelings about the planning cost.  Imagine that there are ten MVs
> that might apply to a complex query, but some of them are mutually
> exclusive, so there are a large number of permutations of MVs which
> could be used to replace parts of the original query.  And maybe
> some of base tables have indexes which could reduce execution cost
> which aren't present in some or all of the MVs.  And each MV has a
> number of indexes.  The combinatorial explosion of possible plans
> would make it hard to constrain plan time without resorting to some
> crude rules about what to choose.  That's not an unsolvable
> problem, but I see it as a pretty big problem.

I'm not sure I agree.  Suppose you have a query like SELECT * FROM a
INNER JOIN b ON a.x = b.x INNER JOIN c ON a.y = c.y WHERE <some
stuff>.  The query planner will construct paths for scans on a, b, and
c.  Then it will construct joinrels for (a b), (a c), (b c), and
eventually (a b c) and calculate a set of promising paths for each of
them.  If there is a materialized view available for one of those
joinrels, all we really need to do is add the possible paths for
scanning that materialized view to the joinrel.  They'll either be
better than the existing paths, or they won't.  If they're better, the
paths that otherwise would have gotten chosen will get kicked out.  If
they're worse, the materialized-view paths will get kicked out.
Either way, we don't have any more paths than we would have otherwise
- so no combinatorial explosion.

There is one case where we might end up with more paths than we had
before.  Suppose there's a materialized view on the query SELECT a.x,
a.y, a.z, b.t FROM a INNER JOIN b ON a.x = b.x ORDER BY a.z and the
users enters just that query.  Suppose further that the materialized
view has an index on column z, but table a does not.  In that case,
the best paths for the joinrel (a b) not involving the materialized
view will be an unordered path, but we could scan the materialized
view using the index and so will have a path that is ordered by a.z to
add to the joinrel.  This path will stick around even if it's more
expensive than the unordered path because we know it avoids a final
sort.  So in that case we do have more paths, but they are potentially
useful paths, so I don't see that as a problem.

It seems to me that the tricky part of this is not that it might add a
lot more paths (because I don't think it will, and if it does I think
they're useful paths), but that figuring out whether or not a
materialized view matches any particular joinrel might be expensive.
I think we're going to need a pretty accurate heuristic for quickly
discarding materialized views that can't possibly be relevant to the
query as a whole, or to particular joinrels.  There are a couple of
possible ways to approach that.  The most manual of these is probably
to have a command like ALTER TABLE x {ENABLE | DISABLE } REWRITE
MATERIALIZED y, where the user has to explicitly ask for materialized
views to be considered, or else they aren't.  That sort of
fine-grained control might have other benefits as well.

I think a more automated solution is also possible, if we want it.
For a materialized view to match a query, all of the baserels in the
materialized view must also be present in the query.  (Actually, there
are situations where this isn't true; e.g. the materialized view has
an extra table, but it's joined in a way that could be pruned away by
the join removal code, but I think we could ignore such
somewhat-pathological cases at least initially.)  It seems to me that
if we could figure out a very-cheap way to throw away all of the
materialized views that don't pass that basic test, we'd be reasonably
close to a workable solution.  A database with tons of materialized
views defined on the same set of target relations might still have
some planning time problems, but so might a database with tons of
indexes.

In that regard, what I was thinking about is to use something sort of
like a Bloom filter.  Essentially, produce a "fingerprint" for each
materialized view.  For the sake of argument, let's say that the
fingerprint is a 64-bit integer, although it could be a bit string of
any length.  To construct the fingerprint, hash the OID of each
relation involved in the view onto a bit position between 0 and 63.
Set the bits for all relations involved in the materialized view.
Then, construct a fingerprint for the query in the same way.  Any
materialized view where (matview_fingerprint & query_fingerprint) !=
matview_fingerprint needn't be considered; moreover, for a given
joinrel, any materialized view where matview_fingerprint !=
joinrel_fingerprint (computed using the same scheme) needn't be
considered.  Of course, a matching fingerprint doesn't mean that the
materialized view matches, or even necessarily that it involves the
correct set of relations, so there's a lot more double-checking that
has to be done afterwards before you can decide that the view is in
fact a match, but at least it's a fast way to throw away some large
percentage of materialized views that are definitely NOT relevant to
the query, and only do more detailed evaluation on the ones that might
be matches.  The chances of false positives can be reduced if you care
to make the fingerprints bigger (and thus more expensive to compare).

All that having been said, it's hard for me to imagine that anyone
really cares about any of this until we have an incremental update
feature, which right now we don't.  Actually, I'm betting that's going
to be significantly harder than automatic-query-rewrite, when all is
said and done.  Automatic-query-rewrite, if and when we get it, will
not be easy and will require a bunch of work from someone with a good
understanding of the planner, but it strikes me as the sort of thing
that might work out to one large project and then it's done.  Whereas,
incremental update sounds to me like a series of projects over a
series of releases targeting various special cases, where we can
always point to some improvements vs. release N-1 but we're never
actually done and able to move on to the next thing.  As a roadmap
goes, I think that's OK.  Even a reasonably simplistic and partial
implementation of incremental update will benefit a lot of users.  But
in terms of relative difficulty, it's not at all obvious to me that
that's the easier part of the project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Reproducible "Bus error" in 9.2.3 during database dump restoration (Ubuntu Server 12.04 LTS)
Next
From: Kevin Grittner
Date:
Subject: Re: Materialized views WIP patch