Re: [HACKERS] longer-term optimizer musings - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] longer-term optimizer musings
Date
Msg-id 26327.922289494@sss.pgh.pa.us
Whole thread Raw
In response to longer-term optimizer musings  (Erik Riedel <riedel+@CMU.EDU>)
Responses Re: [HACKERS] longer-term optimizer musings
List pgsql-hackers
Erik Riedel <riedel+@CMU.EDU> writes:
> This should be considered a long-term "think about" and "probably
> ignore as too specific" optimization.

Not at all --- it looks like useful stuff to work on.

As Bruce pointed out, the current generation of Postgres developers
don't understand the optimizer very well.  (Bruce and I have both been
digging into it a little, but we certainly welcome anyone else who
wants to study it.)  The optimizer has been suffering from software rot
for several releases now, and as far as I can tell there were a lot of
things that it never really did right in the first place.  So take what
you see with a grain of salt.

> [Problem 1 - no cost estimates are done for sort nodes, and costs are
> not passed up the query tree for Sorts, Groups, or Aggregates]

These things probably need to be fixed.  I have noticed that there are
places where the code does not bother to fill in estimates, for example
in a hash join the hash subnode never gets filled in, but it probably
doesn't matter as long as the top hash node does get filled in.  The
important thing is to propagate reasonable estimates upwards.

> [Problem 2 - the Aggregate node is way over-estimated by assuming the
> output is as large as the input, but how can you do better?           ]

An excellent point.  Bruce's idea of a default 10% estimate seems
reasonable to me (and of course, recognize the non-group-by case).

> [ get column min/max values and ] multiply all these values together

You'd have to watch out for integer overflow in this calculation ---
would be safer to do it in floating point I think.  A more serious
issue is how do you know what the granularity of the type is.  For
example, with a float8 column the min and max values might be 1 and 4,
but that doesn't entitle you to assume that there are only 4 values.
You could really only apply this optimization to int, bool, and char(1)
columns, I think.  Of course, there are plenty of those out there, so
it might still be worth doing.

> Also note that none of this actually speeds up even my query, it only
> makes the optimizer estimate much closer to the actual query cost
> (which is what I care about for the work I am doing).

Well, that could result in a more intelligently chosen plan further up
the tree, so it *could* lead to a faster query.  However this would
only be true if there were important choices to be made at higher tree
levels.  I suppose you would have to be looking at a subselect involving
GROUP BY for this to really make much difference in practice.
        regards, tom lane


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] mcxt.h
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] 64-bit hashjoins