Re: [HACKERS] longer-term optimizer musings - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] longer-term optimizer musings |
Date | |
Msg-id | 26327.922289494@sss.pgh.pa.us Whole thread Raw |
In response to | longer-term optimizer musings (Erik Riedel <riedel+@CMU.EDU>) |
Responses |
Re: [HACKERS] longer-term optimizer musings
|
List | pgsql-hackers |
Erik Riedel <riedel+@CMU.EDU> writes: > This should be considered a long-term "think about" and "probably > ignore as too specific" optimization. Not at all --- it looks like useful stuff to work on. As Bruce pointed out, the current generation of Postgres developers don't understand the optimizer very well. (Bruce and I have both been digging into it a little, but we certainly welcome anyone else who wants to study it.) The optimizer has been suffering from software rot for several releases now, and as far as I can tell there were a lot of things that it never really did right in the first place. So take what you see with a grain of salt. > [Problem 1 - no cost estimates are done for sort nodes, and costs are > not passed up the query tree for Sorts, Groups, or Aggregates] These things probably need to be fixed. I have noticed that there are places where the code does not bother to fill in estimates, for example in a hash join the hash subnode never gets filled in, but it probably doesn't matter as long as the top hash node does get filled in. The important thing is to propagate reasonable estimates upwards. > [Problem 2 - the Aggregate node is way over-estimated by assuming the > output is as large as the input, but how can you do better? ] An excellent point. Bruce's idea of a default 10% estimate seems reasonable to me (and of course, recognize the non-group-by case). > [ get column min/max values and ] multiply all these values together You'd have to watch out for integer overflow in this calculation --- would be safer to do it in floating point I think. A more serious issue is how do you know what the granularity of the type is. For example, with a float8 column the min and max values might be 1 and 4, but that doesn't entitle you to assume that there are only 4 values. You could really only apply this optimization to int, bool, and char(1) columns, I think. Of course, there are plenty of those out there, so it might still be worth doing. > Also note that none of this actually speeds up even my query, it only > makes the optimizer estimate much closer to the actual query cost > (which is what I care about for the work I am doing). Well, that could result in a more intelligently chosen plan further up the tree, so it *could* lead to a faster query. However this would only be true if there were important choices to be made at higher tree levels. I suppose you would have to be looking at a subselect involving GROUP BY for this to really make much difference in practice. regards, tom lane
pgsql-hackers by date: