Re: [HACKERS] optimizer and type question - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] optimizer and type question |
Date | |
Msg-id | 23292.922151563@sss.pgh.pa.us Whole thread Raw |
In response to | optimizer and type question (Erik Riedel <riedel+@CMU.EDU>) |
Responses |
Re: [HACKERS] optimizer and type question
Re: [HACKERS] optimizer and type question |
List | pgsql-hackers |
Erik Riedel <riedel+@CMU.EDU> writes: > [ optimizer doesn't find relevant pg_statistic entry ] It's clearly a bug that the selectivity code is not finding this tuple. If your analysis is correct, then selectivity estimation has *never* worked properly, or at least not in recent memory :-(. Yipes. Bruce and I found a bunch of other problems in the optimizer recently, so it doesn't faze me to assume that this is broken too. > the "offending" line is setting the staop to InvalidOid (i.e. 0). > Question 2 - is this right? Is the intent for 0 to serve as a > "wildcard", My thought is that what the staop column ought to be is the OID of the comparison function that was used to determine the sort order of the column. Without a sort op the lowest and highest keys in the column are not well defined, so it makes no sense to assert "these are the lowest and highest values" without providing the sort op that determined that. (For sufficiently complex data types one could reasonably have multiple ordering operators. A crude example is sorting on "circumference" and "area" for polygons.) But typically the sort op will be the "<" operator for the column data type. So, the vacuum code is definitely broken --- it's not storing the sort op that it used. The code in gethilokey might be broken too, depending on how it is producing the operator it's trying to match against the tuple. For example, if the actual operator in the query is any of < <= > >= on int4, then int4lt ought to be used to probe the pg_statistic table. I'm not sure if we have adequate info in pg_operator or pg_type to let the optimizer code determine the right thing to probe with :-( > The immediate next thing that intltsel() does, near lines 122 > in selfuncs.c is call atol() on the strings from gethilokey(). And > guess what it comes up with? > low = 1 > high = 12 > because it calls atol() on '01-02-1992' and '12-01-1998'. This > clearly isn't right, it should get some large integer that includes > the year and day in the result. Then it should compare reasonably > with my constant from the where clause and give a decent selectivity > value. This leads to a re-visit of Question 1. > Question 4 - should date "<=" use a dateltsel() function instead of > intltsel() as oprrest? This is clearly busted as well. I'm not sure that creating dateltsel() is the right fix, however, because if you go down that path then every single datatype needs its own selectivity function; that's more than we need. What we really want here is to be able to map datatype values into some sort of numeric range so that we can compute what fraction of the low-key-to-high-key range is on each side of the probe value (the constant taken from the query). This general concept will apply to many scalar types, so what we want is a type-specific mapping function and a less-specific fraction-computing-function. Offhand I'd say that we want intltsel() and floatltsel(), plus conversion routines that can produce either int4 or float8 from a data type as seems appropriate. Anything that couldn't map to one or the other would have to supply its own selectivity function. > Or is the problem in the > way attribute values are stored in pg_statistic by vacuum analyze? Looks like it converts the low and high values to text and stores them that way. Ugly as can be :-( but I'm not sure there is a good alternative. We have no "wild card" column type AFAIK, which is what these columns of pg_statistic would have to be to allow storage of unconverted min and max values. I think you've found a can of worms here. Congratulations ;-) regards, tom lane
pgsql-hackers by date: