Re: [HACKERS] optimizer and type question - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: [HACKERS] optimizer and type question |
Date | |
Msg-id | 24453.922208517@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [HACKERS] optimizer and type question (Erik Riedel <riedel+@CMU.EDU>) |
List | pgsql-hackers |
Erik Riedel <riedel+@CMU.EDU> writes: > OK, building on your high-level explanation, I am attaching a patch that > attempts to do something "better" than the current code. Note that I > have only tested this with the date type and my particular query. Glad to see you working on this. I don't like the details of your patch too much though ;-). Here are some suggestions for making it better. 1. I think just removing staop from the lookup in gethilokey is OK for now, though I'm dubious about Bruce's thought that we could delete that field entirely. As you observe, vacuum will not currently put more than one tuple for a column into pg_statistic, so we can just do the lookup with relid and attno and leave it at that. But I think we ought to leave the field there, with the idea that vacuum might someday compute more than one statistic for a data column. Fixing vacuum to put its sort op into the field is a good idea in the meantime. 2. The type conversion you're doing in gethilokey is a mess; I think what you ought to make it do is simply the inbound conversion of the string from pg_statistic into the internal representation for the column's datatype, and return that value as a Datum. It also needs a cleaner success/failure return convention --- this business with "n" return is ridiculously type-specific. Also, the best and easiest way to find the type to convert to is to look up the column type in the info for the given relid, not search pg_proc with the staop value. (I'm not sure that will even work, since there are pg_proc entries with wildcard argument types.) 3. The atol() calls currently found in intltsel are a type-specific cheat on what is conceptually a two-step process: * Convert the string stored in pg_statistic back to the internal formfor the column data type. * Generate a numeric representation of the data value that can be used as an estimate ofthe range of values in the table. The second step is trivial for integers, which may obscure the fact that there are two steps involved, but nonetheless there are. If you think about applying selectivity logic to strings, say, it becomes clear that the second step is a necessary component of the process. Furthermore, the second step must also be applied to the probe value that's being passed into the selectivity operator. (The probe value is already in internal form, of course; but it is not necessarily in a useful numeric form.) We can do the first of these steps by applying the appropriate "XXXin" conversion function for the column data type, as you have done. The interesting question is how to do the second one. A really clean solution would require adding a column to pg_type that points to a function that will do the appropriate conversion. I'd be inclined to make all of these functions return "double" (float8) and just have one top-level selectivity routine for all data types that can use range-based selectivity logic. We could probably hack something together that would not use an explicit conversion function for each data type, but instead would rely on type-specific assumptions inside the selectivity routines. We'd need many more selectivity routines though (at least one for each of int, float4, float8, and text data types) so I'm not sure we'd really save any work compared to doing it right. BTW, now that I look at this issue it's real clear that the selectivity entries in pg_operator are horribly broken. The intltsel/intgtsel selectivity routines are currently applied to 32 distinct data types: regression=> select distinct typname,oprleft from pg_operator, pg_type regression-> where pg_type.oid = oprleft regression-> and oprrest in (103,104); typname |oprleft ---------+------- _aclitem | 1034 abstime | 702 bool | 16 box | 603 bpchar | 1042 char | 18 cidr | 650 circle | 718 date | 1082 datetime | 1184 float4 | 700 float8 | 701 inet | 869 int2 | 21 int4 | 23 int8 | 20 line | 628 lseg | 601 macaddr | 829 money | 790 name | 19 numeric | 1700 oid | 26 oid8 | 30 path | 602 point | 600 polygon | 604 text | 25 time | 1083 timespan | 1186 timestamp| 1296 varchar | 1043 (32 rows) many of which are very obviously not compatible with integer for *any* purpose. It looks to me like a lot of data types were added to pg_operator just by copy-and-paste, without paying attention to whether the selectivity routines were actually correct for the data type. As the code stands today, the bogus entries don't matter because gethilokey always fails, so we always get 1/3 as the selectivity estimate for any comparison operator (except = and != of course). I had actually noticed that fact and assumed that it was supposed to work that way :-(. But, clearly, there is code in here that is *trying* to be smarter. As soon as we fix gethilokey so that it can succeed, we will start getting essentially-random selectivity estimates for those data types that aren't actually binary-compatible with integer. That will not do; we have to do something about the issue. regards, tom lane
pgsql-hackers by date: