Thread: Implied Functional Index use
Currently, functional indexes can be used by a query that explicitly mentions the exact phrasing of the functional index within the WHERE clause. IMHO it is feasible to extend the range of WHERE clauses for which functional indexes can be used by using implication, much in the same way that we use transitive closure in join queries now. This has some powerful and important uses, just one of which is index prefix compression (so read on). If we have a subclause within a WHERE clause of the form col1 = k1 AND col2 = k2 AND ... colN = kN this implies that the following clause is also true: col1 = k1 AND col2 = k2 AND ... colN = kN AND F(col1, col2 ... colN) = F(k1, k2 ... kN) iff: - the function F is IMMUTABLE, since that definition implies that the output is always the same for any set of constant inputs. - the operator that connects each col1,k1 pair must be defined as true equality (note that in the above example what looks like the same equality operator is in fact context dependent on the datatypes) - the datatypes of each col1, k1 pair must match If we have a Functional Index F(col1, col2...colN) then we can use the second implied form of the subclause to recognise that the index can be useful for the query. This would then allow a query plan that uses an IndexScan on the functional index with a re-check filter of the original query clause. An example might be a query using the clausesurname = 'RIGGS' would allow us to use an index defined asmetaphone(surname) even though the original application was written before we added the index. Note that the index would be significantly smaller than an index on the full surname, as well as avoiding some of the hotspots caused by certain most-frequent-values in the data. An alternative example might be to define an index onsubstr(text_col, 1, 100) which still allows searching for longer strings without the need to store the complete string in the index. This is effectively index prefix compression, which is not directly possible with pgsql because of the requirements of datatype encapsulation. (Note that this avoids the need to have very long index keys also). One difficulty to this is defining which operators represent true-equality. There isn't a definition of this currently for operators and we cannot assume that an operator called "=" has the required properties, even if that is true for most built-in types. An example of true-equality and its difficulties is for the FLOAT type minus-zero (-0) and plus-zero (+0) have different byte representations yet when compared with the standard FLOAT comparison operator +0 and -0 would be considered equal. If we were to put each value through a hash-like function such as md5() then we would clearly get different answers. To take advantage of this, I propose to - add code to be called from allpaths.c: when we check functional indexes, if they exist and yet are not usable because of a lack of explicit clauses we will check to see if there any clauses that can be used to imply the correct use of the functional index. This is in many ways similar to existing constraint exclusion code. - add a new boolean to pg_operator to allow us to define which operators offer true equality - add a new keyword EQUALITY to CREATE/ALTER OPERATOR (which I think implies HASHES and MERGES if they are not mentioned explicitly). No promises for 8.2, but does anyone have further input? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > ... > - add a new boolean to pg_operator to allow us to define which operators > offer true equality > ... This would be useful for other purposes too, as we keep coming up against "what's the equality operator for this datatype" problems. However, the restriction to "true" equality, such that we can assume x = y implies f(x) = f(y) for every immutable function f on the datatype (note this need not necessarily mean bitwise equality --- it depends on what operations the datatype provides), seems like a problem. For instance, the ordinary = operators on float and numeric are NOT true equality, nor do we provide any true equality in this sense for these common datatypes. We could hardly get away with using this concept to drive foreign-key comparisons, if it doesn't work for float or numeric. We could invent some more-complex concept involving "well, this is equality, but there are some functions for which f(x) might differ from f(y) anyway" and then mark the presumably-few functions that could produce divergent results --- examples are sgn() for float8 and anything dependent on dscale for numeric. This seems ugly and error prone however. Anyone have a better idea? regards, tom lane
Am Dienstag, 11. Juli 2006 23:31 schrieb Tom Lane: > We could invent some more-complex concept involving "well, this is > equality, but there are some functions for which f(x) might differ > from f(y) anyway" and then mark the presumably-few functions that > could produce divergent results --- examples are sgn() for float8 > and anything dependent on dscale for numeric. This seems ugly and > error prone however. From a mathematical point of view, it seems cleaner to attach this property to functions, not operators, namely, "this function preserves the following relations". This would allow extending Simon's idea to other operators such as > and < and possibly more mind-boggling cases with geometric operators and such. Admittedly, this would put a lot of additional typing load on function authors, but perhaps it's safer that way. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Tue, 2006-07-11 at 17:31 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > ... > > - add a new boolean to pg_operator to allow us to define which operators > > offer true equality > > ... > > This would be useful for other purposes too, as we keep coming up > against "what's the equality operator for this datatype" problems. > However, the restriction to "true" equality, such that we can assume > x = y implies f(x) = f(y) for every immutable function f on the datatype > (note this need not necessarily mean bitwise equality --- it depends on > what operations the datatype provides), seems like a problem. For > instance, the ordinary = operators on float and numeric are NOT true > equality, nor do we provide any true equality in this sense for these > common datatypes. We could hardly get away with using this concept > to drive foreign-key comparisons, if it doesn't work for float or > numeric. Normally, I would not suggest that we do things only for certain data types only. In this case however, it seems that the reason it would work only for INTEGER and TEXT data types is that they are simple atomic datatypes that have the required properties. So doing this for those datatypes only seems permissable on a theoretical basis, rather than just because we can't be bothered to do it for more complex types. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon Riggs <simon@2ndquadrant.com> writes: > Normally, I would not suggest that we do things only for certain data > types only. In this case however, it seems that the reason it would work > only for INTEGER and TEXT data types is that they are simple atomic > datatypes that have the required properties. So doing this for those > datatypes only seems permissable on a theoretical basis, rather than > just because we can't be bothered to do it for more complex types. There's nothing simple nor atomic about TEXT, and in fact until very recently text_eq was NOT true equality by this definition. See discussions about hu_HU locale back in December. A number of people thought that fix was an ugly kluge, and so we may someday go back to a behavior in which text_eq is again not true equality --- in particular I'm dubious that such a restriction can survive once we support multiple encodings/collations in the same database. More generally, I don't believe in hacks that only work for a small number of built-in types: to me, that's prima facie evidence that you haven't thought the problem through. regards, tom lane
> > > - add a new boolean to pg_operator to allow us to define which > > > operators offer true equality ... > > > > This would be useful for other purposes too, as we keep coming up > > against "what's the equality operator for this datatype" problems. > > However, the restriction to "true" equality, such that we can assume x > > = y implies f(x) = f(y) for every immutable function f on the datatype Maybe we could have a tri (or more) state flag for the equality operators. ' ' .. not an equality op 'e' .. equality 's' .. strict equality (op only true iff the binary representation is equal) Andreas
On Wed, 2006-07-12 at 22:34 -0400, Tom Lane wrote: > More generally, I don't believe in hacks that only work for a small > number of built-in types: to me, that's prima facie evidence that you > haven't thought the problem through. I agree an attempt at a simple definition of the required functional properties hasn't yielded a great solution, so we must go deeper. On Wed, 2006-07-12 at 15:09 +0200, Peter Eisentraut wrote: > From a mathematical point of view, it seems cleaner to attach this property to > functions, not operators, namely, "this function preserves the following > relations". This would allow extending Simon's idea to other operators such > as > and < and possibly more mind-boggling cases with geometric operators and > such. Well, in PG, operators are functions that define various special properties of their inputs/outputs, so it seems the correct place to put these definitions. I agree it seems more normalised to place this information at the function itself, but that is not current precedent. On Thu, 2006-07-13 at 09:14 +0200, Zeugswetter Andreas DCP SD wrote: > Maybe we could have a tri (or more) state flag for the equality > operators. > > ' ' .. not an equality op > 'e' .. equality > 's' .. strict equality (op only true iff the binary representation is > equal) Tom has pointed out that the required functional properties are actually a fourth state between equality and full binary equality. I was trying to avoid introducing new single-use properties, but I think that is the only way here. My concern was not the complexity of specifying this for function authors, but the problem of making an incorrect selection leading to incorrect query results. It seems we need this in the catalog: > ' ' .. not an equality op > 'e' .. equality (sufficient for FKs)'f' .. functional equality (sufficient for this proposed optimisation) > 's' .. strict equality (op only true iff the binary representation is > equal) We're breaking new ground here, but that's a good thing. I'd much rather have an optimisable and extensible type system than a hard-wired one. There is a problem of implication here, AFAICS: When a user SQL asks WHERE col1 = 7 which equality level is meant when several exist? We currently already assume that they mean e-equality, since it is the most useful and intuitive meaning. That is not as strict as f-equality, so we would not be able to make implications from that. I'll think on this some more, but not for 8.2. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
> There is a problem of implication here, AFAICS: > When a user SQL asks > WHERE col1 = 7 > which equality level is meant when several exist? Well, the operator must be unique, so there is no problem. Unique in the sense that an operator with the same name ('=' in this case) and argument types cannot exist for more than one level of equality. (and the level should not have an effect on the resolution) So, when we see col1 = 7 we lookup the equality level of the operator and decide whether it is strict enough for the particular optimization. Andreas