Thread: Index-only scan for "f(x)" without "x"
Referencing the example given in the documentation for index-only scans [0], we consider an index: CREATE INDEX tab_f_x ON tab (f(x)); This index currently will not be used for an index-scan for the following query since the planner isn't smart enough to know that "x" is not needed: SELECT f(x) FROM tab WHERE f(x) < 1; However, any function applied to a column for an index expression is required to be immutable so as far as I can tell the planner doesn't have to be very smart to know that the index can indeed be used for an index-only scan (without having "x" included). One interesting use-case for this is to be able to create space-efficient indexes for raw log data. For example, for each type of message (which might be encoded as JSON), one could create a partial index with the relevant fields extracted and converted into native data types and use index-only scanning to query. This is not particularly attractive today because the message itself would need to be added to the index effectively duplicating the log data. In the same vein, being able to add this auxiliary data (which is basically immutable expressions on one or more columns) explicitly using INCLUDE would make the technique actually reliable. This is not possible right now since expression are not supported as included columns. What's required in order to move forward on these capabilities? [0] https://www.postgresql.org/docs/current/indexes-index-only-scans.html
Malthe <mborch@gmail.com> writes: > Referencing the example given in the documentation for index-only > scans [0], we consider an index: > CREATE INDEX tab_f_x ON tab (f(x)); > This index currently will not be used for an index-scan for the > following query since the planner isn't smart enough to know that "x" > is not needed: > SELECT f(x) FROM tab WHERE f(x) < 1; > However, any function applied to a column for an index expression is > required to be immutable so as far as I can tell the planner doesn't > have to be very smart to know that the index can indeed be used for an > index-only scan (without having "x" included). The problem is that the planner's initial analysis of the query tree concludes that the scan of "tab" has to return "x", because it looks through the tree for plain Vars, and "x" is what it's going to find. This conclusion is in fact true for any plan that doesn't involve scanning a suitable index on f(x), so it's not something we can just dispense with. To back up from that and conclude that the indexscan doesn't really need to return "x" because every use of it is in the context "f(x)" seems like a pretty expensive proposition, especially once you start considering index expressions that are more complex than a single function call. A brute-force matching search could easily be O(N^2) or worse in the size of the query tree, which is a lot to pay when there's a pretty high chance of failure. (IOW, we have to expect that most queries on "tab" are in fact not going to be able to use that index, so we can't afford to spend a lot of planning time just because the index exists.) > What's required in order to move forward on these capabilities? Some non-brute-force solution to the above problem. regards, tom lane
On Thu, 23 Jan 2020 at 00:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The problem is that the planner's initial analysis of the query tree > concludes that the scan of "tab" has to return "x", because it looks > through the tree for plain Vars, and "x" is what it's going to find. > This conclusion is in fact true for any plan that doesn't involve > scanning a suitable index on f(x), so it's not something we can just > dispense with. If f(x) was actually added to the table as a virtual generated column V then the planner could perhaps reasonably find that a particular expression depends in part on V (rather than the set of real columns underlying the virtual generated column). That is, the planner now knows that it needs to return V = "f(x)" rather than "x" and so it can match the requirement to our index and decide to use an index-only scan. > To back up from that and conclude that the indexscan doesn't really > need to return "x" because every use of it is in the context "f(x)" > seems like a pretty expensive proposition, especially once you start > considering index expressions that are more complex than a single > function call. A brute-force matching search could easily be O(N^2) > or worse in the size of the query tree, which is a lot to pay when > there's a pretty high chance of failure. (IOW, we have to expect > that most queries on "tab" are in fact not going to be able to use > that index, so we can't afford to spend a lot of planning time just > because the index exists.) In the approach outlined above, the set of expressions to match on would be limited by the set of virtual generated columns defined on the table — some sort of prefix tree that allows efficient matching on a given expression syntax tree? There would be no cost to this on a table that had no virtual generated columns. The question is whether this is a reasonable take on virtual generated columns. --- regards