Thread: Index-only scan for "f(x)" without "x"

Index-only scan for "f(x)" without "x"

From
Malthe
Date:
Referencing the example given in the documentation for index-only
scans [0], we consider an index:

CREATE INDEX tab_f_x ON tab (f(x));

This index currently will not be used for an index-scan for the
following query since the planner isn't smart enough to know that "x"
is not needed:

SELECT f(x) FROM tab WHERE f(x) < 1;

However, any function applied to a column for an index expression is
required to be immutable so as far as I can tell the planner doesn't
have to be very smart to know that the index can indeed be used for an
index-only scan (without having "x" included).

One interesting use-case for this is to be able to create
space-efficient indexes for raw log data. For example, for each type
of message (which might be encoded as JSON), one could create a
partial index with the relevant fields extracted and converted into
native data types and use index-only scanning to query. This is not
particularly attractive today because the message itself would need to
be added to the index effectively duplicating the log data.

In the same vein, being able to add this auxiliary data (which is
basically immutable expressions on one or more columns) explicitly
using INCLUDE would make the technique actually reliable. This is not
possible right now since expression are not supported as included
columns.

What's required in order to move forward on these capabilities?

[0] https://www.postgresql.org/docs/current/indexes-index-only-scans.html



Re: Index-only scan for "f(x)" without "x"

From
Tom Lane
Date:
Malthe <mborch@gmail.com> writes:
> Referencing the example given in the documentation for index-only
> scans [0], we consider an index:
> CREATE INDEX tab_f_x ON tab (f(x));
> This index currently will not be used for an index-scan for the
> following query since the planner isn't smart enough to know that "x"
> is not needed:
> SELECT f(x) FROM tab WHERE f(x) < 1;
> However, any function applied to a column for an index expression is
> required to be immutable so as far as I can tell the planner doesn't
> have to be very smart to know that the index can indeed be used for an
> index-only scan (without having "x" included).

The problem is that the planner's initial analysis of the query tree
concludes that the scan of "tab" has to return "x", because it looks
through the tree for plain Vars, and "x" is what it's going to find.
This conclusion is in fact true for any plan that doesn't involve
scanning a suitable index on f(x), so it's not something we can just
dispense with.

To back up from that and conclude that the indexscan doesn't really
need to return "x" because every use of it is in the context "f(x)"
seems like a pretty expensive proposition, especially once you start
considering index expressions that are more complex than a single
function call.  A brute-force matching search could easily be O(N^2)
or worse in the size of the query tree, which is a lot to pay when
there's a pretty high chance of failure.  (IOW, we have to expect
that most queries on "tab" are in fact not going to be able to use
that index, so we can't afford to spend a lot of planning time just
because the index exists.)

> What's required in order to move forward on these capabilities?

Some non-brute-force solution to the above problem.

            regards, tom lane



Re: Index-only scan for "f(x)" without "x"

From
Malthe
Date:
On Thu, 23 Jan 2020 at 00:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The problem is that the planner's initial analysis of the query tree
> concludes that the scan of "tab" has to return "x", because it looks
> through the tree for plain Vars, and "x" is what it's going to find.
> This conclusion is in fact true for any plan that doesn't involve
> scanning a suitable index on f(x), so it's not something we can just
> dispense with.

If f(x) was actually added to the table as a virtual generated column
V then the planner could perhaps reasonably find that a particular
expression depends in part on V (rather than the set of real columns
underlying the virtual generated column).

That is, the planner now knows that it needs to return V = "f(x)"
rather than "x" and so it can match the requirement to our index and
decide to use an index-only scan.

> To back up from that and conclude that the indexscan doesn't really
> need to return "x" because every use of it is in the context "f(x)"
> seems like a pretty expensive proposition, especially once you start
> considering index expressions that are more complex than a single
> function call.  A brute-force matching search could easily be O(N^2)
> or worse in the size of the query tree, which is a lot to pay when
> there's a pretty high chance of failure.  (IOW, we have to expect
> that most queries on "tab" are in fact not going to be able to use
> that index, so we can't afford to spend a lot of planning time just
> because the index exists.)

In the approach outlined above, the set of expressions to match on
would be limited by the set of virtual generated columns defined on
the table — some sort of prefix tree that allows efficient matching on
a given expression syntax tree? There would be no cost to this on a
table that had no virtual generated columns.

The question is whether this is a reasonable take on virtual generated columns.

       ---    regards