Thread: BUG #16993: Query Planner does not use index for EXISTS-query on a large table with all equal foreign key values

The following bug has been logged on the website:

Bug reference:      16993
Logged by:          A Schnabl
Email address:      a.schnabl@synedra.com
PostgreSQL version: 13.2
Operating system:   Windows 10
Description:

Hello,

I can reproduce a query planner problem on a fresh default installation of
postgres-13.2 on my Windows 10 machine. I also saw it occur on on
postgres-12.6 installations on Windows 10 and CentOS 8. I used the following
table setup:

CREATE TABLE data_node (id BIGINT PRIMARY KEY);
CREATE TABLE data_entry (id BIGINT PRIMARY KEY, node_fk BIGINT NOT NULL,
FOREIGN KEY (node_fk) REFERENCES data_node (id));
CREATE INDEX node_ix ON data_entry USING BTREE (node_fk);
INSERT INTO data_node (id) SELECT generate_series(1,10);
INSERT INTO data_entry (id, node_fk) SELECT s, 2 FROM
generate_series(1,10000000) s;
VACUUM ANALYZE data_node;
VACUUM ANALYZE data_entry;

With the following query, I filter all entries of the data_node table (only
a single with the data given here) which are referenced by the data_entry
table:

SELECT * FROM data_node
WHERE EXISTS (
 SELECT 1 FROM data_entry WHERE node_fk = data_node.id
);

The query takes a few seconds to complete, although I expect it to be much
faster (as shown below, it can be performed in less than a millisecond) due
to the node_ix index. EXPLAIN ANALYZE shows the following query plan:

Merge Join  (cost=179053.41..179054.24 rows=1 width=8) (actual
time=2140.820..2140.822 rows=1 loops=1)
  Merge Cond: (data_node.id = data_entry.node_fk)
  ->  Index Only Scan using data_node_pkey on data_node  (cost=0.14..8.29
rows=10 width=8) (actual time=0.006..0.010 rows=3 loops=1)
        Heap Fetches: 0
  ->  Sort  (cost=179053.27..179053.28 rows=1 width=8) (actual
time=2140.805..2140.806 rows=1 loops=1)
        Sort Key: data_entry.node_fk
        Sort Method: quicksort  Memory: 25kB
        ->  HashAggregate  (cost=179053.25..179053.26 rows=1 width=8)
(actual time=2140.796..2140.796 rows=1 loops=1)
              Group Key: data_entry.node_fk
              Batches: 1  Memory Usage: 24kB
              ->  Seq Scan on data_entry  (cost=0.00..154053.60 rows=9999860
width=8) (actual time=0.036..973.376 rows=10000000 loops=1)
Planning Time: 0.192 ms
Execution Time: 2140.873 ms

Note that a sequential scan scan on data_entry is used here rather than some
index (only) scans on node_ix. If I

SET enable_seqscan = false;

and EXPLAIN ANALYZE the query again, I get the following result:

Nested Loop Semi Join  (cost=0.57..12.85 rows=1 width=8) (actual
time=0.877..0.920 rows=1 loops=1)
  ->  Index Only Scan using data_node_pkey on data_node  (cost=0.14..8.29
rows=10 width=8) (actual time=0.006..0.011 rows=10 loops=1)
        Heap Fetches: 0
  ->  Index Only Scan using node_ix on data_entry  (cost=0.43..178143.98
rows=9999860 width=8) (actual time=0.089..0.089 rows=0 loops=10)
        Index Cond: (node_fk = data_node.id)
        Heap Fetches: 0
Planning Time: 0.328 ms
Execution Time: 0.959 ms

Due to the huge performance gap between the plans, I think that the query
planner should use the node_ix index here even without enable_seqscan being
set to false.


PG Bug reporting form <noreply@postgresql.org> writes:
> ... EXPLAIN ANALYZE shows the following query plan:

> Merge Join  (cost=179053.41..179054.24 rows=1 width=8) (actual
> time=2140.820..2140.822 rows=1 loops=1)

> [ but with ]  SET enable_seqscan = false:

> Nested Loop Semi Join  (cost=0.57..12.85 rows=1 width=8) (actual
> time=0.877..0.920 rows=1 loops=1)

Yeah, that seems *extremely* broken.  The planner correctly estimates
the second plan as much cheaper, so why did it fail to choose that one
without help?  Will look into this.

            regards, tom lane



PG Bug reporting form <noreply@postgresql.org> writes:
> CREATE TABLE data_node (id BIGINT PRIMARY KEY);
> CREATE TABLE data_entry (id BIGINT PRIMARY KEY, node_fk BIGINT NOT NULL,
> FOREIGN KEY (node_fk) REFERENCES data_node (id));
> CREATE INDEX node_ix ON data_entry USING BTREE (node_fk);
> INSERT INTO data_node (id) SELECT generate_series(1,10);
> INSERT INTO data_entry (id, node_fk) SELECT s, 2 FROM
> generate_series(1,10000000) s;
> VACUUM ANALYZE data_node;
> VACUUM ANALYZE data_entry;
> ...
> SELECT * FROM data_node
> WHERE EXISTS (
>  SELECT 1 FROM data_entry WHERE node_fk = data_node.id
> );

I looked into this.  The reason for the strange behavior is that,
thanks to the extremely skewed data distribution in data_entry.node_fk,
the indexscan you're hoping it would use is actually costed as being
more expensive than a seqscan.  That causes it to get discarded at
the relation planning level.  Later on, when we consider join plans,
there's a special exception that understands that a semijoin with
an inner indexscan that uses all the join quals is particularly
efficient; which is why the desirable plan is correctly given a
small cost, *if* we consider it at all.  But if the indexscan was
already thrown away, we won't.  Of course, disabling the seqscan
allows the indexscan to survive the scan-level path tournament.

I'm not sure what a good fix for this would look like.  We could imagine
trying to mark indexscan paths that satisfy has_indexed_join_quals()
at time of generation, and then giving them a free pass in add_path().
However
(1) as written, has_indexed_join_quals() depends on the particular
join being considered.  That might be just an artifact of coding
convenience, but I'm not sure.
(2) add_path() is enough of a hot spot that giving it yet more things
to worry about is fairly unattractive.
(3) allowing more paths to survive the tournament slows things down
overall.

Tweaking the example shows that you need a *very* skewed data
distribution to make this happen, so maybe we should write it off
as a corner case that's not worth fixing.  I'm for sure not eager
to slow planning down overall to improve this case, so I don't
much like the above proposal even if it could be made to work.

A different idea is to reconsider the selectivity estimation rules
with an eye to cutting the estimated cost of the indexscan.  The
thing that I'm wondering about here is that we're estimating that
using eqsel() rather than eqjoinsel(), and it seems like the latter
might give us a more useful value.  AFAIR we never thought hard
about that aspect when we invented parameterized plans.  But for
sure, changing that would be a research project with a lot of
potential impact.

A narrower fix would be to hack var_eq_non_const so that it doesn't
assume that the comparison value must be one of the entries in the
column.  But it seems like whatever change we made in that line would
be a very unprincipled hack, because what are you going to assume
instead?

Anyway, short answer is that this is more complicated than it
looks, and there isn't any simple improvement we can make.

            regards, tom lane



On Thu, 6 May 2021 at 05:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A narrower fix would be to hack var_eq_non_const so that it doesn't
> assume that the comparison value must be one of the entries in the
> column.  But it seems like whatever change we made in that line would
> be a very unprincipled hack, because what are you going to assume
> instead?

Yeah, this is the same problem as I was mentioning in [1]

My solution was to go for that "unprincipled hack" in var_eq_non_const().

I'm not sure I 100% agree that it's a complete hack, you don't really
have to change the n_distinct by much to get the good plan.  It's a
massive risk to assume that the given value will *always* be the
single distinct value that's indexed.

# SELECT * FROM data_node WHERE EXISTS (SELECT 1 FROM data_entry WHERE
node_fk = data_node.id);
Time: 4807.956 ms (00:04.808)
# alter table data_entry alter column node_fk set (n_distinct = 2);
# analyze data_entry;
# SELECT * FROM data_node WHERE EXISTS (SELECT 1 FROM data_entry WHERE
node_fk = data_node.id);
Time: 3.930 ms

I just feel like it's a huge risk to reject an index path of a column
with 1 distinct value with the assumption that the value that's going
to be looked up *is* that 1 distinct value.  If the index lookup is
done on any of the other 2^64-1 values (in this case) then the index
path would be a *major* win when compared to a seqscan path. The risk
to reward ratio of what we do now is outrageous.

David

[1] https://www.postgresql.org/message-id/CAApHDvpbJHwMZ1U-nzU0kBxu0kwMpBvyL+AFWvFAmurypSo1SQ@mail.gmail.com



David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 6 May 2021 at 05:47, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A narrower fix would be to hack var_eq_non_const so that it doesn't
>> assume that the comparison value must be one of the entries in the
>> column.  But it seems like whatever change we made in that line would
>> be a very unprincipled hack, because what are you going to assume
>> instead?

> Yeah, this is the same problem as I was mentioning in [1]
> My solution was to go for that "unprincipled hack" in var_eq_non_const().
> I'm not sure I 100% agree that it's a complete hack, you don't really
> have to change the n_distinct by much to get the good plan.  It's a
> massive risk to assume that the given value will *always* be the
> single distinct value that's indexed.

Yeah, I agree that it doesn't seem great to let var_eq_non_const return
1.0 when it has no idea what the comparison value is.  However, that
doesn't translate to having much confidence in any other value either.
The actual number-of-rows-fetched, given that we know the column
contents are all the same value, is either zero or the whole table.
It's hard to do much with that; and biasing it towards believing the
optimistic value over the pessimistic value seems dangerous.

In any case, I think the real issue here is that we know that *in use*,
the indexscan will fetch either zero or one row, and be pretty quick
either way.  The problem is that that knowledge is being applied
at the join level, which is too late to save the path from losing.
How could we move that knowledge down to the scan path costs?

            regards, tom lane



On Thu, 6 May 2021 at 15:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I agree that it doesn't seem great to let var_eq_non_const return
> 1.0 when it has no idea what the comparison value is.  However, that
> doesn't translate to having much confidence in any other value either.
> The actual number-of-rows-fetched, given that we know the column
> contents are all the same value, is either zero or the whole table.
> It's hard to do much with that; and biasing it towards believing the
> optimistic value over the pessimistic value seems dangerous.

Maybe we should be also looking at var_eq_non_const()'s 'other' field
and seeing if that's a Var with stats and dividing the results by the
number of distinct values in that, or at least doing something closer
to what eqjoinsel_inner() does when it's prompted with the same
problem.

David



David Rowley <dgrowleyml@gmail.com> writes:
> On Thu, 6 May 2021 at 15:28, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I agree that it doesn't seem great to let var_eq_non_const return
>> 1.0 when it has no idea what the comparison value is.  However, that
>> doesn't translate to having much confidence in any other value either.
>> The actual number-of-rows-fetched, given that we know the column
>> contents are all the same value, is either zero or the whole table.
>> It's hard to do much with that; and biasing it towards believing the
>> optimistic value over the pessimistic value seems dangerous.

> Maybe we should be also looking at var_eq_non_const()'s 'other' field
> and seeing if that's a Var with stats and dividing the results by the
> number of distinct values in that, or at least doing something closer
> to what eqjoinsel_inner() does when it's prompted with the same
> problem.

Yeah, as I mentioned upthread, it's tempting to think about applying
join selectivity rules when dealing with a parameterized path's
restriction-qual-that's-really-a-join-qual.  I'm not sure if that
selectivity is expressed the right way as things stand (i.e.,
fraction of the Cartesian product size isn't the same thing as
fraction of the inner relation).  In any case that seems like a bit
of a research problem.

            regards, tom lane