Thread: Seq scan instead of index scan querying single row from primary key on large table

Hi all,

I've been optimizing queries for a long time, and I don't think I've
ever seen something more surprising to me than this -- sufficiently so
that I wanted to ask if others thought it implied a bug. It's possible
my mental model for the planner is broken in some significant way, or
that I'm overlooking something obvious, so please let me know if
that's the case too.

We have a query like this (all tables renamed to be generic):

SELECT DISTINCT "objects"."pk"
FROM "objects"
LEFT OUTER JOIN "objects" "allowed_objects_objects" ON
"allowed_objects_objects"."pk" = "objects"."allowed_object_fk" AND
"objects"."allowed_object_fk" IS NOT NULL
LEFT OUTER JOIN "facilitated_object_metadata" ON
"facilitated_object_metadata"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "object_audits" ON "object_audits"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "objects" "action_objects_objects" ON
"action_objects_objects"."allowed_object_fk" IS NOT NULL AND
"action_objects_objects"."allowed_object_fk" = "objects"."pk"
LEFT OUTER JOIN "objects" "return_objects_objects" ON
"return_objects_objects"."returned_object_fk" IS NOT NULL AND
"return_objects_objects"."returned_object_fk" = "objects"."pk"
LEFT OUTER JOIN "object_taxs" ON "object_taxs"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "object_configs" ON "object_configs"."object_fk" =
"objects"."pk"
LEFT OUTER JOIN "object_statuses" ON "object_statuses"."object_fk" =
"objects"."pk"
LEFT OUTER JOIN "object_edits" ON "object_edits"."object_fk" = "objects"."pk"
LEFT OUTER JOIN "audit_answers" ON "audit_answers"."requester_type" =
'Transaction' AND "audit_answers"."requester_fk" = "objects"."pk"
WHERE "objects"."pk" = 131690144
LIMIT 1;

This is ORM generated, so granted there's some slight oddity with the
"DISTINCT pk" and then the "LIMIT 1".

The plan generated by the planner changed suddenly one morning this
week, and in a very surprising way: the innermost scan (of "objects")
started choosing a seq scan, despite the cost from that node being
very high and an index scan being possible -- it's the primary key and
we're restricting on a single value, so intuitively we know, and
indeed the planner knows, that there will only be a single row
returned.

Here's the surprising plan:

Limit  (cost=3.42..4.00 rows=1 width=8)
  ->  Limit  (cost=3.42..4.00 rows=1 width=8)
        ->  Nested Loop Left Join  (cost=3.42..7939067.28 rows=13777920 width=8)
              ->  Nested Loop Left Join  (cost=2.86..7766839.95
rows=1059840 width=8)
                    ->  Nested Loop Left Join  (cost=2.44..7753589.13
rows=211968 width=8)
                          ->  Nested Loop Left Join
(cost=1.86..7750902.24 rows=138 width=8)
                                ->  Nested Loop Left Join
(cost=1.29..7750895.86 rows=1 width=8)
                                      ->  Nested Loop Left Join
(cost=1.29..7750895.85 rows=1 width=8)
                                            ->  Nested Loop Left Join
(cost=0.86..7750893.39 rows=1 width=8)
                                                  ->  Nested Loop Left
Join  (cost=0.43..7750890.93 rows=1 width=8)
                                                        ->  Seq Scan
on objects  (cost=0.00..7750888.47 rows=1 width=16)
                                                              Filter:
(pk = 131690144)
                                                        ->  Index Only
Scan using index_facilitated_object_metadata_on_object_fk on
facilitated_object_metadata  (cost=0.43..2.45 rows=1 width=8)
                                                              Index
Cond: (object_fk = 131690144)
                                                  ->  Index Only Scan
using index_objects_on_allowed_object_fk_not_null on objects
action_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                        Index Cond:
(allowed_object_fk = 131690144)
                                            ->  Index Only Scan using
index_objects_on_returned_object_fk_not_null on objects
return_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                  Index Cond:
(returned_object_fk = 131690144)
                                      ->  Seq Scan on object_taxs
(cost=0.00..0.00 rows=1 width=8)
                                            Filter: (object_fk = 131690144)
                                ->  Index Only Scan using
index_object_audits_on_object_fk on object_audits  (cost=0.57..5.00
rows=138 width=8)
                                      Index Cond: (object_fk = 131690144)
                          ->  Materialize  (cost=0.57..41.13 rows=1536 width=8)
                                ->  Index Only Scan using
index_object_configs_on_object_id on object_configs  (cost=0.57..33.45
rows=1536 width=8)
                                      Index Cond: (object_fk = 131690144)
                    ->  Materialize  (cost=0.42..2.84 rows=5 width=8)
                          ->  Index Only Scan using
index_adjustment_responses_on_response_fk on object_edits
(cost=0.42..2.81 rows=5 width=8)
                                Index Cond: (object_fk = 131690144)
              ->  Materialize  (cost=0.56..3.36 rows=13 width=8)
                    ->  Index Only Scan using
index_audit_answers_on_requester_type_and_fk on audit_answers
(cost=0.56..3.30 rows=13 width=8)
                          Index Cond: ((requester_type =
'Object'::bt_object_or_alternate_object) AND (requester_fk =
131690144))


Note the innermost table scan:
Seq Scan on objects  (cost=0.00..7750888.47 rows=1 width=16)
   Filter: (pk = 131690144)

If I set enable_seqscan = off, then I get the old plan:

Limit  (cost=4.12..4.13 rows=1 width=8)
   ->  Limit  (cost=4.12..4.13 rows=1 width=8)
         ->  Nested Loop Left Join  (cost=4.12..188183.54 rows=13777920 width=8)
               ->  Nested Loop Left Join  (cost=3.55..15956.20
rows=1059840 width=8)
                     ->  Nested Loop Left Join  (cost=3.13..2705.38
rows=211968 width=8)
                           ->  Nested Loop Left Join
(cost=2.55..18.49 rows=138 width=8)
                                 ->  Nested Loop Left Join
(cost=1.98..12.11 rows=1 width=8)
                                       ->  Nested Loop Left Join
(cost=1.86..9.96 rows=1 width=8)
                                             ->  Nested Loop Left Join
 (cost=1.43..7.50 rows=1 width=8)
                                                   ->  Nested Loop
Left Join  (cost=0.99..5.04 rows=1 width=8)
                                                         ->  Index
Scan using objects_pkey on objects  (cost=0.57..2.58 rows=1 width=16)
                                                               Index
Cond: (pk = 131690144)
                                                         ->  Index
Only Scan using index_facilitated_object_metadata_on_object_fk on
facilitated_object_metadata  (cost=0.43..2.45 rows=1 width=8)
                                                               Index
Cond: (object_fk = 131690144)
                                                   ->  Index Only Scan
using index_objects_on_allowed_object_fk_not_null on objects
action_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                         Index Cond:
(allowed_object_fk = 131690144)
                                             ->  Index Only Scan using
index_objects_on_returned_object_fk_not_null on objects
return_objects_objects  (cost=0.43..2.45 rows=1 width=8)
                                                   Index Cond:
(returned_object_fk = 131690144)
                                       ->  Index Only Scan using
index_object_taxs_on_object_fk on object_taxs  (cost=0.12..2.14 rows=1
width=8)
                                             Index Cond: (object_fk = 131690144)
                                 ->  Index Only Scan using
index_object_audits_on_object_fk on object_audits  (cost=0.57..5.00
rows=138 width=8)
                                       Index Cond: (object_fk = 131690144)
                           ->  Materialize  (cost=0.57..41.13 rows=1536 width=8)
                                 ->  Index Only Scan using
index_object_configs_on_object_id on object_configs  (cost=0.57..33.45
rows=1536 width=8)
                                       Index Cond: (object_fk = 131690144)
                     ->  Materialize  (cost=0.42..2.84 rows=5 width=8)
                           ->  Index Only Scan using
index_adjustment_responses_on_response_fk on object_edits
(cost=0.42..2.81 rows=5 width=8)
                                 Index Cond: (object_fk = 131690144)
               ->  Materialize  (cost=0.56..3.36 rows=13 width=8)
                     ->  Index Only Scan using
index_audit_answers_on_requester_type_and_fk on audit_answers
(cost=0.56..3.30 rows=13 width=8)
                           Index Cond: ((requester_type =
'Object'::bt_object_or_alternate_object) AND (requester_fk =
131690144))


Notice the innermost table scan is the expect index scan:
 ->  Index Scan using objects_pkey on objects  (cost=0.57..2.58 rows=1 width=16)
    Index Cond: (pk = 131690144)

The join order stays the same between the two plans. The scan on
object_taxs changes from a seq scan to an index scan (the table is
empty so neither would be a problem) obviously because of disabled
sequence scans.

Analyzing object_audits during the incident did *not* change the plan,
but a later auto-analyze of that same table seemed to change the plan
back. That being said, we're now back at the bad plan (without setting
enable_seqscan = off). I don't understand why that analyze should
really affect anything at all given that join order stays the same
between good/bad plans and given that I'd expect the planner to
strongly prefer the index scan given its far lower cost for the number
of estimated tuples. Analyzing objects doesn't help. Removing the
DISTINCT does not fix the plan.

Removing the LIMIT 1 does fix the plan and removes the double LIMIT
node in the plan. The second LIMIT node is implicit from the DISTINCT,
and it doesn't seem to me that a LIMIT node should change the
cost/path chosen of an identical limit node inside it, and indeed the
cost seems to be the same at both.

The only explanation I can come up with is that the startup cost for
the seq scan is 0 while for the index scan is 0.57, and the LIMIT at
the top level is causing the planner to care about startup cost.

Assuming that's true I think the early return cost multiplication of
the LIMIT is being applied very naively on the seq scan node. Or
perhaps the issue is that the startup cost for a single tuple on a seq
scan like this shouldn't really have a startup cost of 0 -- that cost
is presumably for tuples being returned _without_ having applied the
filter. That seems slightly odd to me, because the cost of getting the
first row out of that node -- in my naive view thinking about it for
all of 5 seconds -- should be calculated based on applying the filter
(and thus the likelihood that that filter matches right away). If we
did that then this path would never win. But that 0.00 startup cost
for the seq scan with a filter shows up in PG14 and PG11 also, not
just PG16, so that's not something that's changed.

To recap: the estimation of rows is correct, the estimated high
(total) cost of the seq scan is correct, but the seq scan is chosen
over the index scan anyway for a plan that returns a single "random"
row based on the primary key.

Am I right to be surprised here?

James Coleman



James Coleman <jtc331@gmail.com> writes:
> The plan generated by the planner changed suddenly one morning this
> week, and in a very surprising way: the innermost scan (of "objects")
> started choosing a seq scan, despite the cost from that node being
> very high and an index scan being possible

That looks weird to me too, but given the number of tables involved
I wonder what you have join_collapse_limit/from_collapse_limit set
to.  If those are smaller than the query's relation count it could
be that this is an artifact of optimization of a join subproblem.
However, if it's the very same query you've been generating all along,
this theory doesn't really explain the sudden change of plan choice.

Also, even granting the bad-peephole-optimization theory, it seems
like the best path for the objects table alone would still have been
the index scan, so I'm confused too.  What nondefault planner settings
have you got?  ("EXPLAIN (SETTINGS)" would help with answering that
accurately.)

Are you really sure nothing changed about what the ORM is emitting?

            regards, tom lane



On Thu, Jul 18, 2024 at 2:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> James Coleman <jtc331@gmail.com> writes:
> > The plan generated by the planner changed suddenly one morning this
> > week, and in a very surprising way: the innermost scan (of "objects")
> > started choosing a seq scan, despite the cost from that node being
> > very high and an index scan being possible
>
> That looks weird to me too, but given the number of tables involved
> I wonder what you have join_collapse_limit/from_collapse_limit set
> to.  If those are smaller than the query's relation count it could
> be that this is an artifact of optimization of a join subproblem.
> However, if it's the very same query you've been generating all along,
> this theory doesn't really explain the sudden change of plan choice.

Those gucs are both set to 8, so...that could be a factor, except that
as you noted if that's not a variable that's changing during the
flip-flop of plans, then it's hard to make that the explanation. See
below for more on this.

> Also, even granting the bad-peephole-optimization theory, it seems
> like the best path for the objects table alone would still have been
> the index scan, so I'm confused too.

I'm glad I'm not the only one.

> What nondefault planner settings
> have you got?  ("EXPLAIN (SETTINGS)" would help with answering that
> accurately.)

Settings: max_parallel_workers = '24', maintenance_io_concurrency =
'1', effective_cache_size = '264741832kB', work_mem = '16MB',
random_page_cost = '1', search_path = 'object_shard,public'

Of those I'd expect random_page_cost to steer it towards the index
scan rather than away from it. The others don't seem to me like they
should effect which path would be the best one for the objects table
alone.

> Are you really sure nothing changed about what the ORM is emitting?

Yes, aside from the fact that application code didn't change, we
reproduced the problem by restoring a physical snapshot of the
database and were able to get the bad plan and then get it to change
to the good plan by analyzing object_audits. Additionally on the live
production database (well, a read replica anyway) today it'd switched
back to the bad plan, and then an hour layer it'd swapped back again
for the same exact query.

Thanks,
James Coleman



James Coleman <jtc331@gmail.com> writes:
> On Thu, Jul 18, 2024 at 2:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Are you really sure nothing changed about what the ORM is emitting?

> Yes, aside from the fact that application code didn't change, we
> reproduced the problem by restoring a physical snapshot of the
> database and were able to get the bad plan and then get it to change
> to the good plan by analyzing object_audits.

OK.  After staring at this for awhile, I think I see what's happening,
and you're right that it's basically a problem with how we do LIMIT.

Consider this simple test case:

regression=# create table t1 (id int primary key, link int);
CREATE TABLE
regression=# insert into t1 select id, id/1000 from generate_series(1,1000000) id;
INSERT 0 1000000
regression=# vacuum analyze t1;
VACUUM
regression=# set enable_indexscan TO 0;
SET
regression=# set enable_bitmapscan TO 0;
SET
regression=# set max_parallel_workers_per_gather TO 0;
SET
regression=# explain select * from t1 where id = 42 limit 1;
                         QUERY PLAN
------------------------------------------------------------
 Limit  (cost=0.00..16925.00 rows=1 width=8)
   ->  Seq Scan on t1  (cost=0.00..16925.00 rows=1 width=8)
         Filter: (id = 42)
(3 rows)

Because we'll pick the winning Limit plan based just on its
total_cost, this set of cost assignments amounts to assuming that the
one matching tuple will appear at the end of the seqscan.  That's
overly conservative perhaps, but it's fine.

But now look what happens when we join to another table that has many
matching rows:

regression=# explain select * from t1 t1 left join t1 t2 on (t1.id=t2.link) where t1.id = 42 limit 1;
                              QUERY PLAN
-----------------------------------------------------------------------
 Limit  (cost=0.00..33.96 rows=1 width=16)
   ->  Nested Loop Left Join  (cost=0.00..33859.97 rows=997 width=16)
         ->  Seq Scan on t1  (cost=0.00..16925.00 rows=1 width=8)
               Filter: (id = 42)
         ->  Seq Scan on t1 t2  (cost=0.00..16925.00 rows=997 width=8)
               Filter: (link = 42)
(6 rows)

We have, basically, completely forgotten about that conservative
positioning estimate: this set of costs amounts to assuming that the
first matching tuple will be found 1/1000th of the way through the
entire join, which is a lot less than what we'd just decided it would
cost to retrieve the first t1 tuple.  In your example, with over 13
million rows to be joined, that allows us to make a ridiculously small
estimate of the time to find the first match, and so it's able to
win a cost comparison against the far-more-sanely-estimated indexscan
plan.

I'm not sure what to do about this.  I think that things might work
out better if we redefined the startup cost as "estimated cost to
retrieve the first tuple", rather than its current very-squishy
definition as "cost to initialize the scan".  That would end up
with the LIMIT node having a cost that's at least the sum of the
startup costs of the input scans, which would fix this problem.
But changing that everywhere would be a lotta work, and I'm far
from sure that it would not have any negative side-effects.

            regards, tom lane



Tom Lane <tgl@sss.pgh.pa.us> writes:

> I'm not sure what to do about this.  I think that things might work
> out better if we redefined the startup cost as "estimated cost to
> retrieve the first tuple", rather than its current very-squishy
> definition as "cost to initialize the scan".

Actually I wanted to raise this question very long time ago when I
read the code, but I don't have a example to prove it can cause any real
impact, then I didn't ask it. 

startup_cost is defined by the cost to retrieve the *first tuple*, so
for the query like "SELECT * FROM t WHERE foo", the IO cost to retrieve
the first tpule is obviously not 0. (I think it can be total_cost /
rows?)  at the same time, the startup_cost of IndexScan is more
restricted, it counts the IO blocks from root -> leaf nodes. I think
there is a inconsistent issue as well.  

> That would end up
> with the LIMIT node having a cost that's at least the sum of the
> startup costs of the input scans, which would fix this problem.

great to know this.

> But changing that everywhere would be a lotta work.

In my understanding, the only place we need to change is the
startup_cost in cost_seqscan, I must be wrong now, but I want to know
where is it. 

> and I'm far from sure that it would not have any negative
> side-effects. 

Yes, I think it is a semantics correct than before however.

-- 
Best Regards
Andy Fan