Thread: Seq scan instead of index scan querying single row from primary key on large table
Seq scan instead of index scan querying single row from primary key on large table
From
James Coleman
Date:
Hi all, I've been optimizing queries for a long time, and I don't think I've ever seen something more surprising to me than this -- sufficiently so that I wanted to ask if others thought it implied a bug. It's possible my mental model for the planner is broken in some significant way, or that I'm overlooking something obvious, so please let me know if that's the case too. We have a query like this (all tables renamed to be generic): SELECT DISTINCT "objects"."pk" FROM "objects" LEFT OUTER JOIN "objects" "allowed_objects_objects" ON "allowed_objects_objects"."pk" = "objects"."allowed_object_fk" AND "objects"."allowed_object_fk" IS NOT NULL LEFT OUTER JOIN "facilitated_object_metadata" ON "facilitated_object_metadata"."object_fk" = "objects"."pk" LEFT OUTER JOIN "object_audits" ON "object_audits"."object_fk" = "objects"."pk" LEFT OUTER JOIN "objects" "action_objects_objects" ON "action_objects_objects"."allowed_object_fk" IS NOT NULL AND "action_objects_objects"."allowed_object_fk" = "objects"."pk" LEFT OUTER JOIN "objects" "return_objects_objects" ON "return_objects_objects"."returned_object_fk" IS NOT NULL AND "return_objects_objects"."returned_object_fk" = "objects"."pk" LEFT OUTER JOIN "object_taxs" ON "object_taxs"."object_fk" = "objects"."pk" LEFT OUTER JOIN "object_configs" ON "object_configs"."object_fk" = "objects"."pk" LEFT OUTER JOIN "object_statuses" ON "object_statuses"."object_fk" = "objects"."pk" LEFT OUTER JOIN "object_edits" ON "object_edits"."object_fk" = "objects"."pk" LEFT OUTER JOIN "audit_answers" ON "audit_answers"."requester_type" = 'Transaction' AND "audit_answers"."requester_fk" = "objects"."pk" WHERE "objects"."pk" = 131690144 LIMIT 1; This is ORM generated, so granted there's some slight oddity with the "DISTINCT pk" and then the "LIMIT 1". The plan generated by the planner changed suddenly one morning this week, and in a very surprising way: the innermost scan (of "objects") started choosing a seq scan, despite the cost from that node being very high and an index scan being possible -- it's the primary key and we're restricting on a single value, so intuitively we know, and indeed the planner knows, that there will only be a single row returned. Here's the surprising plan: Limit (cost=3.42..4.00 rows=1 width=8) -> Limit (cost=3.42..4.00 rows=1 width=8) -> Nested Loop Left Join (cost=3.42..7939067.28 rows=13777920 width=8) -> Nested Loop Left Join (cost=2.86..7766839.95 rows=1059840 width=8) -> Nested Loop Left Join (cost=2.44..7753589.13 rows=211968 width=8) -> Nested Loop Left Join (cost=1.86..7750902.24 rows=138 width=8) -> Nested Loop Left Join (cost=1.29..7750895.86 rows=1 width=8) -> Nested Loop Left Join (cost=1.29..7750895.85 rows=1 width=8) -> Nested Loop Left Join (cost=0.86..7750893.39 rows=1 width=8) -> Nested Loop Left Join (cost=0.43..7750890.93 rows=1 width=8) -> Seq Scan on objects (cost=0.00..7750888.47 rows=1 width=16) Filter: (pk = 131690144) -> Index Only Scan using index_facilitated_object_metadata_on_object_fk on facilitated_object_metadata (cost=0.43..2.45 rows=1 width=8) Index Cond: (object_fk = 131690144) -> Index Only Scan using index_objects_on_allowed_object_fk_not_null on objects action_objects_objects (cost=0.43..2.45 rows=1 width=8) Index Cond: (allowed_object_fk = 131690144) -> Index Only Scan using index_objects_on_returned_object_fk_not_null on objects return_objects_objects (cost=0.43..2.45 rows=1 width=8) Index Cond: (returned_object_fk = 131690144) -> Seq Scan on object_taxs (cost=0.00..0.00 rows=1 width=8) Filter: (object_fk = 131690144) -> Index Only Scan using index_object_audits_on_object_fk on object_audits (cost=0.57..5.00 rows=138 width=8) Index Cond: (object_fk = 131690144) -> Materialize (cost=0.57..41.13 rows=1536 width=8) -> Index Only Scan using index_object_configs_on_object_id on object_configs (cost=0.57..33.45 rows=1536 width=8) Index Cond: (object_fk = 131690144) -> Materialize (cost=0.42..2.84 rows=5 width=8) -> Index Only Scan using index_adjustment_responses_on_response_fk on object_edits (cost=0.42..2.81 rows=5 width=8) Index Cond: (object_fk = 131690144) -> Materialize (cost=0.56..3.36 rows=13 width=8) -> Index Only Scan using index_audit_answers_on_requester_type_and_fk on audit_answers (cost=0.56..3.30 rows=13 width=8) Index Cond: ((requester_type = 'Object'::bt_object_or_alternate_object) AND (requester_fk = 131690144)) Note the innermost table scan: Seq Scan on objects (cost=0.00..7750888.47 rows=1 width=16) Filter: (pk = 131690144) If I set enable_seqscan = off, then I get the old plan: Limit (cost=4.12..4.13 rows=1 width=8) -> Limit (cost=4.12..4.13 rows=1 width=8) -> Nested Loop Left Join (cost=4.12..188183.54 rows=13777920 width=8) -> Nested Loop Left Join (cost=3.55..15956.20 rows=1059840 width=8) -> Nested Loop Left Join (cost=3.13..2705.38 rows=211968 width=8) -> Nested Loop Left Join (cost=2.55..18.49 rows=138 width=8) -> Nested Loop Left Join (cost=1.98..12.11 rows=1 width=8) -> Nested Loop Left Join (cost=1.86..9.96 rows=1 width=8) -> Nested Loop Left Join (cost=1.43..7.50 rows=1 width=8) -> Nested Loop Left Join (cost=0.99..5.04 rows=1 width=8) -> Index Scan using objects_pkey on objects (cost=0.57..2.58 rows=1 width=16) Index Cond: (pk = 131690144) -> Index Only Scan using index_facilitated_object_metadata_on_object_fk on facilitated_object_metadata (cost=0.43..2.45 rows=1 width=8) Index Cond: (object_fk = 131690144) -> Index Only Scan using index_objects_on_allowed_object_fk_not_null on objects action_objects_objects (cost=0.43..2.45 rows=1 width=8) Index Cond: (allowed_object_fk = 131690144) -> Index Only Scan using index_objects_on_returned_object_fk_not_null on objects return_objects_objects (cost=0.43..2.45 rows=1 width=8) Index Cond: (returned_object_fk = 131690144) -> Index Only Scan using index_object_taxs_on_object_fk on object_taxs (cost=0.12..2.14 rows=1 width=8) Index Cond: (object_fk = 131690144) -> Index Only Scan using index_object_audits_on_object_fk on object_audits (cost=0.57..5.00 rows=138 width=8) Index Cond: (object_fk = 131690144) -> Materialize (cost=0.57..41.13 rows=1536 width=8) -> Index Only Scan using index_object_configs_on_object_id on object_configs (cost=0.57..33.45 rows=1536 width=8) Index Cond: (object_fk = 131690144) -> Materialize (cost=0.42..2.84 rows=5 width=8) -> Index Only Scan using index_adjustment_responses_on_response_fk on object_edits (cost=0.42..2.81 rows=5 width=8) Index Cond: (object_fk = 131690144) -> Materialize (cost=0.56..3.36 rows=13 width=8) -> Index Only Scan using index_audit_answers_on_requester_type_and_fk on audit_answers (cost=0.56..3.30 rows=13 width=8) Index Cond: ((requester_type = 'Object'::bt_object_or_alternate_object) AND (requester_fk = 131690144)) Notice the innermost table scan is the expect index scan: -> Index Scan using objects_pkey on objects (cost=0.57..2.58 rows=1 width=16) Index Cond: (pk = 131690144) The join order stays the same between the two plans. The scan on object_taxs changes from a seq scan to an index scan (the table is empty so neither would be a problem) obviously because of disabled sequence scans. Analyzing object_audits during the incident did *not* change the plan, but a later auto-analyze of that same table seemed to change the plan back. That being said, we're now back at the bad plan (without setting enable_seqscan = off). I don't understand why that analyze should really affect anything at all given that join order stays the same between good/bad plans and given that I'd expect the planner to strongly prefer the index scan given its far lower cost for the number of estimated tuples. Analyzing objects doesn't help. Removing the DISTINCT does not fix the plan. Removing the LIMIT 1 does fix the plan and removes the double LIMIT node in the plan. The second LIMIT node is implicit from the DISTINCT, and it doesn't seem to me that a LIMIT node should change the cost/path chosen of an identical limit node inside it, and indeed the cost seems to be the same at both. The only explanation I can come up with is that the startup cost for the seq scan is 0 while for the index scan is 0.57, and the LIMIT at the top level is causing the planner to care about startup cost. Assuming that's true I think the early return cost multiplication of the LIMIT is being applied very naively on the seq scan node. Or perhaps the issue is that the startup cost for a single tuple on a seq scan like this shouldn't really have a startup cost of 0 -- that cost is presumably for tuples being returned _without_ having applied the filter. That seems slightly odd to me, because the cost of getting the first row out of that node -- in my naive view thinking about it for all of 5 seconds -- should be calculated based on applying the filter (and thus the likelihood that that filter matches right away). If we did that then this path would never win. But that 0.00 startup cost for the seq scan with a filter shows up in PG14 and PG11 also, not just PG16, so that's not something that's changed. To recap: the estimation of rows is correct, the estimated high (total) cost of the seq scan is correct, but the seq scan is chosen over the index scan anyway for a plan that returns a single "random" row based on the primary key. Am I right to be surprised here? James Coleman
Re: Seq scan instead of index scan querying single row from primary key on large table
From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes: > The plan generated by the planner changed suddenly one morning this > week, and in a very surprising way: the innermost scan (of "objects") > started choosing a seq scan, despite the cost from that node being > very high and an index scan being possible That looks weird to me too, but given the number of tables involved I wonder what you have join_collapse_limit/from_collapse_limit set to. If those are smaller than the query's relation count it could be that this is an artifact of optimization of a join subproblem. However, if it's the very same query you've been generating all along, this theory doesn't really explain the sudden change of plan choice. Also, even granting the bad-peephole-optimization theory, it seems like the best path for the objects table alone would still have been the index scan, so I'm confused too. What nondefault planner settings have you got? ("EXPLAIN (SETTINGS)" would help with answering that accurately.) Are you really sure nothing changed about what the ORM is emitting? regards, tom lane
Re: Seq scan instead of index scan querying single row from primary key on large table
From
James Coleman
Date:
On Thu, Jul 18, 2024 at 2:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > James Coleman <jtc331@gmail.com> writes: > > The plan generated by the planner changed suddenly one morning this > > week, and in a very surprising way: the innermost scan (of "objects") > > started choosing a seq scan, despite the cost from that node being > > very high and an index scan being possible > > That looks weird to me too, but given the number of tables involved > I wonder what you have join_collapse_limit/from_collapse_limit set > to. If those are smaller than the query's relation count it could > be that this is an artifact of optimization of a join subproblem. > However, if it's the very same query you've been generating all along, > this theory doesn't really explain the sudden change of plan choice. Those gucs are both set to 8, so...that could be a factor, except that as you noted if that's not a variable that's changing during the flip-flop of plans, then it's hard to make that the explanation. See below for more on this. > Also, even granting the bad-peephole-optimization theory, it seems > like the best path for the objects table alone would still have been > the index scan, so I'm confused too. I'm glad I'm not the only one. > What nondefault planner settings > have you got? ("EXPLAIN (SETTINGS)" would help with answering that > accurately.) Settings: max_parallel_workers = '24', maintenance_io_concurrency = '1', effective_cache_size = '264741832kB', work_mem = '16MB', random_page_cost = '1', search_path = 'object_shard,public' Of those I'd expect random_page_cost to steer it towards the index scan rather than away from it. The others don't seem to me like they should effect which path would be the best one for the objects table alone. > Are you really sure nothing changed about what the ORM is emitting? Yes, aside from the fact that application code didn't change, we reproduced the problem by restoring a physical snapshot of the database and were able to get the bad plan and then get it to change to the good plan by analyzing object_audits. Additionally on the live production database (well, a read replica anyway) today it'd switched back to the bad plan, and then an hour layer it'd swapped back again for the same exact query. Thanks, James Coleman
Re: Seq scan instead of index scan querying single row from primary key on large table
From
Tom Lane
Date:
James Coleman <jtc331@gmail.com> writes: > On Thu, Jul 18, 2024 at 2:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Are you really sure nothing changed about what the ORM is emitting? > Yes, aside from the fact that application code didn't change, we > reproduced the problem by restoring a physical snapshot of the > database and were able to get the bad plan and then get it to change > to the good plan by analyzing object_audits. OK. After staring at this for awhile, I think I see what's happening, and you're right that it's basically a problem with how we do LIMIT. Consider this simple test case: regression=# create table t1 (id int primary key, link int); CREATE TABLE regression=# insert into t1 select id, id/1000 from generate_series(1,1000000) id; INSERT 0 1000000 regression=# vacuum analyze t1; VACUUM regression=# set enable_indexscan TO 0; SET regression=# set enable_bitmapscan TO 0; SET regression=# set max_parallel_workers_per_gather TO 0; SET regression=# explain select * from t1 where id = 42 limit 1; QUERY PLAN ------------------------------------------------------------ Limit (cost=0.00..16925.00 rows=1 width=8) -> Seq Scan on t1 (cost=0.00..16925.00 rows=1 width=8) Filter: (id = 42) (3 rows) Because we'll pick the winning Limit plan based just on its total_cost, this set of cost assignments amounts to assuming that the one matching tuple will appear at the end of the seqscan. That's overly conservative perhaps, but it's fine. But now look what happens when we join to another table that has many matching rows: regression=# explain select * from t1 t1 left join t1 t2 on (t1.id=t2.link) where t1.id = 42 limit 1; QUERY PLAN ----------------------------------------------------------------------- Limit (cost=0.00..33.96 rows=1 width=16) -> Nested Loop Left Join (cost=0.00..33859.97 rows=997 width=16) -> Seq Scan on t1 (cost=0.00..16925.00 rows=1 width=8) Filter: (id = 42) -> Seq Scan on t1 t2 (cost=0.00..16925.00 rows=997 width=8) Filter: (link = 42) (6 rows) We have, basically, completely forgotten about that conservative positioning estimate: this set of costs amounts to assuming that the first matching tuple will be found 1/1000th of the way through the entire join, which is a lot less than what we'd just decided it would cost to retrieve the first t1 tuple. In your example, with over 13 million rows to be joined, that allows us to make a ridiculously small estimate of the time to find the first match, and so it's able to win a cost comparison against the far-more-sanely-estimated indexscan plan. I'm not sure what to do about this. I think that things might work out better if we redefined the startup cost as "estimated cost to retrieve the first tuple", rather than its current very-squishy definition as "cost to initialize the scan". That would end up with the LIMIT node having a cost that's at least the sum of the startup costs of the input scans, which would fix this problem. But changing that everywhere would be a lotta work, and I'm far from sure that it would not have any negative side-effects. regards, tom lane
Re: Seq scan instead of index scan querying single row from primary key on large table
From
Andy Fan
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes: > I'm not sure what to do about this. I think that things might work > out better if we redefined the startup cost as "estimated cost to > retrieve the first tuple", rather than its current very-squishy > definition as "cost to initialize the scan". Actually I wanted to raise this question very long time ago when I read the code, but I don't have a example to prove it can cause any real impact, then I didn't ask it. startup_cost is defined by the cost to retrieve the *first tuple*, so for the query like "SELECT * FROM t WHERE foo", the IO cost to retrieve the first tpule is obviously not 0. (I think it can be total_cost / rows?) at the same time, the startup_cost of IndexScan is more restricted, it counts the IO blocks from root -> leaf nodes. I think there is a inconsistent issue as well. > That would end up > with the LIMIT node having a cost that's at least the sum of the > startup costs of the input scans, which would fix this problem. great to know this. > But changing that everywhere would be a lotta work. In my understanding, the only place we need to change is the startup_cost in cost_seqscan, I must be wrong now, but I want to know where is it. > and I'm far from sure that it would not have any negative > side-effects. Yes, I think it is a semantics correct than before however. -- Best Regards Andy Fan