Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT - Mailing list pgsql-general

From Francisco Olarte
Subject Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT
Date
Msg-id CA+bJJbzOpKRG_5S=Vnak7xXjsaWyXi92H67CGByh5xBHX1Z=rg@mail.gmail.com
Whole thread Raw
In response to Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT  (Alan Hodgson <ahodgson@lists.simkin.ca>)
List pgsql-general
Alan:

On Mon, 6 Dec 2021 at 18:58, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
> On Mon, 2021-12-06 at 18:20 +0100, Francisco Olarte wrote:
> Can you post an explain analyze? To me it seems like the planner
> thinks shipment_import_id is randomly distributed and the table is
> well correlated with it's PK, so scanning it for the first id should
> be fast.
>    #explain analyze SELECT "shipment_import_records".* FROM
>    shipment_import_records" WHERE
>    shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
>    "shipment_import_records"."id" ASC LIMIT 1;
>    ---------------------------------------------------------------------
>     Limit  (cost=0.44..873.08 rows=1 width=243) (actual
>    time=31689.725..31689.726 rows=1 loops=1)
>       ->  Index Scan using shipment_import_records_pkey on
>    shipment_import_records  (cost=0.44..5122405.71 rows=5870 width=243)
>    (actual time=31689.723..31689.724 rows=1 loops=1)
>             Filter: (shipment_import_id = 5090609)
>             Rows Removed by Filter: 28710802

This seems to be the problem, IIRC you said you had 29M rows, it is
index-scanning nearly all the table.

I'm not an explain guru, but It seems to me you have a correlation
undetected by the planer.


>       The biggest one (but yes "earlier"):
>
> # explain analyze SELECT "shipment_import_records".* FROM "shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id"= 1247888 ORDER BY
 
> "shipment_import_records"."id" ASC LIMIT 1;
> --------
> Limit  (cost=0.44..426.59 rows=1 width=243) (actual time=8007.069..8007.070 rows=1 loops=1)
>   ->  Index Scan using shipment_import_records_pkey on shipment_import_records  (cost=0.44..5126628.40 rows=12030
width=243)(actual time=8007.068..8007.068 rows=1 l
 
> oops=1)
>         Filter: (shipment_import_id = 1247888)
>         Rows Removed by Filter: 10929193
> Planning Time: 0.584 ms
> Execution Time: 8007.086 ms
> (6 rows)

mmm, the times seem to correlate with rows removed by filter ~ rows
read. To me it seems the planer thinks they are well distributed but
they are not ( from the names it seem like you insert a lot of
shipment_import_records with the same shipment_import_id, and they get
a serial or similar id, they are bunched. The planner thinks they are
distributed and tries to read the table in order, probably because it
correlates well with the id index, but when you select one of the
later-inserted imports it fails miserabley and has to scan the whole /
a lot of the table.


> And the smallest/latest, which actually uses the "right" index:
>  # explain analyze SELECT "shipment_import_records".* FROM "shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id"= 5116174 ORDER BY
 
> "shipment_import_records"."id" ASC LIMIT 1;
> ------------------------
> Limit  (cost=145.44..145.44 rows=1 width=243) (actual time=0.018..0.018 rows=1 loops=1)
>   ->  Sort  (cost=145.44..145.64 rows=79 width=243) (actual time=0.017..0.018 rows=1 loops=1)
>         Sort Key: id
>         Sort Method: quicksort  Memory: 26kB
>         ->  Index Scan using index_shipment_import_records_on_shipment_import_id on shipment_import_records
(cost=0.44..145.05rows=79 width=243) (actual time=0.013
 
> ..0.014 rows=1 loops=1)
>               Index Cond: (shipment_import_id = 5116174)
> Planning Time: 0.104 ms
> Execution Time: 0.032 ms

This one found few rows, and probably the planner knew it ( rom the
rows=79 ) so it went on the import_record_id index route.

> But from the names of the field you may have correlation between
> shipment_import_id and id hidden somewhere ( like they are two serial
> growing together, you query for the latest shipment ids and it scans
> all the table ). An explain analyze should show that ( or three, one
> for that shipment import id, one for 1, one for a really big one )
> This is definitely the case. And we are generally looking for newer data for most operations.
> Thanks for looking at it.

Well, I have not solution for that, but at least I think the problem
is identified.

You could try building an index on shipment_import_id + shipment_id,
it would solve this kind of lookups, and would certainly pull its
weight if your ORM makes lots of this kind of queries,  but I'm not
sure on how to make the planner use it, but it may be worth testing.

Francisco Olarte.



pgsql-general by date:

Previous
From: Vikas Sharma
Date:
Subject: how to get value of parameter set in session for other user
Next
From: Francisco Olarte
Date:
Subject: Re: When Update balloons memory