Re: Planner incorrectly choosing seq scan over index scan - Mailing list pgsql-performance

From Meetesh Karia
Subject Re: Planner incorrectly choosing seq scan over index scan
Date
Msg-id fc5b04ca05080116564d1b3323@mail.gmail.com
Whole thread Raw
In response to Re: Planner incorrectly choosing seq scan over index scan  (John Arbash Meinel <john@arbash-meinel.com>)
Responses Re: Planner incorrectly choosing seq scan over index scan
List pgsql-performance
Thanks John.  I've answered your questions below:

Has lte_user and candidates617004 been recently ANALYZEd? All estimates,
except for the expected number of rows from lte_user seem to be okay.

I ANALYZEd both tables just before putting together my first email.  And, unfortunately, modifying the statistics target didn't help either.

Is user_id the primary key for lte_user?

Yes

I'm trying to figure out how you can get 50k rows, by searching a
primary key, against a 3.5k rows. Is user_id only part of the primary
key for lte_user?

Hmmm ... I missed that before.  But, that surprises me too.  Especially since sourceId in the candidates table has only 1 value.  Also, user_id is the complete primary key for lte_user.

Can you give us the output of:
\d lte_user
\d candidates617004

Sure, here they are:

lte=# \d lte_user
                 Table "public.lte_user"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 user_id       | bigint                      | not null
 firstname     | character varying(255)      |
 lastname      | character varying(255)      |
 address1      | character varying(255)      |
 address2      | character varying(255)      |
 city          | character varying(255)      |
 state         | character varying(255)      |
 zip           | character varying(255)      |
 phone1        | character varying(255)      |
 phone2        | character varying(255)      |
 username      | character varying(255)      |
 password      | character varying(255)      |
 deleted       | boolean                     | not null
 ext_cust_id   | character varying(255)      |
 aboutme       | character varying(255)      |
 birthday      | timestamp without time zone |
 fm_id       | bigint                      |
 ar            | double precision            |
Indexes:
    "lte_user_pkey" PRIMARY KEY, btree (user_id)
    "idx_user_extid" btree (ext_cust_id)
    "idx_user_username" btree (username)
Foreign-key constraints:
    "fk_user_fm" FOREIGN KEY (fm_id) REFERENCES fm(fm_id)

lte=# \d candidates617004
       Table "public.candidates617004"
    Column    |       Type       | Modifiers
--------------+------------------+-----------
 fmid       | bigint           |
 sourceid     | bigint           |
 sr            | double precision |
 targetid     | bigint           |
 tr           | double precision |

Also, if you could describe the table layouts, that would help.

Sure.  The lte_user table is just a collection of users.  user_id is assigned uniquely using a sequence.  During some processing, we create a candidates table (candidates617004 in our case).  This table is usually a temp table.  sourceid is a user_id (in this case it is always 617004) and targetid is also a user_id (2860 distinct values out of 3467).  The rest of the information is either only used in the select clause or not used at all during this processing.

Did I miss something in the table layout description that would be helpful?

Thanks for your help!
Meetesh

>         ->  Sort  (cost=9912.07..9920.73 rows=3467 width=40) (actual
> time=1156.000..1156.000 rows=3467 loops=1)
>               Sort Key: c.sourceid
>               ->  Hash Join  (cost=8710.44..9708.21 rows=3467 width=40)
> (actual time=1125.000..1156.000 rows=3467 loops=1)
>                     Hash Cond: ("outer".targetid = "inner".user_id)
>                     ->  Seq Scan on candidates617004 c
> (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000
> rows=3467 loops=1)
>                     ->  Hash  (cost=8011.95..8011.95 rows=279395
> width=16) (actual time=1125.000..1125.000 rows=0 loops=1)
>                           ->  Seq Scan on lte_user t
> (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..670.000
> rows=279395 loops=1)
> Total runtime: 1406.000 ms
>
> enable_hashjoin disabled
> ----------------------------------------
> QUERY PLAN
> Sort  (cost=14355.37..14364.03 rows=3467 width=48) (actual
> time=391.000..391.000 rows=3467 loops=1)
>   Sort Key: c.sourceid, c.targetid
>   ->  Nested Loop  (cost=271.52..14151.51 rows=3467 width=48) (actual
> time=203.000..359.000 rows=3467 loops=1)
>         ->  Merge Join  (cost=271.52..3490.83 rows=3467 width=40)
> (actual time=203.000..218.000 rows=3467 loops=1)
>               Merge Cond: ("outer".user_id = "inner".sourceid)
>               ->  Index Scan using lte_user_pkey on lte_user s
> (cost=0.00..16837.71 rows=279395 width=16) (actual time=0.000..126.000
> rows=50034 loops=1)
>               ->  Sort  (cost=271.52..280.19 rows=3467 width=32) (actual
> time=15.000..30.000 rows=3467 loops=1)
>                     Sort Key: c.sourceid
>                     ->  Seq Scan on candidates617004 c
> (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..0.000
> rows=3467 loops=1)
>         ->  Index Scan using lte_user_pkey on lte_user t
> (cost=0.00..3.03 rows=1 width=16) (actual time=0.031..0.036 rows=1
> loops=3467)
>               Index Cond: ("outer".targetid = t.user_id)
> Total runtime: 406.000 ms
>
> random_page_cost set to 1.5
> ----------------------------------------------
> QUERY PLAN
> Sort  (cost= 12702.62..12711.29 rows=3467 width=48) (actual
> time=1407.000..1407.000 rows=3467 loops=1)
>   Sort Key: c.sourceid, c.targetid
>   ->  Merge Join  (cost=9912.07..12498.77 rows=3467 width=48) (actual
> time=1391.000..1407.000 rows=3467 loops=1)
>         Merge Cond: ("outer".user_id = "inner".sourceid)
>         ->  Index Scan using lte_user_pkey on lte_user s
> (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..46.000
> rows=50034 loops=1)
>         ->  Sort  (cost=9912.07..9920.73 rows=3467 width=40) (actual
> time=1188.000..1188.000 rows=3467 loops=1)
>               Sort Key: c.sourceid
>               ->  Hash Join  (cost=8710.44..9708.21 rows=3467 width=40)
> (actual time=1157.000..1188.000 rows=3467 loops=1)
>                     Hash Cond: ("outer".targetid = "inner".user_id)
>                     ->  Seq Scan on candidates617004 c
> (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000
> rows=3467 loops=1)
>                     ->  Hash  (cost=8011.95..8011.95 rows=279395
> width=16) (actual time=1157.000..1157.000 rows=0 loops=1)
>                           ->  Seq Scan on lte_user t
> (cost=0.00..8011.95 rows=279395 width=16) (actual time=0.000..750.000
> rows=279395 loops=1)
> Total runtime: 1422.000 ms
>
> random_page_cost set to 1.5 and enable_hashjoin set to false
> --------------------------------------------------------------------------------------------------
> QUERY PLAN
> Sort  (cost=13565.58..13574.25 rows=3467 width=48) (actual
> time=390.000..390.000 rows=3467 loops=1)
>   Sort Key: c.sourceid, c.targetid
>   ->  Nested Loop  (cost=271.52..13361.73 rows=3467 width=48) (actual
> time=203.000..360.000 rows=3467 loops=1)
>         ->  Merge Join  (cost=271.52..2762.88 rows=3467 width=40)
> (actual time=203.000..250.000 rows=3467 loops=1)
>               Merge Cond: ("outer".user_id = "inner".sourceid)
>               ->  Index Scan using lte_user_pkey on lte_user s
> (cost=0.00..12807.34 rows=279395 width=16) (actual time=0.000..48.000
> rows=50034 loops=1)
>               ->  Sort  (cost=271.52..280.19 rows=3467 width=32) (actual
> time=15.000..31.000 rows=3467 loops=1)
>                     Sort Key: c.sourceid
>                     ->  Seq Scan on candidates617004 c
> (cost=0.00..67.67 rows=3467 width=32) (actual time=0.000..15.000
> rows=3467 loops=1)
>         ->  Index Scan using lte_user_pkey on lte_user t
> (cost= 0.00..3.02 rows=1 width=16) (actual time=0.023..0.023 rows=1
> loops=3467)
>               Index Cond: ("outer".targetid = t.user_id)
> Total runtime: 406.000 ms
>
> Thanks,
> Meetesh




pgsql-performance by date:

Previous
From: Meetesh Karia
Date:
Subject: Re: Planner incorrectly choosing seq scan over index scan
Next
From: Tom Lane
Date:
Subject: Re: Planner incorrectly choosing seq scan over index scan