Thread: Bad Planner Statistics for Uneven distribution.
I discussed this with a few members of #postgresql freenode this morning. I'll keep it breif; [note: i have cleaned out columns not relevant]
I have two tables, brands and models_brands. The first has about 300 records, the later about 350,000 records. The number of distinct brands in the models_brands table is < 10.
=# \d models_brands
Table "public.models_brands"
Column | Type | Modifiers
--------+-----------------------+-----------
model | integer | not null
brand | integer | not null
Indexes:
"models_brands_brand" btree (brand)
Foreign-key constraints:
"models_brands_brand_fkey" FOREIGN KEY (brand) REFERENCES brands(brand_id) ON UPDATE CASCADE ON DELETE CASCADE
"models_brands_model_fkey" FOREIGN KEY (model) REFERENCES models(model_id) ON UPDATE CASCADE ON DELETE CASCADE
Table "public.models_brands"
Column | Type | Modifiers
--------+-----------------------+-----------
model | integer | not null
brand | integer | not null
Indexes:
"models_brands_brand" btree (brand)
Foreign-key constraints:
"models_brands_brand_fkey" FOREIGN KEY (brand) REFERENCES brands(brand_id) ON UPDATE CASCADE ON DELETE CASCADE
"models_brands_model_fkey" FOREIGN KEY (model) REFERENCES models(model_id) ON UPDATE CASCADE ON DELETE CASCADE
a=# \d brands;
Table "public.brands"
Column | Type | Modifiers
------------+------------------------+-----------------------------------------------------------
brand_id | integer | not null default nextval('brands_brand_id_seq'::regclass)
brand_name | character varying(255) | not null
Indexes:
"brands_pkey" PRIMARY KEY, btree (brand_id)
Table "public.brands"
Column | Type | Modifiers
------------+------------------------+-----------------------------------------------------------
brand_id | integer | not null default nextval('brands_brand_id_seq'::regclass)
brand_name | character varying(255) | not null
Indexes:
"brands_pkey" PRIMARY KEY, btree (brand_id)
Now the plans/problems..
=# set enable_seqscan to on;
SET
=# explain analyze select distinct brand from models_brands;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46300.70..48148.15 rows=4 width=4) (actual time=3699.691..6215.216 rows=4 loops=1)
-> Sort (cost=46300.70..47224.43 rows=369489 width=4) (actual time=3699.681..5027.069 rows=369489 loops=1)
Sort Key: brand
-> Seq Scan on models_brands (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.040..1352.997 rows=369489 loops=1)
Total runtime: 6223.666 ms
(5 rows)
SET
=# explain analyze select distinct brand from models_brands;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46300.70..48148.15 rows=4 width=4) (actual time=3699.691..6215.216 rows=4 loops=1)
-> Sort (cost=46300.70..47224.43 rows=369489 width=4) (actual time=3699.681..5027.069 rows=369489 loops=1)
Sort Key: brand
-> Seq Scan on models_brands (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.040..1352.997 rows=369489 loops=1)
Total runtime: 6223.666 ms
(5 rows)
=# set enable_seqscan to off;
SET
=# explain analyze select distinct brand from models_brands;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..863160.68 rows=4 width=4) (actual time=0.131..2584.779 rows=4 loops=1)
-> Index Scan using models_brands_brand on models_brands (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.122..1440.809 rows=369489 loops=1)
Total runtime: 2584.871 ms
(3 rows)
SET
=# explain analyze select distinct brand from models_brands;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..863160.68 rows=4 width=4) (actual time=0.131..2584.779 rows=4 loops=1)
-> Index Scan using models_brands_brand on models_brands (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.122..1440.809 rows=369489 loops=1)
Total runtime: 2584.871 ms
(3 rows)
Picks the wrong plan here. Should pick the index with seqscanning enabled.
More (as a different wording/query)... (as suggested by others on irc)
=# set enable_seqscan to on;
SET
=# explain analyze select brand_id from brands where exists (select 1 from models_brands where brand = brands.brand_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on brands (cost=0.00..30.09 rows=152 width=4) (actual time=7742.460..62567.543 rows=4 loops=1)
Filter: (subplan)
SubPlan
-> Seq Scan on models_brands (cost=0.00..7335.61 rows=92372 width=0) (actual time=206.467..206.467 rows=0 loops=303)
Filter: (brand = $0)
Total runtime: 62567.626 ms
SET
=# explain analyze select brand_id from brands where exists (select 1 from models_brands where brand = brands.brand_id);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on brands (cost=0.00..30.09 rows=152 width=4) (actual time=7742.460..62567.543 rows=4 loops=1)
Filter: (subplan)
SubPlan
-> Seq Scan on models_brands (cost=0.00..7335.61 rows=92372 width=0) (actual time=206.467..206.467 rows=0 loops=303)
Filter: (brand = $0)
Total runtime: 62567.626 ms
a=# set enable_seqscan to off;
SET
SET
=# explain analyze select brand_id from brands where exists (select 1 from models_brands where brand = brands.brand_id);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on brands (cost=100000000.00..100000715.90 rows=152 width=4) (actual time=0.615..3.710 rows=4 loops=1)
Filter: (subplan)
SubPlan
-> Index Scan using models_brands_brand on models_brands (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008 rows=0 loops=303)
Index Cond: (brand = $0)
Total runtime: 3.790 ms
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on brands (cost=100000000.00..100000715.90 rows=152 width=4) (actual time=0.615..3.710 rows=4 loops=1)
Filter: (subplan)
SubPlan
-> Index Scan using models_brands_brand on models_brands (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008 rows=0 loops=303)
Index Cond: (brand = $0)
Total runtime: 3.790 ms
It was also tried to similar results with a LIMIT 1 in the subquery for exist.
More...
Seqscan still off..
=# explain analyze select distinct brand_id from brands inner join models_brands on (brand_id = brand);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..867782.58 rows=303 width=4) (actual time=0.391..4898.579 rows=4 loops=1)
-> Merge Join (cost=0.00..866858.85 rows=369489 width=4) (actual time=0.383..3749.771 rows=369489 loops=1)
Merge Cond: ("outer".brand_id = "inner".brand)
-> Index Scan using brands_pkey on brands (cost=0.00..15.53 rows=303 width=4) (actual time=0.080..0.299 rows=60 loops=1)
-> Index Scan using models_brands_brand on models_brands (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.013..1403.175 rows=369489 loops=1)
Total runtime: 4898.697 ms
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..867782.58 rows=303 width=4) (actual time=0.391..4898.579 rows=4 loops=1)
-> Merge Join (cost=0.00..866858.85 rows=369489 width=4) (actual time=0.383..3749.771 rows=369489 loops=1)
Merge Cond: ("outer".brand_id = "inner".brand)
-> Index Scan using brands_pkey on brands (cost=0.00..15.53 rows=303 width=4) (actual time=0.080..0.299 rows=60 loops=1)
-> Index Scan using models_brands_brand on models_brands (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.013..1403.175 rows=369489 loops=1)
Total runtime: 4898.697 ms
=# set enable_seqscan to on;
SET
=# explain analyze select distinct brand_id from brands inner join models_brands on (brand_id = brand);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46300.70..52770.04 rows=303 width=4) (actual time=3742.046..8560.833 rows=4 loops=1)
-> Merge Join (cost=46300.70..51846.32 rows=369489 width=4) (actual time=3742.035..7406.677 rows=369489 loops=1)
Merge Cond: ("outer".brand_id = "inner".brand)
-> Index Scan using brands_pkey on brands (cost=0.00..15.53 rows=303 width=4) (actual time=0.077..0.407 rows=60 loops=1)
-> Sort (cost=46300.70..47224.43 rows=369489 width=4) (actual time=3741.584..5051.348 rows=369489 loops=1)
Sort Key: models_brands.brand
-> Seq Scan on models_brands (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.027..1346.178 rows=369489 loops=1)
Total runtime: 8589.502 ms
(8 rows)
SET
=# explain analyze select distinct brand_id from brands inner join models_brands on (brand_id = brand);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=46300.70..52770.04 rows=303 width=4) (actual time=3742.046..8560.833 rows=4 loops=1)
-> Merge Join (cost=46300.70..51846.32 rows=369489 width=4) (actual time=3742.035..7406.677 rows=369489 loops=1)
Merge Cond: ("outer".brand_id = "inner".brand)
-> Index Scan using brands_pkey on brands (cost=0.00..15.53 rows=303 width=4) (actual time=0.077..0.407 rows=60 loops=1)
-> Sort (cost=46300.70..47224.43 rows=369489 width=4) (actual time=3741.584..5051.348 rows=369489 loops=1)
Sort Key: models_brands.brand
-> Seq Scan on models_brands (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.027..1346.178 rows=369489 loops=1)
Total runtime: 8589.502 ms
(8 rows)
Hope that helps
Kevin McArthur
"Kevin McArthur" <Kevin@StormTide.ca> writes: > -> Seq Scan on models_brands (cost=0.00..6411.89 rows=369489 width=4) (actual time=0.040..1352.997 rows=369489loops=1) > ... > -> Index Scan using models_brands_brand on models_brands (cost=0.00..862236.96 rows=369489 width=4) (actual time=0.122..1440.809rows=369489 loops=1) > Picks the wrong plan here. Should pick the index with seqscanning enabled. It's really not possible for a full-table indexscan to be faster than a seqscan, and not very credible for it even to be approximately as fast. I suspect your second query here is the beneficiary of the first query having fetched all the pages into cache. In general, if you want to optimize for a mostly-cached database, you need to reduce random_page_cost below its default value ... regards, tom lane
Tom, On 7/21/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's really not possible for a full-table indexscan to be faster than a > seqscan, and not very credible for it even to be approximately as fast. > I suspect your second query here is the beneficiary of the first query > having fetched all the pages into cache. In general, if you want to > optimize for a mostly-cached database, you need to reduce > random_page_cost below its default value ... We discussed this case on IRC and the problem was not the first set of queries but the second one: select brand_id from brands where exists (select 1 from models_brands where brand = brands.brand_id);). Isn't there any way to make PostgreSQL have a better estimation here: -> Index Scan using models_brands_brand on models_brands (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008 rows=0 loops=303) Index Cond: (brand = $0) I suppose it's because the planner estimates that there will be 92372 result rows that it chooses the seqscan instead of the index scan. ALTER STATISTICS didn't change anything. IIRC, there were already a few threads about the same sort of estimation problem and there wasn't any solution to solve this problem. Do you have any hint/ideas? -- Guillaume
"Guillaume Smet" <guillaume.smet@gmail.com> writes: > Isn't there any way to make PostgreSQL have a better estimation here: > -> Index Scan using models_brands_brand on models_brands > (cost=0.00..216410.97 rows=92372 width=0) (actual time=0.008..0.008 > rows=0 loops=303) > Index Cond: (brand = $0) Note that the above plan extract is pretty misleading, because it doesn't account for the implicit "LIMIT 1" of an EXISTS() clause. What the planner is *actually* imputing to this plan is 216410.97/92372 cost units, or about 2.34. However that applies to the seqscan variant as well. I think the real issue with Kevin's example is that when doing an EXISTS() on a brand_id that doesn't actually exist in the table, the seqscan plan has worst-case behavior (ie, scan the whole table) while the indexscan plan still manages to be cheap. Because his brands table has so many brand_ids that aren't in the table, that case dominates the results. Not sure how we could factor that risk into the cost estimates. The EXISTS code could probably special-case it reasonably well for the simplest seqscan and indexscan subplans, but I don't see what to do with more general subqueries (like joins). regards, tom lane