Thread: Why is the optimiser choosing a sub-optimal plan?

Why is the optimiser choosing a sub-optimal plan?

From

Stephen Cresswell

Date:

14 November 2016, 17:54:12

I have the a table with two indexes...

CREATE TABLE mobile_summary_usage
(
import text,
msisdn text,
type text,
total integer,
day date,
cycle text
);

CREATE INDEX mobile_summary_usage_msisdn_cycle ON mobile_summary_usage
USING btree (msisdn, cycle);

CREATE INDEX mobile_summary_usage_cycle ON mobile_summary_usage USING
btree (cycle);

We insert approximately 2M records into this table each day. Whenever
someone wants to see the total amount of voice calls, text messages or
data they've used we query the table with the following

SELECT msisdn, type, sum (total), units
FROM mobile_summary_usage msu, mobile_summary_type mst
WHERE type = id AND msisdn = ? AND cycle = ?
GROUP BY msisdn, type, units;

Where:
msisdn is a mobile number
cycle is a billing cycle, e.g. 2016-10
mobile_summary_type contains 3 rows, one for each usage type.

Everything was working fine until we flipped over from 2016-10 to
2016-11. Then instead of averaging well below a 0.5 seconds to
respond, Postgres started taking over a second.

Running EXPLAIN ANALYZE on the above query shows that in 2016-10 when
there are approximately 100M rows, Postgres uses the compound (msisdn,
cycle) index. This has a cost of 3218.98 and takes 0.071 seconds.

HashAggregate (cost=3213.12..3218.98 rows=586 width=52) (actual
time=0.071..0.071 rows=0 loops=1)
Group Key: msu.msisdn, msu.type, mst.units
-> Hash Join (cost=62.54..3205.15 rows=797 width=52) (actual
time=0.069..0.069 rows=0 loops=1)
Hash Cond: (msu.type = mst.id)
-> Bitmap Heap Scan on mobile_summary_usage msu
(cost=32.74..3164.39 rows=797 width=20) (actual time=0.037..0.037
rows=0 loops=1)
Recheck Cond: ((msisdn = '07700900331'::text) AND (cycle
= '2016-10'::text))
-> Bitmap Index Scan on
mobile_summary_usage_msisdn_cycle (cost=0.00..32.54 rows=797 width=0)
(actual time=0.036..0.036 rows=0 loops=1)
Index Cond: ((msisdn = '07700900331'::text) AND
(cycle = '2016-10'::text))
-> Hash (cost=18.80..18.80 rows=880 width=64) (actual
time=0.026..0.026 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on mobile_summary_type mst
(cost=0.00..18.80 rows=880 width=64) (actual time=0.014..0.016 rows=4
loops=1)
Planning time: 0.197 ms
Execution time: 0.125 ms

When I re-run the plan for 2016-11 (currently 4M rows), Postgres uses
the simpler "cycle" index. The cost is 12.79 but the actual time taken
is 1412.609 seconds

HashAggregate (cost=12.78..12.79 rows=1 width=52) (actual
time=1412.609..1412.609 rows=0 loops=1)
Execution time: 1412.674 ms
Group Key: msu.msisdn, msu.type, mst.units
-> Nested Loop (cost=0.72..12.77 rows=1 width=52) (actual
time=1412.606..1412.606 rows=0 loops=1)
-> Index Scan using mobile_summary_usage_cycle on
mobile_summary_usage msu (cost=0.57..4.59 rows=1 width=20) (actual
time=1412.604..1412.604 rows=0 loops=1)
-> Index Scan using mobile_summary_type_pkey on
mobile_summary_type mst (cost=0.15..8.17 rows=1 width=64) (never
executed)
Rows Removed by Filter: 3932875
Index Cond: (id = msu.type)
Index Cond: (cycle = '2016-11'::text)
Filter: (msisdn = '07700900331'::text)

I understand there are a whole host of reasons why postgres may chose
different plans based on data volumes, but in this case despite the
lower cost the performance is significantly worse. Is there any
explanation for why it's making such a poor decision and
recommendations for how to fix it?

Any help appreciated.

Re: Why is the optimiser choosing a sub-optimal plan?

From

Tom Lane

Date:

14 November 2016, 19:37:30

Stephen Cresswell <pgsql-performance@stephen-cresswell.net> writes:
> I have the a table with two indexes...

(1) Tell us about the other table, mobile_summary_type.

(2) Did you transcribe the second query plan correctly?  I have a hard
time believing that EXPLAIN printed two Index Cond lines for the same
indexscan.

(3) What PG version is this, exactly?

(4) Are you doing anything funny like disabling autovacuum/autoanalyze?
The rowcount estimates in the "good" plan seem rather far away from
reality, and it's not obvious why, particularly here:

>               ->  Seq Scan on mobile_summary_type mst
> (cost=0.00..18.80 rows=880 width=64) (actual time=0.014..0.016 rows=4
> loops=1)

            regards, tom lane