Re: PostgreSQL's query planner is using the wrong index, what can I do to improve this situation? - Mailing list pgsql-performance

From David G Johnston
Subject Re: PostgreSQL's query planner is using the wrong index, what can I do to improve this situation?
Date
Msg-id 1399236684254-5802377.post@n5.nabble.com
Whole thread Raw
List pgsql-performance
olavgg wrote
> I have a table with 4 indexes =>
>     "stock_trade_pkey" PRIMARY KEY, btree (id)
>     "stock_trade_source_idx" btree (source_id)
>     "stock_trade_stock_id_time_idx" btree (stock_id, "time")
>     "stock_trade_time_idx" btree ("time")
>
> This table store time series data, basically every trade happening on a
> stock every day.
>
> However I have two similar queries that use completely different index,
> which has HUGE impact on performance.
>
> *********** QUERY START ************
> myfinance=> EXPLAIN (buffers,analyze)
> SELECT COUNT(1)
> FROM stock_trade st
> WHERE st.stock_id = any(array(
>     SELECT s.id FROM stock s WHERE s.exchange_id IN(1,2,3))
> )
> AND st.time BETWEEN '2014-04-22 00:00' AND '2014-04-22 23:59';
>
> QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=19148.27..19148.37 rows=1 width=0) (actual
> time=3644.474..3644.475 rows=1 loops=1)
>    Buffers: shared hit=5994 read=1524
>    InitPlan 1 (returns $0)
>      ->  Index Scan using stock_exchange_idx on stock s
> (cost=28.38..794.17 rows=1482 width=8) (actual time=0.066..4.412 rows=1486
> loops=1)
>            Index Cond: (exchange_id = ANY ('{1,2,3}'::bigint[]))
>            Buffers: shared hit=34
>    ->  Index Only Scan using stock_trade_stock_id_time_idx on stock_trade
> st  (cost=58.50..14380.10 rows=15896 width=0) (actual time=8.033..3071.828
> rows=395019 loops=1)
>          Index Cond: ((stock_id = ANY ($0)) AND ("time" >= '2014-04-22
> 00:00:00'::timestamp without time zone) AND ("time" <= '2014-04-22
> 23:59:00'::timestamp without time zone))
>          Heap Fetches: 0
>          Buffers: shared hit=5994 read=1524
>  Total runtime: 3644.604 ms
> *********** QUERY END ************
>
> This query is using the 'stock_trade_stock_id_time_idx' multi-column
> index, with good performance.
> However once I change the date to a more recent one, it is suddenly using
> another and MUCH slower index...
>
> *********** QUERY START ************
> myfinance=> EXPLAIN (buffers,analyze)
> SELECT COUNT(1)
> FROM stock_trade st
> WHERE st.stock_id = any(array(
>     SELECT s.id FROM stock s WHERE s.exchange_id IN(1,2,3))
> )
> AND st.time BETWEEN '2014-05-02 00:00' AND '2014-05-02 23:59';
>
> QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=859.78..859.88 rows=1 width=0) (actual
> time=115505.403..115505.405 rows=1 loops=1)
>    Buffers: shared hit=4433244
>    InitPlan 1 (returns $0)
>      ->  Index Scan using stock_exchange_idx on stock s
> (cost=28.38..794.17 rows=1482 width=8) (actual time=0.047..4.361 rows=1486
> loops=1)
>            Index Cond: (exchange_id = ANY ('{1,2,3}'::bigint[]))
>            Buffers: shared hit=34
>    ->  Index Scan using stock_trade_time_idx on stock_trade st
> (cost=57.50..65.35 rows=1 width=0) (actual time=7.415..114921.242
> rows=395834 loops=1)
>          Index Cond: (("time" >= '2014-05-02 00:00:00'::timestamp without
> time zone) AND ("time" <= '2014-05-02 23:59:00'::timestamp without time
> zone))
>          Filter: (stock_id = ANY ($0))
>          Rows Removed by Filter: 6903136
>          Buffers: shared hit=4433244
>  Total runtime: 115505.545 ms
> *********** QUERY END ************
>
> As you see, now it is using the 'stock_trade_time_idx' index.
> I have a similar problem when using IN or EXISTS for stock_id's, it will
> automatically chose the wrong index. But when I tried with
> any(array($subquery)), the right index would be chosen for data that is a
> few days old(Not sure why the query planner is behaving like this).
>
> I've tried running VACUUM and ANALYZE without any effect. Are there other
> things I can do?

I suspect the the index-only aspect of the first plan is what is giving the
largest performance boost. As time passes the likelihood of having data be
all-visible increases.  I do not know how or if it is possible to force
visibility for this purpose.

Time is likely more selective than stock_id so for the multiple-column index
time should probably the first listed field.

The planner figures being more selective and filtering is going to be faster
than scanning the much larger section of index covered by the stock_id(s)
and then going and fetching those pages and then checking them for
visibility.  But if it can get most or all of the data directly from the
index then the savings are substantial enough to use the compound index.
Others will comment more broadly on the trade-off between the two
indexes/plans but the answer in this case is that the best index would be
(time, stock); I don't think you can improve the query of the more recent
data beyond that since index-only scans are not likely something you can
force in this situation.

David J.





--
View this message in context:
http://postgresql.1045698.n5.nabble.com/PostgreSQL-s-query-planner-is-using-the-wrong-index-what-can-I-do-to-improve-this-situation-tp5802349p5802377.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


pgsql-performance by date:

Previous
From: Vitalii Tymchyshyn
Date:
Subject: Re: Optimize query for listing un-read messages
Next
From: Stelios Mavromichalis
Date:
Subject: Re: recently and selectively slow, but very simple, update query....