Index usage with sub select or inner joins - Mailing list pgsql-performance

From Julien Theulier
Subject Index usage with sub select or inner joins
Date
Msg-id 003c01c944c9$c1d92ac0$458b8040$@com
Whole thread Raw
In response to Re: Using index for IS NULL query  (Tomas Vondra <tv@fuzzy.cz>)
Responses Re: Index usage with sub select or inner joins  (Joshua Tolley <eggyknap@gmail.com>)
List pgsql-performance
Hello,

I am doing some performances testing on Postgres & I discovered the
following behavior, when using 2 different ways of writing selects (but
doing the same aggregations at the end):
1. test case 1, using outer join:
create table test2 as
select
soj_session_log_id, pv_timestamp, vi_pv_id,a.item_id,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '3 day') then 1 else 0 end)) as recent_sales_3d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '7 day') then 1 else 0 end)) as recent_sales_7d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '14 day') then 1 else 0 end)) as recent_sales_14d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '30 day') then 1 else 0 end)) as recent_sales_30d1,
coalesce(sum(case when (bid_date<pv_timestamp and bid_date>=pv_timestamp -
INTERVAL '60 day') then 1 else 0 end)) as recent_sales_60d1
from bm_us_views_main_1609 a
left outer join bm_us_bids b on (b.item_id=a.item_id and
b.bid_date<a.pv_timestamp and (b.bid_date>=a.pv_timestamp - INTERVAL '60
day'))
where a.item_type in (7,9) and qty>1
group by soj_session_log_id, pv_timestamp, vi_pv_id, a.item_id;;

This query doesn't use any index according to the explain plan:
"HashAggregate  (cost=672109.07..683054.81 rows=182429 width=49)"
"  ->  Merge Left Join  (cost=646489.83..668004.42 rows=182429 width=49)"
"        Merge Cond: (a.item_id = b.item_id)"
"        Join Filter: ((b.bid_date < a.pv_timestamp) AND (b.bid_date >=
(a.pv_timestamp - '60 days'::interval)))"
"        ->  Sort  (cost=331768.62..332224.69 rows=182429 width=41)"
"              Sort Key: a.item_id"
"              ->  Seq Scan on bm_us_views_main_1609 a
(cost=0.00..315827.08 rows=182429 width=41)"
"                    Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND
(qty > 1))"
"        ->  Sort  (cost=314669.01..320949.52 rows=2512205 width=19)"
"              Sort Key: b.item_id"
"              ->  Seq Scan on bm_us_bids b  (cost=0.00..47615.05
rows=2512205 width=19)"

2. Test case 2, using sub queries:
create table test2 as
select
soj_session_log_id, pv_timestamp, vi_pv_id,item_id,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '3 day' group by
item_id ),0) as recent_sales_3d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '7 day' group by
item_id ),0) as recent_sales_7d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '14 day' group
by item_id ),0) as recent_sales_14d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '30 day' group
by item_id ),0) as recent_sales_30d,
coalesce((select count(*) from bm_us_bids b where b.item_id=a.item_id and
bid_date<pv_timestamp and bid_date>=pv_timestamp - INTERVAL '60 day' group
by item_id ),0) as recent_sales_60d
from bm_us_views_main_1609 a
where item_type in (7,9) and qty>1;

This query uses indexes  according to the explain plan:
"Seq Scan on bm_us_views_main_1609 a  (cost=0.00..8720230.77 rows=182429
width=41)"
"  Filter: ((item_type = ANY ('{7,9}'::numeric[])) AND (qty > 1))"
"  SubPlan"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '60 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '30 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '14 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '7 days'::interval)))"
"    ->  GroupAggregate  (cost=0.00..9.21 rows=1 width=11)"
"          ->  Index Scan using bm_us_bids_item_ix on bm_us_bids b
(cost=0.00..9.20 rows=1 width=11)"
"                Index Cond: ((item_id = $0) AND (bid_date < $1) AND
(bid_date >= ($1 - '3 days'::interval)))"

The index bm_us_bids_item_ix is on columns item_id, bidder_id, bid_date


QUESTION: Why the planner choose seq scan in the first case & indexes scan
in the second case? In a more general way, I observed that the planner has
difficulties to select index scans & does in almost all the cases seq scan,
when doing join queries. After investigations, it looks like when you join
table a with table b on a column x and y and you have an index on column x
only, the planner is not able to choose the index scan. You have to build
the index corresponding exactly to the join statement btw the 2 tables

For example,by creating an new index on item_id and bid_date, the planner
has been able to choose this last index in both cases. Would it be possible
that the planner can choose in any case the closest index for queries having
outer join

Last thing, I am running Postgres 8.3.4 on a Windows laptop having 3.5Gb
RAM, 161Gb disk and dual core 2.5Gz processor

Regards,
Julien Theulier

Attachment

pgsql-performance by date:

Previous
From: Matthew Wakeling
Date:
Subject: Re: Using index for IS NULL query
Next
From: Joshua Tolley
Date:
Subject: Re: Index usage with sub select or inner joins