Re: Curious problem of using BETWEEN with start and end being the same versus EQUALS '=' - Mailing list pgsql-sql
From | Venky Kandaswamy |
---|---|
Subject | Re: Curious problem of using BETWEEN with start and end being the same versus EQUALS '=' |
Date | |
Msg-id | 776CCF725798BE4ABEC2521B9AEDBB354319E66C@BY2PRD0511MB429.namprd05.prod.outlook.com Whole thread Raw |
In response to | Re: Curious problem of using BETWEEN with start and end being the same versus EQUALS '=' (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-sql |
Thanks for the quick and detailed response, Tom. Yes, I did add a redundant where clause with a restriction on b.date_id on the range queries. This appears to speed thingsup since it does an index scan on the b table before the merge join. We will get more intelligent on query generation (our system generates queries on the fly) to work around this problem. ________________________________________ Venky Kandaswamy Principal Engineer, Adchemy Inc. 925-200-7124 ________________________________________ From: Tom Lane [tgl@sss.pgh.pa.us] Sent: Tuesday, January 15, 2013 2:30 PM To: Venky Kandaswamy Cc: pgsql-general@postgresql.org; pgsql-sql@postgresql.org Subject: Re: [SQL] Curious problem of using BETWEEN with start and end being the same versus EQUALS '=' Venky Kandaswamy <venky@adchemy.com> writes: > On 9.1, I am running into a curious issue. It's not very curious at all, or at least people on pgsql-performance (the right list for this sort of question) would have figured it out quickly. You're getting a crummy plan because of a crummy row estimate. When you do this: > WHERE a.date_id = 20120228 you get this: > " -> Index Scan using alps_agg_date_id on bi2003.alps_agg a (cost=0.00..17870.00 rows=26292 width=1350)(actual time=0.047..142.383 rows=36132 loops=1)" > " Output: a.date_id, a.page_group, a.page, a.int_alloc_type, a.componentset, a.adc_visit, upper((a.adc_visit)::text)" > " Index Cond: (a.date_id = 20120228)" > " Filter: ((a.page)::text = 'ddi_671'::text)" 26K estimated rows versus 36K actual isn't the greatest estimate in the world, but it's plenty good enough. But when you do this: > WHERE a.date_id BETWEEN 20120228 AND 20120228 you get this: > " -> Index Scan using alps_agg_date_id on bi2003.alps_agg a (cost=0.00..10.12 rows=1 width=1350)" > " Output: a.date_id, a.adc_visit, a.page_group, a.page, a.int_alloc_type, a.componentset, a.variation_tagset,a.page_instance" > " Index Cond: ((a.date_id >= 20120228) AND (a.date_id <= 20120228))" > " Filter: ((a.page)::text = 'ddi_671'::text)" so the bogus estimate of only one row causes the planner to pick an entirely different plan, which would probably be a great choice if there were indeed only one such row, but with 36000 of them it's horrid. The reason the row estimate is so crummy is that a zero-width interval is an edge case for range estimates. We've seen this before, although usually it's not quite this bad. There's been some talk of making the estimate for "x >= a AND x <= b" always be at least as much as the estimate for "x = a", but this would increase the cost of making the estimate by quite a bit, and make things actually worse in some cases (in particular, if a > b then a nil estimate is indeed the right thing). You might look into whether queries formed like "date_id >= 20120228 AND date_id < 20120229" give you more robust estimates at the edge cases. BTW, I notice in your EXPLAIN results that the same range restriction has been propagated to b.date_id: > " -> Index Scan using event_agg_date_id on bi2003.event_agg b (cost=0.00..10.27 rows=1 width=1694)" > " Output: b.date_id, b.vcset, b.eventcountset, b.eventvalueset" > " Index Cond: ((b.date_id >= 20120228) AND (b.date_id <= 20120228))" I'd expect that to happen automatically for a simple equality constraint, but not for a range constraint. Did you do that manually and not tell us about it? regards, tom lane