Thread: The planner hates me.
"PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)" (test environment) Picture a table called 'transaction' with 1 million rows. most (> 99% of the records have date1 and date2 values in the past (spread over 4 years) 99.99% of the records have a state OTHER than 'I'. CREATE TABLE transaction ( amount numeric(10,2) NOT NULL, date1 date NOT NULL, state character(1) NOT NULL, date2 date DEFAULT date(now()) ) CREATE INDEX t_date1_index ON transaction USING btree (date1); CREATE INDEX t_date2_index ON transaction USING btree (date2); CREATE INDEX t_state_index ON transaction USING btree (state); explain analyze select sum(amount), dates.date as date from transaction t join (select get_dates as date from get_dates('09/17/08','09/24/08')) dates on (t.state='I' or t.date1 >= dates.date) group by dates.date get_dates simply returns each date between (and including) the passed dates ...and is a 'stable' function). Yes...I know I probably could have used a generate_series or something...but this was written before I knew anything about that "HashAggregate (cost=1290485.15..1290485.40 rows=200 width=17) (actual time=277.804..277.809 rows=8 loops=1)" " -> Nested Loop (cost=270.37..1123134.88 rows=334700533 width=17) (actual time=3.182..153.741 rows=120536 loops=1)" " -> Function Scan on get_dates (cost=0.00..1.25 rows=1000 width=4) (actual time=0.057..0.065 rows=8 loops=1)" " -> Bitmap Heap Scan on "transaction" t (cost=270.37..618.60 rows=336357 width=22) (actual time=3.093..10.958 rows=15067 loops=8)" " Recheck Cond: ((t.state = 'I'::bpchar) OR (t.date1 >= get_dates.get_dates))" " -> BitmapOr (cost=270.37..270.37 rows=336357 width=0) (actual time=2.853..2.853 rows=0 loops=8)" " -> Bitmap Index Scan on t_state_index (cost=0.00..8.97 rows=4971 width=0) (actual time=2.842..2.842 rows=15067 loops=8)" " Index Cond: (state = 'I'::bpchar)" " -> Bitmap Index Scan on t_date1_index (cost=0.00..252.99 rows=331387 width=0) (actual time=0.009..0.009 rows=2 loops=8)" " Index Cond: (t.date1 >= get_dates.get_dates)" "Total runtime: 277.883 ms" Uses bitmap scans for optimization and performs admirably. BUT...when I change the query thusly... (adding in an addition AND clause) explain analyze select sum(amount), dates.date as date from transaction t join (select get_dates as date from get_dates('09/17/08','09/24/08')) dates on (t.state='I' or t.date1 >= dates.date) and t.date2 < dates.date group by dates.date "HashAggregate (cost=1222618.09..1222618.34 rows=200 width=15) (actual time=7538.193..7538.199 rows=8 loops=1)" " -> Nested Loop (cost=0.00..1166174.15 rows=112887885 width=15) (actual time=0.889..7411.997 rows=120522 loops=1)" " Join Filter: ((t.state = 'I'::bpchar) OR (t.date1 >= get_dates.get_dates))" " -> Function Scan on get_dates (cost=0.00..1.25 rows=1000 width=4) (actual time=0.055..0.062 rows=8 loops=1)" " -> Index Scan using t_date2_index on "transaction" t (cost=0.00..590.77 rows=328800 width=24) (actual time=0.018..492.348 rows=986273 loops=8)" " Index Cond: (t.date2 < get_dates.get_dates)" "Total runtime: 7538.259 ms" I am pulling my hair out. Why does it insist on using the index on the date2 field? It was doing so well with the bitmap indexing on the other fields...if the planner would simply do the same with date2, all would be well. I really need The date2 portion of the query to filter AFTER the first two clauses, because those are quicker and narrows the data down to a manageable size.... Any suggestions on query or index changes? I still need the date2 index for other queries....but the fact that it gets used exclusively for this query drives me batty!. I've tried various combinations of multi-column indexes to no avail...because I still need the date2 index for other reasons, this query ALWAYS chooses it for some reason. Any help would be appreciated.
"Jeff Amiel" <JAmiel@istreamimaging.com> writes: > select sum(amount), dates.date as date > from transaction t > join (select get_dates as date from > get_dates('09/17/08','09/24/08')) dates on > (t.state='I' or t.date1 >= dates.date) and t.date2 > < dates.date > group by dates.date The problem you've got here is that the planner has got absolutely no visibility into the behavior of get_dates(). In particular it doesn't realize that the values being generated are close to the end of the range of dates that are in the table, and thus the date1 >= dates.date condition is far more selective than the date2 < dates.date condition. If you look closely at the rowcount estimates you'll see that those are actually being estimated the same, to within roundoff error. So looking at two indexes instead of one doesn't look like a win to it. So far as I can see the only way to improve this is to break it up into separate queries, one for each date, with the date being specified explicitly in the query. regards, tom lane
-----Original Message----- From: Hoover, Jeffrey [mailto:jhoover@jcvi.org] change t.date2< dates.date to t.date2+0<dates.date, this will prevent the query from trying to use the index on date2 because the where clause now references an expression and not the column itself: explain analyze select sum(amount), dates.date as date from transaction t join (select get_dates as date from get_dates('09/17/08','09/24/08')) dates on (t.state='I' or t.date1 >= dates.date) and t.date2+0<dates.date group by dates.date Yup...that did the trick.....whodathunkit?
On Thu, Sep 25, 2008 at 8:24 AM, Jeff Amiel <JAmiel@istreamimaging.com> wrote: > "PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC) > 3.4.3 (csl-sol210-3_4-branch+sol_rpath)" > (test environment) Update your pgsql version to 8.2.10 or whatever is latest. There was some pathological planner behaviour that tom fixed. I believe it was fixed between those versions. > Picture a table called 'transaction' with 1 million rows. > most (> 99% of the records have date1 and date2 values in the past > (spread over 4 years) > 99.99% of the records have a state OTHER than 'I'. I'd definitely make a partial or functional index on that then. If you test for not 'I' then make it on the date with that: create index xyz on table (somedate_field[, another_date_field]...) where status <> 'I'; Don't know if that would help, but I've seen it give the planner a hint and produces a useful index. Or for queries where it needs the opposite: same as before then ... where status='I' Just a thought. Functional / partial indexes can make a big difference if you get it right, but don't create one unless you know it helps because of update costs.
On Thu, Sep 25, 2008 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The problem you've got here is that the planner has got absolutely no > visibility into the behavior of get_dates(). In particular it doesn't > realize that the values being generated are close to the end of the > range of dates that are in the table, and thus the date1 >= dates.date > condition is far more selective than the date2 < dates.date condition. > If you look closely at the rowcount estimates you'll see that those are > actually being estimated the same, to within roundoff error. So looking > at two indexes instead of one doesn't look like a win to it. Couldn't they make a simple immutable function and index on that?
"Scott Marlowe" <scott.marlowe@gmail.com> writes: > On Thu, Sep 25, 2008 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The problem you've got here is that the planner has got absolutely no >> visibility into the behavior of get_dates(). > Couldn't they make a simple immutable function and index on that? Maybe, but I can't think of a different index definition that would work better. The Right Way (tm) to do this would be something like create temp table dates as select * from get_dates(...); analyze dates; ... original select, but join against temp table ... which would leave the planner armed with some stats about the range of dates of interest. Unfortunately, this isn't going to help Jeff today, because scalargtjoinsel/scalarltjoinsel are just stubs :-(. It's not an area of the planner than anyone's ever got round to working on. regards, tom lane
-----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] The Right Way (tm) to do this would be something like create temp table dates as select * from get_dates(...); analyze dates; ... original select, but join against temp table ... which would leave the planner armed with some stats about the range of dates of interest. ------------ I actually tried that but minus the analyze..... :( Btw, congrats on whomever contributed to the bitmap scans (v8.1 I believe)....the ability to use multiple indexes for a single table in the same query is fantastic!!
change t.date2< dates.date to t.date2+0<dates.date, this will prevent the query from trying to use the index on date2 because the where clause now references an expression and not the column itself: explain analyze select sum(amount), dates.date as date from transaction t join (select get_dates as date from get_dates('09/17/08','09/24/08')) dates on (t.state='I' or t.date1 >= dates.date) and t.date2+0<dates.date group by dates.date -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Amiel Sent: Thursday, September 25, 2008 10:24 AM To: pgsql-general@postgresql.org Subject: [GENERAL] The planner hates me. "PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC) 3.4.3 (csl-sol210-3_4-branch+sol_rpath)" (test environment) Picture a table called 'transaction' with 1 million rows. most (> 99% of the records have date1 and date2 values in the past (spread over 4 years) 99.99% of the records have a state OTHER than 'I'. CREATE TABLE transaction ( amount numeric(10,2) NOT NULL, date1 date NOT NULL, state character(1) NOT NULL, date2 date DEFAULT date(now()) ) CREATE INDEX t_date1_index ON transaction USING btree (date1); CREATE INDEX t_date2_index ON transaction USING btree (date2); CREATE INDEX t_state_index ON transaction USING btree (state); explain analyze select sum(amount), dates.date as date from transaction t join (select get_dates as date from get_dates('09/17/08','09/24/08')) dates on (t.state='I' or t.date1 >= dates.date) group by dates.date get_dates simply returns each date between (and including) the passed dates ...and is a 'stable' function). Yes...I know I probably could have used a generate_series or something...but this was written before I knew anything about that "HashAggregate (cost=1290485.15..1290485.40 rows=200 width=17) (actual time=277.804..277.809 rows=8 loops=1)" " -> Nested Loop (cost=270.37..1123134.88 rows=334700533 width=17) (actual time=3.182..153.741 rows=120536 loops=1)" " -> Function Scan on get_dates (cost=0.00..1.25 rows=1000 width=4) (actual time=0.057..0.065 rows=8 loops=1)" " -> Bitmap Heap Scan on "transaction" t (cost=270.37..618.60 rows=336357 width=22) (actual time=3.093..10.958 rows=15067 loops=8)" " Recheck Cond: ((t.state = 'I'::bpchar) OR (t.date1 >= get_dates.get_dates))" " -> BitmapOr (cost=270.37..270.37 rows=336357 width=0) (actual time=2.853..2.853 rows=0 loops=8)" " -> Bitmap Index Scan on t_state_index (cost=0.00..8.97 rows=4971 width=0) (actual time=2.842..2.842 rows=15067 loops=8)" " Index Cond: (state = 'I'::bpchar)" " -> Bitmap Index Scan on t_date1_index (cost=0.00..252.99 rows=331387 width=0) (actual time=0.009..0.009 rows=2 loops=8)" " Index Cond: (t.date1 >= get_dates.get_dates)" "Total runtime: 277.883 ms" Uses bitmap scans for optimization and performs admirably. BUT...when I change the query thusly... (adding in an addition AND clause) explain analyze select sum(amount), dates.date as date from transaction t join (select get_dates as date from get_dates('09/17/08','09/24/08')) dates on (t.state='I' or t.date1 >= dates.date) and t.date2 < dates.date group by dates.date "HashAggregate (cost=1222618.09..1222618.34 rows=200 width=15) (actual time=7538.193..7538.199 rows=8 loops=1)" " -> Nested Loop (cost=0.00..1166174.15 rows=112887885 width=15) (actual time=0.889..7411.997 rows=120522 loops=1)" " Join Filter: ((t.state = 'I'::bpchar) OR (t.date1 >= get_dates.get_dates))" " -> Function Scan on get_dates (cost=0.00..1.25 rows=1000 width=4) (actual time=0.055..0.062 rows=8 loops=1)" " -> Index Scan using t_date2_index on "transaction" t (cost=0.00..590.77 rows=328800 width=24) (actual time=0.018..492.348 rows=986273 loops=8)" " Index Cond: (t.date2 < get_dates.get_dates)" "Total runtime: 7538.259 ms" I am pulling my hair out. Why does it insist on using the index on the date2 field? It was doing so well with the bitmap indexing on the other fields...if the planner would simply do the same with date2, all would be well. I really need The date2 portion of the query to filter AFTER the first two clauses, because those are quicker and narrows the data down to a manageable size.... Any suggestions on query or index changes? I still need the date2 index for other queries....but the fact that it gets used exclusively for this query drives me batty!. I've tried various combinations of multi-column indexes to no avail...because I still need the date2 index for other reasons, this query ALWAYS chooses it for some reason. Any help would be appreciated. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general