Thread: The planner hates me.

The planner hates me.

From

"Jeff Amiel"

Date:

25 September 2008, 11:34:16

"PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC)
3.4.3 (csl-sol210-3_4-branch+sol_rpath)"
(test environment)

Picture a table called 'transaction' with 1 million rows.
most (> 99% of the records have date1 and date2 values in the past
(spread over 4 years)
99.99% of the records have a state OTHER than 'I'.


CREATE TABLE transaction
(
  amount numeric(10,2) NOT NULL,
  date1 date NOT NULL,
  state character(1) NOT NULL,
  date2 date DEFAULT date(now())
)

CREATE INDEX t_date1_index
  ON transaction
  USING btree
  (date1);

CREATE INDEX t_date2_index
  ON transaction
  USING btree
  (date2);

CREATE INDEX t_state_index
  ON transaction
  USING btree
  (state);



explain analyze
     select sum(amount),  dates.date as date
        from transaction t
         join (select get_dates as date from
get_dates('09/17/08','09/24/08')) dates on
        (t.state='I' or   t.date1 >=  dates.date)
           group by dates.date

get_dates simply returns each date between (and including) the passed
dates ...and is a 'stable' function).  Yes...I know I probably could
have used a generate_series or something...but this was written before I
knew anything about that

"HashAggregate  (cost=1290485.15..1290485.40 rows=200 width=17) (actual
time=277.804..277.809 rows=8 loops=1)"
"  ->  Nested Loop  (cost=270.37..1123134.88 rows=334700533 width=17)
(actual time=3.182..153.741 rows=120536 loops=1)"
"        ->  Function Scan on get_dates  (cost=0.00..1.25 rows=1000
width=4) (actual time=0.057..0.065 rows=8 loops=1)"
"        ->  Bitmap Heap Scan on "transaction" t  (cost=270.37..618.60
rows=336357 width=22) (actual time=3.093..10.958 rows=15067 loops=8)"
"              Recheck Cond: ((t.state = 'I'::bpchar) OR (t.date1 >=
get_dates.get_dates))"
"              ->  BitmapOr  (cost=270.37..270.37 rows=336357 width=0)
(actual time=2.853..2.853 rows=0 loops=8)"
"                    ->  Bitmap Index Scan on t_state_index
(cost=0.00..8.97 rows=4971 width=0) (actual time=2.842..2.842 rows=15067
loops=8)"
"                          Index Cond: (state = 'I'::bpchar)"
"                    ->  Bitmap Index Scan on t_date1_index
(cost=0.00..252.99 rows=331387 width=0) (actual time=0.009..0.009 rows=2
loops=8)"
"                          Index Cond: (t.date1 >= get_dates.get_dates)"
"Total runtime: 277.883 ms"


Uses bitmap scans for optimization and performs admirably.

BUT...when I change the query thusly... (adding in an addition AND
clause)

explain analyze
     select sum(amount),  dates.date as date
        from transaction t
         join (select get_dates as date from
get_dates('09/17/08','09/24/08')) dates on
        (t.state='I' or   t.date1 >=  dates.date)    and t.date2
< dates.date
           group by dates.date

"HashAggregate  (cost=1222618.09..1222618.34 rows=200 width=15) (actual
time=7538.193..7538.199 rows=8 loops=1)"
"  ->  Nested Loop  (cost=0.00..1166174.15 rows=112887885 width=15)
(actual time=0.889..7411.997 rows=120522 loops=1)"
"        Join Filter: ((t.state = 'I'::bpchar) OR (t.date1 >=
get_dates.get_dates))"
"        ->  Function Scan on get_dates  (cost=0.00..1.25 rows=1000
width=4) (actual time=0.055..0.062 rows=8 loops=1)"
"        ->  Index Scan using t_date2_index on "transaction" t
(cost=0.00..590.77 rows=328800 width=24) (actual time=0.018..492.348
rows=986273 loops=8)"
"              Index Cond: (t.date2 < get_dates.get_dates)"
"Total runtime: 7538.259 ms"

I am pulling my hair out.  Why does it insist on using the index on the
date2 field?   It was doing so well with the bitmap indexing on the
other fields...if the planner would simply do the same with date2, all
would be well.  I really need The date2 portion of the query to filter
AFTER the first two clauses, because those are quicker and narrows the
data down to a manageable size....


Any suggestions on query or index changes?  I still need the date2 index
for other queries....but the fact that it gets used exclusively for this
query drives me batty!.

I've tried various combinations of multi-column indexes to no
avail...because I still need the date2 index for other reasons, this
query ALWAYS chooses it for some reason.


Any help would be appreciated.

Re: The planner hates me.

From

Tom Lane

Date:

25 September 2008, 12:38:51

"Jeff Amiel" <JAmiel@istreamimaging.com> writes:
>      select sum(amount),  dates.date as date
>         from transaction t
>          join (select get_dates as date from
> get_dates('09/17/08','09/24/08')) dates on
>         (t.state='I' or   t.date1 >=  dates.date)    and t.date2
> < dates.date
>            group by dates.date

The problem you've got here is that the planner has got absolutely no
visibility into the behavior of get_dates().  In particular it doesn't
realize that the values being generated are close to the end of the
range of dates that are in the table, and thus the date1 >= dates.date
condition is far more selective than the date2 < dates.date condition.
If you look closely at the rowcount estimates you'll see that those are
actually being estimated the same, to within roundoff error.  So looking
at two indexes instead of one doesn't look like a win to it.

So far as I can see the only way to improve this is to break it up
into separate queries, one for each date, with the date being specified
explicitly in the query.

            regards, tom lane

Re: The planner hates me.

From

"Jeff Amiel"

Date:

25 September 2008, 12:42:32

-----Original Message-----
From: Hoover, Jeffrey [mailto:jhoover@jcvi.org]

change t.date2< dates.date to t.date2+0<dates.date, this will prevent
the query from trying to use the index on date2 because the where clause
now references an expression and not the column itself:

explain analyze
select sum(amount),  dates.date as date
from transaction t
join (select get_dates as date from get_dates('09/17/08','09/24/08'))
dates
  on (t.state='I' or t.date1 >=  dates.date) and t.date2+0<dates.date
group by dates.date



Yup...that did the trick.....whodathunkit?

Re: The planner hates me.

From

"Scott Marlowe"

Date:

25 September 2008, 13:04:41

On Thu, Sep 25, 2008 at 8:24 AM, Jeff Amiel <JAmiel@istreamimaging.com> wrote:
> "PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC)
> 3.4.3 (csl-sol210-3_4-branch+sol_rpath)"
> (test environment)

Update your pgsql version to 8.2.10 or whatever is latest.  There was
some pathological planner behaviour that tom fixed. I believe it was
fixed between those versions.

> Picture a table called 'transaction' with 1 million rows.
> most (> 99% of the records have date1 and date2 values in the past
> (spread over 4 years)
> 99.99% of the records have a state OTHER than 'I'.

I'd definitely make a partial or functional index on that then.  If
you test for not 'I' then make it on the date with that:

create index xyz on table (somedate_field[, another_date_field]...)
where status <> 'I';

Don't know if that would help, but I've seen it give the planner a
hint and produces a useful index.

Or for queries where it needs the opposite:

same as before then ... where status='I'

Just a thought.  Functional / partial indexes can make a big
difference if you get it right, but don't create one unless you know
it helps because of update costs.

Re: The planner hates me.

From

"Scott Marlowe"

Date:

25 September 2008, 13:06:31

On Thu, Sep 25, 2008 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> The problem you've got here is that the planner has got absolutely no
> visibility into the behavior of get_dates().  In particular it doesn't
> realize that the values being generated are close to the end of the
> range of dates that are in the table, and thus the date1 >= dates.date
> condition is far more selective than the date2 < dates.date condition.
> If you look closely at the rowcount estimates you'll see that those are
> actually being estimated the same, to within roundoff error.  So looking
> at two indexes instead of one doesn't look like a win to it.

Couldn't they make a simple immutable function and index on that?

Re: The planner hates me.

From

Tom Lane

Date:

25 September 2008, 13:42:28

"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Thu, Sep 25, 2008 at 9:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The problem you've got here is that the planner has got absolutely no
>> visibility into the behavior of get_dates().

> Couldn't they make a simple immutable function and index on that?

Maybe, but I can't think of a different index definition that would work
better.

The Right Way (tm) to do this would be something like

    create temp table dates as select * from get_dates(...);
    analyze dates;
    ... original select, but join against temp table ...

which would leave the planner armed with some stats about the range
of dates of interest.  Unfortunately, this isn't going to help Jeff
today, because scalargtjoinsel/scalarltjoinsel are just stubs :-(.
It's not an area of the planner than anyone's ever got round to working
on.

            regards, tom lane

Re: The planner hates me.

From

"Jeff Amiel"

Date:

25 September 2008, 19:18:49

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]


The Right Way (tm) to do this would be something like

    create temp table dates as select * from get_dates(...);
    analyze dates;
    ... original select, but join against temp table ...

which would leave the planner armed with some stats about the range
of dates of interest.

------------

I actually tried that   but minus the analyze..... :(

Btw, congrats on whomever contributed to the bitmap scans (v8.1 I
believe)....the ability to use multiple indexes for a single table in
the same query is fantastic!!

Re: The planner hates me.

From

"Hoover, Jeffrey"

Date:

29 September 2008, 16:49:38

change t.date2< dates.date to t.date2+0<dates.date, this will prevent
the query from trying to use the index on date2 because the where clause
now references an expression and not the column itself:

explain analyze
select sum(amount),  dates.date as date
from transaction t
join (select get_dates as date from get_dates('09/17/08','09/24/08'))
dates
  on (t.state='I' or t.date1 >=  dates.date) and t.date2+0<dates.date
group by dates.date


-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Amiel
Sent: Thursday, September 25, 2008 10:24 AM
To: pgsql-general@postgresql.org
Subject: [GENERAL] The planner hates me.

"PostgreSQL 8.2.4 on i386-pc-solaris2.10, compiled by GCC gcc (GCC)
3.4.3 (csl-sol210-3_4-branch+sol_rpath)"
(test environment)

Picture a table called 'transaction' with 1 million rows.
most (> 99% of the records have date1 and date2 values in the past
(spread over 4 years)
99.99% of the records have a state OTHER than 'I'.


CREATE TABLE transaction
(
  amount numeric(10,2) NOT NULL,
  date1 date NOT NULL,
  state character(1) NOT NULL,
  date2 date DEFAULT date(now())
)

CREATE INDEX t_date1_index
  ON transaction
  USING btree
  (date1);

CREATE INDEX t_date2_index
  ON transaction
  USING btree
  (date2);

CREATE INDEX t_state_index
  ON transaction
  USING btree
  (state);



explain analyze
     select sum(amount),  dates.date as date
        from transaction t
         join (select get_dates as date from
get_dates('09/17/08','09/24/08')) dates on
        (t.state='I' or   t.date1 >=  dates.date)
           group by dates.date

get_dates simply returns each date between (and including) the passed
dates ...and is a 'stable' function).  Yes...I know I probably could
have used a generate_series or something...but this was written before I
knew anything about that

"HashAggregate  (cost=1290485.15..1290485.40 rows=200 width=17) (actual
time=277.804..277.809 rows=8 loops=1)"
"  ->  Nested Loop  (cost=270.37..1123134.88 rows=334700533 width=17)
(actual time=3.182..153.741 rows=120536 loops=1)"
"        ->  Function Scan on get_dates  (cost=0.00..1.25 rows=1000
width=4) (actual time=0.057..0.065 rows=8 loops=1)"
"        ->  Bitmap Heap Scan on "transaction" t  (cost=270.37..618.60
rows=336357 width=22) (actual time=3.093..10.958 rows=15067 loops=8)"
"              Recheck Cond: ((t.state = 'I'::bpchar) OR (t.date1 >=
get_dates.get_dates))"
"              ->  BitmapOr  (cost=270.37..270.37 rows=336357 width=0)
(actual time=2.853..2.853 rows=0 loops=8)"
"                    ->  Bitmap Index Scan on t_state_index
(cost=0.00..8.97 rows=4971 width=0) (actual time=2.842..2.842 rows=15067
loops=8)"
"                          Index Cond: (state = 'I'::bpchar)"
"                    ->  Bitmap Index Scan on t_date1_index
(cost=0.00..252.99 rows=331387 width=0) (actual time=0.009..0.009 rows=2
loops=8)"
"                          Index Cond: (t.date1 >= get_dates.get_dates)"
"Total runtime: 277.883 ms"


Uses bitmap scans for optimization and performs admirably.

BUT...when I change the query thusly... (adding in an addition AND
clause)

explain analyze
     select sum(amount),  dates.date as date
        from transaction t
         join (select get_dates as date from
get_dates('09/17/08','09/24/08')) dates on
        (t.state='I' or   t.date1 >=  dates.date)    and t.date2
< dates.date
           group by dates.date

"HashAggregate  (cost=1222618.09..1222618.34 rows=200 width=15) (actual
time=7538.193..7538.199 rows=8 loops=1)"
"  ->  Nested Loop  (cost=0.00..1166174.15 rows=112887885 width=15)
(actual time=0.889..7411.997 rows=120522 loops=1)"
"        Join Filter: ((t.state = 'I'::bpchar) OR (t.date1 >=
get_dates.get_dates))"
"        ->  Function Scan on get_dates  (cost=0.00..1.25 rows=1000
width=4) (actual time=0.055..0.062 rows=8 loops=1)"
"        ->  Index Scan using t_date2_index on "transaction" t
(cost=0.00..590.77 rows=328800 width=24) (actual time=0.018..492.348
rows=986273 loops=8)"
"              Index Cond: (t.date2 < get_dates.get_dates)"
"Total runtime: 7538.259 ms"

I am pulling my hair out.  Why does it insist on using the index on the
date2 field?   It was doing so well with the bitmap indexing on the
other fields...if the planner would simply do the same with date2, all
would be well.  I really need The date2 portion of the query to filter
AFTER the first two clauses, because those are quicker and narrows the
data down to a manageable size....


Any suggestions on query or index changes?  I still need the date2 index
for other queries....but the fact that it gets used exclusively for this
query drives me batty!.

I've tried various combinations of multi-column indexes to no
avail...because I still need the date2 index for other reasons, this
query ALWAYS chooses it for some reason.


Any help would be appreciated.







--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general