Thread: How to force Nested Loop plan?

How to force Nested Loop plan?

From

Rob Nagler

Date:

30 August 2003, 10:36:20

I'm trying to understand how I can get the planner to always do the
right thing with this query:

    EXPLAIN ANALYZE
    SELECT
    aa_t.min_date_time
    FROM
    aa_t
    , bb_t
    , cc_t
    WHERE bb_t.bb_id = aa_t.bb_id
    AND aa_t.realm_id = cc_t.realm_id
    AND aa_t.server_id = 21
    ORDER BY aa_t.min_date_time desc
    LIMIT 1
    OFFSET 674
    ;

There's a extreme elbow in the performance curve around the sum of
LIMIT and OFFSET.  The two plans follow.  First for the query above:

 Limit  (cost=21569.56..21601.56 rows=1 width=84) (actual time=59.60..59.69 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..110535.66 rows=3454 width=84) (actual time=0.19..59.20 rows=676 loops=1)
         ->  Nested Loop  (cost=0.00..93177.46 rows=3454 width=65) (actual time=0.14..44.41 rows=676 loops=1)
               ->  Index Scan Backward using aa_t20 on aa_t  (cost=0.00..76738.77 rows=3454 width=46) (actual
time=0.10..31.30rows=676 loops=1) 
                     Filter: (server_id = 21::numeric)
               ->  Index Scan using cc_t1 on cc_t  (cost=0.00..4.75 rows=1 width=19) (actual time=0.01..0.01 rows=1
loops=676)
                     Index Cond: ("outer".realm_id = cc_t.realm_id)
         ->  Index Scan using bb_t1 on bb_t  (cost=0.00..5.01 rows=1 width=19) (actual time=0.02..0.02 rows=1
loops=676)
               Index Cond: (bb_t.bb_id = "outer".bb_id)
 Total runtime: 59.89 msec
(10 rows)

Setting OFFSET to 675 in the above query, results in this 100 times
slower plan:

 Limit  (cost=21614.48..21614.48 rows=1 width=84) (actual time=4762.39..4762.39 rows=1 loops=1)
   ->  Sort  (cost=21612.79..21621.42 rows=3454 width=84) (actual time=4761.45..4761.92 rows=677 loops=1)
         Sort Key: aa_t.min_date_time
         ->  Merge Join  (cost=21139.96..21409.80 rows=3454 width=84) (actual time=4399.80..4685.24 rows=41879 loops=1)
               Merge Cond: ("outer".bb_id = "inner".bb_id)
               ->  Sort  (cost=8079.83..8184.53 rows=41879 width=19) (actual time=936.99..967.37 rows=41879 loops=1)
                     Sort Key: bb_t.bb_id
                     ->  Seq Scan on bb_t  (cost=0.00..4864.79 rows=41879 width=19) (actual time=0.06..729.60
rows=41879loops=1) 
               ->  Sort  (cost=13060.13..13068.76 rows=3454 width=65) (actual time=3462.76..3493.97 rows=41879 loops=1)
                     Sort Key: aa_t.bb_id
                     ->  Merge Join  (cost=12794.42..12857.14 rows=3454 width=65) (actual time=2923.62..3202.78
rows=41879loops=1) 
                           Merge Cond: ("outer".realm_id = "inner".realm_id)
                           ->  Sort  (cost=12762.78..12771.41 rows=3454 width=46) (actual time=2920.78..2950.87
rows=41879loops=1) 
                                 Sort Key: aa_t.realm_id
                                 ->  Index Scan using aa_t5 on aa_t  (cost=0.00..12559.79 rows=3454 width=46) (actual
time=0.18..2589.22rows=41879 loops=1) 
                                       Index Cond: (server_id = 21::numeric)
                           ->  Sort  (cost=31.64..32.78 rows=455 width=19) (actual time=2.54..33.12 rows=42163 loops=1)
                                 Sort Key: cc_t.realm_id
                                 ->  Seq Scan on cc_t  (cost=0.00..11.55 rows=455 width=19) (actual time=0.04..0.86
rows=455loops=1) 
 Total runtime: 4792.84 msec
(20 rows)

Twiddling effective_cache_size and random_page_cost allows for a large
LIMIT+OFFSET number but not enough.  These tests are made with 400000
effective_cache_size and random_page_cost of 4.

I can increase the LIMIT+OFFSET elbow to 1654 by changing the
query thusly:

< AND aa_t.server_id = 21
---
> AND aa_t.server_id IN (21, 0)

The value 0 is an invalid server_id, so I know it won't be returned.
However, I've got 41K rows that could be returned by this query and
growing, and 1654 is obviously not enough.  (aa is 690K rows, bb is
41K rows, and cc is 500 rows.)

If I drop the ORDER BY, the query goes much faster, but the query is
useless without the ORDER BY.

I've figured out that the second plan is slow, because it is writing a
huge result set to disk (+200MB).  This doesn't make sense to me,
since sort_mem is 32000.

Is there a way to tell the optimizer to use Nested Loop plan always
instead of the Merge/Join plan?  Turning off enable_mergejoin is
obviously not an option.

Thanks,
Rob

Re: How to force Nested Loop plan?

From

Tom Lane

Date:

30 August 2003, 12:02:16

Rob Nagler <nagler@bivio.biz> writes:
> I'm trying to understand how I can get the planner to always do the
> right thing with this query:

>     SELECT
>     aa_t.min_date_time
>     FROM
>     aa_t
>     , bb_t
>     , cc_t
>     WHERE bb_t.bb_id = aa_t.bb_id
>     AND aa_t.realm_id = cc_t.realm_id
>     AND aa_t.server_id = 21
>     ORDER BY aa_t.min_date_time desc
>     LIMIT 1
>     OFFSET 674

>                ->  Index Scan Backward using aa_t20 on aa_t  (cost=0.00..76738.77 rows=3454 width=46) (actual
time=0.10..31.30rows=676 loops=1) 
>                      Filter: (server_id = 21::numeric)

The reason the planner does not much like this plan is that it's
estimating that quite a lot of rows will have to be hit in min_date_time
order before it finds enough rows with server_id = 21.  Thus the high
cost estimate for the above step.

I suspect that the reason you like this plan is that there's actually
substantial correlation between server_id and min_date_time, such that
the required rows are found quickly.  Before trying to force the planner
into what you consider an optimal plan, you had better ask yourself
whether you can expect that correlation to hold up in the future.
If not, your plan could become pessimal pretty quickly.

I'd suggest creating a double-column index:

    create index aa_ti on aa_t(server_id, min_date_time);

and altering the query to read

    ORDER BY aa_t.server_id DESC, aa_t.min_date_time DESC

(you need this kluge to make sure the planner recognizes that the new
index matches the ORDER BY request).  Then you should get a plan with
a much smaller cost coefficient for this step.

            regards, tom lane

PS: does server_id really need to be NUMERIC?  Why not integer, or at
worst bigint?

Re: How to force Nested Loop plan?

From

Rob Nagler

Date:

30 August 2003, 12:50:26

Tom Lane writes:
> The reason the planner does not much like this plan is that it's
> estimating that quite a lot of rows will have to be hit in min_date_time
> order before it finds enough rows with server_id = 21.  Thus the high
> cost estimate for the above step.

Thanks for the speedy and useful reply!  More questions follow. :)

Very interesting.  How does it know "quite a lot"?  Is there something
I can do to get the planner to analyze the data better?

> I suspect that the reason you like this plan is that there's actually
> substantial correlation between server_id and min_date_time, such that
> the required rows are found quickly.  Before trying to force the planner
> into what you consider an optimal plan, you had better ask yourself
> whether you can expect that correlation to hold up in the future.
> If not, your plan could become pessimal pretty quickly.

The correlation holds.  min_date_time increases over time as records
are inserted.  server_id is uniformly distributed over time.  There's
no randomness.  There is at least one 21 record for every value of
min_date_time.  21 is a special server_id containing aggregate
(denormalized) data for the other servers.  I thought about putting it
in a separate table, but this would complicate the code as the data is
identical to the non-aggregated case.

Do you have any suggestions for organizing the data/query now that you
know this?

> I'd suggest creating a double-column index:

Thanks.  I'll try this.

I'm a very big fan of declarative programming.  However, there's a
danger in declarative programming when the interperter isn't smart
enough.  When I add this index, I will slow down inserts (about
20K/day) and increase data size (this is the second largest table in
the database).  Moreover, if the planner is improved, I've should fix
my code, delete the index, etc.

Is there a way of giving the planner direct hints as in Oracle?  They
can be ignored when the optimizer is improved, just as "register" is
ignored by C compilers nowadays.

Adding the extra index and ORDER BY is also not easy in our case.  The
query is dynamically generated.  I can force the query ORDER BY to be
whatever I want, but I would lose the ability to do interesting
things, like the automatic generation of ORDER BY when someone clicks
on a column header in the application.  Indeed there are times when
people want to sort on other columns in the query. I reduced the
problem to the salient details for my post to this board.  What if the
ORDER BY was:

    ORDER BY aa_t.server_id DESC, cc_t.name ASC

Would the planner do the right thing?

> PS: does server_id really need to be NUMERIC?  Why not integer, or at
> worst bigint?

It is a NUMERIC(18).  It could be a bigint.  What would be the change
in performance of this query if we changed it to bigint?

BTW, my customer is probably going to be switching to Oracle.  This
particular query has been one of the reasons.  Maybe this change will
help us stay with Postgres.

Thanks,
Rob

Re: How to force Nested Loop plan?

From

Tom Lane

Date:

30 August 2003, 13:20:20

Rob Nagler <nagler@bivio.biz> writes:
> Tom Lane writes:
>> The reason the planner does not much like this plan is that it's
>> estimating that quite a lot of rows will have to be hit in min_date_time
>> order before it finds enough rows with server_id = 21.

> Very interesting.  How does it know "quite a lot"?

It doesn't, because it has no cross-column-correlation stats.  The
default assumption is that there's no correlation.

> server_id is uniformly distributed over time.  There's
> no randomness.  There is at least one 21 record for every value of
> min_date_time.

That doesn't really tell me anything.  What's the proportion of 21
records out of the total table?

> 21 is a special server_id containing aggregate
> (denormalized) data for the other servers.  I thought about putting it
> in a separate table, but this would complicate the code as the data is
> identical to the non-aggregated case.

Hm.  If most of your queries are for id 21, an alternative approach is
to create single-column partial indexes:

    create index fooi on foo (min_date_time) where server_id = 21;

This reduces the cost of maintaining the index but also makes it useful
*only* for id = 21 queries.  On the plus side, you don't need to hack
the ORDER BY clause to get your queries to use it.  Your choice...

> What if the ORDER BY was:
>     ORDER BY aa_t.server_id DESC, cc_t.name ASC
> Would the planner do the right thing?

What do you consider the right thing?  cc_t.name doesn't seem connected
to this table at all --- or did I miss something?

> It is a NUMERIC(18).  It could be a bigint.  What would be the change
> in performance of this query if we changed it to bigint?

Hard to say.  I can tell you that the raw comparison operator is a lot
quicker for bigint than for numeric, but I don't have any hard numbers
about what percentage of total CPU time is involved.  You'd pretty much
have to try it for yourself to see what the effect is in your queries.

If you've been generically using NUMERIC(n) where you could be using
integer or bigint, then I think you've probably paid a high price
without knowing it.  I don't know what Oracle's cost tradeoffs are for
these datatypes, but I can tell you that Postgres's integer types are
way faster (and more compact) than our NUMERIC.

            regards, tom lane

Re: How to force Nested Loop plan?

From

Ron Johnson

Date:

30 August 2003, 13:34:14

On Sat, 2003-08-30 at 10:47, Rob Nagler wrote:
> Tom Lane writes:
[snip]
> enough.  When I add this index, I will slow down inserts (about
> 20K/day) and increase data size (this is the second largest table in
[snip]

Since I gather that this is a web-site, can we presume that they
are clumped into an 8 hour range?  20,000/8 = 2,500/hour, which
is 41.67/minute.  If you can't do .69 inserts/second, something
is wrong, and it ain't hardware, and it ain't Postgresql...

> > PS: does server_id really need to be NUMERIC?  Why not integer, or at
> > worst bigint?
>
> It is a NUMERIC(18).  It could be a bigint.  What would be the change
> in performance of this query if we changed it to bigint?

http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-INT
http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-NUMERIC-DECIMAL

Scalars are faster than arbitrary precision types.  Small (32 bit)
scalars are faster than bit (64 bit) scalars on x86 h/w.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Adventure is a sign of incompetence"
Stephanson, great polar explorer

Re: How to force Nested Loop plan?

From

Rob Nagler

Date:

30 August 2003, 17:42:48

Tom Lane writes:
> That doesn't really tell me anything.  What's the proportion of 21
> records out of the total table?

Currently we have about 15 servers so 6% of the data is uniformly
distributed with the value 21.

>     create index fooi on foo (min_date_time) where server_id = 21;
>
> This reduces the cost of maintaining the index but also makes it useful
> *only* for id = 21 queries.  On the plus side, you don't need to hack
> the ORDER BY clause to get your queries to use it.  Your choice...

I like that better, thanks.  After testing I found the elbow was at
1610 records with this index, but this clause still yields better
performance at 1654 records:

    AND aa_t.server_id IN (21, 0)

This is independent of the existence of the new index.

Interestingly, when I drop the aa_t5 index, the elbow goes up to
1729 with the IN (21, 0) query.

You might ask: why do I have an index at all?  That's from my Oracle
experience.  server_id is a foreign key into the server table.  If you
don't create an index on a foreign key, Oracle locks the entire
foreign table when you modify the local table.  With an index, it only
locks a row in the foreign table.  This causes a major bottleneck, but
in this case the server table is static.  Therefore, the index is
superfluous, and since there are only 16 values, the index should be
bitmap index (Oracle speak, sorry, don't know the PG term).  Dropping
the index probably won't change any of the other queries, I think.

Without the aa_t5 index and after the elbow, the Index Scan is
replaced with a Seq Scan, which is just about as fast, but still 50
times slower than before the elbow:

 Limit  (cost=34071.30..34071.31 rows=1 width=84) (actual time=5111.14..5111.15 rows=1 loops=1)
   ->  Sort  (cost=34066.98..34075.61 rows=3454 width=84) (actual time=5108.74..5109.96 rows=1733 loops=1)
         Sort Key: aa_t.min_date_time
         ->  Merge Join  (cost=33801.26..33863.98 rows=3454 width=84) (actual time=4868.62..5020.58 rows=41879 loops=1)
               Merge Cond: ("outer".realm_id = "inner".realm_id)
               ->  Sort  (cost=31.64..32.78 rows=455 width=19) (actual time=3.06..3.38 rows=415 loops=1)
                     Sort Key: cc_t.realm_id
                     ->  Seq Scan on cc_t  (cost=0.00..11.55 rows=455 width=19) (actual time=0.05..0.99 rows=455
loops=1)
               ->  Sort  (cost=33769.63..33778.26 rows=3454 width=65) (actual time=4865.20..4895.28 rows=41879 loops=1)
                     Sort Key: aa_t.realm_id
                     ->  Merge Join  (cost=33296.79..33566.63 rows=3454 width=65) (actual time=4232.52..4541.24
rows=41879loops=1) 
                           Merge Cond: ("outer".bb_id = "inner".bb_id)
                           ->  Sort  (cost=25216.97..25225.60 rows=3454 width=46) (actual time=3213.53..3243.65
rows=41879loops=1) 
                                 Sort Key: aa_t.bb_id
                                 ->  Seq Scan on aa_t  (cost=0.00..25013.97 rows=3454 width=46) (actual
time=20.07..2986.11rows=41879 loops=1) 
                                       Filter: (server_id = 21::numeric)
                           ->  Sort  (cost=8079.83..8184.53 rows=41879 width=19) (actual time=1018.95..1049.37
rows=41879loops=1) 
                                 Sort Key: bb_t.bb_id
                                 ->  Seq Scan on bb_t  (cost=0.00..4864.79 rows=41879 width=19) (actual
time=0.04..810.88rows=41879 loops=1) 
 Total runtime: 5141.22 msec

What I'm not sure is why does it decide to switch modes so "early",
i.e., at about 5% of the table size or less?  It seems that an Index
Scan would give better mileage than a Seq Scan for possibly up to 50%
of the table in this case.  I clearly don't understand the internals,
but the elbow seems rather sharp to me.

> > What if the ORDER BY was:
> >     ORDER BY aa_t.server_id DESC, cc_t.name ASC
> > Would the planner do the right thing?
>
> What do you consider the right thing?
> cc_t.name doesn't seem connected
> to this table at all --- or did I miss something?

Sorry, this is a red herring.  Please ignore.

> If you've been generically using NUMERIC(n) where you could be using
> integer or bigint, then I think you've probably paid a high price
> without knowing it.  I don't know what Oracle's cost tradeoffs are for
> these datatypes, but I can tell you that Postgres's integer types are
> way faster (and more compact) than our NUMERIC.

I'll try to figure out what the price is in our case.  I think Oracle
does a pretty good job on data compression for NUMERIC.  I haven't
dealt with a large Postgres database until this one, so I guess it's
time to learn. :)

We actually have been quite pleased with Postgres's performance
without paying much attention to it before this.  When we first set up
Oracle, we got into all of its parameters pretty heavily.  With
Postgres, we just tried it and it worked.  This is the first query
where we ran out of ideas to try.

BTW, everybody's help on this list is fantastic.  Usually, I can find
the answer to my question (and have been doing so for 3 years) on this
list without asking.

Thanks,
Rob

Re: How to force Nested Loop plan?

From

Ron Johnson

Date:

30 August 2003, 18:28:53

On Sat, 2003-08-30 at 15:42, Rob Nagler wrote:
[snip]
> We actually have been quite pleased with Postgres's performance
> without paying much attention to it before this.  When we first set up
> Oracle, we got into all of its parameters pretty heavily.  With
> Postgres, we just tried it and it worked.  This is the first query
> where we ran out of ideas to try.

Dumb question: given your out-of-the-box satisfaction, could it be
that postgresql.conf hasn't been tweaked?

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

484,246 sq mi are needed for 6 billion people to live, 4 persons
per lot, in lots that are 60'x150'.
That is ~ California, Texas and Missouri.
Alternatively, France, Spain and The United Kingdom.

Re: How to force Nested Loop plan?

From

Tom Lane

Date:

30 August 2003, 19:11:17

Rob Nagler <nagler@bivio.biz> writes:
> What I'm not sure is why does it decide to switch modes so "early",
> i.e., at about 5% of the table size or less?

Given the default cost parameters and cost models, that's the correct
place to switch.  Since the estimate evidently doesn't match reality
for your case, you might want to play with the parameters.  Reducing
random_page_cost would be the first thing I'd try.  Some people think
that increasing effective_cache_size is a good idea too, though I feel
that that has only marginal impact on the planner's choices.

Keep in mind though that you seem to be experimenting with a
fully-cached database; you may find that the planner's beliefs more
nearly approach reality when actual I/O has to occur.

Another thing I'd be interested to know about is how closely the
physical order of the table entries correlates with min_date_time.
A high correlation reduces the actual cost of the indexscan (since
visiting the rows in index order becomes less of a random-access
proposition).  We are aware that the planner doesn't model this effect
very well at present ...

            regards, tom lane

Re: How to force Nested Loop plan?

From

Rob Nagler

Date:

30 August 2003, 21:09:59

Ron Johnson writes:
> Dumb question: given your out-of-the-box satisfaction, could it be
> that postgresql.conf hasn't been tweaked?

Here are the modified values:

shared_buffers = 8000
wal_buffers = 80
sort_mem = 32000
effective_cache_size = 400000
random_page_cost = 4
autocommit = false
timezone = UTC

I had run a test with effective_cache_size to high value to see what
would happen.  Also adjusted random_page_cost:

random_page_cost effective_cache_size    elbow
    4                    40000            675
    .5                    40000            592
    .1                    40000            392
    4                    1000            30

My conclusion is that random_page_cost should be left alone and
effective_cache_size higher is better.

BTW, the hardware is 2 x 2.4ghz Xeon, 1.2GB, SCSI (linux software
raid) with 10K disks.  This is close to the production box.  Although
we are planning on adding more memory to production.

Rob

Re: How to force Nested Loop plan?

From

Rob Nagler

Date:

30 August 2003, 21:23:46

Tom Lane writes:
> Keep in mind though that you seem to be experimenting with a
> fully-cached database; you may find that the planner's beliefs more
> nearly approach reality when actual I/O has to occur.

My hope is that the entire database should fit in memory.  This may
not be in the case right now with only 1GB, but it should be close.
The pgsql/data/base/NNN directory is about 1.5GB on production.  I'm
pretty sure with constant vacuuming, we could keep that size down.
A pgdump is about 60MB now, growing at about .5MB a day.

> Another thing I'd be interested to know about is how closely the
> physical order of the table entries correlates with min_date_time.

Probably "pretty close".  The primary key of aa_t is (bb_id,
server_id), and bb_id is a sequence.  aa_t is updated heavily on
production, but these tests are on a fresh import so vacuuming and
index order is not a factor.  We do a reload every now and then to
improve performance on production.  min_date_time is highly correlated
with bb_id, because both are increasing constantly.  server_id is one
of 16 values.

> A high correlation reduces the actual cost of the indexscan (since
> visiting the rows in index order becomes less of a random-access
> proposition).  We are aware that the planner doesn't model this effect
> very well at present ...

Oracle's optimizer is lacking here, too.  The best optimizer I've seen
was at Tandem, and even then hints were required.

Are there plans for explicit hints to the planner?

Thanks,
Rob

Re: How to force Nested Loop plan?

From

Tom Lane

Date:

31 August 2003, 20:13:03

Rob Nagler <nagler@bivio.biz> writes:
> Are there plans for explicit hints to the planner?

Personally, I'm philosophically opposed to planner hints; see previous
discussions in the archives.

            regards, tom lane

Re: How to force Nested Loop plan?

From

Ron Johnson

Date:

01 September 2003, 12:23:05

On Sun, 2003-08-31 at 18:12, Tom Lane wrote:
> Rob Nagler <nagler@bivio.biz> writes:
> > Are there plans for explicit hints to the planner?
>
> Personally, I'm philosophically opposed to planner hints; see previous
> discussions in the archives.

How about (if you don't already do it) ranked (or approximately
ranked) b-tree indexes, where each node also stores the (approximate)
count of tuple pointers under it?

This way, the planner would know whether or how skewed a tree is,
and (approximately) how many tuples a given WHERE predicate resolves
to.

--
-----------------------------------------------------------------
Ron Johnson, Jr. ron.l.johnson@cox.net
Jefferson, LA USA

"Fair is where you take your cows to be judged."
Unknown

Re: How to force Nested Loop plan?

From

Tom Lane

Date:

01 September 2003, 18:03:31

Ron Johnson <ron.l.johnson@cox.net> writes:
> How about (if you don't already do it) ranked (or approximately
> ranked) b-tree indexes, where each node also stores the (approximate)
> count of tuple pointers under it?
> This way, the planner would know whether or how skewed a tree is,
> and (approximately) how many tuples a given WHERE predicate resolves
> to.

Why is that better than our existing implementation of column statistics?

            regards, tom lane