Thread: incorrect row estimates for primary key join

incorrect row estimates for primary key join

From

Ben

Date:

24 June 2013, 22:18:28

hello postgresql experts --

i have a strange row estimate problem, which unfortunately i have trouble reproducing for anything but very large
tableswhich makes boiling down to a simple example hard.  i'm using version 9.1.1, all tables have had analyze run on
them.

here's the example : i have a large table (billions of rows) with a five column primary key and a sixth value column.
forconfidentiality purposes i can't reproduce the exact schema here, but it is essentially 

create table bigtable (
  id1 integer not null,
  id2 date not null,
  id3 integer not null,
  id4 time not null,
  id5 integer not null,
  value real not null,
  primary key (id1, id2, id3, id4, id5)
);

for various reasons there is only one id1 in the table right now, though in the future there will be more; also the
primarykey was added after table creation with alter table, though that shouldn't matter. 

i need to select out a somewhat arbitrary collection of rows out of bigtable.  to do so i generate a temporary table

create table jointable (
  id1 integer not null,
  id2 date not null,
  id3 integer not null,
  id4 time not null,
  id5 integer not null
);

and then perform a join against this table.

if jointable doesn't have many rows, the planner picks a nested loop over jointable and a primary key lookup on
bigtable. in the following, for expository purposes, jointable has 10 rows.  we can see the planner knows this. 

explain select * from bigtable join jointable using (id1, id2, id3, id4, id5);

 Nested Loop  (cost=0.00..6321.03 rows=145 width=28)
   ->  Seq Scan on jointable  (cost=0.00..1.10 rows=10 width=24)
   ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.97 rows=1 width=28)
         Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 =
jointable.id4)AND (vid = foo.vid)) 
(4 rows)

as you increase the number of rows in jointable, the planner switches to a sort + merge.  in this case jointable has
roughly2 million rows. 

 Merge Join  (cost=727807979.29..765482193.16 rows=18212633 width=28)
  Merge Cond: ((bigtable.id1 = jointabe.id1) AND (bigtable.id2 = jointable.id2) AND (bigtable.id3 = jointable.id3) AND
(bigtable.id4= bigtable.id4) AND (bigtable.id5 = bigtable.id5)) 
   ->  Sort  (cost=727511795.16..735430711.00 rows=3167566336 width=28)
         Sort Key: bigtable.id3, bigtable.id1, bigtable.id2, bigtable.id4, bigtable.id5
         ->  Seq Scan on bigtable  (cost=0.00..76085300.36 rows=3167566336 width=28)
   ->  Materialize  (cost=295064.70..305399.26 rows=2066911 width=24)
         ->  Sort  (cost=295064.70..300231.98 rows=2066911 width=24)
               Sort Key: jointable.id3, jointable.id1, jointable.id2, jointable.id4, jointable.id5
               ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
(9 rows)

the choice of sort + merge is really bad here, given the size of bigtable (3 billion rows and counting.)

some questions :

1 - first off, why isn't the sort happening on the primary key, so that bigtable does not have to be sorted?

2 - more importantly, since we are joining on the primary key, shouldn't the row estimate coming out of the join be
limitedby the number of rows in jointable? 

for example, it is strange to see that if i put in a non-limiting limit statement (something bigger than the number of
rowsin jointable) it switches back to a nested loop + index scan : 

explain select * from bigtable join jointable using (id1, id2, id3, id4, id5) limit 2500000;

 Limit  (cost=0.00..178452647.11 rows=2500000 width=28)
   ->  Nested Loop  (cost=0.00..1306127545.35 rows=18297957 width=28)
         ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
         ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.88 rows=1 width=28)
               Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 =
jointable.id4)AND (id5 = jointable.id5)) 
(5 rows)

am i not understanding the query planner, or is this a known issue in the query planner, or have i stumbled onto
somethingamiss?  unfortunately any synthetic examples i was able to make (up to 10 million rows) did not exhibit this
behavior,which makes it hard to test. 

best regards, ben

Re: incorrect row estimates for primary key join

From

bricklen

Date:

24 June 2013, 23:23:30

On Mon, Jun 24, 2013 at 3:18 PM, Ben <midfield@gmail.com> wrote:

create table jointable (
id1 integer not null,
id2 date not null,
id3 integer not null,
id4 time not null,
id5 integer not null
);

and then perform a join against this table.

Is it safe to assume you ANALYZEd the jointable after creating it? (I assume so, just checking)

as you increase the number of rows in jointable, the planner switches to a sort + merge. in this case jointable has roughly 2 million rows.

Can you post the output of:

SELECT version();
SELECT name, current_setting(name), source
FROM pg_settings
WHERE source NOT IN ('default', 'override');

Re: incorrect row estimates for primary key join

From

Ben

Date:

24 June 2013, 23:48:57

hello --

On Jun 24, 2013, at 4:23 PM, bricklen wrote:

> Is it safe to assume you ANALYZEd the jointable after creating it? (I assume so, just checking)

yes, jointable was analyzed.  both tables were further analyzed after any changes.

> Can you post the output of:
>
> SELECT version();
> SELECT name, current_setting(name), source
> FROM pg_settings
> WHERE source NOT IN ('default', 'override');

                                      version
---------------------------------------------------------------------------------------
 PostgreSQL 9.1.1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.6.2, 64-bit
(1 row)

             name             |  current_setting   |        source
------------------------------+--------------------+----------------------
 checkpoint_completion_target | 0.9                | configuration file
 checkpoint_segments          | 16                 | configuration file
 DateStyle                    | ISO, MDY           | configuration file
 default_statistics_target    | 50                 | configuration file
 default_text_search_config   | pg_catalog.english | configuration file
 effective_cache_size         | 5632MB             | configuration file
 lc_messages                  | en_US.UTF-8        | configuration file
 lc_monetary                  | en_US.UTF-8        | configuration file
 lc_numeric                   | en_US.UTF-8        | configuration file
 lc_time                      | en_US.UTF-8        | configuration file
 listen_addresses             | *                  | configuration file
 log_destination              | stderr             | configuration file
 log_line_prefix              | %t %d %u           | configuration file
 log_timezone                 | US/Pacific         | environment variable
 logging_collector            | on                 | configuration file
 maintenance_work_mem         | 480MB              | configuration file
 max_connections              | 300                | configuration file
 max_stack_depth              | 2MB                | environment variable
 max_wal_senders              | 3                  | configuration file
 search_path                  | public             | user
 shared_buffers               | 1920MB             | configuration file
 TimeZone                     | US/Pacific         | environment variable
 wal_buffers                  | 8MB                | configuration file
 wal_keep_segments            | 128                | configuration file
 wal_level                    | hot_standby        | configuration file
 work_mem                     | 48MB               | configuration file
(26 rows)

hope this helps!

thanks, ben

Re: incorrect row estimates for primary key join

From

Kevin Grittner

Date:

25 June 2013, 13:20:35

Ben <midfield@gmail.com> wrote:

> PostgreSQL 9.1.1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.6.2, 64-bit

Consider applying the latest bug fixes for 9.1 -- which would leave
you showing 9.1.9.

http://www.postgresql.org/support/versioning/

> default_statistics_target    | 50                | configuration file

Why did you change this from the default of 100?

> effective_cache_size        | 5632MB            | configuration file

How much RAM is on this machine?  What else is running on it?
(Normally people set this to 50% to 75% of total RAM.  Lower values
discourage index usage in queries like your example.)

Do you get a different plan if you set cpu_tuple_cost = 0.03?  How
about 0.05?  You can set this just for a single connection and run
explain on the query to do a quick check.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: incorrect row estimates for primary key join

From

Julien Cigar

Date:

25 June 2013, 14:44:11

On 06/25/2013 00:18, Ben wrote:
> hello postgresql experts --
>
> i have a strange row estimate problem, which unfortunately i have trouble reproducing for anything but very large
tableswhich makes boiling down to a simple example hard.  i'm using version 9.1.1, all tables have had analyze run on
them.
>
> here's the example : i have a large table (billions of rows) with a five column primary key and a sixth value column.
for confidentiality purposes i can't reproduce the exact schema here, but it is essentially 
>
> create table bigtable (
>    id1 integer not null,
>    id2 date not null,
>    id3 integer not null,
>    id4 time not null,
>    id5 integer not null,
>    value real not null,
>    primary key (id1, id2, id3, id4, id5)
> );
>
> for various reasons there is only one id1 in the table right now, though in the future there will be more; also the
primarykey was added after table creation with alter table, though that shouldn't matter. 
>
> i need to select out a somewhat arbitrary collection of rows out of bigtable.  to do so i generate a temporary table
>
> create table jointable (
>    id1 integer not null,
>    id2 date not null,
>    id3 integer not null,
>    id4 time not null,
>    id5 integer not null
> );
>
> and then perform a join against this table.
>
> if jointable doesn't have many rows, the planner picks a nested loop over jointable and a primary key lookup on
bigtable. in the following, for expository purposes, jointable has 10 rows.  we can see the planner knows this. 
>
> explain select * from bigtable join jointable using (id1, id2, id3, id4, id5);
>
>   Nested Loop  (cost=0.00..6321.03 rows=145 width=28)
>     ->  Seq Scan on jointable  (cost=0.00..1.10 rows=10 width=24)
>     ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.97 rows=1 width=28)
>           Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 =
jointable.id4)AND (vid = foo.vid)) 
> (4 rows)
>
> as you increase the number of rows in jointable, the planner switches to a sort + merge.  in this case jointable has
roughly2 million rows. 
>
>   Merge Join  (cost=727807979.29..765482193.16 rows=18212633 width=28)
>    Merge Cond: ((bigtable.id1 = jointabe.id1) AND (bigtable.id2 = jointable.id2) AND (bigtable.id3 = jointable.id3)
AND(bigtable.id4 = bigtable.id4) AND (bigtable.id5 = bigtable.id5)) 
>     ->  Sort  (cost=727511795.16..735430711.00 rows=3167566336 width=28)
>           Sort Key: bigtable.id3, bigtable.id1, bigtable.id2, bigtable.id4, bigtable.id5
>           ->  Seq Scan on bigtable  (cost=0.00..76085300.36 rows=3167566336 width=28)
>     ->  Materialize  (cost=295064.70..305399.26 rows=2066911 width=24)
>           ->  Sort  (cost=295064.70..300231.98 rows=2066911 width=24)
>                 Sort Key: jointable.id3, jointable.id1, jointable.id2, jointable.id4, jointable.id5
>                 ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
> (9 rows)

can you show us the explain analyze version ?

> the choice of sort + merge is really bad here, given the size of bigtable (3 billion rows and counting.)
>
> some questions :
>
> 1 - first off, why isn't the sort happening on the primary key, so that bigtable does not have to be sorted?
>
> 2 - more importantly, since we are joining on the primary key, shouldn't the row estimate coming out of the join be
limitedby the number of rows in jointable? 
>
> for example, it is strange to see that if i put in a non-limiting limit statement (something bigger than the number
ofrows in jointable) it switches back to a nested loop + index scan : 
>
> explain select * from bigtable join jointable using (id1, id2, id3, id4, id5) limit 2500000;
>
>   Limit  (cost=0.00..178452647.11 rows=2500000 width=28)
>     ->  Nested Loop  (cost=0.00..1306127545.35 rows=18297957 width=28)
>           ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
>           ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.88 rows=1 width=28)
>                 Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 =
jointable.id4)AND (id5 = jointable.id5)) 
> (5 rows)
>
> am i not understanding the query planner, or is this a known issue in the query planner, or have i stumbled onto
somethingamiss?  unfortunately any synthetic examples i was able to make (up to 10 million rows) did not exhibit this
behavior,which makes it hard to test. 
>
> best regards, ben
>
>
>


--
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.

Re: incorrect row estimates for primary key join

From

Ben

Date:

25 June 2013, 18:27:46

hello --

thanks kevin for the tuning advice, i will answer your questions below and try different tuning configurations and
reportback.  but first allow me take a step back and ask a couple simple questions : 

it seems to me that an equality join between two relations (call them A and B) using columns in relation B with a
uniqueconstraint should yield row estimates which are at most equal to the row estimates for relation A.  my questions
are

1 - is this correct?

2 - does the postgresql planner implement this when generating row estimates?

it seems like if the answers to 1 and 2 are yes, then the row estimates for my join should always come back less or
equalto the estimates for jointable, regardless of what the query plan is.  indeed this is what i find experimentally
forsmaller examples.  what is perplexing to me is why this is not true for this large table.  (the fact that the table
sizeis greater than 2^31 is probably a red herring but hasn't escaped my attention.)  while i do have a performance
issue(i'd like for it to select the index scan) which might be solved by better configuration, that at the moment is a
secondaryquestion -- right now i'm interested in why the row estimates are off. 

moving on to your remarks :

On Jun 25, 2013, at 6:20 AM, Kevin Grittner wrote:

> Ben <midfield@gmail.com> wrote:
>
>> PostgreSQL 9.1.1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.6.2, 64-bit
>
> Consider applying the latest bug fixes for 9.1 -- which would leave
> you showing 9.1.9.

i will bring it up with our ops people.  do you have any particular fixes in mind, or is this a (very sensible) blanket
suggestion?

>> default_statistics_target    | 50                | configuration file
>
> Why did you change this from the default of 100?

sorry, i do not know.  it is possible this was copied from the configuration of a different server, which is serving
somevery very large tables with gist indexes, where the statistics do not help the selectivity estimations much if at
all(as far as i can tell gist indexes often use hard-coded selectivity estimates as opposed to using the statistics.)
inthat case it is an oversight and i will correct it.  but i believe the statistics for the tables in question are
closeenough, and certainly do not explain the off row estimates in the query plan. 

>> effective_cache_size        | 5632MB            | configuration file
>
> How much RAM is on this machine?  What else is running on it?
> (Normally people set this to 50% to 75% of total RAM.  Lower values
> discourage index usage in queries like your example.)

24GB.  i can up it to 12 or 16GB and report back.

> Do you get a different plan if you set cpu_tuple_cost = 0.03?  How
> about 0.05?  You can set this just for a single connection and run
> explain on the query to do a quick check.

setting cpu_tuple_cost to 0.03 or 0.05 has no effect on the choice of plan or the row estimates for the un-limited
queryor the limited query. 

best regards, ben

Re: incorrect row estimates for primary key join

From

Kevin Grittner

Date:

25 June 2013, 22:13:26

Ben <midfield@gmail.com> wrote:

> it seems to me that an equality join between two relations (call them A and B)
> using columns in relation B with a unique constraint should yield row estimates
> which are at most equal to the row estimates for relation A.  my questions are
>
> 1 - is this correct?
>
> 2 - does the postgresql planner implement this when generating row estimates?

That seems intuitive, but some of the estimates need to be made
before all such information is available.  Maybe we can do
something about that some day....

> while i do have a performance > issue (i'd like for it to select
> the index scan) which might be solved by better configuration,
> that at the moment is a secondary question -- right now i'm
> interested in why the row estimates are off.

Maybe someone else will jump in here with more details than I can
provide (at least without hours digging in the source code).

> On Jun 25, 2013, at 6:20 AM, Kevin Grittner wrote:
>> Ben <midfield@gmail.com> wrote:
>>
>>> PostgreSQL 9.1.1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux)
>>> 4.6.2, 64-bit
>>
>> Consider applying the latest bug fixes for 9.1 -- which would leave
>> you showing 9.1.9.
>
> i will bring it up with our ops people.  do you have any particular fixes in
> mind, or is this a (very sensible) blanket suggestion?

I do recommend staying up-to-date in general (subject to roll-out
procedures),  We try very hard not to change any behavior that
isn't a clear bug from one minor release to the next, precisely so
that people can apply these critical bug fixes with confidence that
things won't break.  There is a fix for a pretty significant
security vulnerability you are currently missing, which would be my
top concern; but it wouldn't be surprising if there were a planner
bug in 9.1.1 which is fixed in 9.1.9.

>> Do you get a different plan if you set cpu_tuple_cost = 0.03?  How
>> about 0.05?  You can set this just for a single connection and run
>> explain on the query to do a quick check.
>
> setting cpu_tuple_cost to 0.03 or 0.05 has no effect on the choice of plan or
> the row estimates for the un-limited query or the limited query.

That wouldn't affect row estimates, but it would tend to encourage
index usage, because it would increase the estimated cost of
reading each row.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: incorrect row estimates for primary key join

From

Tom Lane

Date:

25 June 2013, 23:36:20

Kevin Grittner <kgrittn@ymail.com> writes:
> Ben <midfield@gmail.com> wrote:
>> it seems to me that an equality join between two relations (call them A and B)
>> using columns in relation B with a unique constraint should yield row estimates
>> which are at most equal to the row estimates for relation A.� my questions are
>>
>> 1 - is this correct?
>>
>> 2 - does the postgresql planner implement this when generating row estimates?

> That seems intuitive, but some of the estimates need to be made
> before all such information is available.� Maybe we can do
> something about that some day....
> Maybe someone else will jump in here with more details than I can
> provide (at least without hours digging in the source code).

It does not attempt to match up query WHERE clauses with indexes during
selectivity estimation, so the existence of a multi-column unique
constraint wouldn't help it improve the estimate.

In the case at hand, I doubt that a better result rowcount estimate
would have changed the planner's opinion of how to do the join.  The OP
seems to be imagining that 2 million index probes into a large table
would be cheap, but that's hardly true.  It's quite likely that the
mergejoin actually is the best way to do the query.  If it isn't really
best on his hardware, I would think that indicates a need for some
tuning of the cost parameters.  Another thing that might be helpful for
working with such large tables is increasing work_mem, to make hashes
and sorts run faster.

            regards, tom lane

Re: incorrect row estimates for primary key join

From

Ben

Date:

26 June 2013, 00:29:12

On Jun 25, 2013, at 4:36 PM, Tom Lane wrote:

>> That seems intuitive, but some of the estimates need to be made
>> before all such information is available.  Maybe we can do
>> something about that some day....
>> Maybe someone else will jump in here with more details than I can
>> provide (at least without hours digging in the source code).
>
> It does not attempt to match up query WHERE clauses with indexes during
> selectivity estimation, so the existence of a multi-column unique
> constraint wouldn't help it improve the estimate.

thanks tom, that answered my question.

> In the case at hand, I doubt that a better result rowcount estimate
> would have changed the planner's opinion of how to do the join.  The OP
> seems to be imagining that 2 million index probes into a large table
> would be cheap, but that's hardly true.  It's quite likely that the
> mergejoin actually is the best way to do the query.  If it isn't really
> best on his hardware, I would think that indicates a need for some
> tuning of the cost parameters.  Another thing that might be helpful for
> working with such large tables is increasing work_mem, to make hashes
> and sorts run faster.

i apologize if i seemed like i was presuming to know what the best query plan is.  i fully understand that the query
plannersometimes makes unintuitive decisions which turn out to be for the best, having experienced it first hand many
times. since i've nudged my company to use postgresql (instead of mysql/sqlite), we've been very happy with it.  also,
havingtried my hand (and failing) at making good gist selectivity estimators, i think i've got a
not-completely-ignorant10,000 ft view of the trade-offs it tries to make, when sequential scans are better than
repeatedindex lookups, et cetera.  i'm writing because i found this example, which shows yet another thing i don't
understandabout the query planner, and i am trying to learn better about it. 

you've already answered my main question (whether or not unique constraints are used to help row estimation.)  there's
acouple more issues which i don't quite understand : 

1) when i give a hint to the query planner to not expect more than number-of-rows-in-jointable (via a limit), switches
toa nested loop + index scan, but with the same row estimates.  i'll show the plan i had in the first email : 

Limit  (cost=0.00..178452647.11 rows=2500000 width=28)
  ->  Nested Loop  (cost=0.00..1306127545.35 rows=18297957 width=28)
        ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
        ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.88 rows=1 width=28)
              Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 =
jointable.id4)AND (id5 = jointable.id5)) 
(5 rows)

before, i was misreading this as saying the planner was going to execute the nested loop fully (rows=18 million), and
thenlimit the results.  i am now reading it as saying that the inner nested loop will be short-circuited after it
generatesenough rows.  if this is true, it seems to imply that, in query plan with deeply nested inner nested loops,
oneshould read the inner loop row estimates with a grain of salt, as there might be limits (potentially many levels
outwards)which can short-circuit them.  am i wrong about this? 

2) when doing the sort+merge join, it choses to sort bigtable rather than use an index scan.  i've tried to give hints
byrequesting the results come in primary key order, but it keeps sorting by a permutation of the primary key and then
resortingthe join results at the end.  so obviously the random seek cost dominates the sequential read + sort (which i
findsurprising, but again i am happy to be surprised.)  that seems fine for a query which is going to touch the whole
table. but i can't seem to come up with a query which would ever favor using an index scan.  for example this : 

explain select * from bigtable order by (id1, id2, id3, id4, id5) limit 1;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Limit  (cost=91923132.04..91923132.04 rows=1 width=28)
   ->  Sort  (cost=91923132.04..99842047.88 rows=3167566336 width=28)
         Sort Key: (ROW(id1, id2, id3, id4, id5))
         ->  Seq Scan on bigtable  (cost=0.00..76085300.36 rows=3167566336 width=28)
(4 rows)

(apologies bigtable has grown since i've first started this thread.)  shouldn't an index scan definitely be fastest
here? you don't need to touch the whole table or index.  maybe there something i have misconfigured here? 

best regards, ben

Re: incorrect row estimates for primary key join

From

Marcin Mańk

Date:

27 June 2013, 00:23:04

On Wed, Jun 26, 2013 at 2:29 AM, Ben <midfield@gmail.com> wrote:

> shouldn't an index scan definitely be fastest here?  you don't need to touch the whole table or index.  maybe there
somethingi have misconfigured here? 
>

How about you try increasing work_mem ? I think a hash join may be the
best plan here, and it won't get chosen with low work_mem .

Regards
Marcin Mańk

Re: incorrect row estimates for primary key join

From

Ben

Date:

27 June 2013, 03:48:29

On Jun 26, 2013, at 5:22 PM, Marcin Mańk wrote:

> On Wed, Jun 26, 2013 at 2:29 AM, Ben <midfield@gmail.com> wrote:
>
>> shouldn't an index scan definitely be fastest here?  you don't need to touch the whole table or index.  maybe there
somethingi have misconfigured here? 
>>
>
> How about you try increasing work_mem ? I think a hash join may be the
> best plan here, and it won't get chosen with low work_mem .

i will increase work_mem and experiment for the other queries, but the query which i was asking about in this
particularquestion was looking up the single smallest key in the primary key index, which seems like it shouldn't need
totouch more than one key, since it can just get the first one from an in-order index traversal.  of course with my
earlierbigtable/jointable join question increasing work_mem makes a lot of sense. 

best regards, ben