Thread: Row estimates for empty tables

Row estimates for empty tables

From
Christophe Pettus
Date:
I realize I've never quite known this; where does the planner get the row estimates for an empty table?  Example:

psql (11.8)
Type "help" for help.

xof=# CREATE TABLE t (i integer, t text, j integer);
CREATE TABLE
xof=# VACUUM ANALYZE t;
VACUUM
xof=# EXPLAIN ANALYZE SELECT * FROM t;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..22.00 rows=1200 width=40) (actual time=0.015..0.015 rows=0 loops=1)
 Planning Time: 5.014 ms
 Execution Time: 0.094 ms
(3 rows)

xof=# INSERT INTO t values(1, 'this', 2);
INSERT 0 1
xof=# EXPLAIN ANALYZE SELECT * FROM t;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..22.00 rows=1200 width=40) (actual time=0.010..0.011 rows=1 loops=1)
 Planning Time: 0.039 ms
 Execution Time: 0.021 ms
(3 rows)

xof=# VACUUM ANALYZE t;
VACUUM
xof=# EXPLAIN ANALYZE SELECT * FROM t;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..1.01 rows=1 width=13) (actual time=0.008..0.008 rows=1 loops=1)
 Planning Time: 0.069 ms
 Execution Time: 0.019 ms
(3 rows)

xof=# DELETE FROM t;
DELETE 0
xof=# VACUUM ANALYZE t;
VACUUM
xof=# EXPLAIN ANALYZE SELECT * FROM t;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..29.90 rows=1990 width=13) (actual time=0.004..0.004 rows=0 loops=1)
 Planning Time: 0.034 ms
 Execution Time: 0.015 ms
(3 rows)


--
-- Christophe Pettus
   xof@thebuild.com




Re: Row estimates for empty tables

From
David Rowley
Date:
On Fri, 24 Jul 2020 at 16:01, Christophe Pettus <xof@thebuild.com> wrote:
> I realize I've never quite known this; where does the planner get the row estimates for an empty table?  Example:

We just assume there are 10 pages if the relation has not yet been
vacuumed or analyzed. The row estimates you see are the number of
times 1 tuple is likely to fit onto a single page multiplied by the
assumed 10 pages.  If you had made your table wider then the planner
would have assumed fewer rows

There's a comment that justifies the 10 pages, which, as of master is
in table_block_relation_estimate_size(). It'll be somewhere else in
pg12.

* HACK: if the relation has never yet been vacuumed, use a minimum size
* estimate of 10 pages.  The idea here is to avoid assuming a
* newly-created table is really small, even if it currently is, because
* that may not be true once some data gets loaded into it.  Once a vacuum
* or analyze cycle has been done on it, it's more reasonable to believe
* the size is somewhat stable.
*
* (Note that this is only an issue if the plan gets cached and used again
* after the table has been filled.  What we're trying to avoid is using a
* nestloop-type plan on a table that has grown substantially since the
* plan was made.  Normally, autovacuum/autoanalyze will occur once enough
* inserts have happened and cause cached-plan invalidation; but that
* doesn't happen instantaneously, and it won't happen at all for cases
* such as temporary tables.)
*
* We approximate "never vacuumed" by "has relpages = 0", which means this
* will also fire on genuinely empty relations.  Not great, but
* fortunately that's a seldom-seen case in the real world, and it
* shouldn't degrade the quality of the plan too much anyway to err in
* this direction.
*
* If the table has inheritance children, we don't apply this heuristic.
* Totally empty parent tables are quite common, so we should be willing
* to believe that they are empty.

The code which decides if the table has been vacuumed here assumes it
has not if pg_class.relpages == 0. So even if you were to manually
vacuum the table the code here would think it's not yet been vacuumed.

David



Re: Row estimates for empty tables

From
Tom Lane
Date:
David Rowley <dgrowleyml@gmail.com> writes:
> On Fri, 24 Jul 2020 at 16:01, Christophe Pettus <xof@thebuild.com> wrote:
>> I realize I've never quite known this; where does the planner get the row estimates for an empty table?  Example:

> We just assume there are 10 pages if the relation has not yet been
> vacuumed or analyzed. The row estimates you see are the number of
> times 1 tuple is likely to fit onto a single page multiplied by the
> assumed 10 pages.  If you had made your table wider then the planner
> would have assumed fewer rows

Yeah.  Also note that since we have no ANALYZE stats in this scenario,
the row width estimate is going to be backed into via some guesses
based on column data types.  (It's fine for fixed-width types, much
less fine for var-width.)

There's certainly not a lot besides tradition to justify the exact
numbers used in this case.  However, we do have a good deal of
practical experience to justify the principle of "never assume a
table is empty, or even contains just one row, unless you're really
sure of that".  Otherwise you tend to end up with nestloop joins that
will perform horrifically if you were wrong.  The other join types
are notably less brittle.

            regards, tom lane



Re: Row estimates for empty tables

From
Christophe Pettus
Date:

> On Jul 24, 2020, at 06:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> There's certainly not a lot besides tradition to justify the exact
> numbers used in this case.

Since we already special-case parent tables for partition sets, would a storage parameter that lets you either tell the
planner"no, really, zero is reasonable here" or sets a minimum number of rows to plan for be reasonable?  I happened to
getbit by this tracking down an issue where several tables in a large query had zero rows, and the planner's assumption
ofa few pages worth caused some sub-optimal plans.  The performance hit wasn't huge, but they were being joined to some
*very*large tables, and the differences added up. 
--
-- Christophe Pettus
   xof@thebuild.com




Re: Row estimates for empty tables

From
Pavel Stehule
Date:


pá 24. 7. 2020 v 16:38 odesílatel Christophe Pettus <xof@thebuild.com> napsal:


> On Jul 24, 2020, at 06:48, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> There's certainly not a lot besides tradition to justify the exact
> numbers used in this case.

Since we already special-case parent tables for partition sets, would a storage parameter that lets you either tell the planner "no, really, zero is reasonable here" or sets a minimum number of rows to plan for be reasonable?  I happened to get bit by this tracking down an issue where several tables in a large query had zero rows, and the planner's assumption of a few pages worth caused some sub-optimal plans.  The performance hit wasn't huge, but they were being joined to some *very* large tables, and the differences added up.

I did this patch ten years ago.  GoodData application https://www.gooddata.com/  uses Postgres lot, and this application stores some results in tables (as guard against repeated calculations). Lot of these tables have zero or one row.

Although we ran an ANALYZE over all tables - the queries on empty tables had very bad plans, and I had to fix it by this patch. Another company uses a fake one row in table - so there is no possibility to have a really empty table.

It is an issue for special, not typical applications (this situation is typical for some OLAP patterns)  - it is not too often - but some clean solution (instead hacking postgres) can be nice.

Regards

Pavel
--
-- Christophe Pettus
   xof@thebuild.com



Re: Row estimates for empty tables

From
Christophe Pettus
Date:
> On Jul 24, 2020, at 12:14, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> this application stores some results in tables (as guard against repeated calculations). Lot of these tables have
zeroor one row.  

Yes, that's the situation we encountered, too.  It's not very common (and even less common, I would assume, that it
resultsin a bad plan), but it did in this case. 

--
-- Christophe Pettus
   xof@thebuild.com




Re: Row estimates for empty tables

From
Tom Lane
Date:
Pavel Stehule <pavel.stehule@gmail.com> writes:
> pá 24. 7. 2020 v 16:38 odesílatel Christophe Pettus <xof@thebuild.com>
> napsal:
>> Since we already special-case parent tables for partition sets, would a
>> storage parameter that lets you either tell the planner "no, really, zero
>> is reasonable here" or sets a minimum number of rows to plan for be
>> reasonable?

> It is an issue for special, not typical applications (this situation is
> typical for some OLAP patterns)  - it is not too often - but some clean
> solution (instead hacking postgres) can be nice.

The core issue here is "how do we know whether the table is likely to stay
empty?".  I can think of a couple of more or less klugy solutions:

1. Arrange to send out a relcache inval when adding the first page to
a table, and then remove the planner hack for disbelieving relpages = 0.
I fear this'd be a mess from a system structural standpoint, but it might
work fairly transparently.

2. Establish the convention that vacuuming or analyzing an empty table
is what you do to tell the system that this state is going to persist.
That's more or less what the existing comments in plancat.c envision,
but we never made a definition for how the occurrence of that event
would be recorded in the catalogs, other than setting relpages > 0.
Rather than adding another pg_class column, I'm tempted to say that
vacuum/analyze should set relpages to a minimum of 1, even if the
relation has zero pages.  That does get the job done:

regression=# create table foo(f1 text);
CREATE TABLE
regression=# explain select * from foo;
                       QUERY PLAN                       
--------------------------------------------------------
 Seq Scan on foo  (cost=0.00..23.60 rows=1360 width=32)
(1 row)

regression=# vacuum foo;  -- doesn't help
VACUUM
regression=# explain select * from foo;
                       QUERY PLAN                       
--------------------------------------------------------
 Seq Scan on foo  (cost=0.00..23.60 rows=1360 width=32)
(1 row)
regression=# update pg_class set relpages = 1 where relname = 'foo';
UPDATE 1
regression=# explain select * from foo;
                     QUERY PLAN                     
----------------------------------------------------
 Seq Scan on foo  (cost=0.00..0.00 rows=1 width=32)
(1 row)

(We're still estimating one row, but that's as a result of different
decisions that I'm not nearly as willing to compromise on...)

            regards, tom lane



Re: Row estimates for empty tables

From
Christophe Pettus
Date:

> On Jul 24, 2020, at 14:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Rather than adding another pg_class column, I'm tempted to say that
> vacuum/analyze should set relpages to a minimum of 1, even if the
> relation has zero pages.

If there's not an issue about relpages != actual pages on disk, that certain seems straight-forward, and no *more*
hackythan the current situation. 

--
-- Christophe Pettus
   xof@thebuild.com




Re: Row estimates for empty tables

From
Justin Pryzby
Date:
On Fri, Jul 24, 2020 at 09:14:04PM +0200, Pavel Stehule wrote:
> pá 24. 7. 2020 v 16:38 odesílatel Christophe Pettus <xof@thebuild.com> napsal:
> > Since we already special-case parent tables for partition sets, would a
> > storage parameter that lets you either tell the planner "no, really, zero
> > is reasonable here" or sets a minimum number of rows to plan for be
> > reasonable?  I happened to get bit by this tracking down an issue where
> > several tables in a large query had zero rows, and the planner's assumption
> > of a few pages worth caused some sub-optimal plans.  The performance hit
> > wasn't huge, but they were being joined to some *very* large tables, and
> > the differences added up.
> 
> I did this patch ten years ago.  GoodData application
> https://www.gooddata.com/  uses Postgres lot, and this application stores
> some results in tables (as guard against repeated calculations). Lot of
> these tables have zero or one row.
> 
> Although we ran an ANALYZE over all tables - the queries on empty tables
> had very bad plans, and I had to fix it by this patch. Another company uses
> a fake one row in table - so there is no possibility to have a really empty
> table.
> 
> It is an issue for special, not typical applications (this situation is
> typical for some OLAP patterns)  - it is not too often - but some clean
> solution (instead hacking postgres) can be nice.

On Mon, Aug 24, 2020 at 09:43:49PM +0200, Pavel Stehule wrote:
> This patch is just a workaround that works well 10 years (but for one
> special use case) - nothing more. Without this patch that application
> cannot work ever.

My own workaround was here:
https://www.postgresql.org/message-id/20200427181034.GA28974@telsasoft.com
|... 1) create an child table: CREATE TABLE x_child() INHERITS(x)
|and, 2) change the query to use "select from ONLY".
|
|(1) allows the planner to believe that the table really is empty, a conclusion
|it otherwise avoids and (2) avoids decending into the child (for which the
|planner would likewise avoid the conclusion that it's actually empty).

-- 
Justin