analyze stats: child vs parent - Mailing list pgsql-performance

From Justin Pryzby
Subject analyze stats: child vs parent
Date
Msg-id 20171230055630.GF4172@telsasoft.com
Whole thread Raw
List pgsql-performance
This topic is confusing to lots of people, usually including myself, so I'm
hoping to clarify it at least to myself, and maybe provide a good reference or
doc update for others in the future.

autovacuum/analyze automatically scans tables being inserted/updated/deleted
and updates their statistics in pg_class and pg_statistic.  Since PG 9.0 [0,1],
ANALYZE (can) include stats of child tables along with stats of the (ONLY)
parent table.  But, autoanalyze still doesn't know to analyze (typical) empty
parent tables, which need to be manually ANALYZEd to include stats for their
children.

...which leaves one wondering: "which stats are being used?, and why are we
keeping two and apparently sometimes not looking at both/either" ?

I think the explanation is this:
 - Parent table stats without children (pg_statistic.stainherit='f') [2] are
used if you query SELECT ONLY).  Simple enough.

 - Postgres uses rowcount estimate as the primary component of query planning.
When planning a query involving a parent table, its rowcount estimate is
obtained as the sum of the rowcounts for its child nodes (appendrels) - if a
table is excluded by query exclusion, it doesn't even show up in the plan, and
if only a fraction of its rows are returned due to a restrictive clause, that's
reflected in its rowcount estimate and in the estimate of the parent.  So child
tables need to be analyzed for their rowcount (and also for their column stats
which affect rowcount).

 - But, column stats (ndistinct, most-common values, and histogram) are
relatively big, and there's nothing implemented (yet) to intelligently combine
them across child tables in a query.  So postgres, having obtained a rowcount
estimate for parent tables involved in a query, having determined how (or one
way) to join the tables, needs to determine how many rows are expected to
result be output by a join, which uses on parent table's column stats
(ndistinct, MCV list, histogram).

Is that mostly right ?

Toward the future: maybe, with declarative partitioning, combining
selectivities as in [3] is possible now without objectionable planning overhead
(?)

Justin

References
[0] https://www.postgresql.org/docs/9.0/static/release-9-0.html#AEN102560
[1] https://www.postgresql.org/message-id/flat/2674.1262040064%40sss.pgh.pa.us#2674.1262040064@sss.pgh.pa.us
[2] https://www.postgresql.org/docs/current/static/catalog-pg-statistic.html
[3] https://www.postgresql.org/message-id/flat/7363.1426537103%40sss.pgh.pa.us#7363.1426537103@sss.pgh.pa.us
more:
https://www.postgresql.org/message-id/7806.1310233430@sss.pgh.pa.us
https://www.postgresql.org/message-id/4B3875C6020000250002D97D@gw.wicourts.gov
https://www.postgresql.org/message-id/29559.1287206562%40sss.pgh.pa.us


pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: partitioning an existing table
Next
From: Robert Blayzor
Date:
Subject: Re: partitioning an existing table