Home > mailing lists

analyze stats: child vs parent - Mailing list pgsql-performance

From	Justin Pryzby
Subject	analyze stats: child vs parent
Date	December 30, 2017 08:56:30
Msg-id	20171230055630.GF4172@telsasoft.com Whole thread Raw
List	pgsql-performance

Tree view

This topic is confusing to lots of people, usually including myself, so I'm
hoping to clarify it at least to myself, and maybe provide a good reference or
doc update for others in the future.

autovacuum/analyze automatically scans tables being inserted/updated/deleted
and updates their statistics in pg_class and pg_statistic.  Since PG 9.0 [0,1],
ANALYZE (can) include stats of child tables along with stats of the (ONLY)
parent table.  But, autoanalyze still doesn't know to analyze (typical) empty
parent tables, which need to be manually ANALYZEd to include stats for their
children.

...which leaves one wondering: "which stats are being used?, and why are we
keeping two and apparently sometimes not looking at both/either" ?

I think the explanation is this:
 - Parent table stats without children (pg_statistic.stainherit='f') [2] are
used if you query SELECT ONLY).  Simple enough.

 - Postgres uses rowcount estimate as the primary component of query planning.
When planning a query involving a parent table, its rowcount estimate is
obtained as the sum of the rowcounts for its child nodes (appendrels) - if a
table is excluded by query exclusion, it doesn't even show up in the plan, and
if only a fraction of its rows are returned due to a restrictive clause, that's
reflected in its rowcount estimate and in the estimate of the parent.  So child
tables need to be analyzed for their rowcount (and also for their column stats
which affect rowcount).

 - But, column stats (ndistinct, most-common values, and histogram) are
relatively big, and there's nothing implemented (yet) to intelligently combine
them across child tables in a query.  So postgres, having obtained a rowcount
estimate for parent tables involved in a query, having determined how (or one
way) to join the tables, needs to determine how many rows are expected to
result be output by a join, which uses on parent table's column stats
(ndistinct, MCV list, histogram).

Is that mostly right ?

Toward the future: maybe, with declarative partitioning, combining
selectivities as in [3] is possible now without objectionable planning overhead
(?)

Justin

References
[0] https://www.postgresql.org/docs/9.0/static/release-9-0.html#AEN102560
[1] https://www.postgresql.org/message-id/flat/2674.1262040064%40sss.pgh.pa.us#2674.1262040064@sss.pgh.pa.us
[2] https://www.postgresql.org/docs/current/static/catalog-pg-statistic.html
[3] https://www.postgresql.org/message-id/flat/7363.1426537103%40sss.pgh.pa.us#7363.1426537103@sss.pgh.pa.us
more:
https://www.postgresql.org/message-id/7806.1310233430@sss.pgh.pa.us
https://www.postgresql.org/message-id/4B3875C6020000250002D97D@gw.wicourts.gov
https://www.postgresql.org/message-id/29559.1287206562%40sss.pgh.pa.us

pgsql-performance by date:

From: Justin Pryzby
Date: 30 December 2017, 08:38:21
Subject: Re: partitioning an existing table

From: Robert Blayzor
Date: 30 December 2017, 17:19:05
Subject: Re: partitioning an existing table

analyze stats: child vs parent - Mailing list pgsql-performance

Previous

Next