Re: Statistics Import and Export - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Statistics Import and Export
Date
Msg-id x5yo5ly5u3dhzfaq3hrzeqhju3io5zmwk4yqrkge3ywcflffck@ghbbgsfv56yb
Whole thread Raw
In response to Re: Statistics Import and Export  (Andres Freund <andres@anarazel.de>)
Responses Re: Statistics Import and Export
List pgsql-hackers
Hi,

On 2025-02-21 15:23:00 -0500, Andres Freund wrote:
> On 2025-02-20 01:39:34 -0800, Jeff Davis wrote:
> > Committed with some revisions on top of v48:
>
> This made the pg_upgrade tests considerably slower.
>
> In an assert build without optimization (since that's what I use for normal
> dev work):
>
> 1fd1bd87101^     65.03s
> 1fd1bd87101      86.84s
>
>
> Looking at the times in the in the regress_log, I see:
> [...]
> Which to me rather strongly suggests pg_dump has gotten a *lot* slower with
> this change.

Indeed. While the slowdown is worse with assertions and without compiler
optimizations, it's pretty bad otherwise too.

optimized, non-cassert, pg_dump and server with the regression database contents:

$ time ./src/bin/pg_dump/pg_dump regression > /dev/null

real    0m1.314s
user    0m0.189s
sys    0m0.059s

$ time ./src/bin/pg_dump/pg_dump --no-statistics regression > /dev/null

real    0m0.472s
user    0m0.179s
sys    0m0.035s


Unoptimized, cassert server and pg_dump:

$ time ./src/bin/pg_dump/pg_dump regression > /dev/null

real    0m9.008s
user    0m0.396s
sys    0m0.108s


$ time ./src/bin/pg_dump/pg_dump --no-statistics regression > /dev/null

real    0m2.590s
user    0m0.347s
sys    0m0.037s


Looking at the query log, the biggest culprit is a *lot* of additional
queries, I think primarily these two:


SELECT c.oid::regclass AS relation, current_setting('server_version_num') AS version, c.relpages, c.reltuples,
c.relallvisibleFROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'public' AND c.relname =
'alpha_neg_p2'

SELECT c.oid::regclass AS relation, s.attname,s.inherited,current_setting('server_version_num') AS version,
s.null_frac,s.avg_width,s.n_distinct,s.most_common_vals,s.most_common_freqs,s.histogram_bounds,s.correlation,s.most_common_elems,s.most_common_elem_freqs,s.elem_count_histogram,s.range_length_histogram,s.range_empty_frac,s.range_bounds_histogram
FROMpg_stats s JOIN pg_namespace n ON n.nspname = s.schemaname JOIN pg_class c ON c.relname = s.tablename AND
c.relnamespace= n.oid WHERE s.schemaname = 'public' AND s.tablename = 'alpha_neg_p2' ORDER BY s.attname, s.inherited
 


I think there are a few things wrong here:

1) Why do we need to plan this over and over? Tom a while ago put in a fair
   bit of work to make frequent queries use prepared statements.

   In this case we spend more time replanning the query than executing it.

2) Querying this one-by-one makes this much more expensive than if it were
   queried in a batched fashion, for multiple tables at once.  This is
   especially true if actually executed over network, rather than locally.

3) The query is unnecessarily expensive due to repeated joins gathering the
   same information.  pg_stats has a join to pg_namespace and pg_class, but
   then the query above joins to both *again*.

   And afaict the joins in the pg_stats query are pretty useless? Isn't all
   that information already available in pg_stats? I guess you did that to get
   it as a ::regclass, but isn't that already known, why requery it?


4) Why do we need to fetch the version twice for every table, that can't be
   right? It won't change while pg_dump is running.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: Amcheck verification of GiST and GIN
Next
From: Andres Freund
Date:
Subject: Re: Statistics Import and Export