Re: Statistics Import and Export: difference in statistics dumped - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Statistics Import and Export: difference in statistics dumped
Date
Msg-id CAExHW5ttputnXg231arxPWWScP7MkZjqRNjt4jPeht5GfypErw@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Statistics Import and Export: difference in statistics dumped
List pgsql-hackers
Hi Jeff,
I am changing the subject on this email and thus creating a new thread
to discuss this issue.

On Fri, Feb 28, 2025 at 8:02 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Tue, 2025-02-25 at 11:11 +0530, Ashutosh Bapat wrote:
> > So the dumped statistics are not restored exactly. The reason for
> > this
> > is the table statistics is dumped before dumping ALTER TABLE ... ADD
> > CONSTRAINT command which changes the statistics. I think all the
> > pg_restore_relation_stats() calls should be dumped after all the
> > schema and data modifications have been done. OR what's the point in
> > dumping statistics only to get rewritten even before restore
> > finishes.
>
> In your example, it's not so bad because the stats are actually better:
> the index is built after the data is present, and therefore relpages
> and reltuples are correct.
>
> The problem is more clear if you use --no-data. If you load data,
> ANALYZE, pg_dump --no-data, then reload the sql file, then the stats
> are lost.
>
> That workflow is very close to what pg_upgrade does. We solved the
> problem for pg_upgrade in commit 71b66171d0 by simply not updating the
> statistics when building an index and IsBinaryUpgrade.
>
> To solve the issue with dump --no-data, I propose that we change the
> test in 71b66171d0 to only update the stats if the physical relpages is
> non-zero.

I don't think I understand the patch well, but here's one question: If
a table is truncated and index is rebuilt would the code in patch stop
it from updating the stats? If yes, that looks problematic.

>
> Patch attached:
>
>  * If the dump is --no-data, or during pg_upgrade, the table will be
> empty, so the physical relpages will be zero and the restored stats
> won't be overwritten.
>
>  * If (like in your example) the dump includes data, the new stats are
> based on real data, so they are better anyway. This is sort of like the
> case where autoanalyze kicks in.
>
>  * If the dump is --statistics-only, then there won't be any indexes
> created in the SQL file, so when you restore the stats, they will
> remain until you do something else to change them.
>
>  * If your example really is a problem, you'd need to dump first with -
> -no-statistics, and then with --statistics-only, and restore the two
> SQL files in order.

There are few problems

1. If there are thousands of tables with primary key constraints, we
have twice the number of calls to pg_restore_relation_stats() of which
only half will be useful. The stats written by the first set of calls
will be overwritten by the second set of calls. The time spent in
executing the first set of calls can be saved completely and to some
extent time dumping the calls as well. It will be some measurable
improvement I think.

2. We aren't restoring the statistics faithfully - as mentioned in
Greg's reply. If users dump and restore with autovacuum turned off,
they will be surprised to see the statistics to be different on the
original and restored database - which may have other effects like
change in plans.

3. The test I am building over at [1] is aimed at testing whether the
objects dumped get restored faithfully by comparing dumps from the
original and restored database. That's a bit crude method but is being
used by some of our tests. I think it will be good to test statistics
as well in that test. But if it's not going to be same on the original
and the restored database we can not test it. For now, I have used
--no-statistics.

>
>
> Alternatively, we could put stats into SECTION_POST_DATA, which was
> already discussed[*], and we decided against it (though there was not a
> clear consensus).

I haven't looked at the code which dumps the statistics, but it does
seem simple dump the statistics after the constraint creation command
for the tables with primary key constraint. That will dump
not-up-to-date statistics and might overwrite the statistics


[1] https://www.postgresql.org/message-id/CAExHW5sBbMki6Xs4XxFQQF3C4Wx3wxkLAcySrtuW3vrnOxXDNQ%40mail.gmail.com

--
Best Wishes,
Ashutosh Bapat



pgsql-hackers by date:

Previous
From: Maxim Orlov
Date:
Subject: Re: Proposal: Limitations of palloc inside checkpointer
Next
From: Bertrand Drouvot
Date:
Subject: Re: per backend WAL statistics