Re: Statistics Import and Export: difference in statistics dumped - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Statistics Import and Export: difference in statistics dumped |
Date | |
Msg-id | CAExHW5ttputnXg231arxPWWScP7MkZjqRNjt4jPeht5GfypErw@mail.gmail.com Whole thread Raw |
In response to | Re: Statistics Import and Export (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Statistics Import and Export: difference in statistics dumped
|
List | pgsql-hackers |
Hi Jeff, I am changing the subject on this email and thus creating a new thread to discuss this issue. On Fri, Feb 28, 2025 at 8:02 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Tue, 2025-02-25 at 11:11 +0530, Ashutosh Bapat wrote: > > So the dumped statistics are not restored exactly. The reason for > > this > > is the table statistics is dumped before dumping ALTER TABLE ... ADD > > CONSTRAINT command which changes the statistics. I think all the > > pg_restore_relation_stats() calls should be dumped after all the > > schema and data modifications have been done. OR what's the point in > > dumping statistics only to get rewritten even before restore > > finishes. > > In your example, it's not so bad because the stats are actually better: > the index is built after the data is present, and therefore relpages > and reltuples are correct. > > The problem is more clear if you use --no-data. If you load data, > ANALYZE, pg_dump --no-data, then reload the sql file, then the stats > are lost. > > That workflow is very close to what pg_upgrade does. We solved the > problem for pg_upgrade in commit 71b66171d0 by simply not updating the > statistics when building an index and IsBinaryUpgrade. > > To solve the issue with dump --no-data, I propose that we change the > test in 71b66171d0 to only update the stats if the physical relpages is > non-zero. I don't think I understand the patch well, but here's one question: If a table is truncated and index is rebuilt would the code in patch stop it from updating the stats? If yes, that looks problematic. > > Patch attached: > > * If the dump is --no-data, or during pg_upgrade, the table will be > empty, so the physical relpages will be zero and the restored stats > won't be overwritten. > > * If (like in your example) the dump includes data, the new stats are > based on real data, so they are better anyway. This is sort of like the > case where autoanalyze kicks in. > > * If the dump is --statistics-only, then there won't be any indexes > created in the SQL file, so when you restore the stats, they will > remain until you do something else to change them. > > * If your example really is a problem, you'd need to dump first with - > -no-statistics, and then with --statistics-only, and restore the two > SQL files in order. There are few problems 1. If there are thousands of tables with primary key constraints, we have twice the number of calls to pg_restore_relation_stats() of which only half will be useful. The stats written by the first set of calls will be overwritten by the second set of calls. The time spent in executing the first set of calls can be saved completely and to some extent time dumping the calls as well. It will be some measurable improvement I think. 2. We aren't restoring the statistics faithfully - as mentioned in Greg's reply. If users dump and restore with autovacuum turned off, they will be surprised to see the statistics to be different on the original and restored database - which may have other effects like change in plans. 3. The test I am building over at [1] is aimed at testing whether the objects dumped get restored faithfully by comparing dumps from the original and restored database. That's a bit crude method but is being used by some of our tests. I think it will be good to test statistics as well in that test. But if it's not going to be same on the original and the restored database we can not test it. For now, I have used --no-statistics. > > > Alternatively, we could put stats into SECTION_POST_DATA, which was > already discussed[*], and we decided against it (though there was not a > clear consensus). I haven't looked at the code which dumps the statistics, but it does seem simple dump the statistics after the constraint creation command for the tables with primary key constraint. That will dump not-up-to-date statistics and might overwrite the statistics [1] https://www.postgresql.org/message-id/CAExHW5sBbMki6Xs4XxFQQF3C4Wx3wxkLAcySrtuW3vrnOxXDNQ%40mail.gmail.com -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: