Re: Statistics Import and Export - Mailing list pgsql-hackers

From Greg Sabino Mullane
Subject Re: Statistics Import and Export
Date
Msg-id CAKAnmmJyXoVFttrdCz=R-oZnbcJifzSidP9Z2wHO=pU4-AEb0g@mail.gmail.com
Whole thread Raw
In response to Re: Statistics Import and Export  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Statistics Import and Export
List pgsql-hackers
> Can you expand on some of those cases?

Certainly. I think one of the problems is that because this patch is solving a pg_upgrade issue, the focus is on the "dump and restore" scenarios. But pg_dump is used for much more than that, especially "dump and examine".

Although pg_dump is meant to be a canonical, logical representation of your schema and data, the stats add a non-determinant element to that. Statistical sampling is random, so pg_dump output changes with each run. (yes, COPY can also change, but much less so, as I argue later).

One use case is a program that is simply using pg_dump to verify that nothing has modified your table data (I'll use a single table for these examples, but obviously this applies to a whole database as well). So let's say we create a table and populate it at time X, then check back at a later time to verify things are still exactly as we left them.

dropdb gregtest
createdb gregtest
pgbench gregtest -i 2> /dev/null
pg_dump gregtest -t pgbench_accounts > a1
sleep 10
pg_dump gregtest -t pgbench_accounts > a2
diff a1 a2 | cut -c1-50

100078c100078
<       'histogram_bounds', '{2,964,1921,2917,3892,4935
---
>       'histogram_bounds', '{7,989,1990,2969,3973,4977

While COPY is not going to promise a particular output order, the order should not change except for manual things: insert, update, delete, truncate, vacuum full, cluster (off the top of my head). What should not change the output is a background process gathering some metadata. Or someone running a database-wide ANALYZE.


Another use case is someone rolling out their schema to a QA box. All the table definitions and data are checked into a git repository, with a checksum. They want to roll it out, and then verify that everything is exactly as they expect it to be. Or the program is part of a test suite that does a sanity check that the database is in an exact known state before starting.

(Our system catalogs are very difficult when reverse engineering objects. Thus, many programs rely on pg_dump to do the heavy lifting for them. Parsing the text file generated by pg_dump is much easier than trying to manipulate the system catalogs.)

So let's say the process is to create a new database, load things into it, and then checksum the result. We can simulate that with pg_bench:

dropdb qa1; dropdb qa2
createdb qa1; createdb qa2
pgbench qa1 -i 2>/dev/null
pgbench qa2 -i 2>/dev/null
pg_dump qa1 > dump1; pg_dump qa2 > dump2

$ md5sum dump1
39a2da5e51e8541e9a2c025c918bf463  dump1

This md5sum does not match our repo! It doesn't even match the other one:

$ md5sum dump2
4a977657dfdf910cb66c875d29cfebf2  dump2

It's the stats, or course, which has added a dose of randomness that was not there before, and makes our checksums useless:

$ diff dump1 dump2 | cut -c1-50
100172c100172
<       'histogram_bounds', '{1,979,1974,2952,3973,4900
---
>       'histogram_bounds', '{8,1017,2054,3034,4045,513

With --no-statistics, the diff shows no difference, and the md5sum is always the same.

Just to be clear, I love this patch, and I love the fact that one of our major upgrade warts is finally getting fixed. I've tried fixing it myself a few times over the last decade or so, but lacked the skills to do so. :) So I am thrilled to have this finally done. I just don't think it should be enabled by default for everything using pg_dump. For the record, I would not strongly object to having stats on by default for binary dumps, although I would prefer them off.

So why not just expect people to modify their programs to use --no-statistics for cases like this? That's certainly an option, but it's going to break a lot of existing things, and create branching code:

old code:
pg_dump mydb -f pg.dump

new code:
if pg_dump.version >= 18
  pg_dump --no-statistics mydb -f pg.dump
else
  pg_dump mydb -f pg.dump

Also, anything trained to parse pg_dump output will have to learn about the new SELECT pg_restore_ calls with their multi-line formats (not 100% sure we don't have that anywhere, as things like "SELECT setval" and "SELECT set_config" are single line, but there may be existing things)


Cheers,
Greg

--
Enterprise Postgres Software Products & Tech Support

pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: Statistics Import and Export
Next
From: Tom Lane
Date:
Subject: Re: Statistics Import and Export