Thread: [PROPOSAL] Backup and recovery of pg_statistic

[PROPOSAL] Backup and recovery of pg_statistic

From
Dmitry Ivanov
Date:
Hi hackers,

This is my proposal for 'pg_dump' utility enhancements which enable backup and
recovery of statistical data stored in the 'pg_statistic' table. Statistical
data is very valuable for the query planner, so support engineers have to
manually generate it using ANALYZE whenever they migrate or restore a database
from backup, which is unpleasant. This proposal aims to improve the overall
situation.


Problem description
===================

Currently there is no way to backup 'pg_statistic' because columns of
'anyarray' type cannot be reconstructed solely with their textual
representation. Meanwhile, all that is needed to solve this problem is a small
extension capable of retrieving the element type of an 'anyarray' object and
recreating this particular 'anyarray' object using the 'array_in' procedure.
Another vital feature is the ability to transform various object identificators
that are stored in this table to textual representations in order to make them
portable. Such functionality could be easily implemented, which is why I've
made up a tiny proof of concept extension that is able to demonstrate the
possibility of recovery.


Design
======

Several things come to mind when we think of the 'pg_statistic' recovery:

* The procedure written in the C language is needed in order to determine the
element type of a composite 'anyarray' type. The returned 'oid' will be used
to reconstruct the 'anyarray' object using the 'array_in' procedure.

    arr := array_in('{1,2,3}', 'text'::regtype::oid, -1);
    anyarray_elemtype(arr) -> 25 ('text'::regtype::oid)

* The columns 'starelid' (relation identifier) and 'staop' (operator identifier)
have type 'oid', so their values could be invalid within a new DB, because the
object IDs of newly recovered relations and operators might have changed
during recovery. These kinds of values should be substituted with proper casts
to internal types, for example:

    65554 (relid) -> 'public.test'::regclass::oid
    15 (staopN) -> 'public.=(pg_catalog.=(pg_catalog.int4, pg_catalog.int8)'

Note that every type is schema-qualified in order to avoid naming conflicts.

* The type of a column which is referenced by the 'staattnum' column also
needs to be checked for the sake of consistency. It should remain the same,
otherwise collected stats won't be of any use.

* The main procedure will simply generate a bunch of INSERT queries (one query
per each row of the 'pg_statistic') which can be saved to a text file.


Proof of concept
================

Interface
---------

Currently there's a PoC extension which contains several functions:

dump_statistic() - returns a set of INSERT queries;

anyarray_elemtype(anyarray) - returns the object identificator of an element
type;

to_schema_qualified_operator(opid oid) - converts the 'opid' (operator ID) to a
schema-qualified operator name;
to_schema_qualified_relname(relid oid) - converts the 'relid' (relation ID) to
a schema-qualified relation name;
to_schema_qualified_type(typid oid) - converts the 'typid' (type ID) to a
schema-qualified type name;

to_attname(rel text, colnum smallint) - returns the name of the Nth column of
the specified table 'rel';
to_attnum(rel text, col text) - converts the table name 'rel' and the column
name 'col' to a column number;

to_atttype(rel text, col text) - returns the type of the column 'col';
to_atttype(rel text, colnum smallint) - overloaded for the column number
'colnum';

The extension is compatible with versions 9.4 and above.

Usage example
-------------

DB=# \copy (select dump_statistic()) to 'stat_backup.sql'
$ psql DB < stat_backup.sql


Proposed changes to pg_dump
===========================

Now that the approach has been developed, it may be applied to improve the
'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit
specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade'
mode, thus allowing us to restore saved stats after an upgrade.


Conclusion
==========

I've attached a tarball with sources so that anyone could try the extension.

--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachment

Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Simon Riggs
Date:
On 27 November 2015 at 15:52, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:

Currently there is no way to backup 'pg_statistic' because columns of
'anyarray' type cannot be reconstructed solely with their textual
representation. Meanwhile, all that is needed to solve this problem is a small
extension capable of retrieving the element type of an 'anyarray' object and
recreating this particular 'anyarray' object using the 'array_in' procedure.

Please submit a patch on core for this. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Bruce Momjian
Date:
On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote:
> Proposed changes to pg_dump
> ===========================
> 
> Now that the approach has been developed, it may be applied to improve the 
> 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit 
> specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' 
> mode, thus allowing us to restore saved stats after an upgrade.

I think this is great.  My only question for the list is how much does
this dumping create new compatibility requirements between major
versions.  While pg_upgrade could disable it, pg_dump can't because it
doesn't know what PG version it is being loaded into.  Can we create a
format that is portable to all future PG versions, or at least detect
incompatibility?  I am thinking we would need some pg_statistic version
number in the dump data that can cause the data to be ignored.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +



Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Tom Lane
Date:
Bruce Momjian <bruce@momjian.us> writes:
> On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote:
>> Now that the approach has been developed, it may be applied to improve the 
>> 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit 
>> specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' 
>> mode, thus allowing us to restore saved stats after an upgrade.

> I think this is great.  My only question for the list is how much does
> this dumping create new compatibility requirements between major
> versions.

That concern is exactly the reason why we never did this originally.
In particular, emitting raw INSERTs into pg_statistic is just plain
foolish; we have changed the rowtype of pg_statistic in the past and
are likely to do so again.  At a minimum we would need such a facility
to be proof against addition of more statistic "slots" (more columns)
to pg_statistic.

While at Salesforce, I did some work on this problem, and came up
with code that emitted calls to functions that would insert data for
a single "slot".  That is, the dump consisted of calls to functions
defined more or less like this:

pg_load_statistics_slot (table_oid regclass,                        attr_name name,                        slot_kind
int,                       slot_op oid,                        slot_numbers float4[],
slot_valuesanyarray);
 

and there was another one to load the non-slot columns of a pg_statistic
entry.  And of course there was code to emit such a dump, producing one
dump statement per occupied "slot" in pg_statistic plus one call to
the other function per pg_statistic row.

An API like this seems a good deal more future-proof than plain INSERTs.
Even if the underlying pg_statistic representation changes, such functions
could try to adapt the presented data to the new format, or at worst they
could be redefined as no-ops.

The major problem that I never had a very good solution for is how the
"load" function could validate that the presented data was actually valid
for the specified table attribute's data type and slot kind.  The absolute
minimum that you would want it to do is to cross-check that the
slot_values array has the correct element datatype, because if it doesn't
the backend is just about certain to crash when it tries to use the data.
I think there are probably other assumptions that would need to be
validated to be safe, depending on the code that looks at each slot kind.
I have not found an answer other than the load function knowing all there
is to know about each defined slot kind; which seems like a maintenance
problem, not to mention breaking extensions (like PostGIS) that define
their own slot kinds.

Failing any good solution to that, we could probably get by by restricting
use of these functions to superusers (which is certainly where direct use
of INSERTs would have to stop).  But that puts a pretty considerable crimp
in the usefulness of the stats dump/load process.  Ideally, ordinary users
could use this facility to transfer statistics for their own tables, just
as they can use pg_dump ... but without adequate validity checking, it's
way too much of a security hazard to allow that.

I don't have access to this code anymore, but could probably persuade
Salesforce to donate it.  Or we could just write something similar
from scratch.
        regards, tom lane



Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Dmitry Ivanov
Date:
> That concern is exactly the reason why we never did this originally.
> In particular, emitting raw INSERTs into pg_statistic is just plain
> foolish; we have changed the rowtype of pg_statistic in the past and
> are likely to do so again.  At a minimum we would need such a facility
> to be proof against addition of more statistic "slots" (more columns)
> to pg_statistic.

While this approach may indeed look dumb, it is intended to be used only in 
conjunction with 'binary-upgrade' option, which effectively means that the 
pg_dump-generated INSERT statement has to be compatible only with the release 
that includes this very pg_dump version. Thus, there's no need for validation.

> And of course there was code to emit such a dump, producing one
> dump statement per occupied "slot" in pg_statistic plus one call to
> the other function per pg_statistic row.

> An API like this seems a good deal more future-proof than plain INSERTs.

This sounds really good, but I doubt that this is necessary if we're to just 
preserve statistical data during an upgrade.

> Ideally, ordinary users
> could use this facility to transfer statistics for their own tables, just
> as they can use pg_dump ... but without adequate validity checking, it's
> way too much of a security hazard to allow that.

This could be implemented separately from the pg_dump if needed. The latter 
proposal aims for the preservation of the statistical data during the database 
upgrade, which is a rather important feature required by many DBAs. Of course, 
it would be great if there was a way for a user to dump and restore stats for 
his own tables on a whim, but, in my opinion, it is far less important.




Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Tom Lane
Date:
Dmitry Ivanov <d.ivanov@postgrespro.ru> writes:
>> That concern is exactly the reason why we never did this originally.
>> In particular, emitting raw INSERTs into pg_statistic is just plain
>> foolish; we have changed the rowtype of pg_statistic in the past and
>> are likely to do so again.  At a minimum we would need such a facility
>> to be proof against addition of more statistic "slots" (more columns)
>> to pg_statistic.

> While this approach may indeed look dumb, it is intended to be used only in 
> conjunction with 'binary-upgrade' option, which effectively means that the 
> pg_dump-generated INSERT statement has to be compatible only with the release 
> that includes this very pg_dump version. Thus, there's no need for validation.

There's still the problem that it won't work across a major version
upgrade that makes any change whatsoever to the rowtype of pg_statistic.
That's a sufficient reason to reject it.

If the amount of infrastructure required were trivial, I'd probably be
okay with just putting it in as a kluge that pg_upgrade could use for
some version transitions and not others.  But your proposal seems quite
complicated, which makes me think we could solve the more general problem
for roughly comparable amounts of work.
        regards, tom lane



Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Dmitry Ivanov
Date:
> There's still the problem that it won't work across a major version
> upgrade that makes any change whatsoever to the rowtype of pg_statistic.

I'm sorry if my previous explanation was poor, this time I am going to be 
detailed.

Suppose you want to upgrade from 9.4 to 9.6. In that case you would use the 
pg_upgrade utility provided by the release 9.6, which means that it's the 
pg_dump who would have to connect to the older instance and to prepare tuples 
to be inserted to the pg_statistic of the newer instance. The pg_dump utility 
would have to convert statistical data to the new format (for example, add 
placeholders for new columns), so generated INSERT statements would be fine 
provided that the pg_dump would be up-to-date.

The documentation states that we should always run the pg_upgrade binary of 
the new server, not the old one [http://www.postgresql.org/docs/9.5/static/
pgupgrade.html, Usage, #9]. This means that the pg_upgrade will definitely use 
a fresh version of pg_dump utility that is aware of all possible pitfalls.

Furthermore, each INSERT statement consists of textually-serialized columns of 
pg_statistic. Columns of 'anyarray' type are deserialized using the 'array_in' 
procedure which performs various sanity checks, including the element type 
check. Thus it is not possible to insert an anyarray object which will cause 
server death.


-- 
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Dmitry Ivanov
Date:
Here's the patch with an 'array_elemtype' procedure which returns array's
element type as an oid. It should apply to the commit fc995b. I wasn't sure if
I shoud start a new thread.


--
Dmitry Ivanov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
Attachment

Re: [PROPOSAL] Backup and recovery of pg_statistic

From
Bruce Momjian
Date:
On Thu, Dec 24, 2015 at 05:28:11PM +0300, Dmitry Ivanov wrote:
> Suppose you want to upgrade from 9.4 to 9.6. In that case you would use the 
> pg_upgrade utility provided by the release 9.6, which means that it's the 
> pg_dump who would have to connect to the older instance and to prepare tuples 
> to be inserted to the pg_statistic of the newer instance. The pg_dump utility 
> would have to convert statistical data to the new format (for example, add 
> placeholders for new columns), so generated INSERT statements would be fine 
> provided that the pg_dump would be up-to-date.
> 
> The documentation states that we should always run the pg_upgrade binary of 
> the new server, not the old one [http://www.postgresql.org/docs/9.5/static/
> pgupgrade.html, Usage, #9]. This means that the pg_upgrade will definitely use 
> a fresh version of pg_dump utility that is aware of all possible pitfalls.
> 
> Furthermore, each INSERT statement consists of textually-serialized columns of 
> pg_statistic. Columns of 'anyarray' type are deserialized using the 'array_in' 
> procedure which performs various sanity checks, including the element type 
> check. Thus it is not possible to insert an anyarray object which will cause 
> server death.

My idea was to do the insert via a function, with a version number at
the head:
SELECT pg_stats_insert('1.0', '{row value}');

When the pg_stats format is changed, the version number is bumped.  The
backend would know the pg_stats version it supports and either remap or
ignore pg_stats_insert() calls for older versions.

To get more complicated, you could have a version number for data types
too and just invalidate inserts for data type format changes, rather
than requiring the entire pg_stats version to be bumped.  I am not sure
how we would consistently record the data type name.  pg_upgrade
preserves pg_type.oid, so that would work for it, but pg_type.oid is not
preserved for non-pg_upgrade usage of non-builtin data types.  For those
cases, I guess the type name would be sufficient.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +