Thread: [PROPOSAL] Backup and recovery of pg_statistic
Hi hackers, This is my proposal for 'pg_dump' utility enhancements which enable backup and recovery of statistical data stored in the 'pg_statistic' table. Statistical data is very valuable for the query planner, so support engineers have to manually generate it using ANALYZE whenever they migrate or restore a database from backup, which is unpleasant. This proposal aims to improve the overall situation. Problem description =================== Currently there is no way to backup 'pg_statistic' because columns of 'anyarray' type cannot be reconstructed solely with their textual representation. Meanwhile, all that is needed to solve this problem is a small extension capable of retrieving the element type of an 'anyarray' object and recreating this particular 'anyarray' object using the 'array_in' procedure. Another vital feature is the ability to transform various object identificators that are stored in this table to textual representations in order to make them portable. Such functionality could be easily implemented, which is why I've made up a tiny proof of concept extension that is able to demonstrate the possibility of recovery. Design ====== Several things come to mind when we think of the 'pg_statistic' recovery: * The procedure written in the C language is needed in order to determine the element type of a composite 'anyarray' type. The returned 'oid' will be used to reconstruct the 'anyarray' object using the 'array_in' procedure. arr := array_in('{1,2,3}', 'text'::regtype::oid, -1); anyarray_elemtype(arr) -> 25 ('text'::regtype::oid) * The columns 'starelid' (relation identifier) and 'staop' (operator identifier) have type 'oid', so their values could be invalid within a new DB, because the object IDs of newly recovered relations and operators might have changed during recovery. These kinds of values should be substituted with proper casts to internal types, for example: 65554 (relid) -> 'public.test'::regclass::oid 15 (staopN) -> 'public.=(pg_catalog.=(pg_catalog.int4, pg_catalog.int8)' Note that every type is schema-qualified in order to avoid naming conflicts. * The type of a column which is referenced by the 'staattnum' column also needs to be checked for the sake of consistency. It should remain the same, otherwise collected stats won't be of any use. * The main procedure will simply generate a bunch of INSERT queries (one query per each row of the 'pg_statistic') which can be saved to a text file. Proof of concept ================ Interface --------- Currently there's a PoC extension which contains several functions: dump_statistic() - returns a set of INSERT queries; anyarray_elemtype(anyarray) - returns the object identificator of an element type; to_schema_qualified_operator(opid oid) - converts the 'opid' (operator ID) to a schema-qualified operator name; to_schema_qualified_relname(relid oid) - converts the 'relid' (relation ID) to a schema-qualified relation name; to_schema_qualified_type(typid oid) - converts the 'typid' (type ID) to a schema-qualified type name; to_attname(rel text, colnum smallint) - returns the name of the Nth column of the specified table 'rel'; to_attnum(rel text, col text) - converts the table name 'rel' and the column name 'col' to a column number; to_atttype(rel text, col text) - returns the type of the column 'col'; to_atttype(rel text, colnum smallint) - overloaded for the column number 'colnum'; The extension is compatible with versions 9.4 and above. Usage example ------------- DB=# \copy (select dump_statistic()) to 'stat_backup.sql' $ psql DB < stat_backup.sql Proposed changes to pg_dump =========================== Now that the approach has been developed, it may be applied to improve the 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' mode, thus allowing us to restore saved stats after an upgrade. Conclusion ========== I've attached a tarball with sources so that anyone could try the extension. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On 27 November 2015 at 15:52, Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:
--
Currently there is no way to backup 'pg_statistic' because columns of
'anyarray' type cannot be reconstructed solely with their textual
representation. Meanwhile, all that is needed to solve this problem is a small
extension capable of retrieving the element type of an 'anyarray' object and
recreating this particular 'anyarray' object using the 'array_in' procedure.
Please submit a patch on core for this.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote: > Proposed changes to pg_dump > =========================== > > Now that the approach has been developed, it may be applied to improve the > 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit > specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' > mode, thus allowing us to restore saved stats after an upgrade. I think this is great. My only question for the list is how much does this dumping create new compatibility requirements between major versions. While pg_upgrade could disable it, pg_dump can't because it doesn't know what PG version it is being loaded into. Can we create a format that is portable to all future PG versions, or at least detect incompatibility? I am thinking we would need some pg_statistic version number in the dump data that can cause the data to be ignored. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
Bruce Momjian <bruce@momjian.us> writes: > On Fri, Nov 27, 2015 at 06:52:59PM +0300, Dmitry Ivanov wrote: >> Now that the approach has been developed, it may be applied to improve the >> 'pg_dump' utility. Some minor code changes would make the 'pg_dump' emit >> specially-formed recovery INSERTS for 'pg_statistic' in the 'binary-upgrade' >> mode, thus allowing us to restore saved stats after an upgrade. > I think this is great. My only question for the list is how much does > this dumping create new compatibility requirements between major > versions. That concern is exactly the reason why we never did this originally. In particular, emitting raw INSERTs into pg_statistic is just plain foolish; we have changed the rowtype of pg_statistic in the past and are likely to do so again. At a minimum we would need such a facility to be proof against addition of more statistic "slots" (more columns) to pg_statistic. While at Salesforce, I did some work on this problem, and came up with code that emitted calls to functions that would insert data for a single "slot". That is, the dump consisted of calls to functions defined more or less like this: pg_load_statistics_slot (table_oid regclass, attr_name name, slot_kind int, slot_op oid, slot_numbers float4[], slot_valuesanyarray); and there was another one to load the non-slot columns of a pg_statistic entry. And of course there was code to emit such a dump, producing one dump statement per occupied "slot" in pg_statistic plus one call to the other function per pg_statistic row. An API like this seems a good deal more future-proof than plain INSERTs. Even if the underlying pg_statistic representation changes, such functions could try to adapt the presented data to the new format, or at worst they could be redefined as no-ops. The major problem that I never had a very good solution for is how the "load" function could validate that the presented data was actually valid for the specified table attribute's data type and slot kind. The absolute minimum that you would want it to do is to cross-check that the slot_values array has the correct element datatype, because if it doesn't the backend is just about certain to crash when it tries to use the data. I think there are probably other assumptions that would need to be validated to be safe, depending on the code that looks at each slot kind. I have not found an answer other than the load function knowing all there is to know about each defined slot kind; which seems like a maintenance problem, not to mention breaking extensions (like PostGIS) that define their own slot kinds. Failing any good solution to that, we could probably get by by restricting use of these functions to superusers (which is certainly where direct use of INSERTs would have to stop). But that puts a pretty considerable crimp in the usefulness of the stats dump/load process. Ideally, ordinary users could use this facility to transfer statistics for their own tables, just as they can use pg_dump ... but without adequate validity checking, it's way too much of a security hazard to allow that. I don't have access to this code anymore, but could probably persuade Salesforce to donate it. Or we could just write something similar from scratch. regards, tom lane
> That concern is exactly the reason why we never did this originally. > In particular, emitting raw INSERTs into pg_statistic is just plain > foolish; we have changed the rowtype of pg_statistic in the past and > are likely to do so again. At a minimum we would need such a facility > to be proof against addition of more statistic "slots" (more columns) > to pg_statistic. While this approach may indeed look dumb, it is intended to be used only in conjunction with 'binary-upgrade' option, which effectively means that the pg_dump-generated INSERT statement has to be compatible only with the release that includes this very pg_dump version. Thus, there's no need for validation. > And of course there was code to emit such a dump, producing one > dump statement per occupied "slot" in pg_statistic plus one call to > the other function per pg_statistic row. > An API like this seems a good deal more future-proof than plain INSERTs. This sounds really good, but I doubt that this is necessary if we're to just preserve statistical data during an upgrade. > Ideally, ordinary users > could use this facility to transfer statistics for their own tables, just > as they can use pg_dump ... but without adequate validity checking, it's > way too much of a security hazard to allow that. This could be implemented separately from the pg_dump if needed. The latter proposal aims for the preservation of the statistical data during the database upgrade, which is a rather important feature required by many DBAs. Of course, it would be great if there was a way for a user to dump and restore stats for his own tables on a whim, but, in my opinion, it is far less important.
Dmitry Ivanov <d.ivanov@postgrespro.ru> writes: >> That concern is exactly the reason why we never did this originally. >> In particular, emitting raw INSERTs into pg_statistic is just plain >> foolish; we have changed the rowtype of pg_statistic in the past and >> are likely to do so again. At a minimum we would need such a facility >> to be proof against addition of more statistic "slots" (more columns) >> to pg_statistic. > While this approach may indeed look dumb, it is intended to be used only in > conjunction with 'binary-upgrade' option, which effectively means that the > pg_dump-generated INSERT statement has to be compatible only with the release > that includes this very pg_dump version. Thus, there's no need for validation. There's still the problem that it won't work across a major version upgrade that makes any change whatsoever to the rowtype of pg_statistic. That's a sufficient reason to reject it. If the amount of infrastructure required were trivial, I'd probably be okay with just putting it in as a kluge that pg_upgrade could use for some version transitions and not others. But your proposal seems quite complicated, which makes me think we could solve the more general problem for roughly comparable amounts of work. regards, tom lane
> There's still the problem that it won't work across a major version > upgrade that makes any change whatsoever to the rowtype of pg_statistic. I'm sorry if my previous explanation was poor, this time I am going to be detailed. Suppose you want to upgrade from 9.4 to 9.6. In that case you would use the pg_upgrade utility provided by the release 9.6, which means that it's the pg_dump who would have to connect to the older instance and to prepare tuples to be inserted to the pg_statistic of the newer instance. The pg_dump utility would have to convert statistical data to the new format (for example, add placeholders for new columns), so generated INSERT statements would be fine provided that the pg_dump would be up-to-date. The documentation states that we should always run the pg_upgrade binary of the new server, not the old one [http://www.postgresql.org/docs/9.5/static/ pgupgrade.html, Usage, #9]. This means that the pg_upgrade will definitely use a fresh version of pg_dump utility that is aware of all possible pitfalls. Furthermore, each INSERT statement consists of textually-serialized columns of pg_statistic. Columns of 'anyarray' type are deserialized using the 'array_in' procedure which performs various sanity checks, including the element type check. Thus it is not possible to insert an anyarray object which will cause server death. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Here's the patch with an 'array_elemtype' procedure which returns array's element type as an oid. It should apply to the commit fc995b. I wasn't sure if I shoud start a new thread. -- Dmitry Ivanov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Attachment
On Thu, Dec 24, 2015 at 05:28:11PM +0300, Dmitry Ivanov wrote: > Suppose you want to upgrade from 9.4 to 9.6. In that case you would use the > pg_upgrade utility provided by the release 9.6, which means that it's the > pg_dump who would have to connect to the older instance and to prepare tuples > to be inserted to the pg_statistic of the newer instance. The pg_dump utility > would have to convert statistical data to the new format (for example, add > placeholders for new columns), so generated INSERT statements would be fine > provided that the pg_dump would be up-to-date. > > The documentation states that we should always run the pg_upgrade binary of > the new server, not the old one [http://www.postgresql.org/docs/9.5/static/ > pgupgrade.html, Usage, #9]. This means that the pg_upgrade will definitely use > a fresh version of pg_dump utility that is aware of all possible pitfalls. > > Furthermore, each INSERT statement consists of textually-serialized columns of > pg_statistic. Columns of 'anyarray' type are deserialized using the 'array_in' > procedure which performs various sanity checks, including the element type > check. Thus it is not possible to insert an anyarray object which will cause > server death. My idea was to do the insert via a function, with a version number at the head: SELECT pg_stats_insert('1.0', '{row value}'); When the pg_stats format is changed, the version number is bumped. The backend would know the pg_stats version it supports and either remap or ignore pg_stats_insert() calls for older versions. To get more complicated, you could have a version number for data types too and just invalidate inserts for data type format changes, rather than requiring the entire pg_stats version to be bumped. I am not sure how we would consistently record the data type name. pg_upgrade preserves pg_type.oid, so that would work for it, but pg_type.oid is not preserved for non-pg_upgrade usage of non-builtin data types. For those cases, I guess the type name would be sufficient. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +