Thread: pg_upgrade project: high-level design proposal of in-place upgrade facility

pg_upgrade project: high-level design proposal of in-place upgrade facility

From

"Serguei A. Mokhov"

Date:

30 September 2004, 23:49:23

Hello dear all,

[Please CC your replies to me as I am on the digest mode]

Here's finally a very high-level design proposal of the pg_upgrade feature
I was handwaiving a couple of weeks ago. Since, I am almost done with the
moving, I can allocate some time for this for 8.1/8.2.

If this topic is of interest to you, please read on until the very end
before flaming or bashing the ideas out.  I had designed that thing and
kept updating (the design) more or less regularly, and also reflected some
issues from the nearby threads [1] and [2].

This design is very high-level at the moment and is not very detailed.  I
will need to figure out more stuff as I go and design some aspects in
finer detail.  I started to poke around asking for initdb-forcing code
paths in [3], but got no response so far.  But I guess if the general idea
or, rather, ideas accepted I will insist on more information more
aggressively :) if I can't figure something out for myself.

[1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php
[2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php
[3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php

Comments are very welcome, especially _*CONSTRUCTIVE*_...

Thank you, and now sit back and read...

CONTENTS:
=========

1. The Need
1. Utilities and User's View of the pg_upgrade Feature
2. Storage Management  - Storage Managers and the smgr API
3. Source Code Maintenance Aspects
2. The Upgrade Sequence
4. Proposed Implementation Plan  - initdb() API  - upgrade API


1. The Need
-----------

It's been a problem for PG for quite awhile now to have a less painful
upgrade procedure with every new revision of PostgreSQL, so the
dump/restore sequence is required.  That can take a while for a production
DB, while keeping it offline.  The new replication-related solutions, such
as Slony I, pg_pool, and others can remedy the problem somewhat, but
require to roughly double the storage requirements of a given database
while replicating from the older server to a newer one.

The proposed implementation of an in-server pg_upgrade facility attempts
to address both issues at the same time -- a possibility to keep the
server running and upgrading lazily w/o doubling the storage requirements
(there will be some extra disk space taken, but far from doubling the
size).  The in-process upgrade will not take much of down time and won't
require that much memory/disk/network resources as replication solutions
do.


Prerequisites
-------------

Ideally, the (maybe not so anymore) ambitious goal is to simply be able to
"drop in" the new binaries of the new server and kick off on the older
version of data files. I think is this feasible now a lot more than before
since we have those things available, which should ease up the
implementation:
 - bgwriter - pg_autovacuum (the one to be integrated into the backend in 8.1) - smgr API for pluggable storage
managers- initdb in C - ...
 

initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager
have made the possibility of creation of the Upgrade Subsystem for
PostgreSQL to be something more reasonable, complete, feasible, and sane
to a point.


Utilities and the User's (DBA) View of the Feature
--------------------------------------------------

Two instances exist:
  pg_upgrade (in C)
      A standalone utility to upgrade the binary on-disk format from one      version to another when the database is
offline.     We should always have this as an option.      pg_upgrade will accept sub/super set of
pg_dump(all)/pg_restore     options that do not require a connection. I haven't      thought through this in detail
yet.
  pg_autoupgrade
      a postgres subprocess, modeled after bgwriter and pg_autovacuum      daemons.  This will work when the database
systemis running      on old data directory, and lazily converting relations to the new      format.
 

pg_autoupgrade daemon can be triggered by the following events in addition
to the lazy upgrade process:
  "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time]

While the database won't be offline running over older database files,
SELECT/read-only queries would be allowed using older storage managers*.
Any write operation on old data will act using write-invalidate approach
that will force the upgrade the affected relations to the new format to be
scheduled after the relation-in-progress.

(* See the "Storage Management" section.)

Availability of the relations while upgrade is in progress is likely to be
the same as in VACUUM FULL for that relation, i.e. the entire relation is
locked until the upgrade is complete.  Maybe we could optimize that by
locking only particular pages of relations, I have to figure that out.

The upgrade of indices can be done using REINDEX, which seems far less
complicated than trying to convert its on-disk representation.  This has
to be done after the relation is converted.  Alternatively, the index
upgrade can simply be done by "CREATE INDEX" after the upgrade of
relations.

The relations to be upgraded are ordered according to some priority, e.g.
system relations being first, then user-owned relations.  System relations
upgrade is forced upon the postmaster startup, and then user relations are
processed lazily.

So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate
storage manager (like a driver) between the new server and the old data
file upgrading them on-demand.  For that purpose we might need to add a
pg_upgradeproxy to intercept backend requests and use appropriate storage
manager.  There will be one proxy process per backend.


Storage Management
==================

Somebody has made a possibility to plug a different storage manager in
postgres and we even had two of them at some point . for the magnetic disk
and the main memory.  The main memory one is gone, but the smgr API is
still there.  Some were dubious why we would ever need another third-party
storage manager, but here I propose to "plug in" storage managers from the
older Postgres versions itself!  Here is where the pluggable storage
manager API would be handy once fully resurrected.  Instead of trying to
plug some third party storage managers it will primarily be used by the
storage managers of different versions of Postgres.

We can take the storage manager code from the past maintenance releases,
namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them
in appropriate fashion and have them implement the API properly.  Anyone
can contribute a storage manager as they see fit, there's no need to get
them all at once.  As a trial implementation I will try to do the last
three or four maybe.


Where to put relations being upgraded?
--------------------------------------

At the beginning of the upgrade process if pg detects the old version of
data files, it moves them under $PGDATA/<ver>, and keeps the old relations
there until upgraded.  The relations to be upgraded will be kept in the
pg_upgrade_catalog.  Once all relations upgraded, the <ver> directory is
removed and the auto and proxy processes are shut down.  The contents of
the pg_upgrade_catalog emptied.  The only issue remains is how to deal
with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can
probably be addressed in the similar fashion, but having a
/my/tablespace/<ver> directory.

Source Code Maintenance
=======================

Now, after the above some of you may get scared on the amount of similar
code to possibly maintain in all those storage managers, but in reality
they would require as much maintenance as the corresponding releases do
get back-patched in that code area, and some are not being maintained for
quite some time already.  Plus, I should be around to maintain it, should
this become realized.

Release-time Maintenance
------------------------

For maintenance of pg_upgrade itself, one will have to fork out a new
storage manager from the previous stable release and "register" it within
the system.  Alternatively, the new storage manager can be forked when the
new release cycle begins.  Additionally, a pg_upgrade version has to be
added implementing the API steps outlined in the pg_upgrade API section.


Implementation Steps
====================

To materialize the above idea, I'd proceed as follows:

*) Provide the initdb() API (quick)

*) Resurrect the pluggable storage manager API to be usable for the  purpose.

*) Document it

*) Implement pg_upgrade API for 8.0 and 7.4.5.

*) Extract 8.0 and 7.4.5 storage managers and have them implement the API  as a proof of concept.  Massage the API as
needed.

*) Document the process of adding new storage managers and pg_upgrade  drivers.

*) Extract other versions storage managers.


pg_upgrade sequence
-------------------

pg_upgrade API for the steps below to update for the next release.

What to do with WAL?? Maybe upgrade can simply be done using WAL replay
with old WAL manager? Not, fully, because not everything is in WAL, but
some WAL recovery maybe needed in case the server was not shutdown cleanly
before the upgrade.

pg_upgrade will proceed as follows:

- move PGDATA to PGDATA/<major pg version>
- move tablespaces likewise
- optional recovery from WAL in case old server was not shutdown properly
-? Shall I upgrade PITR logs of 8.x??? So one can recover to a point-in-time in the upgraded database?
- CLUSTER all old data
- ANALYZE all old data
- initdb() new system catalogs
- Merge in modifications from old system catalogs
- upgrade schemas/users -- variations
- upgrade user relations

Upgrade API:
------------

First draft, to be refined multiple times, but to convey the ideas behind:

moveData() movePGData() moveTablespaces() 8.0+ moveDbLocation() < 8.0

preliminaryRecovery()- WAL??- PITR 8.0+??

preliminaryCleanup() CLUSTER -- recover some dead space ANALYZE -- gives us stats

upgradeSystemInfo() initdb() mergeOldCatalogs() mergeOldTemplates()

upgradeUsers()

upgradeSchemas() - > 7.2, else NULL

upgradeUserRelations() upgradeIndices()   DROP/CREATE

upgradeInit()
{

}

The main body in pseudocode:

upgradeLoop()
{   moveData();   preliminaryRecovery();   preliminaryCleanup();   upgradeSystemInfo();   upgradeUsers();
upgradeSchemas();  upgradeUserRelations();
 
}

Something along these lines the API would be:

typedef struct t_upgrade
{bool        (*moveData) (void);bool        (*preliminaryRecovery) (void);        /* may be NULL */bool
(*preliminaryCleanup)(void);        /* may be NULL */bool        (*upgradeSystemInfo) (void);        /* may be NULL
*/bool       (*upgradeUsers) (void);        /* may be NULL */bool        (*upgradeSchemas) (void);        /* may be
NULL*/bool        (*upgradeUserRelations) (void);        /* may be NULL */
 
} t_upgrade;


The above sequence is executed by either pg_upgrade utility uninterrupted
or by the pg_autoupgrade daemon. In the former the upgrade priority is
simply by OID, in the latter also, but can be overridden by the user using
the UPGRADE command to schedule relations upgrade, write operation can
also change such schedule, with user's selected choice to be first.  The
more write requests a relation receives while in the upgrade queue, its
priority increases; thus, the relation with most hits is on top. In case
of tie, OID is the decision mark.

Some issues to look into:

- catalog merger
- a crash in the middle of upgrade
- PITR logs for 8.x+
- ...

Flames and Handwaiving
----------------------

Okay, flame is on, but before you flame, mind you, this is a very initial
version of the design.  Some of the ideas may seem far fetched, the
contents may seem messy, but I believe it's now more doable than ever and
I am willing to put effort in it for the next release or two and then
maintain it afterwards.  It's not going to be done in one shot maybe, but
incrementally, using input, feedback, and hints from you, guys.

Thank you for reading till this far :-) I.d like to hear from you if any
of this made sense to you.

Truly yours,

-- 
Serguei A. Mokhov            |  /~\    The ASCII
Computer Science Department  |  \ / Ribbon Campaign
Concordia University         |   X    Against HTML
Montreal, Quebec, Canada     |  / \      Email!

Re: pg_upgrade project: high-level design proposal of in-place upgrade facility

From

Tom Lane

Date:

01 October 2004, 00:37:44

"Serguei A. Mokhov" <mokhov@cs.concordia.ca> writes:
> Comments are very welcome, especially _*CONSTRUCTIVE*_...

This is fundamentally wrong, because you are assigning the storage
manager functionality that it does not have.  In particular, the
storage manager knows nothing of the contents or format of the files
it is managing, and so you can't realistically expect to use the smgr
switch as a way to support access to tables with different internal
formats.  The places that change in interesting ways across versions
are usually far above the smgr switch.

I don't believe in the idea of incremental "lazy" upgrades very much
either.  It certainly will not work on the system catalogs --- you have
to convert those in a big-bang fashion, because how are you going to
find the other stuff otherwise?  And the real problem with it IMHO is
that if something goes wrong partway through the process, you're in
deep trouble because you have no way to back out.  You can't just revert
to the old version because it won't understand your data, and your old
backups that are compatible with it are now out of date.  If there are
going to be any problems, you really need to find out about them
immediately while your old backups are still current, not in a "lazy"
fashion.

The design we'd pretty much all bought into six months ago involved
being able to do in-place upgrades when the format/contents of user
relations and indexes is not changing.  All you'd have to do is dump
and restore the schema data (system catalogs) which is a reasonably
short process even on a large DB, so the big-bang nature of the
conversion isn't a problem.  Admittedly this will not work for every
single upgrade, but we had agreed that we could schedule upgrades
so that the majority of releases do not change user data.  Historically
that's been mostly true anyway, even without any deliberate attempt to
group user-data-changing features together.

I think the last major discussion about it started here:
http://archives.postgresql.org/pgsql-hackers/2003-12/msg00379.php
(I got distracted by other stuff and never did the promised work,
but I still think the approach is sound.)
        regards, tom lane

Re: pg_upgrade project: high-level design proposal of

From

"Serguei A. Mokhov"

Date:

01 October 2004, 01:42:29

On Thu, 30 Sep 2004, Tom Lane wrote:

> Date: Thu, 30 Sep 2004 16:36:02 -0400
>
> "Serguei A. Mokhov" <mokhov@cs.concordia.ca> writes:
> > Comments are very welcome, especially _*CONSTRUCTIVE*_...
>
> This is fundamentally wrong, because you are assigning the storage
> manager functionality that it does not have.  In particular, the

Maybe, that's why I was asking of all init-db forcing paths, so I can go
level above smgr to upgrade stuff, let say in access/ and other parts. I
did ask that before and never got a reply. So the concept of "Storage
Managers" may and will go well beyond the smgt API. That's the design
refinement stage.

> I don't believe in the idea of incremental "lazy" upgrades very much
> either.  It certainly will not work on the system catalogs --- you have
> to convert those in a big-bang fashion,

I never proposed to do that to system catalogs, on the contrary, I said
the system catalogs are to be upgraded upon the postmaster startup. only
user relations are upgraded lazily:

> The relations to be upgraded are ordered according to some priority,
> e.g.  system relations being first, then user-owned relations.  System
> relations upgrade is forced upon the postmaster startup, and then user
> relations are processed lazily.

So looks like we don't disagree here :)

> The design we'd pretty much all bought into six months ago involved
> being able to do in-place upgrades when the format/contents of user
> relations and indexes is not changing.  All you'd have to do is dump and
> restore the schema data (system catalogs) which is a reasonably short
> process even on a large DB, so the big-bang nature of the conversion
> isn't a problem.  Admittedly this will not work for every single
> upgrade, but we had agreed that we could schedule upgrades so that the
> majority of releases do not change user data.  Historically that's been
> mostly true anyway, even without any deliberate attempt to group
> user-data-changing features together.

Annoyingly enough, they still do change.

> I think the last major discussion about it started here:
> http://archives.postgresql.org/pgsql-hackers/2003-12/msg00379.php
> (I got distracted by other stuff and never did the promised work,
> but I still think the approach is sound.)

I'll go over that discussion and maybe will combine useful ideas together.
I'll open a pgfoundry project to develop it there and then will submit for
evaluation UNLESS you reserved it for yourself, Tom, to fullfill the
promise... If anybody objects the pgfoundry idea to test the concepts,
I'll apply for a project there.

Thank you for the comments!

>             regards, tom lane
>

-- 
Serguei A. Mokhov            |  /~\    The ASCII
Computer Science Department  |  \ / Ribbon Campaign
Concordia University         |   X    Against HTML
Montreal, Quebec, Canada     |  / \      Email!

Re: pg_upgrade project: high-level design proposal of in-place

From

Bruce Momjian

Date:

04 June 2005, 23:33:44

I have added a link to this thread on the TODO list under TODO.detail.

---------------------------------------------------------------------------

Serguei A. Mokhov wrote:
> Hello dear all,
> 
> [Please CC your replies to me as I am on the digest mode]
> 
> Here's finally a very high-level design proposal of the pg_upgrade feature
> I was handwaiving a couple of weeks ago. Since, I am almost done with the
> moving, I can allocate some time for this for 8.1/8.2.
> 
> If this topic is of interest to you, please read on until the very end
> before flaming or bashing the ideas out.  I had designed that thing and
> kept updating (the design) more or less regularly, and also reflected some
> issues from the nearby threads [1] and [2].
> 
> This design is very high-level at the moment and is not very detailed.  I
> will need to figure out more stuff as I go and design some aspects in
> finer detail.  I started to poke around asking for initdb-forcing code
> paths in [3], but got no response so far.  But I guess if the general idea
> or, rather, ideas accepted I will insist on more information more
> aggressively :) if I can't figure something out for myself.
> 
> [1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php
> [2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php
> [3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php
> 
> Comments are very welcome, especially _*CONSTRUCTIVE*_...
> 
> Thank you, and now sit back and read...
> 
> CONTENTS:
> =========
> 
> 1. The Need
> 1. Utilities and User's View of the pg_upgrade Feature
> 2. Storage Management
>    - Storage Managers and the smgr API
> 3. Source Code Maintenance Aspects
> 2. The Upgrade Sequence
> 4. Proposed Implementation Plan
>    - initdb() API
>    - upgrade API
> 
> 
> 1. The Need
> -----------
> 
> It's been a problem for PG for quite awhile now to have a less painful
> upgrade procedure with every new revision of PostgreSQL, so the
> dump/restore sequence is required.  That can take a while for a production
> DB, while keeping it offline.  The new replication-related solutions, such
> as Slony I, pg_pool, and others can remedy the problem somewhat, but
> require to roughly double the storage requirements of a given database
> while replicating from the older server to a newer one.
> 
> The proposed implementation of an in-server pg_upgrade facility attempts
> to address both issues at the same time -- a possibility to keep the
> server running and upgrading lazily w/o doubling the storage requirements
> (there will be some extra disk space taken, but far from doubling the
> size).  The in-process upgrade will not take much of down time and won't
> require that much memory/disk/network resources as replication solutions
> do.
> 
> 
> Prerequisites
> -------------
> 
> Ideally, the (maybe not so anymore) ambitious goal is to simply be able to
> "drop in" the new binaries of the new server and kick off on the older
> version of data files. I think is this feasible now a lot more than before
> since we have those things available, which should ease up the
> implementation:
> 
>   - bgwriter
>   - pg_autovacuum (the one to be integrated into the backend in 8.1)
>   - smgr API for pluggable storage managers
>   - initdb in C
>   - ...
> 
> initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager
> have made the possibility of creation of the Upgrade Subsystem for
> PostgreSQL to be something more reasonable, complete, feasible, and sane
> to a point.
> 
> 
> Utilities and the User's (DBA) View of the Feature
> --------------------------------------------------
> 
> Two instances exist:
> 
>    pg_upgrade (in C)
> 
>        A standalone utility to upgrade the binary on-disk format from one
>        version to another when the database is offline.
>        We should always have this as an option.
>        pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore
>        options that do not require a connection. I haven't
>        thought through this in detail yet.
> 
>    pg_autoupgrade
> 
>        a postgres subprocess, modeled after bgwriter and pg_autovacuum
>        daemons.  This will work when the database system is running
>        on old data directory, and lazily converting relations to the new
>        format.
> 
> pg_autoupgrade daemon can be triggered by the following events in addition
> to the lazy upgrade process:
> 
>    "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time]
> 
> While the database won't be offline running over older database files,
> SELECT/read-only queries would be allowed using older storage managers*.
> Any write operation on old data will act using write-invalidate approach
> that will force the upgrade the affected relations to the new format to be
> scheduled after the relation-in-progress.
> 
> (* See the "Storage Management" section.)
> 
> Availability of the relations while upgrade is in progress is likely to be
> the same as in VACUUM FULL for that relation, i.e. the entire relation is
> locked until the upgrade is complete.  Maybe we could optimize that by
> locking only particular pages of relations, I have to figure that out.
> 
> The upgrade of indices can be done using REINDEX, which seems far less
> complicated than trying to convert its on-disk representation.  This has
> to be done after the relation is converted.  Alternatively, the index
> upgrade can simply be done by "CREATE INDEX" after the upgrade of
> relations.
> 
> The relations to be upgraded are ordered according to some priority, e.g.
> system relations being first, then user-owned relations.  System relations
> upgrade is forced upon the postmaster startup, and then user relations are
> processed lazily.
> 
> So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate
> storage manager (like a driver) between the new server and the old data
> file upgrading them on-demand.  For that purpose we might need to add a
> pg_upgradeproxy to intercept backend requests and use appropriate storage
> manager.  There will be one proxy process per backend.
> 
> 
> Storage Management
> ==================
> 
> Somebody has made a possibility to plug a different storage manager in
> postgres and we even had two of them at some point . for the magnetic disk
> and the main memory.  The main memory one is gone, but the smgr API is
> still there.  Some were dubious why we would ever need another third-party
> storage manager, but here I propose to "plug in" storage managers from the
> older Postgres versions itself!  Here is where the pluggable storage
> manager API would be handy once fully resurrected.  Instead of trying to
> plug some third party storage managers it will primarily be used by the
> storage managers of different versions of Postgres.
> 
> We can take the storage manager code from the past maintenance releases,
> namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them
> in appropriate fashion and have them implement the API properly.  Anyone
> can contribute a storage manager as they see fit, there's no need to get
> them all at once.  As a trial implementation I will try to do the last
> three or four maybe.
> 
> 
> Where to put relations being upgraded?
> --------------------------------------
> 
> At the beginning of the upgrade process if pg detects the old version of
> data files, it moves them under $PGDATA/<ver>, and keeps the old relations
> there until upgraded.  The relations to be upgraded will be kept in the
> pg_upgrade_catalog.  Once all relations upgraded, the <ver> directory is
> removed and the auto and proxy processes are shut down.  The contents of
> the pg_upgrade_catalog emptied.  The only issue remains is how to deal
> with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can
> probably be addressed in the similar fashion, but having a
> /my/tablespace/<ver> directory.
> 
> Source Code Maintenance
> =======================
> 
> Now, after the above some of you may get scared on the amount of similar
> code to possibly maintain in all those storage managers, but in reality
> they would require as much maintenance as the corresponding releases do
> get back-patched in that code area, and some are not being maintained for
> quite some time already.  Plus, I should be around to maintain it, should
> this become realized.
> 
> Release-time Maintenance
> ------------------------
> 
> For maintenance of pg_upgrade itself, one will have to fork out a new
> storage manager from the previous stable release and "register" it within
> the system.  Alternatively, the new storage manager can be forked when the
> new release cycle begins.  Additionally, a pg_upgrade version has to be
> added implementing the API steps outlined in the pg_upgrade API section.
> 
> 
> Implementation Steps
> ====================
> 
> To materialize the above idea, I'd proceed as follows:
> 
> *) Provide the initdb() API (quick)
> 
> *) Resurrect the pluggable storage manager API to be usable for the
>    purpose.
> 
> *) Document it
> 
> *) Implement pg_upgrade API for 8.0 and 7.4.5.
> 
> *) Extract 8.0 and 7.4.5 storage managers and have them implement the API
>    as a proof of concept.  Massage the API as needed.
> 
> *) Document the process of adding new storage managers and pg_upgrade
>    drivers.
> 
> *) Extract other versions storage managers.
> 
> 
> pg_upgrade sequence
> -------------------
> 
> pg_upgrade API for the steps below to update for the next release.
> 
> What to do with WAL?? Maybe upgrade can simply be done using WAL replay
> with old WAL manager? Not, fully, because not everything is in WAL, but
> some WAL recovery maybe needed in case the server was not shutdown cleanly
> before the upgrade.
> 
> pg_upgrade will proceed as follows:
> 
> - move PGDATA to PGDATA/<major pg version>
> - move tablespaces likewise
> - optional recovery from WAL in case old server was not shutdown properly
> -? Shall I upgrade PITR logs of 8.x??? So one can recover to a
>   point-in-time in the upgraded database?
> - CLUSTER all old data
> - ANALYZE all old data
> - initdb() new system catalogs
> - Merge in modifications from old system catalogs
> - upgrade schemas/users
>   -- variations
> - upgrade user relations
> 
> Upgrade API:
> ------------
> 
> First draft, to be refined multiple times, but to convey the ideas behind:
> 
> moveData()
>   movePGData()
>   moveTablespaces() 8.0+
>   moveDbLocation() < 8.0
> 
> preliminaryRecovery()
>  - WAL??
>  - PITR 8.0+??
> 
> preliminaryCleanup()
>   CLUSTER -- recover some dead space
>   ANALYZE -- gives us stats
> 
> upgradeSystemInfo()
>   initdb()
>   mergeOldCatalogs()
>   mergeOldTemplates()
> 
> upgradeUsers()
> 
> upgradeSchemas()
>   - > 7.2, else NULL
> 
> upgradeUserRelations()
>   upgradeIndices()
>     DROP/CREATE
> 
> upgradeInit()
> {
> 
> }
> 
> The main body in pseudocode:
> 
> upgradeLoop()
> {
>     moveData();
>     preliminaryRecovery();
>     preliminaryCleanup();
>     upgradeSystemInfo();
>     upgradeUsers();
>     upgradeSchemas();
>     upgradeUserRelations();
> }
> 
> Something along these lines the API would be:
> 
> typedef struct t_upgrade
> {
>     bool        (*moveData) (void);
>     bool        (*preliminaryRecovery) (void);        /* may be NULL */
>     bool        (*preliminaryCleanup) (void);        /* may be NULL */
>     bool        (*upgradeSystemInfo) (void);        /* may be NULL */
>     bool        (*upgradeUsers) (void);        /* may be NULL */
>     bool        (*upgradeSchemas) (void);        /* may be NULL */
>     bool        (*upgradeUserRelations) (void);        /* may be NULL */
> } t_upgrade;
> 
> 
> The above sequence is executed by either pg_upgrade utility uninterrupted
> or by the pg_autoupgrade daemon. In the former the upgrade priority is
> simply by OID, in the latter also, but can be overridden by the user using
> the UPGRADE command to schedule relations upgrade, write operation can
> also change such schedule, with user's selected choice to be first.  The
> more write requests a relation receives while in the upgrade queue, its
> priority increases; thus, the relation with most hits is on top. In case
> of tie, OID is the decision mark.
> 
> Some issues to look into:
> 
> - catalog merger
> - a crash in the middle of upgrade
> - PITR logs for 8.x+
> - ...
> 
> Flames and Handwaiving
> ----------------------
> 
> Okay, flame is on, but before you flame, mind you, this is a very initial
> version of the design.  Some of the ideas may seem far fetched, the
> contents may seem messy, but I believe it's now more doable than ever and
> I am willing to put effort in it for the next release or two and then
> maintain it afterwards.  It's not going to be done in one shot maybe, but
> incrementally, using input, feedback, and hints from you, guys.
> 
> Thank you for reading till this far :-) I.d like to hear from you if any
> of this made sense to you.
> 
> Truly yours,
> 
> -- 
> Serguei A. Mokhov            |  /~\    The ASCII
> Computer Science Department  |  \ / Ribbon Campaign
> Concordia University         |   X    Against HTML
> Montreal, Quebec, Canada     |  / \      Email!
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

Re: pg_upgrade project: high-level design proposal of

From

"Serguei A. Mokhov"

Date:

04 June 2005, 23:42:25

On Sat, 4 Jun 2005, Bruce Momjian wrote:

> Date: Sat, 4 Jun 2005 19:33:40 -0400 (EDT)
>
> I have added a link to this thread on the TODO list under TODO.detail.

Thanks Bruce (and Josh Berkus) for reminding me about this. I should be
able to follow this up in early July.

-s

> Serguei A. Mokhov wrote:
> > Hello dear all,
> >
> > [Please CC your replies to me as I am on the digest mode]
> >
> > Here's finally a very high-level design proposal of the pg_upgrade feature
> > I was handwaiving a couple of weeks ago. Since, I am almost done with the
> > moving, I can allocate some time for this for 8.1/8.2.
> >
> > If this topic is of interest to you, please read on until the very end
> > before flaming or bashing the ideas out.  I had designed that thing and
> > kept updating (the design) more or less regularly, and also reflected some
> > issues from the nearby threads [1] and [2].
> >
> > This design is very high-level at the moment and is not very detailed.  I
> > will need to figure out more stuff as I go and design some aspects in
> > finer detail.  I started to poke around asking for initdb-forcing code
> > paths in [3], but got no response so far.  But I guess if the general idea
> > or, rather, ideas accepted I will insist on more information more
> > aggressively :) if I can't figure something out for myself.
> >
> > [1] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00000.php
> > [2] http://archives.postgresql.org/pgsql-hackers/2004-09/msg00382.php
> > [3] http://archives.postgresql.org/pgsql-hackers/2004-08/msg01594.php
> >
> > Comments are very welcome, especially _*CONSTRUCTIVE*_...
> >
> > Thank you, and now sit back and read...
> >
> > CONTENTS:
> > =========
> >
> > 1. The Need
> > 1. Utilities and User's View of the pg_upgrade Feature
> > 2. Storage Management
> >    - Storage Managers and the smgr API
> > 3. Source Code Maintenance Aspects
> > 2. The Upgrade Sequence
> > 4. Proposed Implementation Plan
> >    - initdb() API
> >    - upgrade API
> >
> >
> > 1. The Need
> > -----------
> >
> > It's been a problem for PG for quite awhile now to have a less painful
> > upgrade procedure with every new revision of PostgreSQL, so the
> > dump/restore sequence is required.  That can take a while for a production
> > DB, while keeping it offline.  The new replication-related solutions, such
> > as Slony I, pg_pool, and others can remedy the problem somewhat, but
> > require to roughly double the storage requirements of a given database
> > while replicating from the older server to a newer one.
> >
> > The proposed implementation of an in-server pg_upgrade facility attempts
> > to address both issues at the same time -- a possibility to keep the
> > server running and upgrading lazily w/o doubling the storage requirements
> > (there will be some extra disk space taken, but far from doubling the
> > size).  The in-process upgrade will not take much of down time and won't
> > require that much memory/disk/network resources as replication solutions
> > do.
> >
> >
> > Prerequisites
> > -------------
> >
> > Ideally, the (maybe not so anymore) ambitious goal is to simply be able to
> > "drop in" the new binaries of the new server and kick off on the older
> > version of data files. I think is this feasible now a lot more than before
> > since we have those things available, which should ease up the
> > implementation:
> >
> >   - bgwriter
> >   - pg_autovacuum (the one to be integrated into the backend in 8.1)
> >   - smgr API for pluggable storage managers
> >   - initdb in C
> >   - ...
> >
> > initdb in C, bgwriter and pg_autovacuum, and pluggable storage manager
> > have made the possibility of creation of the Upgrade Subsystem for
> > PostgreSQL to be something more reasonable, complete, feasible, and sane
> > to a point.
> >
> >
> > Utilities and the User's (DBA) View of the Feature
> > --------------------------------------------------
> >
> > Two instances exist:
> >
> >    pg_upgrade (in C)
> >
> >        A standalone utility to upgrade the binary on-disk format from one
> >        version to another when the database is offline.
> >        We should always have this as an option.
> >        pg_upgrade will accept sub/super set of pg_dump(all)/pg_restore
> >        options that do not require a connection. I haven't
> >        thought through this in detail yet.
> >
> >    pg_autoupgrade
> >
> >        a postgres subprocess, modeled after bgwriter and pg_autovacuum
> >        daemons.  This will work when the database system is running
> >        on old data directory, and lazily converting relations to the new
> >        format.
> >
> > pg_autoupgrade daemon can be triggered by the following events in addition
> > to the lazy upgrade process:
> >
> >    "SQL" level: UPGRADE <ALL | relation_name [, relation_name]> [NOW | time]
> >
> > While the database won't be offline running over older database files,
> > SELECT/read-only queries would be allowed using older storage managers*.
> > Any write operation on old data will act using write-invalidate approach
> > that will force the upgrade the affected relations to the new format to be
> > scheduled after the relation-in-progress.
> >
> > (* See the "Storage Management" section.)
> >
> > Availability of the relations while upgrade is in progress is likely to be
> > the same as in VACUUM FULL for that relation, i.e. the entire relation is
> > locked until the upgrade is complete.  Maybe we could optimize that by
> > locking only particular pages of relations, I have to figure that out.
> >
> > The upgrade of indices can be done using REINDEX, which seems far less
> > complicated than trying to convert its on-disk representation.  This has
> > to be done after the relation is converted.  Alternatively, the index
> > upgrade can simply be done by "CREATE INDEX" after the upgrade of
> > relations.
> >
> > The relations to be upgraded are ordered according to some priority, e.g.
> > system relations being first, then user-owned relations.  System relations
> > upgrade is forced upon the postmaster startup, and then user relations are
> > processed lazily.
> >
> > So, in a sense, pg_autoupgrade will act like a proxy choosing appropriate
> > storage manager (like a driver) between the new server and the old data
> > file upgrading them on-demand.  For that purpose we might need to add a
> > pg_upgradeproxy to intercept backend requests and use appropriate storage
> > manager.  There will be one proxy process per backend.
> >
> >
> > Storage Management
> > ==================
> >
> > Somebody has made a possibility to plug a different storage manager in
> > postgres and we even had two of them at some point . for the magnetic disk
> > and the main memory.  The main memory one is gone, but the smgr API is
> > still there.  Some were dubious why we would ever need another third-party
> > storage manager, but here I propose to "plug in" storage managers from the
> > older Postgres versions itself!  Here is where the pluggable storage
> > manager API would be handy once fully resurrected.  Instead of trying to
> > plug some third party storage managers it will primarily be used by the
> > storage managers of different versions of Postgres.
> >
> > We can take the storage manager code from the past maintenance releases,
> > namely 6.5.3, 7.0.3, 7.1.3, 7.2.5, 7.3.7, 7.4.5, and 8.0, and arrange them
> > in appropriate fashion and have them implement the API properly.  Anyone
> > can contribute a storage manager as they see fit, there's no need to get
> > them all at once.  As a trial implementation I will try to do the last
> > three or four maybe.
> >
> >
> > Where to put relations being upgraded?
> > --------------------------------------
> >
> > At the beginning of the upgrade process if pg detects the old version of
> > data files, it moves them under $PGDATA/<ver>, and keeps the old relations
> > there until upgraded.  The relations to be upgraded will be kept in the
> > pg_upgrade_catalog.  Once all relations upgraded, the <ver> directory is
> > removed and the auto and proxy processes are shut down.  The contents of
> > the pg_upgrade_catalog emptied.  The only issue remains is how to deal
> > with tablespaces (or LOCATION in 7.* releases) elsewhere .- this can
> > probably be addressed in the similar fashion, but having a
> > /my/tablespace/<ver> directory.
> >
> > Source Code Maintenance
> > =======================
> >
> > Now, after the above some of you may get scared on the amount of similar
> > code to possibly maintain in all those storage managers, but in reality
> > they would require as much maintenance as the corresponding releases do
> > get back-patched in that code area, and some are not being maintained for
> > quite some time already.  Plus, I should be around to maintain it, should
> > this become realized.
> >
> > Release-time Maintenance
> > ------------------------
> >
> > For maintenance of pg_upgrade itself, one will have to fork out a new
> > storage manager from the previous stable release and "register" it within
> > the system.  Alternatively, the new storage manager can be forked when the
> > new release cycle begins.  Additionally, a pg_upgrade version has to be
> > added implementing the API steps outlined in the pg_upgrade API section.
> >
> >
> > Implementation Steps
> > ====================
> >
> > To materialize the above idea, I'd proceed as follows:
> >
> > *) Provide the initdb() API (quick)
> >
> > *) Resurrect the pluggable storage manager API to be usable for the
> >    purpose.
> >
> > *) Document it
> >
> > *) Implement pg_upgrade API for 8.0 and 7.4.5.
> >
> > *) Extract 8.0 and 7.4.5 storage managers and have them implement the API
> >    as a proof of concept.  Massage the API as needed.
> >
> > *) Document the process of adding new storage managers and pg_upgrade
> >    drivers.
> >
> > *) Extract other versions storage managers.
> >
> >
> > pg_upgrade sequence
> > -------------------
> >
> > pg_upgrade API for the steps below to update for the next release.
> >
> > What to do with WAL?? Maybe upgrade can simply be done using WAL replay
> > with old WAL manager? Not, fully, because not everything is in WAL, but
> > some WAL recovery maybe needed in case the server was not shutdown cleanly
> > before the upgrade.
> >
> > pg_upgrade will proceed as follows:
> >
> > - move PGDATA to PGDATA/<major pg version>
> > - move tablespaces likewise
> > - optional recovery from WAL in case old server was not shutdown properly
> > -? Shall I upgrade PITR logs of 8.x??? So one can recover to a
> >   point-in-time in the upgraded database?
> > - CLUSTER all old data
> > - ANALYZE all old data
> > - initdb() new system catalogs
> > - Merge in modifications from old system catalogs
> > - upgrade schemas/users
> >   -- variations
> > - upgrade user relations
> >
> > Upgrade API:
> > ------------
> >
> > First draft, to be refined multiple times, but to convey the ideas behind:
> >
> > moveData()
> >   movePGData()
> >   moveTablespaces() 8.0+
> >   moveDbLocation() < 8.0
> >
> > preliminaryRecovery()
> >  - WAL??
> >  - PITR 8.0+??
> >
> > preliminaryCleanup()
> >   CLUSTER -- recover some dead space
> >   ANALYZE -- gives us stats
> >
> > upgradeSystemInfo()
> >   initdb()
> >   mergeOldCatalogs()
> >   mergeOldTemplates()
> >
> > upgradeUsers()
> >
> > upgradeSchemas()
> >   - > 7.2, else NULL
> >
> > upgradeUserRelations()
> >   upgradeIndices()
> >     DROP/CREATE
> >
> > upgradeInit()
> > {
> >
> > }
> >
> > The main body in pseudocode:
> >
> > upgradeLoop()
> > {
> >     moveData();
> >     preliminaryRecovery();
> >     preliminaryCleanup();
> >     upgradeSystemInfo();
> >     upgradeUsers();
> >     upgradeSchemas();
> >     upgradeUserRelations();
> > }
> >
> > Something along these lines the API would be:
> >
> > typedef struct t_upgrade
> > {
> >     bool        (*moveData) (void);
> >     bool        (*preliminaryRecovery) (void);        /* may be NULL */
> >     bool        (*preliminaryCleanup) (void);        /* may be NULL */
> >     bool        (*upgradeSystemInfo) (void);        /* may be NULL */
> >     bool        (*upgradeUsers) (void);        /* may be NULL */
> >     bool        (*upgradeSchemas) (void);        /* may be NULL */
> >     bool        (*upgradeUserRelations) (void);        /* may be NULL */
> > } t_upgrade;
> >
> >
> > The above sequence is executed by either pg_upgrade utility uninterrupted
> > or by the pg_autoupgrade daemon. In the former the upgrade priority is
> > simply by OID, in the latter also, but can be overridden by the user using
> > the UPGRADE command to schedule relations upgrade, write operation can
> > also change such schedule, with user's selected choice to be first.  The
> > more write requests a relation receives while in the upgrade queue, its
> > priority increases; thus, the relation with most hits is on top. In case
> > of tie, OID is the decision mark.
> >
> > Some issues to look into:
> >
> > - catalog merger
> > - a crash in the middle of upgrade
> > - PITR logs for 8.x+
> > - ...
> >
> > Flames and Handwaiving
> > ----------------------
> >
> > Okay, flame is on, but before you flame, mind you, this is a very initial
> > version of the design.  Some of the ideas may seem far fetched, the
> > contents may seem messy, but I believe it's now more doable than ever and
> > I am willing to put effort in it for the next release or two and then
> > maintain it afterwards.  It's not going to be done in one shot maybe, but
> > incrementally, using input, feedback, and hints from you, guys.
> >
> > Thank you for reading till this far :-) I.d like to hear from you if any
> > of this made sense to you.
> >
> > Truly yours,

-- 
Serguei A. Mokhov            |  /~\    The ASCII
Computer Science Department  |  \ / Ribbon Campaign
Concordia University         |   X    Against HTML
Montreal, Quebec, Canada     |  / \      Email!