Re: Schema version management - Mailing list pgsql-hackers
From | Merlin Moncure |
---|---|
Subject | Re: Schema version management |
Date | |
Msg-id | CAHyXU0xD82sRH+djGP9FcUxBFody_JDAqZUjdjLZ40nN=ED6Xw@mail.gmail.com Whole thread Raw |
In response to | Re: Schema version management (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: Schema version management
|
List | pgsql-hackers |
On Tue, Jul 10, 2012 at 5:24 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sun, May 20, 2012 at 2:41 PM, Joel Jacobson <joel@trustly.com> wrote: >> Hi, >> >> I just read a very interesting post about "schema version management". >> >> Quote: "You could set it up so that every developer gets their own >> test database, sets up the schema there, takes a dump, and checks that >> in. There are going to be problems with that, including that dumps >> produced by pg_dump are ugly and optimized for restoring, not for >> developing with, and they don't have a deterministic output order." ( >> http://petereisentraut.blogspot.com/2012/05/my-anti-take-on-database-schema-version.html >> ) >> >> Back in December 2010, I suggested a new option to pg_dump, --split, >> which would write the schema definition of each object in separate >> files: >> >> http://archives.postgresql.org/pgsql-hackers/2010-12/msg02285.php >> >> Instead of a huge plain text schema file, impossible to version >> control, all tables/sequences/views/functions are written to separate >> files, allowing the use of a version control software system, such as >> git, to do proper version controlling. >> >> The "deterministic output order" problem mentioned in the post above, >> is not a problem if each object (table/sequence/view/function/etc) is >> written to the same filename everytime. >> No matter the order, the tree of files and their content will be >> identical, no matter the order in which they are dumped. >> >> I remember a lot of hackers were very positive about this option, but >> we somehow failed to agree on the naming of files in the tree >> structure. I'm sure we can work that out though. >> >> I use this feature in production, I have a cronjob which does a dump >> of the schema every hour, committing any eventual changes to a >> separate git branch for each database installation, such as >> production, development and test. >> If no changes to the schema have been made, nothing will be committed >> to git since none of the files have changed. >> >> It is then drop-dead simple to diff two different branches of the >> database schema, such as development or production, or diffing >> different revisions allowing point-in-time comparison of the schema. >> >> This is an example of the otuput of a git log --summary for one of the >> automatic commits to our production database's git-repo: >> >> -- >> commit 18c31f8162d851b0dac3bad7e80529ef2ed18be3 >> Author: Production Database <production.database@trustly.com> >> Date: Fri May 4 15:00:04 2012 +0200 >> >> Update of database schema Linux DB0 2.6.26-2-amd64 #1 SMP Wed Aug >> 19 22:33:18 UTC 2009 x86_64 GNU/Linux Fri, 04 May 2012 15:00:04 +0200 >> >> create mode 100644 >> gluepay-split/public/CONSTRAINT/openingclosingbalances_pkey.sql >> create mode 100644 >> gluepay-split/public/CONSTRAINT/openingclosingbalances_source_key.sql >> create mode 100644 gluepay-split/public/SEQUENCE/seqopeningclosingbalance.sql >> create mode 100644 gluepay-split/public/TABLE/openingclosingbalances.sql >> -- >> >> Here we can see we apparently deployed a new table, >> "openingclosingbalances" around Fri May 4 15:00:04. >> >> Without any manual work, I'm able to follow all changes actually >> _deployed_ in each database. >> >> At my company, a highly database-centric stored-procedure intensive >> business dealing with mission-critical monetary transactions, we've >> been using this technique to successfully do schema version management >> without any hassle for the last two years. >> >> Hopefully this can add to the list of various possible _useful_ schema >> version management methods. > > What does your patch do that you can't already do with pg_restore? > > create function foo(a int, b int, c text) returns int as $$ select 0; > $$ language sql; > CREATE FUNCTION > > pg_dump -Fc postgres -s > postgres.dump > pg_restore -l postgres.dump | grep FUNCTION > 196; 1255 32939 FUNCTION public foo(integer, integer, text) merlin > > pg_restore -P "foo(integer, integer, text)" postgres.dump > <function body follows> > > it's fairly easy to wrap pg_restore with a smalls script that extracts > function bodies and writes them out to file names. this is a great > and underused feature, so I'd argue that if you wanted to formalize > per object file extraction you should be looking at expanding > pg_restore, not pg_dump. After extensive off-list discussion with Joel it became clear that per-object dumping ability really belongs in pg_restore. It already has some machinery for that, and has the nice property that you can pull objects out of dumps long after the fact, not just when the dump happens. It then remains to be worked out of pg_restore should be given the ability to write directly to files as Joel was gunning for or simply extended to improve upon the current TOC based facilities, or both. As noted, choosing a reversible unambiguous filename based on the database primitive is nastiness of the first order. For example, besides the mentioned issues, some filesystems (windows) use case insensitive entries. What do you do about that? Given that all the reasonable escape characters are reserved or unusable for some reason, pretty soon you'll arrive to the point of view that you need some sort of proxy identifier in the filename to give uniqueness so that you can ditch all the unpleasantness, just like as is done with relfilenode (maybe using oid, I don't know). I've worked on the exact same problem as Joel, but in my case I was able to leverage known naming conventions in the code in-core solution should not do that, especially when it comes to backup/restore. So, just thinking out loud here, maybe the way forward is to try and tackle stuff in controversy order: 1) add more options to pg_restore to dump other stuff besides tables and functions 2) add smarter object selectors for extraction to pg_restore ('all functions', 'objects in list <list>', 'all schema objects') -- thereby eliminating some of the necessary surrounding scripting, especially if you can direct output to a program which handles the writing out, for example something along the lines of pg_restore --objects=ALL --handler="dumper.sh" schema.dump Where dumper.sh is a user supplied program that consumes the output and takes object name, class, etc as arguments. Just thinking out loud here, but maybe that's cleaner than hacking specific filename conventions directly into pg_restore -- this punts the absolute minimum to userland (what the filename is and where the file is to be stored). 3) figure out a way to dump those selected objects in filenames not supplied by the user. It's tempting to relax #3 so that filenames are only guaranteed unique within a particular extraction but that would foil SCM interaction I think. Maybe #3 is not really solvable and a hybrid solution needs to be worked out -- for example hashing the object signature for the filename and dumping a TOC file along with the extraction: __restore_toc.sql: function foo(a int, b int) -> ABCDEF.sql function "Foo"(a int, b int) -> 132456.sql etc. Not very pleasant, but at least internally consistent and SCM friendly. merlin
pgsql-hackers by date: