Thread: WIP Patch: pg_dump structured

WIP Patch: pg_dump structured

From

Attila Soki

Date:

12 March 2023, 23:36:36

Hi all,

I was looking for a way to track actual schema changes after database migrations
in a VCS. Preferably, the schema definition should come from a trusted source
like pg_dump and should consist of small files.
This patch was born out of that need.

This patch adds the structured output format to pg_dump.
This format is a plaintext output split up into multiple files and the
resulting small files are stored in a directory path based on the dumped object.
This format can be restored by feeding its plaintext toc file (restore-dump.sql)
to psql. The output is also suitable for manipulating the files with standard
editing tools.

This patch is a WIP (V1). The patch is against master and it compiles
successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64.
To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname

What do you think of this feature, any chance it will be added to pg_dump once
the patch is ready?
Is the chosen name "structured" appropriate?

Thanks for any feedback.

--
Attila Soki

Attachment

v1-wip-pg_dump_structured.patch

Re: WIP Patch: pg_dump structured

From

Tom Lane

Date:

12 March 2023, 23:50:06

Attila Soki <pgsql@attilasoki.com> writes:
> This patch adds the structured output format to pg_dump.
> This format is a plaintext output split up into multiple files and the
> resulting small files are stored in a directory path based on the dumped object.

Won't this fail completely with SQL objects whose names aren't suitable
to be pathname components?  "A/B" is a perfectly good name so far as
SQL is concerned.  You could also have problems with collisions on
case-insensitive filesystems.

> This format can be restored by feeding its plaintext toc file (restore-dump.sql)
> to psql. The output is also suitable for manipulating the files with standard
> editing tools.

This seems a little contradictory: if you want to edit the individual
files, you'd have to also update restore-dump.sql, or else it's pointless.
It might make more sense to consider this as a write-only dump format
and not worry about whether it can be restored directly.

> What do you think of this feature, any chance it will be added to pg_dump once
> the patch is ready?

I'm not clear on how big the use-case is.  It's not really obvious to
me that this'd have any benefit over the existing plain-text dump
capability.  You can edit those files too, at least till the schema
gets too big for your editor.  (But if you've got many many thousand
SQL objects, a file-per-SQL-object directory will also be no fun to
deal with.)

            regards, tom lane

Re: WIP Patch: pg_dump structured

From

Attila Soki

Date:

13 March 2023, 00:56:06

> On 12 Mar 2023, at 21:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Attila Soki <pgsql@attilasoki.com> writes:
>> This patch adds the structured output format to pg_dump.
>> This format is a plaintext output split up into multiple files and the
>> resulting small files are stored in a directory path based on the dumped object.
>
> Won't this fail completely with SQL objects whose names aren't suitable
> to be pathname components?  "A/B" is a perfectly good name so far as
> SQL is concerned.  You could also have problems with collisions on
> case-insensitive filesystems.

The “A/B” case is handled in _CleanFilename function, the slash and other
problematic characters are replaced.
You are right about the case-insensivity, this is not handled and will fail. I forgot
to handle that. I trying to find a way to handle this.

>
>> This format can be restored by feeding its plaintext toc file (restore-dump.sql)
>> to psql. The output is also suitable for manipulating the files with standard
>> editing tools.
>
> This seems a little contradictory: if you want to edit the individual
> files, you'd have to also update restore-dump.sql, or else it's pointless.
> It might make more sense to consider this as a write-only dump format
> and not worry about whether it can be restored directly.

The main motivation was to track changes with VCS at the file (object) level,
editing small files was intended as a second possible use case.
I did not know that a write-only format would go.

>
>> What do you think of this feature, any chance it will be added to pg_dump once
>> the patch is ready?
>
> I'm not clear on how big the use-case is.  It's not really obvious to
> me that this'd have any benefit over the existing plain-text dump
> capability.  You can edit those files too, at least till the schema
> gets too big for your editor.  (But if you've got many many thousand
> SQL objects, a file-per-SQL-object directory will also be no fun to
> deal with.)

I use something like this (a previous version) to track several thousand
objects. But I'm not sure if that would have a wide user base.
Therefore the wip to see if there is interest in this feature.
I think the advantage of having many small files is that it is recognizable
which file (object) is involved in a commit and that the SQL functions and
tables get a change history.

Thank you for your feedback.

Regards,
Attila Soki

Re: WIP Patch: pg_dump structured

From

Attila Soki

Date:

23 March 2023, 18:34:05

On 12 Mar 2023, at 22:56, Attila Soki <pgsql@attilasoki.com> wrote:
>> On 12 Mar 2023, at 21:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> Won't this fail completely with SQL objects whose names aren't suitable
>> to be pathname components?  "A/B" is a perfectly good name so far as
>> SQL is concerned.  You could also have problems with collisions on
>> case-insensitive filesystems.
>

> You are right about the case-insensivity, this is not handled and will fail. I forgot
> to handle that. I trying to find a way to handle this.

Hi Tom,

Thank you for your feedback.

This is an updated version of the pg_dump structured wip patch (V2) with the
following changes:
- to avoid path collisions on case insensitive filessystems, all path components
  created from user input are suffixed with the hex representation of a 32 bit
  hash. “A/B” and “a/b” will get different suffixes.
- all path components are now filesystem safe

All this is a proposal, if you know a better solution please let me know.

This patch is a WIP (V2). The patch is against master and it compiles
successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64.
To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname

>
>>
>>> This format can be restored by feeding its plaintext toc file (restore-dump.sql)
>>> to psql. The output is also suitable for manipulating the files with standard
>>> editing tools.
>>
>> This seems a little contradictory: if you want to edit the individual
>> files, you'd have to also update restore-dump.sql, or else it's pointless.
>> It might make more sense to consider this as a write-only dump format
>> and not worry about whether it can be restored directly.
>
> The main motivation was to track changes with VCS at the file (object) level,
> editing small files was intended as a second possible use case.
> I did not know that a write-only format would go.

Declaring this format as a write-only dump would allow a more flexible directory
structure since we wouldn't have to maintain the restore order.

>
>>
>>> What do you think of this feature, any chance it will be added to pg_dump once
>>> the patch is ready?
>>
>> I'm not clear on how big the use-case is.  It's not really obvious to
>> me that this'd have any benefit over the existing plain-text dump
>> capability.  You can edit those files too, at least till the schema
>> gets too big for your editor.  (But if you've got many many thousand
>> SQL objects, a file-per-SQL-object directory will also be no fun to
>> deal with.)

Here is a sample use case to demonstrate how this format could be used to track
schema changes with git. The main difference from using the existing plain-text
schema dump is, that this format makes it possible to keep a history of the
actual changes made to the individual objects. For example, to determine which
migrations have changed the foo function.

# import the schema into the repository
cd /path/to/my_app_code
pg_dump --format=structured --schema-only --file=foo_schema foodb
git add foo_schema --all
git commit foo_schema -m'initial commit foo_schema'

# make changes in the db
(my_app migrate foodb)
(psql foodb < tweak.sql)

# get a fresh dump
rm -rf foo_schema
pg_dump --format=structured --schema-only --file=foo_schema foodb

# now inspect the changes under foo_schema: there may be changed, new and
# missing files
git status foo_schema

# commit all schema changes
git add foo_schema -u
git commit foo_schema -m'changes from migration foodb'

# later, inspect changes
git log --stat

# show the history of one object
git log -p -- "foo_schema/path/to/FUNCTIONS/foo.sql"

Sure, the user base for this is narrow.

Thanks for any feedback.

—
Best regards
Attila Soki

Attachment

v2-wip-pg_dump_structured.patch