Thread: WIP Patch: pg_dump structured
Hi all, I was looking for a way to track actual schema changes after database migrations in a VCS. Preferably, the schema definition should come from a trusted source like pg_dump and should consist of small files. This patch was born out of that need. This patch adds the structured output format to pg_dump. This format is a plaintext output split up into multiple files and the resulting small files are stored in a directory path based on the dumped object. This format can be restored by feeding its plaintext toc file (restore-dump.sql) to psql. The output is also suitable for manipulating the files with standard editing tools. This patch is a WIP (V1). The patch is against master and it compiles successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64. To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname What do you think of this feature, any chance it will be added to pg_dump once the patch is ready? Is the chosen name "structured" appropriate? Thanks for any feedback. -- Attila Soki
Attachment
Attila Soki <pgsql@attilasoki.com> writes: > This patch adds the structured output format to pg_dump. > This format is a plaintext output split up into multiple files and the > resulting small files are stored in a directory path based on the dumped object. Won't this fail completely with SQL objects whose names aren't suitable to be pathname components? "A/B" is a perfectly good name so far as SQL is concerned. You could also have problems with collisions on case-insensitive filesystems. > This format can be restored by feeding its plaintext toc file (restore-dump.sql) > to psql. The output is also suitable for manipulating the files with standard > editing tools. This seems a little contradictory: if you want to edit the individual files, you'd have to also update restore-dump.sql, or else it's pointless. It might make more sense to consider this as a write-only dump format and not worry about whether it can be restored directly. > What do you think of this feature, any chance it will be added to pg_dump once > the patch is ready? I'm not clear on how big the use-case is. It's not really obvious to me that this'd have any benefit over the existing plain-text dump capability. You can edit those files too, at least till the schema gets too big for your editor. (But if you've got many many thousand SQL objects, a file-per-SQL-object directory will also be no fun to deal with.) regards, tom lane
> On 12 Mar 2023, at 21:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Attila Soki <pgsql@attilasoki.com> writes: >> This patch adds the structured output format to pg_dump. >> This format is a plaintext output split up into multiple files and the >> resulting small files are stored in a directory path based on the dumped object. > > Won't this fail completely with SQL objects whose names aren't suitable > to be pathname components? "A/B" is a perfectly good name so far as > SQL is concerned. You could also have problems with collisions on > case-insensitive filesystems. The “A/B” case is handled in _CleanFilename function, the slash and other problematic characters are replaced. You are right about the case-insensivity, this is not handled and will fail. I forgot to handle that. I trying to find a way to handle this. > >> This format can be restored by feeding its plaintext toc file (restore-dump.sql) >> to psql. The output is also suitable for manipulating the files with standard >> editing tools. > > This seems a little contradictory: if you want to edit the individual > files, you'd have to also update restore-dump.sql, or else it's pointless. > It might make more sense to consider this as a write-only dump format > and not worry about whether it can be restored directly. The main motivation was to track changes with VCS at the file (object) level, editing small files was intended as a second possible use case. I did not know that a write-only format would go. > >> What do you think of this feature, any chance it will be added to pg_dump once >> the patch is ready? > > I'm not clear on how big the use-case is. It's not really obvious to > me that this'd have any benefit over the existing plain-text dump > capability. You can edit those files too, at least till the schema > gets too big for your editor. (But if you've got many many thousand > SQL objects, a file-per-SQL-object directory will also be no fun to > deal with.) I use something like this (a previous version) to track several thousand objects. But I'm not sure if that would have a wide user base. Therefore the wip to see if there is interest in this feature. I think the advantage of having many small files is that it is recognizable which file (object) is involved in a commit and that the SQL functions and tables get a change history. Thank you for your feedback. Regards, Attila Soki
On 12 Mar 2023, at 22:56, Attila Soki <pgsql@attilasoki.com> wrote: >> On 12 Mar 2023, at 21:50, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Won't this fail completely with SQL objects whose names aren't suitable >> to be pathname components? "A/B" is a perfectly good name so far as >> SQL is concerned. You could also have problems with collisions on >> case-insensitive filesystems. > > You are right about the case-insensivity, this is not handled and will fail. I forgot > to handle that. I trying to find a way to handle this. Hi Tom, Thank you for your feedback. This is an updated version of the pg_dump structured wip patch (V2) with the following changes: - to avoid path collisions on case insensitive filessystems, all path components created from user input are suffixed with the hex representation of a 32 bit hash. “A/B” and “a/b” will get different suffixes. - all path components are now filesystem safe All this is a proposal, if you know a better solution please let me know. This patch is a WIP (V2). The patch is against master and it compiles successfully on macOS 13.2.1 aarch64 and on Debian 11 arm64. To test, execute pg_dump --format=structured --file=/path/to/outputdir dbname > >> >>> This format can be restored by feeding its plaintext toc file (restore-dump.sql) >>> to psql. The output is also suitable for manipulating the files with standard >>> editing tools. >> >> This seems a little contradictory: if you want to edit the individual >> files, you'd have to also update restore-dump.sql, or else it's pointless. >> It might make more sense to consider this as a write-only dump format >> and not worry about whether it can be restored directly. > > The main motivation was to track changes with VCS at the file (object) level, > editing small files was intended as a second possible use case. > I did not know that a write-only format would go. Declaring this format as a write-only dump would allow a more flexible directory structure since we wouldn't have to maintain the restore order. > >> >>> What do you think of this feature, any chance it will be added to pg_dump once >>> the patch is ready? >> >> I'm not clear on how big the use-case is. It's not really obvious to >> me that this'd have any benefit over the existing plain-text dump >> capability. You can edit those files too, at least till the schema >> gets too big for your editor. (But if you've got many many thousand >> SQL objects, a file-per-SQL-object directory will also be no fun to >> deal with.) Here is a sample use case to demonstrate how this format could be used to track schema changes with git. The main difference from using the existing plain-text schema dump is, that this format makes it possible to keep a history of the actual changes made to the individual objects. For example, to determine which migrations have changed the foo function. # import the schema into the repository cd /path/to/my_app_code pg_dump --format=structured --schema-only --file=foo_schema foodb git add foo_schema --all git commit foo_schema -m'initial commit foo_schema' # make changes in the db (my_app migrate foodb) (psql foodb < tweak.sql) # get a fresh dump rm -rf foo_schema pg_dump --format=structured --schema-only --file=foo_schema foodb # now inspect the changes under foo_schema: there may be changed, new and # missing files git status foo_schema # commit all schema changes git add foo_schema -u git commit foo_schema -m'changes from migration foodb' # later, inspect changes git log --stat # show the history of one object git log -p -- "foo_schema/path/to/FUNCTIONS/foo.sql" Sure, the user base for this is narrow. Thanks for any feedback. — Best regards Attila Soki