Thread: Non-text mode for pg_dumpall
Tom and Nathan opined recently that providing for non-text mode for pg_dumpall would be a Good Thing (TM). Not having it has been a long-standing complaint, so I've decided to give it a go. I think we would need to restrict it to directory mode, at least to begin with. I would have a toc.dat with a different magic block (say "PGGLO" instead of "PGDMP") containing the global entries (roles, tablespaces, databases). Then for each database there would be a subdirectory (named for its toc entry) with a standard directory mode dump for that database. These could be generated in parallel (possibly by pg_dumpall calling pg_dump for each database). pg_restore on detecting a global type toc.data would restore the globals and then each of the databases (again possibly in parallel). I'm sure there are many wrinkles I haven't thought of, but I don't see any insurmountable obstacles, just a significant amount of code. Barring the unforeseen my main is to have a preliminary patch by the September CF. Following that I would turn my attention to using it in pg_upgrade. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Jun 10, 2024 at 08:58:49AM -0400, Andrew Dunstan wrote: > Tom and Nathan opined recently that providing for non-text mode for > pg_dumpall would be a Good Thing (TM). Not having it has been a > long-standing complaint, so I've decided to give it a go. Thank you! > I think we would need to restrict it to directory mode, at least to begin > with. I would have a toc.dat with a different magic block (say "PGGLO" > instead of "PGDMP") containing the global entries (roles, tablespaces, > databases). Then for each database there would be a subdirectory (named for > its toc entry) with a standard directory mode dump for that database. These > could be generated in parallel (possibly by pg_dumpall calling pg_dump for > each database). pg_restore on detecting a global type toc.data would restore > the globals and then each of the databases (again possibly in parallel). I'm curious why we couldn't also support the "custom" format. > Following that I would turn my attention to using it in pg_upgrade. +1 -- nathan
On 2024-06-10 Mo 10:14, Nathan Bossart wrote: > On Mon, Jun 10, 2024 at 08:58:49AM -0400, Andrew Dunstan wrote: >> Tom and Nathan opined recently that providing for non-text mode for >> pg_dumpall would be a Good Thing (TM). Not having it has been a >> long-standing complaint, so I've decided to give it a go. > Thank you! > >> I think we would need to restrict it to directory mode, at least to begin >> with. I would have a toc.dat with a different magic block (say "PGGLO" >> instead of "PGDMP") containing the global entries (roles, tablespaces, >> databases). Then for each database there would be a subdirectory (named for >> its toc entry) with a standard directory mode dump for that database. These >> could be generated in parallel (possibly by pg_dumpall calling pg_dump for >> each database). pg_restore on detecting a global type toc.data would restore >> the globals and then each of the databases (again possibly in parallel). > I'm curious why we couldn't also support the "custom" format. We could, but the housekeeping would be a bit harder. We'd need to keep pointers to the offsets of the per-database TOCs (I don't want to have a single per-cluster TOC). And we can't produce it in parallel, so I'd rather start with something we can produce in parallel. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Jun 10, 2024 at 4:14 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Jun 10, 2024 at 08:58:49AM -0400, Andrew Dunstan wrote:
> Tom and Nathan opined recently that providing for non-text mode for
> pg_dumpall would be a Good Thing (TM). Not having it has been a
> long-standing complaint, so I've decided to give it a go.
Thank you!
Indeed, this has been quite annoying!
> I think we would need to restrict it to directory mode, at least to begin
> with. I would have a toc.dat with a different magic block (say "PGGLO"
> instead of "PGDMP") containing the global entries (roles, tablespaces,
> databases). Then for each database there would be a subdirectory (named for
> its toc entry) with a standard directory mode dump for that database. These
> could be generated in parallel (possibly by pg_dumpall calling pg_dump for
> each database). pg_restore on detecting a global type toc.data would restore
> the globals and then each of the databases (again possibly in parallel).
I'm curious why we couldn't also support the "custom" format.
Or maybe even a combo - a directory of custom format files? Plus that one special file being globals? I'd say that's what most use cases I've seen would prefer.
On Mon, Jun 10, 2024 at 10:51:42AM -0400, Andrew Dunstan wrote: > On 2024-06-10 Mo 10:14, Nathan Bossart wrote: >> I'm curious why we couldn't also support the "custom" format. > > We could, but the housekeeping would be a bit harder. We'd need to keep > pointers to the offsets of the per-database TOCs (I don't want to have a > single per-cluster TOC). And we can't produce it in parallel, so I'd rather > start with something we can produce in parallel. Got it. -- nathan
On Mon, Jun 10, 2024 at 04:52:06PM +0200, Magnus Hagander wrote: > On Mon, Jun 10, 2024 at 4:14 PM Nathan Bossart <nathandbossart@gmail.com> > wrote: >> I'm curious why we couldn't also support the "custom" format. > > Or maybe even a combo - a directory of custom format files? Plus that one > special file being globals? I'd say that's what most use cases I've seen > would prefer. Is there a particular advantage to that approach as opposed to just using "directory" mode for everything? I know pg_upgrade uses "custom" mode for each of the databases, so a combo approach would be a closer match to the existing behavior, but that doesn't strike me as an especially strong reason to keep doing it that way. -- nathan
On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Mon, Jun 10, 2024 at 04:52:06PM +0200, Magnus Hagander wrote:
> On Mon, Jun 10, 2024 at 4:14 PM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
>> I'm curious why we couldn't also support the "custom" format.
>
> Or maybe even a combo - a directory of custom format files? Plus that one
> special file being globals? I'd say that's what most use cases I've seen
> would prefer.
Is there a particular advantage to that approach as opposed to just using
"directory" mode for everything? I know pg_upgrade uses "custom" mode for
each of the databases, so a combo approach would be a closer match to the
existing behavior, but that doesn't strike me as an especially strong
reason to keep doing it that way.
A gazillion files to deal with? Much easier to work with individual custom files if you're moving databases around and things like that. Much easier to monitor eg sizes/dates if you're using it for backups.
It's not things that are make-it-or-break-it or anything, but there are some smaller things that definitely can be useful.
On Mon, Jun 10, 2024 at 05:45:19PM +0200, Magnus Hagander wrote: > On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart <nathandbossart@gmail.com> > wrote: >> Is there a particular advantage to that approach as opposed to just using >> "directory" mode for everything? I know pg_upgrade uses "custom" mode for >> each of the databases, so a combo approach would be a closer match to the >> existing behavior, but that doesn't strike me as an especially strong >> reason to keep doing it that way. > > A gazillion files to deal with? Much easier to work with individual custom > files if you're moving databases around and things like that. > Much easier to monitor eg sizes/dates if you're using it for backups. > > It's not things that are make-it-or-break-it or anything, but there are > some smaller things that definitely can be useful. Makes sense, thanks for elaborating. -- nathan
Magnus Hagander <magnus@hagander.net> writes: > On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart <nathandbossart@gmail.com> > wrote: >> Is there a particular advantage to that approach as opposed to just using >> "directory" mode for everything? > A gazillion files to deal with? Much easier to work with individual custom > files if you're moving databases around and things like that. > Much easier to monitor eg sizes/dates if you're using it for backups. You can always tar up the directory tree after-the-fact if you want one file. Sure, that step's not parallelized, but I think we'd need some non-parallelized copying to create such a file anyway. regards, tom lane
On 2024-06-10 Mo 12:21, Tom Lane wrote: > Magnus Hagander <magnus@hagander.net> writes: >> On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart <nathandbossart@gmail.com> >> wrote: >>> Is there a particular advantage to that approach as opposed to just using >>> "directory" mode for everything? >> A gazillion files to deal with? Much easier to work with individual custom >> files if you're moving databases around and things like that. >> Much easier to monitor eg sizes/dates if you're using it for backups. > You can always tar up the directory tree after-the-fact if you want > one file. Sure, that step's not parallelized, but I think we'd need > some non-parallelized copying to create such a file anyway. > > Yeah. I think I can probably allow for Magnus' suggestion fairly easily, but if I have to choose I'm going to go for the format that can be produced with the maximum parallelism. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On Mon, Jun 10, 2024 at 6:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Magnus Hagander <magnus@hagander.net> writes:
> On Mon, Jun 10, 2024 at 5:03 PM Nathan Bossart <nathandbossart@gmail.com>
> wrote:
>> Is there a particular advantage to that approach as opposed to just using
>> "directory" mode for everything?
> A gazillion files to deal with? Much easier to work with individual custom
> files if you're moving databases around and things like that.
> Much easier to monitor eg sizes/dates if you're using it for backups.
You can always tar up the directory tree after-the-fact if you want
one file. Sure, that step's not parallelized, but I think we'd need
some non-parallelized copying to create such a file anyway.
That would require double the disk space.
But you can also just run pg_dump manually on each database and a pg_dumpall -g like people are doing today -- I thought this whole thing was about making it more convenient :)