Thread: Re: request clarification on pg_restore documentation

Re: request clarification on pg_restore documentation

From

Bruce Momjian

Date:

06 October 2022, 17:28:15

On Thu, Sep 29, 2022 at 02:30:13PM +0000, PG Doc comments form wrote:
> The following documentation comment has been logged on the website:
> 
> Page: https://www.postgresql.org/docs/14/app-pgrestore.html
> Description:
> 
> pg_restore seems to have two ways to restore data:
> 
> --section=data 
> 
> or
> 
>  --data-only
> 
> There is this cryptic warning that --data-only is "similar to but for
> historical reasons different from" --section=data
> 
> But there is no further explanation of what those differences are or what
> might be missed or different in your restore if you pick one option or the
> other. Maybe one or the other option is the "preferred current way" and one
> is the "historical way" or they are aimed at different types of use cases,
> but that's not clear.

[Thread moved from docs to hackers because there are behavioral issues.]

Very good question.  I dug into this and found this commit which says
--data-only and --section=data were equivalent:

    commit a4cd6abcc9
    Author: Andrew Dunstan <andrew@dunslane.net>
    Date:   Fri Dec 16 19:09:38 2011 -0500
    
        Add --section option to pg_dump and pg_restore.
    
        Valid values are --pre-data, data and post-data. The option can be
        given more than once. --schema-only is equivalent to
-->        --section=pre-data --section=post-data. --data-only is equivalent
-->        to --section=data.
    
and then this commit which says they are not:

    commit 4317e0246c
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Tue May 29 23:22:14 2012 -0400
    
        Rewrite --section option to decouple it from --schema-only/--data-only.
    
-->        The initial implementation of pg_dump's --section option supposed that the
-->        existing --schema-only and --data-only options could be made equivalent to
-->        --section settings.  This is wrong, though, due to dubious but long since
        set-in-stone decisions about where to dump SEQUENCE SET items, as seen in
        bug report from Martin Pitt.  (And I'm not totally convinced there weren't
        other bugs, either.)  Undo that coupling and instead drive --section
        filtering off current-section state tracked as we scan through the TOC
        list to call _tocEntryRequired().
    
        To make sure those decisions don't shift around and hopefully save a few
        cycles, run _tocEntryRequired() only once per TOC entry and save the result
        in a new TOC field.  This required minor rejiggering of ACL handling but
        also allows a far cleaner implementation of inhibit_data_for_failed_table.
    
        Also, to ensure that pg_dump and pg_restore have the same behavior with
        respect to the --section switches, add _tocEntryRequired() filtering to
        WriteToc() and WriteDataChunks(), rather than trying to implement section
        filtering in an entirely orthogonal way in dumpDumpableObject().  This
        required adjusting the handling of the special ENCODING and STDSTRINGS
        items, but they were pretty weird before anyway.

and this commit which made them closer:

    commit 5a39114fe7
    Author: Tom Lane <tgl@sss.pgh.pa.us>
    Date:   Fri Oct 26 12:12:42 2012 -0400
    
        In pg_dump, dump SEQUENCE SET items in the data not pre-data section.
    
        Represent a sequence's current value as a separate TableDataInfo dumpable
        object, so that it can be dumped within the data section of the archive
-->        rather than in pre-data.  This fixes an undesirable inconsistency between
-->        the meanings of "--data-only" and "--section=data", and also fixes dumping
        of sequences that are marked as extension configuration tables, as per a
        report from Marko Kreen back in July.  The main cost is that we do one more
        SQL query per sequence, but that's probably not very meaningful in most
        databases.

Looking at the restore code, I see --data-only disabling triggers, while
--section=data doesn't.  I also tested --data-only vs. --section=data in
pg_dump for the regression database and saw the only differences as the
creation and comments on large objects, e.g.,

    -- Name: 2121; Type: BLOB; Schema: -; Owner: postgres
    --
    
    SELECT pg_catalog.lo_create('2121');


    ALTER LARGE OBJECT 2121 OWNER TO postgres;

    --
    -- Name: LARGE OBJECT 2121; Type: COMMENT; Schema: -; Owner: postgres
    --

    COMMENT ON LARGE OBJECT 2121 IS 'testing comments';

but the large object _data_ was dumped in both cases.

So, where does this leave us?  We know we need --section=data because
the pre/post-data options are clearly useful, so why would someone use
--data-only vs. --section=data.  We don't document why to use one rather
than the other, so the --data-only option looks useless to me.  Do we
remove it, adjust it, or leave it alone?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson

Re: request clarification on pg_restore documentation

From

Bruce Momjian

Date:

28 October 2023, 01:33:38

Does anyone have a suggestion on how to handle this issue?  The report
is a year old.

---------------------------------------------------------------------------

On Thu, Oct  6, 2022 at 10:28:15AM -0400, Bruce Momjian wrote:
> On Thu, Sep 29, 2022 at 02:30:13PM +0000, PG Doc comments form wrote:
> > The following documentation comment has been logged on the website:
> > 
> > Page: https://www.postgresql.org/docs/14/app-pgrestore.html
> > Description:
> > 
> > pg_restore seems to have two ways to restore data:
> > 
> > --section=data 
> > 
> > or
> > 
> >  --data-only
> > 
> > There is this cryptic warning that --data-only is "similar to but for
> > historical reasons different from" --section=data
> > 
> > But there is no further explanation of what those differences are or what
> > might be missed or different in your restore if you pick one option or the
> > other. Maybe one or the other option is the "preferred current way" and one
> > is the "historical way" or they are aimed at different types of use cases,
> > but that's not clear.
> 
> [Thread moved from docs to hackers because there are behavioral issues.]
> 
> Very good question.  I dug into this and found this commit which says
> --data-only and --section=data were equivalent:
> 
>     commit a4cd6abcc9
>     Author: Andrew Dunstan <andrew@dunslane.net>
>     Date:   Fri Dec 16 19:09:38 2011 -0500
>     
>         Add --section option to pg_dump and pg_restore.
>     
>         Valid values are --pre-data, data and post-data. The option can be
>         given more than once. --schema-only is equivalent to
> -->        --section=pre-data --section=post-data. --data-only is equivalent
> -->        to --section=data.
>     
> and then this commit which says they are not:
> 
>     commit 4317e0246c
>     Author: Tom Lane <tgl@sss.pgh.pa.us>
>     Date:   Tue May 29 23:22:14 2012 -0400
>     
>         Rewrite --section option to decouple it from --schema-only/--data-only.
>     
> -->        The initial implementation of pg_dump's --section option supposed that the
> -->        existing --schema-only and --data-only options could be made equivalent to
> -->        --section settings.  This is wrong, though, due to dubious but long since
>         set-in-stone decisions about where to dump SEQUENCE SET items, as seen in
>         bug report from Martin Pitt.  (And I'm not totally convinced there weren't
>         other bugs, either.)  Undo that coupling and instead drive --section
>         filtering off current-section state tracked as we scan through the TOC
>         list to call _tocEntryRequired().
>     
>         To make sure those decisions don't shift around and hopefully save a few
>         cycles, run _tocEntryRequired() only once per TOC entry and save the result
>         in a new TOC field.  This required minor rejiggering of ACL handling but
>         also allows a far cleaner implementation of inhibit_data_for_failed_table.
>     
>         Also, to ensure that pg_dump and pg_restore have the same behavior with
>         respect to the --section switches, add _tocEntryRequired() filtering to
>         WriteToc() and WriteDataChunks(), rather than trying to implement section
>         filtering in an entirely orthogonal way in dumpDumpableObject().  This
>         required adjusting the handling of the special ENCODING and STDSTRINGS
>         items, but they were pretty weird before anyway.
> 
> and this commit which made them closer:
> 
>     commit 5a39114fe7
>     Author: Tom Lane <tgl@sss.pgh.pa.us>
>     Date:   Fri Oct 26 12:12:42 2012 -0400
>     
>         In pg_dump, dump SEQUENCE SET items in the data not pre-data section.
>     
>         Represent a sequence's current value as a separate TableDataInfo dumpable
>         object, so that it can be dumped within the data section of the archive
> -->        rather than in pre-data.  This fixes an undesirable inconsistency between
> -->        the meanings of "--data-only" and "--section=data", and also fixes dumping
>         of sequences that are marked as extension configuration tables, as per a
>         report from Marko Kreen back in July.  The main cost is that we do one more
>         SQL query per sequence, but that's probably not very meaningful in most
>         databases.
> 
> Looking at the restore code, I see --data-only disabling triggers, while
> --section=data doesn't.  I also tested --data-only vs. --section=data in
> pg_dump for the regression database and saw the only differences as the
> creation and comments on large objects, e.g.,
> 
>     -- Name: 2121; Type: BLOB; Schema: -; Owner: postgres
>     --
>     
>     SELECT pg_catalog.lo_create('2121');
> 
> 
>     ALTER LARGE OBJECT 2121 OWNER TO postgres;
> 
>     --
>     -- Name: LARGE OBJECT 2121; Type: COMMENT; Schema: -; Owner: postgres
>     --
> 
>     COMMENT ON LARGE OBJECT 2121 IS 'testing comments';
> 
> but the large object _data_ was dumped in both cases.
> 
> So, where does this leave us?  We know we need --section=data because
> the pre/post-data options are clearly useful, so why would someone use
> --data-only vs. --section=data.  We don't document why to use one rather
> than the other, so the --data-only option looks useless to me.  Do we
> remove it, adjust it, or leave it alone?
> 
> -- 
>   Bruce Momjian  <bruce@momjian.us>        https://momjian.us
>   EDB                                      https://enterprisedb.com
> 
>   Indecision is a decision.  Inaction is an action.  Mark Batterson
> 
> 
> 

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Re: request clarification on pg_restore documentation

From

Tom Lane

Date:

28 October 2023, 01:53:00

Bruce Momjian <bruce@momjian.us> writes:
> Does anyone have a suggestion on how to handle this issue?

It might be that the later decision to change the representation
of sequence dumps would make it possible to undo 4317e0246c
and go back to having --schema-only/--data-only be true aliases
for --section.  But it'd take some research and probably end up
causing some behavioral changes (eg. trigger handling as you note).

Much the same research would be needed if you just wanted to
document the current state of affairs more clearly.

The real issue here is that --schema-only/--data-only do a few
things that are not within --section's remit, such as trigger
adjustments.  Do we want to cause --section to have those effects
too?  I dunno.  Do we want to give up those extra behaviors?
Almost certainly not.

Either way, I'm not personally planning to put effort into that
anytime soon.

            regards, tom lane