Thread: Pg 16: will pg_dump & pg_restore be faster?

Pg 16: will pg_dump & pg_restore be faster?

From
Ron
Date:
https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/ 
says "PostgreSQL 16 can also improve the performance of concurrent bulk 
loading of data using COPY up to 300%."

Since pg_dump & pg_restore use COPY (or something very similar), will the 
speed increase translate to higher speeds for those utilities?

-- 
Born in Arizona, moved to Babylonia.



Re: Pg 16: will pg_dump & pg_restore be faster?

From
David Rowley
Date:
On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote:
> https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
> says "PostgreSQL 16 can also improve the performance of concurrent bulk
> loading of data using COPY up to 300%."
>
> Since pg_dump & pg_restore use COPY (or something very similar), will the
> speed increase translate to higher speeds for those utilities?

I think the improvements to relation extension only help when multiple
backends need to extend the relation at the same time.  pg_restore can
have multiple workers, but the tasks that each worker performs are
only divided as far as an entire table, i.e. 2 workers will never be
working on the same table at the same time. So there is no concurrency
in terms of 2 or more workers working on loading data into the same
table at the same time.

It might be an interesting project now that we have TidRange scans, to
have pg_dump split larger tables into chunks so that they can be
restored in parallel.

David



Re: Pg 16: will pg_dump & pg_restore be faster?

From
Bruce Momjian
Date:
On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote:
> On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote:
> > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
> > says "PostgreSQL 16 can also improve the performance of concurrent bulk
> > loading of data using COPY up to 300%."
> >
> > Since pg_dump & pg_restore use COPY (or something very similar), will the
> > speed increase translate to higher speeds for those utilities?
> 
> I think the improvements to relation extension only help when multiple
> backends need to extend the relation at the same time.  pg_restore can
> have multiple workers, but the tasks that each worker performs are
> only divided as far as an entire table, i.e. 2 workers will never be
> working on the same table at the same time. So there is no concurrency
> in terms of 2 or more workers working on loading data into the same
> table at the same time.
> 
> It might be an interesting project now that we have TidRange scans, to
> have pg_dump split larger tables into chunks so that they can be
> restored in parallel.

Uh, the release notes say:

    <!--
    Author: Andres Freund <andres@anarazel.de>
    2023-04-06 [00d1e02be] hio: Use ExtendBufferedRelBy() to extend tables more eff
    Author: Andres Freund <andres@anarazel.de>
    2023-04-06 [26158b852] Use ExtendBufferedRelTo() in XLogReadBufferExtended()
    -->
    
    <listitem>
    <para>
    Allow more efficient addition of heap and index pages (Andres Freund)
    </para>
    </listitem>

There is no mention of concurrency being a requirement.  Is it wrong?  I
think there was a question of whether you had to add _multiple_ blocks
ot get a benefit, not if concurrency was needed.  This email about the
release notes didn't mention the concurrent requirement:

    https://www.postgresql.org/message-id/20230521171341.jjxykfsefsek4kzj%40awork3.anarazel.de


    While the case of extending by multiple pages improved the most, even
    extending by a single page at a time got a good bit more scalable. Maybe
    just "Improve efficiency of extending relations"?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Pg 16: will pg_dump & pg_restore be faster?

From
David Rowley
Date:
On Wed, 31 May 2023 at 13:13, Bruce Momjian <bruce@momjian.us> wrote:
> There is no mention of concurrency being a requirement.  Is it wrong?  I
> think there was a question of whether you had to add _multiple_ blocks
> ot get a benefit, not if concurrency was needed.  This email about the
> release notes didn't mention the concurrent requirement:

My understanding had been that concurrency was required, but I see the
commit message for 00d1e02be mentions:

> Even single threaded
> COPY is measurably faster, primarily due to not dirtying pages while
> extending, if supported by the operating system (see commit 4d330a61bb1).

If that's the case then maybe the beta release notes could be edited
slightly to reflect this. Maybe something like:

"Relation extensions have been improved allowing faster bulk loading
of data using COPY. These improvements are more significant when
multiple processes are concurrently loading data into the same table."

The current text of "PostgreSQL 16 can also improve the performance of
concurrent bulk loading of data using COPY up to 300%." does lead me
to believe that nothing has been done to improve things when only a
single backend is involved.

David



Re: Pg 16: will pg_dump & pg_restore be faster?

From
Bruce Momjian
Date:
On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:
> On Wed, 31 May 2023 at 13:13, Bruce Momjian <bruce@momjian.us> wrote:
> > There is no mention of concurrency being a requirement.  Is it wrong?  I
> > think there was a question of whether you had to add _multiple_ blocks
> > ot get a benefit, not if concurrency was needed.  This email about the
> > release notes didn't mention the concurrent requirement:
> 
> My understanding had been that concurrency was required, but I see the
> commit message for 00d1e02be mentions:
> 
> > Even single threaded
> > COPY is measurably faster, primarily due to not dirtying pages while
> > extending, if supported by the operating system (see commit 4d330a61bb1).
> 
> If that's the case then maybe the beta release notes could be edited
> slightly to reflect this. Maybe something like:
> 
> "Relation extensions have been improved allowing faster bulk loading
> of data using COPY. These improvements are more significant when
> multiple processes are concurrently loading data into the same table."

The release notes don't normally get into details on the magnitude of
the improvement in various circumstances.

> The current text of "PostgreSQL 16 can also improve the performance of
> concurrent bulk loading of data using COPY up to 300%." does lead me
> to believe that nothing has been done to improve things when only a
> single backend is involved.

Yes, agreed.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Pg 16: will pg_dump & pg_restore be faster?

From
David Rowley
Date:
On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:
> > "Relation extensions have been improved allowing faster bulk loading
> > of data using COPY. These improvements are more significant when
> > multiple processes are concurrently loading data into the same table."
>
> The release notes don't normally get into details on the magnitude of
> the improvement in various circumstances.

Sorry, I meant the release announcement rather than the release notes here.

David



Re: Pg 16: will pg_dump & pg_restore be faster?

From
Bruce Momjian
Date:
On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote:
> On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote:
> >
> > On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:
> > > "Relation extensions have been improved allowing faster bulk loading
> > > of data using COPY. These improvements are more significant when
> > > multiple processes are concurrently loading data into the same table."
> >
> > The release notes don't normally get into details on the magnitude of
> > the improvement in various circumstances.
> 
> Sorry, I meant the release announcement rather than the release notes here.

Oh, yeah, that gets into more details, sure.

There is also the major features list at the top of the release notes
--- that needs adjustment based on release text I have updated recently,
but I don't feel I control that list.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Pg 16: will pg_dump & pg_restore be faster?

From
Bruce Momjian
Date:
On Tue, May 30, 2023 at 10:28:58PM -0400, Bruce Momjian wrote:
> On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote:
> > On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote:
> > >
> > > On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote:
> > > > "Relation extensions have been improved allowing faster bulk loading
> > > > of data using COPY. These improvements are more significant when
> > > > multiple processes are concurrently loading data into the same table."
> > >
> > > The release notes don't normally get into details on the magnitude of
> > > the improvement in various circumstances.
> > 
> > Sorry, I meant the release announcement rather than the release notes here.
> 
> Oh, yeah, that gets into more details, sure.
> 
> There is also the major features list at the top of the release notes
> --- that needs adjustment based on release text I have updated recently,
> but I don't feel I control that list.

Oh, I now remember I added that and reworded it when I did, so I have
now adjusted it to match the new OUTER parallelism text.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: Pg 16: will pg_dump & pg_restore be faster?

From
Andres Freund
Date:
Hi,

On 2023-05-30 21:13:08 -0400, Bruce Momjian wrote:
> On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote:
> > On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote:
> > > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/
> > > says "PostgreSQL 16 can also improve the performance of concurrent bulk
> > > loading of data using COPY up to 300%."
> > >
> > > Since pg_dump & pg_restore use COPY (or something very similar), will the
> > > speed increase translate to higher speeds for those utilities?
> > 
> > I think the improvements to relation extension only help when multiple
> > backends need to extend the relation at the same time.  pg_restore can
> > have multiple workers, but the tasks that each worker performs are
> > only divided as far as an entire table, i.e. 2 workers will never be
> > working on the same table at the same time. So there is no concurrency
> > in terms of 2 or more workers working on loading data into the same
> > table at the same time.
> > 
> > It might be an interesting project now that we have TidRange scans, to
> > have pg_dump split larger tables into chunks so that they can be
> > restored in parallel.
> 
> Uh, the release notes say:
> 
>     <!--
>     Author: Andres Freund <andres@anarazel.de>
>     2023-04-06 [00d1e02be] hio: Use ExtendBufferedRelBy() to extend tables more eff
>     Author: Andres Freund <andres@anarazel.de>
>     2023-04-06 [26158b852] Use ExtendBufferedRelTo() in XLogReadBufferExtended()
>     -->
>     
>     <listitem>
>     <para>
>     Allow more efficient addition of heap and index pages (Andres Freund)
>     </para>
>     </listitem>
> 
> There is no mention of concurrency being a requirement.  Is it wrong?  I
> think there was a question of whether you had to add _multiple_ blocks
> ot get a benefit, not if concurrency was needed.  This email about the
> release notes didn't mention the concurrent requirement:

>     https://www.postgresql.org/message-id/20230521171341.jjxykfsefsek4kzj%40awork3.anarazel.de

There's multiple improvements that work together to get the overall
improvement. One part of that is filesystem interactions, another is holding
the relation extension lock for a *much* shorter time. The former helps
regardless of concurrency, the latter only with concurrency.

Regards,

Andres



Re: Pg 16: will pg_dump & pg_restore be faster?

From
"Jonathan S. Katz"
Date:
On 5/30/23 10:05 PM, David Rowley wrote:

> My understanding had been that concurrency was required, but I see the
> commit message for 00d1e02be mentions:
> 
>> Even single threaded
>> COPY is measurably faster, primarily due to not dirtying pages while
>> extending, if supported by the operating system (see commit 4d330a61bb1).
> 
> If that's the case then maybe the beta release notes could be edited
> slightly to reflect this. Maybe something like:
> 
> "Relation extensions have been improved allowing faster bulk loading
> of data using COPY. These improvements are more significant when
> multiple processes are concurrently loading data into the same table."
> 
> The current text of "PostgreSQL 16 can also improve the performance of
> concurrent bulk loading of data using COPY up to 300%." does lead me
> to believe that nothing has been done to improve things when only a
> single backend is involved.

Typically once a release announcement is out, we'll only edit it if it's 
inaccurate. I don't think the statement in the release announcement is 
inaccurate, as it specifies that concurrent bulk loading is faster.

I had based the description on what Andres described in the original 
discussion and through reading[1], which showed a "measurable" 
improvement as the commit message said, but it was not to the same 
degree as concurrently loading. It does still seem impactful -- the 
results show up to 20% improvement on a single backend -- but the bigger 
story was around the concurrency.

I'm -0.5 for revising the announcement, but I also don't want people to 
miss out on testing this. I'd be OK with this:

"PostgreSQL 16 can also improve the performance of bulk loading of data, 
with some tests showing using up to 300% improvement when concurrently 
executing `COPY` commands."

Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de


Attachment

Re: Pg 16: will pg_dump & pg_restore be faster?

From
David Rowley
Date:
On Sat, 3 Jun 2023 at 00:14, Jonathan S. Katz <jkatz@postgresql.org> wrote:
> Typically once a release announcement is out, we'll only edit it if it's
> inaccurate. I don't think the statement in the release announcement is
> inaccurate, as it specifies that concurrent bulk loading is faster.

Understood.  I had thought that the policy might be that if there's
room for and reason enough to make improvements, then we probably
should.  We do aim to still make improvements to fix any problem with
the software that's the topic of the announcement, maybe it's strange
that we want to lock down what we write about that software just
before the beta1 release.

> I'm -0.5 for revising the announcement, but I also don't want people to
> miss out on testing this. I'd be OK with this:
>
> "PostgreSQL 16 can also improve the performance of bulk loading of data,
> with some tests showing using up to 300% improvement when concurrently
> executing `COPY` commands."

I might have just misunderstood the release notes based on my
misunderstanding of Andres's work that it only improved things when
multiple backends were extending the relation at the same time.  The
release announcement did seem to confirm that there had to be
concurrency, so it might be good to not lead anyone else down into
thinking that only concurrent cases are faster. I certainly understand
that's where the big wins are.

I'm fine with your proposed wording.

David