Thread: Pg 16: will pg_dump & pg_restore be faster?
https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/ says "PostgreSQL 16 can also improve the performance of concurrent bulk loading of data using COPY up to 300%." Since pg_dump & pg_restore use COPY (or something very similar), will the speed increase translate to higher speeds for those utilities? -- Born in Arizona, moved to Babylonia.
On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote: > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/ > says "PostgreSQL 16 can also improve the performance of concurrent bulk > loading of data using COPY up to 300%." > > Since pg_dump & pg_restore use COPY (or something very similar), will the > speed increase translate to higher speeds for those utilities? I think the improvements to relation extension only help when multiple backends need to extend the relation at the same time. pg_restore can have multiple workers, but the tasks that each worker performs are only divided as far as an entire table, i.e. 2 workers will never be working on the same table at the same time. So there is no concurrency in terms of 2 or more workers working on loading data into the same table at the same time. It might be an interesting project now that we have TidRange scans, to have pg_dump split larger tables into chunks so that they can be restored in parallel. David
On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote: > On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote: > > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/ > > says "PostgreSQL 16 can also improve the performance of concurrent bulk > > loading of data using COPY up to 300%." > > > > Since pg_dump & pg_restore use COPY (or something very similar), will the > > speed increase translate to higher speeds for those utilities? > > I think the improvements to relation extension only help when multiple > backends need to extend the relation at the same time. pg_restore can > have multiple workers, but the tasks that each worker performs are > only divided as far as an entire table, i.e. 2 workers will never be > working on the same table at the same time. So there is no concurrency > in terms of 2 or more workers working on loading data into the same > table at the same time. > > It might be an interesting project now that we have TidRange scans, to > have pg_dump split larger tables into chunks so that they can be > restored in parallel. Uh, the release notes say: <!-- Author: Andres Freund <andres@anarazel.de> 2023-04-06 [00d1e02be] hio: Use ExtendBufferedRelBy() to extend tables more eff Author: Andres Freund <andres@anarazel.de> 2023-04-06 [26158b852] Use ExtendBufferedRelTo() in XLogReadBufferExtended() --> <listitem> <para> Allow more efficient addition of heap and index pages (Andres Freund) </para> </listitem> There is no mention of concurrency being a requirement. Is it wrong? I think there was a question of whether you had to add _multiple_ blocks ot get a benefit, not if concurrency was needed. This email about the release notes didn't mention the concurrent requirement: https://www.postgresql.org/message-id/20230521171341.jjxykfsefsek4kzj%40awork3.anarazel.de While the case of extending by multiple pages improved the most, even extending by a single page at a time got a good bit more scalable. Maybe just "Improve efficiency of extending relations"? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Wed, 31 May 2023 at 13:13, Bruce Momjian <bruce@momjian.us> wrote: > There is no mention of concurrency being a requirement. Is it wrong? I > think there was a question of whether you had to add _multiple_ blocks > ot get a benefit, not if concurrency was needed. This email about the > release notes didn't mention the concurrent requirement: My understanding had been that concurrency was required, but I see the commit message for 00d1e02be mentions: > Even single threaded > COPY is measurably faster, primarily due to not dirtying pages while > extending, if supported by the operating system (see commit 4d330a61bb1). If that's the case then maybe the beta release notes could be edited slightly to reflect this. Maybe something like: "Relation extensions have been improved allowing faster bulk loading of data using COPY. These improvements are more significant when multiple processes are concurrently loading data into the same table." The current text of "PostgreSQL 16 can also improve the performance of concurrent bulk loading of data using COPY up to 300%." does lead me to believe that nothing has been done to improve things when only a single backend is involved. David
On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote: > On Wed, 31 May 2023 at 13:13, Bruce Momjian <bruce@momjian.us> wrote: > > There is no mention of concurrency being a requirement. Is it wrong? I > > think there was a question of whether you had to add _multiple_ blocks > > ot get a benefit, not if concurrency was needed. This email about the > > release notes didn't mention the concurrent requirement: > > My understanding had been that concurrency was required, but I see the > commit message for 00d1e02be mentions: > > > Even single threaded > > COPY is measurably faster, primarily due to not dirtying pages while > > extending, if supported by the operating system (see commit 4d330a61bb1). > > If that's the case then maybe the beta release notes could be edited > slightly to reflect this. Maybe something like: > > "Relation extensions have been improved allowing faster bulk loading > of data using COPY. These improvements are more significant when > multiple processes are concurrently loading data into the same table." The release notes don't normally get into details on the magnitude of the improvement in various circumstances. > The current text of "PostgreSQL 16 can also improve the performance of > concurrent bulk loading of data using COPY up to 300%." does lead me > to believe that nothing has been done to improve things when only a > single backend is involved. Yes, agreed. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote: > > On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote: > > "Relation extensions have been improved allowing faster bulk loading > > of data using COPY. These improvements are more significant when > > multiple processes are concurrently loading data into the same table." > > The release notes don't normally get into details on the magnitude of > the improvement in various circumstances. Sorry, I meant the release announcement rather than the release notes here. David
On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote: > On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote: > > > > On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote: > > > "Relation extensions have been improved allowing faster bulk loading > > > of data using COPY. These improvements are more significant when > > > multiple processes are concurrently loading data into the same table." > > > > The release notes don't normally get into details on the magnitude of > > the improvement in various circumstances. > > Sorry, I meant the release announcement rather than the release notes here. Oh, yeah, that gets into more details, sure. There is also the major features list at the top of the release notes --- that needs adjustment based on release text I have updated recently, but I don't feel I control that list. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
On Tue, May 30, 2023 at 10:28:58PM -0400, Bruce Momjian wrote: > On Wed, May 31, 2023 at 02:18:25PM +1200, David Rowley wrote: > > On Wed, 31 May 2023 at 14:11, Bruce Momjian <bruce@momjian.us> wrote: > > > > > > On Wed, May 31, 2023 at 02:05:10PM +1200, David Rowley wrote: > > > > "Relation extensions have been improved allowing faster bulk loading > > > > of data using COPY. These improvements are more significant when > > > > multiple processes are concurrently loading data into the same table." > > > > > > The release notes don't normally get into details on the magnitude of > > > the improvement in various circumstances. > > > > Sorry, I meant the release announcement rather than the release notes here. > > Oh, yeah, that gets into more details, sure. > > There is also the major features list at the top of the release notes > --- that needs adjustment based on release text I have updated recently, > but I don't feel I control that list. Oh, I now remember I added that and reworded it when I did, so I have now adjusted it to match the new OUTER parallelism text. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Hi, On 2023-05-30 21:13:08 -0400, Bruce Momjian wrote: > On Wed, May 31, 2023 at 09:14:20AM +1200, David Rowley wrote: > > On Wed, 31 May 2023 at 08:54, Ron <ronljohnsonjr@gmail.com> wrote: > > > https://www.postgresql.org/about/news/postgresql-16-beta-1-released-2643/ > > > says "PostgreSQL 16 can also improve the performance of concurrent bulk > > > loading of data using COPY up to 300%." > > > > > > Since pg_dump & pg_restore use COPY (or something very similar), will the > > > speed increase translate to higher speeds for those utilities? > > > > I think the improvements to relation extension only help when multiple > > backends need to extend the relation at the same time. pg_restore can > > have multiple workers, but the tasks that each worker performs are > > only divided as far as an entire table, i.e. 2 workers will never be > > working on the same table at the same time. So there is no concurrency > > in terms of 2 or more workers working on loading data into the same > > table at the same time. > > > > It might be an interesting project now that we have TidRange scans, to > > have pg_dump split larger tables into chunks so that they can be > > restored in parallel. > > Uh, the release notes say: > > <!-- > Author: Andres Freund <andres@anarazel.de> > 2023-04-06 [00d1e02be] hio: Use ExtendBufferedRelBy() to extend tables more eff > Author: Andres Freund <andres@anarazel.de> > 2023-04-06 [26158b852] Use ExtendBufferedRelTo() in XLogReadBufferExtended() > --> > > <listitem> > <para> > Allow more efficient addition of heap and index pages (Andres Freund) > </para> > </listitem> > > There is no mention of concurrency being a requirement. Is it wrong? I > think there was a question of whether you had to add _multiple_ blocks > ot get a benefit, not if concurrency was needed. This email about the > release notes didn't mention the concurrent requirement: > https://www.postgresql.org/message-id/20230521171341.jjxykfsefsek4kzj%40awork3.anarazel.de There's multiple improvements that work together to get the overall improvement. One part of that is filesystem interactions, another is holding the relation extension lock for a *much* shorter time. The former helps regardless of concurrency, the latter only with concurrency. Regards, Andres
On 5/30/23 10:05 PM, David Rowley wrote: > My understanding had been that concurrency was required, but I see the > commit message for 00d1e02be mentions: > >> Even single threaded >> COPY is measurably faster, primarily due to not dirtying pages while >> extending, if supported by the operating system (see commit 4d330a61bb1). > > If that's the case then maybe the beta release notes could be edited > slightly to reflect this. Maybe something like: > > "Relation extensions have been improved allowing faster bulk loading > of data using COPY. These improvements are more significant when > multiple processes are concurrently loading data into the same table." > > The current text of "PostgreSQL 16 can also improve the performance of > concurrent bulk loading of data using COPY up to 300%." does lead me > to believe that nothing has been done to improve things when only a > single backend is involved. Typically once a release announcement is out, we'll only edit it if it's inaccurate. I don't think the statement in the release announcement is inaccurate, as it specifies that concurrent bulk loading is faster. I had based the description on what Andres described in the original discussion and through reading[1], which showed a "measurable" improvement as the commit message said, but it was not to the same degree as concurrently loading. It does still seem impactful -- the results show up to 20% improvement on a single backend -- but the bigger story was around the concurrency. I'm -0.5 for revising the announcement, but I also don't want people to miss out on testing this. I'd be OK with this: "PostgreSQL 16 can also improve the performance of bulk loading of data, with some tests showing using up to 300% improvement when concurrently executing `COPY` commands." Thanks, Jonathan [1] https://www.postgresql.org/message-id/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Attachment
On Sat, 3 Jun 2023 at 00:14, Jonathan S. Katz <jkatz@postgresql.org> wrote: > Typically once a release announcement is out, we'll only edit it if it's > inaccurate. I don't think the statement in the release announcement is > inaccurate, as it specifies that concurrent bulk loading is faster. Understood. I had thought that the policy might be that if there's room for and reason enough to make improvements, then we probably should. We do aim to still make improvements to fix any problem with the software that's the topic of the announcement, maybe it's strange that we want to lock down what we write about that software just before the beta1 release. > I'm -0.5 for revising the announcement, but I also don't want people to > miss out on testing this. I'd be OK with this: > > "PostgreSQL 16 can also improve the performance of bulk loading of data, > with some tests showing using up to 300% improvement when concurrently > executing `COPY` commands." I might have just misunderstood the release notes based on my misunderstanding of Andres's work that it only improved things when multiple backends were extending the relation at the same time. The release announcement did seem to confirm that there had to be concurrency, so it might be good to not lead anyone else down into thinking that only concurrent cases are faster. I certainly understand that's where the big wins are. I'm fine with your proposed wording. David