Thread: pg_upgrade parallelism

pg_upgrade parallelism

From
Jaime Casanova
Date:
Hi,

Currently docs about pg_upgrade says:

"""
    <para>
     The <option>--jobs</option> option allows multiple CPU cores to be used
     for copying/linking of files and to dump and reload database schemas
     in parallel;  a good place to start is the maximum of the number of
     CPU cores and tablespaces.  This option can dramatically reduce the
     time to upgrade a multi-database server running on a multiprocessor
     machine.
    </para>
"""

Which make the user think that the --jobs option could use all CPU
cores. Which is not true. Or that it has anything to do with multiple
databases, which is true only to some extent.

What that option really improves are upgrading servers with multiple
tablespaces, of course if --link or --clone are used pg_upgrade is still
very fast but used with the --copy option is not what one could expect.

As an example, a customer with a 25Tb database, 40 cores and lots of ram
used --jobs=35 and got only 7 processes (they have 6 tablespaces) and
the disks where not used at maximum speed either. They expected 35
processes copying lots of files at the same time.

So, first I would like to improve documentation. What about something
like the attached? 

Now, a couple of questions:

- in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to
  determine the amount of bytes that should be used in read()/write() to
  copy the relfilenode segments. And we define it as (50 * BLCKSZ),
  which is 400Kb. Isn't this too small?

- why we read()/write() at all? is not a faster way of copying the file?
  i'm asking that because i don't actually know.

I'm trying to add more parallelism by copying individual segments
of a relfilenode in different processes. Does anyone one see a big
problem in trying to do that? I'm asking because no one did it before,
that could not be a good sign.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachment

Re: pg_upgrade parallelism

From
Jacob Champion
Date:
On Wed, 2021-11-17 at 14:44 -0500, Jaime Casanova wrote:
> I'm trying to add more parallelism by copying individual segments
> of a relfilenode in different processes. Does anyone one see a big
> problem in trying to do that? I'm asking because no one did it before,
> that could not be a good sign.

I looked into speeding this up a while back, too. For the use case I
was looking at -- Greenplum, which has huge numbers of relfilenodes --
spinning disk I/O was absolutely the bottleneck and that is typically
not easily parallelizable. (In fact I felt at the time that Andres'
work on async I/O might be a better way forward, at least for some
filesystems.)

But you mentioned that you were seeing disks that weren't saturated, so
maybe some CPU optimization is still valuable? I am a little skeptical
that more parallelism is the way to do that, but numbers trump my
skepticism.

> - why we read()/write() at all? is not a faster way of copying the file?
>   i'm asking that because i don't actually know.

I have idly wondered if something based on splice() would be faster,
but I haven't actually tried it.

But there is now support for copy-on-write with the clone mode, isn't
there? Or are you not able to take advantage of it?

--Jacob

Re: pg_upgrade parallelism

From
Justin Pryzby
Date:
On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> Hi,
> 
> Currently docs about pg_upgrade says:
> 
> """
>     <para>
>      The <option>--jobs</option> option allows multiple CPU cores to be used
>      for copying/linking of files and to dump and reload database schemas
>      in parallel;  a good place to start is the maximum of the number of
>      CPU cores and tablespaces.  This option can dramatically reduce the
>      time to upgrade a multi-database server running on a multiprocessor
>      machine.
>     </para>
> """
> 
> Which make the user think that the --jobs option could use all CPU
> cores. Which is not true. Or that it has anything to do with multiple
> databases, which is true only to some extent.
> 
> What that option really improves are upgrading servers with multiple
> tablespaces, of course if --link or --clone are used pg_upgrade is still
> very fast but used with the --copy option is not what one could expect.

> As an example, a customer with a 25Tb database, 40 cores and lots of ram
> used --jobs=35 and got only 7 processes (they have 6 tablespaces) and
> the disks where not used at maximum speed either. They expected 35
> processes copying lots of files at the same time.

I would test this.  How long does it take to cp -r the data dirs vs pg_upgrade
them ?  If running 7 "cp" in parallel is faster than the "copy" portion of
pg_upgrade -j7, then pg_upgrade's file copy should be optimized.

But if it's not faster, then maybe should look at other options, like your idea
to copy filenodes (or their segments) in parallel.

> So, first I would like to improve documentation. What about something
> like the attached? 

The relevant history is in commits
6f1b9e4efd94fc644f5de5377829d42e48c3c758
a89c46f9bc314ed549245d888da09b8c5cace104

--jobs originally parallelized pg_dump and pg_restore, and then added
copying/linking.  So the docs should mention tablespaces, as you said, but
should also mention databases.  It may not be an issue for you, but pg_restore
is the slowest part of our pg_upgrades, since we have many partitions.

> Now, a couple of questions:
> 
> - in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to
>   determine the amount of bytes that should be used in read()/write() to
>   copy the relfilenode segments. And we define it as (50 * BLCKSZ),
>   which is 400Kb. Isn't this too small?

Maybe - you'll have to check :)

> - why we read()/write() at all? is not a faster way of copying the file?
>   i'm asking that because i don't actually know.

No portable way.  Linux has this:
https://man7.org/linux/man-pages/man2/copy_file_range.2.html

But I just read:

|       First support for cross-filesystem copies was introduced in Linux
|       5.3.  Older kernels will return -EXDEV when cross-filesystem
|       copies are attempted.

To me that sounds like it may not be worth it, at least not quite yet.
But it would be good to test.

> I'm trying to add more parallelism by copying individual segments
> of a relfilenode in different processes. Does anyone one see a big
> problem in trying to do that? I'm asking because no one did it before,
> that could not be a good sign.

My concern would be if there's too many jobs and the storage bogs down, then it
could be slower.

I think something like that should have a separate option, not just --jobs.
Like --parallel-in-tablespace.  The original implementation puts processes
across CPUs (for pg_dump/restore) and tablespaces (for I/O).  Maybe it should
be possible to control those with separate options, too.

FWIW, we typically have only one database of any significance, but we do use
tablespaces, and I've used pg_upgrade --link since c. v9.0.  --jobs probably
helps pg_dump/restore at few customers who have multiple DBs.  But it probably
doesn't help to parallelize --link across tablespaces (since our tablespaces
are actually on the same storage devices, but with different filesystems).
I anticipate it might even make a few customers upgrade a bit slower, since
--link is a metadata operation and probably involves a lot of FS barriers, for
which the storage may be inadequate to support in parallel.

-- 
Justin



Re: pg_upgrade parallelism

From
Bruce Momjian
Date:
On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> Hi,
> 
> Currently docs about pg_upgrade says:
> 
> """
>     <para>
>      The <option>--jobs</option> option allows multiple CPU cores to be used
>      for copying/linking of files and to dump and reload database schemas
>      in parallel;  a good place to start is the maximum of the number of
>      CPU cores and tablespaces.  This option can dramatically reduce the
>      time to upgrade a multi-database server running on a multiprocessor
>      machine.
>     </para>
> """
> 
> Which make the user think that the --jobs option could use all CPU
> cores. Which is not true. Or that it has anything to do with multiple
> databases, which is true only to some extent.

Uh, the behavior is a little more complicated.  The --jobs option in
pg_upgrade is used to parallelize three operations:

*  copying relation files

*  dumping old cluster objects (via parallel_exec_prog())

*  creating objects in the new cluster (via parallel_exec_prog())

The last two basically operate on databases in parallel --- they can't
dump/load a single database in parallel, but they can dump/load several
databases in parallel.

The documentation you quote above is saying that you set jobs based on
the number of CPUs (for dump/reload which are assumed to be CPU bound)
and the number of tablespaces (which is assumed to be I/O bound).

I am not sure how we can improve that text.  We could just say the max
of the number of databases and tablespaces, but then the number of CPUs
needs to be involved since, if you only have one CPU core, you don't
want parallel dumps/loads happening since that will just cause CPU
contention with little benefit.  We mention tablespaces because even if
you only have once CPU core, since tablespace copying is I/O bound, you
can still benefit from --jobs.

> What that option really improves are upgrading servers with multiple
> tablespaces, of course if --link or --clone are used pg_upgrade is still
> very fast but used with the --copy option is not what one could expect.
> 
> As an example, a customer with a 25Tb database, 40 cores and lots of ram
> used --jobs=35 and got only 7 processes (they have 6 tablespaces) and
> the disks where not used at maximum speed either. They expected 35
> processes copying lots of files at the same time.
> 
> So, first I would like to improve documentation. What about something
> like the attached? 
> 
> Now, a couple of questions:
> 
> - in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to
>   determine the amount of bytes that should be used in read()/write() to
>   copy the relfilenode segments. And we define it as (50 * BLCKSZ),
>   which is 400Kb. Isn't this too small?

Uh, if you find that increasing that helps, we can increase it --- I
don't know how that value was chosen.  However, we are really just
copying the data into the kernel, not forcing it to storage, so I don't
know if a larger value would help.

> - why we read()/write() at all? is not a faster way of copying the file?
>   i'm asking that because i don't actually know.

Uh, we could use buffered I/O, I guess, but again, would there be a
benefit?

> I'm trying to add more parallelism by copying individual segments
> of a relfilenode in different processes. Does anyone one see a big
> problem in trying to do that? I'm asking because no one did it before,
> that could not be a good sign.

I think we were assuming the copy would be I/O bound and that
parallelism wouldn't help in a single tablespace.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: pg_upgrade parallelism

From
Jacob Champion
Date:
On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote:
> On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> > 
> > - why we read()/write() at all? is not a faster way of copying the file?
> >   i'm asking that because i don't actually know.
> 
> No portable way.  Linux has this:
>
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.html&data=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3D&reserved=0
> 
> But I just read:
> 
> >       First support for cross-filesystem copies was introduced in Linux
> >       5.3.  Older kernels will return -EXDEV when cross-filesystem
> >       copies are attempted.
> 
> To me that sounds like it may not be worth it, at least not quite yet.
> But it would be good to test.

I think a downside of copy_file_range() is that filesystems might
perform a reflink under us, and to me that seems like something that
needs to be opted into via clone mode.

(https://lwn.net/Articles/846403/ is also good reading on some sharp
edges, though I doubt many of them apply to our use case.)

--Jacob

Re: pg_upgrade parallelism

From
Justin Pryzby
Date:
On Tue, Nov 23, 2021 at 06:54:03PM +0000, Jacob Champion wrote:
> On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote:
> > On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote:
> > > 
> > > - why we read()/write() at all? is not a faster way of copying the file?
> > >   i'm asking that because i don't actually know.
> > 
> > No portable way.  Linux has this:
> >
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.html&data=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3D&reserved=0
> > 
> > But I just read:
> > 
> > >       First support for cross-filesystem copies was introduced in Linux
> > >       5.3.  Older kernels will return -EXDEV when cross-filesystem
> > >       copies are attempted.
> > 
> > To me that sounds like it may not be worth it, at least not quite yet.
> > But it would be good to test.

I realized that pg_upgrade doesn't copy between filesystems - it copies from
$tablespace/PG13/NNN to $tblespace/PG14/NNN.  So that's no issue.

And I did a bit of testing with this last weekend, and saw no performance
benefit from a larger buffersize, nor from copy_file_range, nor from libc stdio
(fopen/fread/fwrite/fclose).

> I think a downside of copy_file_range() is that filesystems might
> perform a reflink under us, and to me that seems like something that
> needs to be opted into via clone mode.

You're referring to this:

|       copy_file_range()  gives  filesystems an opportunity to implement "copy
|    acceleration" techniques, such as the use of reflinks (i.e., two or more
|    i-nodes that share pointers to the same copy-on-write disk blocks) or
|    server-side-copy (in the case of NFS).

I don't see why that's an issue though ?  It's COW, not hardlink.  It'd be the
same as if the filesystem implemented deduplication, right?  postgres shouldn't
notice nor care.

I guess you're concerned for someone who wants to be able to run pg_upgrade and
preserve the ability to start the old cluster in addition to the new.  But
that'd work fine on a COW filesystem, right ?

> (https://lwn.net/Articles/846403/ is also good reading on some sharp
> edges, though I doubt many of them apply to our use case.)

Yea, it doesn't seem the issues are relevant, other than to indicate that the
syscall is still evolving, which supports my initial conclusion.

-- 
Justin



Re: pg_upgrade parallelism

From
Jacob Champion
Date:
On Tue, 2021-11-23 at 13:51 -0600, Justin Pryzby wrote:
> 
> I guess you're concerned for someone who wants to be able to run pg_upgrade and
> preserve the ability to start the old cluster in addition to the new.

Right. What I'm worried about is, if disk space or write performance on
the new cluster is a concern, then having a copy-mode upgrade silently
use copy-on-write could be a problem if the DBA needs copy mode to
actually copy.

--Jacob

Re: pg_upgrade parallelism

From
Tom Lane
Date:
Jacob Champion <pchampion@vmware.com> writes:
> Right. What I'm worried about is, if disk space or write performance on
> the new cluster is a concern, then having a copy-mode upgrade silently
> use copy-on-write could be a problem if the DBA needs copy mode to
> actually copy.

Particularly for the cross-filesystem case, where it would not be
unreasonable to expect that one could dismount or destroy the old FS
immediately afterward.  I don't know if recent kernels try to make
that safe/transparent.

            regards, tom lane



Re: pg_upgrade parallelism

From
Jaime Casanova
Date:
On Wed, Nov 17, 2021 at 08:04:41PM +0000, Jacob Champion wrote:
> On Wed, 2021-11-17 at 14:44 -0500, Jaime Casanova wrote:
> > I'm trying to add more parallelism by copying individual segments
> > of a relfilenode in different processes. Does anyone one see a big
> > problem in trying to do that? I'm asking because no one did it before,
> > that could not be a good sign.
> 
> I looked into speeding this up a while back, too. For the use case I
> was looking at -- Greenplum, which has huge numbers of relfilenodes --
> spinning disk I/O was absolutely the bottleneck and that is typically
> not easily parallelizable. (In fact I felt at the time that Andres'
> work on async I/O might be a better way forward, at least for some
> filesystems.)
> 
> But you mentioned that you were seeing disks that weren't saturated, so
> maybe some CPU optimization is still valuable? I am a little skeptical
> that more parallelism is the way to do that, but numbers trump my
> skepticism.
> 

Sorry for being unresponsive too long. I did add a new --jobs-per-disk
option, this is a simple patch I made for the customer and ignored all
WIN32 parts because I don't know anything about that part. I was wanting
to complete that part but it has been in the same state two months now.

AFAIU, it seems there is a different struct for the parameters of the
function that will be called on the thread.

I also decided to create a new reap_*_child() function for using with
the new parameter.

Now, the customer went from copy 25Tb in 6 hours to 4h 45min, which is
an improvement of 20%!


> > - why we read()/write() at all? is not a faster way of copying the file?
> >   i'm asking that because i don't actually know.
> 
> I have idly wondered if something based on splice() would be faster,
> but I haven't actually tried it.
> 

I tried and got no better result.

> But there is now support for copy-on-write with the clone mode, isn't
> there? Or are you not able to take advantage of it?
> 

That's sadly not possible because those are different disks, and yes I
know that's something that pg_upgrade normally doesn't allow but is not
difficult to make it happen.

-- 
Jaime Casanova
Director de Servicios Profesionales
SystemGuards - Consultores de PostgreSQL

Attachment