Thread: pg_upgrade parallelism
Hi, Currently docs about pg_upgrade says: """ <para> The <option>--jobs</option> option allows multiple CPU cores to be used for copying/linking of files and to dump and reload database schemas in parallel; a good place to start is the maximum of the number of CPU cores and tablespaces. This option can dramatically reduce the time to upgrade a multi-database server running on a multiprocessor machine. </para> """ Which make the user think that the --jobs option could use all CPU cores. Which is not true. Or that it has anything to do with multiple databases, which is true only to some extent. What that option really improves are upgrading servers with multiple tablespaces, of course if --link or --clone are used pg_upgrade is still very fast but used with the --copy option is not what one could expect. As an example, a customer with a 25Tb database, 40 cores and lots of ram used --jobs=35 and got only 7 processes (they have 6 tablespaces) and the disks where not used at maximum speed either. They expected 35 processes copying lots of files at the same time. So, first I would like to improve documentation. What about something like the attached? Now, a couple of questions: - in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to determine the amount of bytes that should be used in read()/write() to copy the relfilenode segments. And we define it as (50 * BLCKSZ), which is 400Kb. Isn't this too small? - why we read()/write() at all? is not a faster way of copying the file? i'm asking that because i don't actually know. I'm trying to add more parallelism by copying individual segments of a relfilenode in different processes. Does anyone one see a big problem in trying to do that? I'm asking because no one did it before, that could not be a good sign. -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Attachment
On Wed, 2021-11-17 at 14:44 -0500, Jaime Casanova wrote: > I'm trying to add more parallelism by copying individual segments > of a relfilenode in different processes. Does anyone one see a big > problem in trying to do that? I'm asking because no one did it before, > that could not be a good sign. I looked into speeding this up a while back, too. For the use case I was looking at -- Greenplum, which has huge numbers of relfilenodes -- spinning disk I/O was absolutely the bottleneck and that is typically not easily parallelizable. (In fact I felt at the time that Andres' work on async I/O might be a better way forward, at least for some filesystems.) But you mentioned that you were seeing disks that weren't saturated, so maybe some CPU optimization is still valuable? I am a little skeptical that more parallelism is the way to do that, but numbers trump my skepticism. > - why we read()/write() at all? is not a faster way of copying the file? > i'm asking that because i don't actually know. I have idly wondered if something based on splice() would be faster, but I haven't actually tried it. But there is now support for copy-on-write with the clone mode, isn't there? Or are you not able to take advantage of it? --Jacob
On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote: > Hi, > > Currently docs about pg_upgrade says: > > """ > <para> > The <option>--jobs</option> option allows multiple CPU cores to be used > for copying/linking of files and to dump and reload database schemas > in parallel; a good place to start is the maximum of the number of > CPU cores and tablespaces. This option can dramatically reduce the > time to upgrade a multi-database server running on a multiprocessor > machine. > </para> > """ > > Which make the user think that the --jobs option could use all CPU > cores. Which is not true. Or that it has anything to do with multiple > databases, which is true only to some extent. > > What that option really improves are upgrading servers with multiple > tablespaces, of course if --link or --clone are used pg_upgrade is still > very fast but used with the --copy option is not what one could expect. > As an example, a customer with a 25Tb database, 40 cores and lots of ram > used --jobs=35 and got only 7 processes (they have 6 tablespaces) and > the disks where not used at maximum speed either. They expected 35 > processes copying lots of files at the same time. I would test this. How long does it take to cp -r the data dirs vs pg_upgrade them ? If running 7 "cp" in parallel is faster than the "copy" portion of pg_upgrade -j7, then pg_upgrade's file copy should be optimized. But if it's not faster, then maybe should look at other options, like your idea to copy filenodes (or their segments) in parallel. > So, first I would like to improve documentation. What about something > like the attached? The relevant history is in commits 6f1b9e4efd94fc644f5de5377829d42e48c3c758 a89c46f9bc314ed549245d888da09b8c5cace104 --jobs originally parallelized pg_dump and pg_restore, and then added copying/linking. So the docs should mention tablespaces, as you said, but should also mention databases. It may not be an issue for you, but pg_restore is the slowest part of our pg_upgrades, since we have many partitions. > Now, a couple of questions: > > - in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to > determine the amount of bytes that should be used in read()/write() to > copy the relfilenode segments. And we define it as (50 * BLCKSZ), > which is 400Kb. Isn't this too small? Maybe - you'll have to check :) > - why we read()/write() at all? is not a faster way of copying the file? > i'm asking that because i don't actually know. No portable way. Linux has this: https://man7.org/linux/man-pages/man2/copy_file_range.2.html But I just read: | First support for cross-filesystem copies was introduced in Linux | 5.3. Older kernels will return -EXDEV when cross-filesystem | copies are attempted. To me that sounds like it may not be worth it, at least not quite yet. But it would be good to test. > I'm trying to add more parallelism by copying individual segments > of a relfilenode in different processes. Does anyone one see a big > problem in trying to do that? I'm asking because no one did it before, > that could not be a good sign. My concern would be if there's too many jobs and the storage bogs down, then it could be slower. I think something like that should have a separate option, not just --jobs. Like --parallel-in-tablespace. The original implementation puts processes across CPUs (for pg_dump/restore) and tablespaces (for I/O). Maybe it should be possible to control those with separate options, too. FWIW, we typically have only one database of any significance, but we do use tablespaces, and I've used pg_upgrade --link since c. v9.0. --jobs probably helps pg_dump/restore at few customers who have multiple DBs. But it probably doesn't help to parallelize --link across tablespaces (since our tablespaces are actually on the same storage devices, but with different filesystems). I anticipate it might even make a few customers upgrade a bit slower, since --link is a metadata operation and probably involves a lot of FS barriers, for which the storage may be inadequate to support in parallel. -- Justin
On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote: > Hi, > > Currently docs about pg_upgrade says: > > """ > <para> > The <option>--jobs</option> option allows multiple CPU cores to be used > for copying/linking of files and to dump and reload database schemas > in parallel; a good place to start is the maximum of the number of > CPU cores and tablespaces. This option can dramatically reduce the > time to upgrade a multi-database server running on a multiprocessor > machine. > </para> > """ > > Which make the user think that the --jobs option could use all CPU > cores. Which is not true. Or that it has anything to do with multiple > databases, which is true only to some extent. Uh, the behavior is a little more complicated. The --jobs option in pg_upgrade is used to parallelize three operations: * copying relation files * dumping old cluster objects (via parallel_exec_prog()) * creating objects in the new cluster (via parallel_exec_prog()) The last two basically operate on databases in parallel --- they can't dump/load a single database in parallel, but they can dump/load several databases in parallel. The documentation you quote above is saying that you set jobs based on the number of CPUs (for dump/reload which are assumed to be CPU bound) and the number of tablespaces (which is assumed to be I/O bound). I am not sure how we can improve that text. We could just say the max of the number of databases and tablespaces, but then the number of CPUs needs to be involved since, if you only have one CPU core, you don't want parallel dumps/loads happening since that will just cause CPU contention with little benefit. We mention tablespaces because even if you only have once CPU core, since tablespace copying is I/O bound, you can still benefit from --jobs. > What that option really improves are upgrading servers with multiple > tablespaces, of course if --link or --clone are used pg_upgrade is still > very fast but used with the --copy option is not what one could expect. > > As an example, a customer with a 25Tb database, 40 cores and lots of ram > used --jobs=35 and got only 7 processes (they have 6 tablespaces) and > the disks where not used at maximum speed either. They expected 35 > processes copying lots of files at the same time. > > So, first I would like to improve documentation. What about something > like the attached? > > Now, a couple of questions: > > - in src/bin/pg_upgrade/file.c at copyFile() we define a buffer to > determine the amount of bytes that should be used in read()/write() to > copy the relfilenode segments. And we define it as (50 * BLCKSZ), > which is 400Kb. Isn't this too small? Uh, if you find that increasing that helps, we can increase it --- I don't know how that value was chosen. However, we are really just copying the data into the kernel, not forcing it to storage, so I don't know if a larger value would help. > - why we read()/write() at all? is not a faster way of copying the file? > i'm asking that because i don't actually know. Uh, we could use buffered I/O, I guess, but again, would there be a benefit? > I'm trying to add more parallelism by copying individual segments > of a relfilenode in different processes. Does anyone one see a big > problem in trying to do that? I'm asking because no one did it before, > that could not be a good sign. I think we were assuming the copy would be I/O bound and that parallelism wouldn't help in a single tablespace. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote: > On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote: > > > > - why we read()/write() at all? is not a faster way of copying the file? > > i'm asking that because i don't actually know. > > No portable way. Linux has this: > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.html&data=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3D&reserved=0 > > But I just read: > > > First support for cross-filesystem copies was introduced in Linux > > 5.3. Older kernels will return -EXDEV when cross-filesystem > > copies are attempted. > > To me that sounds like it may not be worth it, at least not quite yet. > But it would be good to test. I think a downside of copy_file_range() is that filesystems might perform a reflink under us, and to me that seems like something that needs to be opted into via clone mode. (https://lwn.net/Articles/846403/ is also good reading on some sharp edges, though I doubt many of them apply to our use case.) --Jacob
On Tue, Nov 23, 2021 at 06:54:03PM +0000, Jacob Champion wrote: > On Wed, 2021-11-17 at 14:34 -0600, Justin Pryzby wrote: > > On Wed, Nov 17, 2021 at 02:44:52PM -0500, Jaime Casanova wrote: > > > > > > - why we read()/write() at all? is not a faster way of copying the file? > > > i'm asking that because i don't actually know. > > > > No portable way. Linux has this: > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fman7.org%2Flinux%2Fman-pages%2Fman2%2Fcopy_file_range.2.html&data=04%7C01%7Cpchampion%40vmware.com%7C35fb5d59bd2745636fd408d9aa09a245%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727780625465398%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=PS6OCE55n12KBOjh5qZ2uGzDR9U687nzNIV5AM9Zke4%3D&reserved=0 > > > > But I just read: > > > > > First support for cross-filesystem copies was introduced in Linux > > > 5.3. Older kernels will return -EXDEV when cross-filesystem > > > copies are attempted. > > > > To me that sounds like it may not be worth it, at least not quite yet. > > But it would be good to test. I realized that pg_upgrade doesn't copy between filesystems - it copies from $tablespace/PG13/NNN to $tblespace/PG14/NNN. So that's no issue. And I did a bit of testing with this last weekend, and saw no performance benefit from a larger buffersize, nor from copy_file_range, nor from libc stdio (fopen/fread/fwrite/fclose). > I think a downside of copy_file_range() is that filesystems might > perform a reflink under us, and to me that seems like something that > needs to be opted into via clone mode. You're referring to this: | copy_file_range() gives filesystems an opportunity to implement "copy | acceleration" techniques, such as the use of reflinks (i.e., two or more | i-nodes that share pointers to the same copy-on-write disk blocks) or | server-side-copy (in the case of NFS). I don't see why that's an issue though ? It's COW, not hardlink. It'd be the same as if the filesystem implemented deduplication, right? postgres shouldn't notice nor care. I guess you're concerned for someone who wants to be able to run pg_upgrade and preserve the ability to start the old cluster in addition to the new. But that'd work fine on a COW filesystem, right ? > (https://lwn.net/Articles/846403/ is also good reading on some sharp > edges, though I doubt many of them apply to our use case.) Yea, it doesn't seem the issues are relevant, other than to indicate that the syscall is still evolving, which supports my initial conclusion. -- Justin
On Tue, 2021-11-23 at 13:51 -0600, Justin Pryzby wrote: > > I guess you're concerned for someone who wants to be able to run pg_upgrade and > preserve the ability to start the old cluster in addition to the new. Right. What I'm worried about is, if disk space or write performance on the new cluster is a concern, then having a copy-mode upgrade silently use copy-on-write could be a problem if the DBA needs copy mode to actually copy. --Jacob
Jacob Champion <pchampion@vmware.com> writes: > Right. What I'm worried about is, if disk space or write performance on > the new cluster is a concern, then having a copy-mode upgrade silently > use copy-on-write could be a problem if the DBA needs copy mode to > actually copy. Particularly for the cross-filesystem case, where it would not be unreasonable to expect that one could dismount or destroy the old FS immediately afterward. I don't know if recent kernels try to make that safe/transparent. regards, tom lane
On Wed, Nov 17, 2021 at 08:04:41PM +0000, Jacob Champion wrote: > On Wed, 2021-11-17 at 14:44 -0500, Jaime Casanova wrote: > > I'm trying to add more parallelism by copying individual segments > > of a relfilenode in different processes. Does anyone one see a big > > problem in trying to do that? I'm asking because no one did it before, > > that could not be a good sign. > > I looked into speeding this up a while back, too. For the use case I > was looking at -- Greenplum, which has huge numbers of relfilenodes -- > spinning disk I/O was absolutely the bottleneck and that is typically > not easily parallelizable. (In fact I felt at the time that Andres' > work on async I/O might be a better way forward, at least for some > filesystems.) > > But you mentioned that you were seeing disks that weren't saturated, so > maybe some CPU optimization is still valuable? I am a little skeptical > that more parallelism is the way to do that, but numbers trump my > skepticism. > Sorry for being unresponsive too long. I did add a new --jobs-per-disk option, this is a simple patch I made for the customer and ignored all WIN32 parts because I don't know anything about that part. I was wanting to complete that part but it has been in the same state two months now. AFAIU, it seems there is a different struct for the parameters of the function that will be called on the thread. I also decided to create a new reap_*_child() function for using with the new parameter. Now, the customer went from copy 25Tb in 6 hours to 4h 45min, which is an improvement of 20%! > > - why we read()/write() at all? is not a faster way of copying the file? > > i'm asking that because i don't actually know. > > I have idly wondered if something based on splice() would be faster, > but I haven't actually tried it. > I tried and got no better result. > But there is now support for copy-on-write with the clone mode, isn't > there? Or are you not able to take advantage of it? > That's sadly not possible because those are different disks, and yes I know that's something that pg_upgrade normally doesn't allow but is not difficult to make it happen. -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL