Thread: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
[PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
Hello list, I'm submitting a patch for improving an almost 1h long pause at the start of parallel pg_restore of a big archive. Related discussion has taken place at pgsql-performance mailing list at: https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net I think I explain it rather well in the commit message, so I paste it inline: Improve the performance of parallel pg_restore (-j) from a custom format pg_dump archive that does not include data offsets - typically happening when pg_dump has generated it by writing to stdout instead of a file. In this case pg_restore workers manifest constant looping of reading small sizes (4KB) and seeking forward small lenths (around 10KB for a compressed archive): read(4, "..."..., 4096) = 4096 lseek(4, 55544369152, SEEK_SET) = 55544369152 read(4, "..."..., 4096) = 4096 lseek(4, 55544381440, SEEK_SET) = 55544381440 read(4, "..."..., 4096) = 4096 lseek(4, 55544397824, SEEK_SET) = 55544397824 read(4, "..."..., 4096) = 4096 lseek(4, 55544414208, SEEK_SET) = 55544414208 read(4, "..."..., 4096) = 4096 lseek(4, 55544426496, SEEK_SET) = 55544426496 This happens as each worker scans the whole file until it finds the entry it wants, skipping forward each block. In combination to the small block size of the custom format dump, this causes many seeks and low performance. Fix by avoiding forward seeks for jumps of less than 1MB forward. Do instead sequential reads. Performance gain can be significant, depending on the size of the dump and the I/O subsystem. On my local NVMe drive, read speeds for that phase of pg_restore increased from 150MB/s to 3GB/s. This is my first patch submission, all help is much appreciated. Regards, Dimitris P.S. What is the recommended way to test a change, besides a generic make check? And how do I run selectively only the pg_dump/restore tests, in order to speed up my development routine?
Attachment
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
On Sat, 29 Mar 2025, Dimitrios Apostolou wrote: > > P.S. What is the recommended way to test a change, besides a generic make > check? And how do I run selectively only the pg_dump/restore tests, in order > to speed up my development routine? I have tested it with: make -C src/bin/pg_dump check It didn't break any test, but I also don't see any difference, the performance boost is noticeable only when restoring a huge archive that is missing offsets. Any volunteer to review this one-line patch? Thanks, Dimitris
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Nathan Bossart
Date:
On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote: > It didn't break any test, but I also don't see any difference, the > performance boost is noticeable only when restoring a huge archive that is > missing offsets. This seems generally reasonable to me, but how did you decide on 1MB as the threshold? Have you tested other values? Could the best threshold vary based on the workload and hardware? -- nathan
Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and 20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive? Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it madesense to guard against these cases too. I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it wouldbe better to seek() forward for lengths of over 1MB. On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote: >On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote: >> It didn't break any test, but I also don't see any difference, the >> performance boost is noticeable only when restoring a huge archive that is >> missing offsets. > >This seems generally reasonable to me, but how did you decide on 1MB as the >threshold? Have you tested other values? Could the best threshold vary >based on the workload and hardware? >
Re: [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
From
Dimitrios Apostolou
Date:
I just managed to run pgindent, here is v2 with the comment style fixed.