Thread: parallel pg_restore blocks on heavy random read I/O on all children processes

parallel pg_restore blocks on heavy random read I/O on all children processes

From
Dimitrios Apostolou
Date:
Hello list,

I noticed the weird behaviour that doing a pg_restore of a huge database
dump, leads to constant read I/O (at about 15K IOPS from the NVMe drive
that has the dump file) for about one hour. I believe it happens with
any -j value>=2.

In particular, I get output like the following in the pg_restore log, only
a few seconds after running it:

pg_restore: launching item 12110 TABLE DATA yyy
pg_restore: processing data for table "public.yyy"
[ long pause ...]
pg_restore: finished item 12110 TABLE DATA yyy

The output varies a bit, depending on when each process comes to the
bottleneck point, and it is hard to isolate each process' log since the
output is not prefixed with the PID, so take it with a grain of salt.

If I strace the pg_restore children, I see them read()ing 4K and seek()ing
a couple blocks forward, all the time:

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET)         = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET)         = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET)         = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET)         = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET)         = 55544426496

No other syscalls are occuring (no sends or writes whatsoever).

After the ~1h pause, I don't notice the behaviour again during the many
hours restore, even though there are plenty of larger tables in the dump.
The normal restore activity resumes, during which the pg_restore processes
are barely noticeable in the system, and the COPY postgres backend
processes do most of the work.

Here is the exact pg_restore command that reproduced the issue for me last
on PostgreSQL 17:

pg_restore  -vvv -U $DBUSER -j8 -d $DBNAME --data-only --no-tablespaces --no-table-access-method --strict-names
$DUMP_FILENAME

And here is the pg_dump command which has created the dump file, executed
on PostgreSQL 16.

pg_dump -v --format=custom --compress=zstd --no-toast-compression  $DBNAME   |  $send_to_remote_storage

The database is multiple TBs large with a couple thousands tables (notable
is one huge table with 1000 partitions). The zstd-compressed custom format
dump file is 816 GB in size.

What do you think causes this? Is it something that can be improved?

Thanks in advance,
Dimitris



Dimitrios Apostolou <jimis@gmx.net> writes:
> I noticed the weird behaviour that doing a pg_restore of a huge database
> dump, leads to constant read I/O (at about 15K IOPS from the NVMe drive
> that has the dump file) for about one hour. I believe it happens with
> any -j value>=2.

> In particular, I get output like the following in the pg_restore log, only
> a few seconds after running it:

> pg_restore: launching item 12110 TABLE DATA yyy
> pg_restore: processing data for table "public.yyy"
> [ long pause ...]
> pg_restore: finished item 12110 TABLE DATA yyy

I am betting that the problem is that the dump's TOC (table of
contents) lacks offsets to the actual data of the database objects,
and thus the readers have to reconstruct that information by scanning
the dump file.  Normally, pg_dump will back-fill offset data in the
TOC at completion of the dump, but if it's told to write to an
un-seekable output file then it cannot do that.

> And here is the pg_dump command which has created the dump file, executed
> on PostgreSQL 16.

> pg_dump -v --format=custom --compress=zstd --no-toast-compression  $DBNAME   |  $send_to_remote_storage

Yup, writing the output to a pipe would cause that ...

> What do you think causes this? Is it something that can be improved?

I don't see an easy way, and certainly no way that wouldn't involve
redefining the archive format.  Can you write the dump to a local
file rather than piping it immediately?

            regards, tom lane



On Thu, 20 Mar 2025, Tom Lane wrote:

> I am betting that the problem is that the dump's TOC (table of
> contents) lacks offsets to the actual data of the database objects,
> and thus the readers have to reconstruct that information by scanning
> the dump file.  Normally, pg_dump will back-fill offset data in the
> TOC at completion of the dump, but if it's told to write to an
> un-seekable output file then it cannot do that.

Thanks Tom, this makes sense! As you noticed, I'm piping the output, and
this was a conscious choice.

> I don't see an easy way, and certainly no way that wouldn't involve
> redefining the archive format.  Can you write the dump to a local
> file rather than piping it immediately?

Unfortunately I don't have enough space for that. I'm still testing, but
the way this is designed to work is to take an uncompressed pg_dump
(unlike the above which was compressed for testing purposes) and send it
to a backup server having its own deduplication and compression.

Further questions:

* Does the same happen in an uncompressed dump? Or maybe the offsets are
   pre-filled because they are predictable without compression?

* Should pg_dump print some warning for generating a lower quality format?

* The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
   jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
   hour. /Maybe/ something to improve there... Where can I read more about
   the format?

* Why doesn't it happen in single-process pg_restore?


Thank you!
Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> On Thu, 20 Mar 2025, Tom Lane wrote:
>> I am betting that the problem is that the dump's TOC (table of
>> contents) lacks offsets to the actual data of the database objects,
>> and thus the readers have to reconstruct that information by scanning
>> the dump file.  Normally, pg_dump will back-fill offset data in the
>> TOC at completion of the dump, but if it's told to write to an
>> un-seekable output file then it cannot do that.

> Further questions:

> * Does the same happen in an uncompressed dump? Or maybe the offsets are
>    pre-filled because they are predictable without compression?

Yes; no.  We don't know the size of a table's data as-dumped until
we've dumped it.

> * Should pg_dump print some warning for generating a lower quality format?

I don't think so.  In many use-cases this is irrelevant and the
warning would just be an annoyance.

> * The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
>    jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
>    hour. /Maybe/ something to improve there... Where can I read more about
>    the format?

It's reading data blocks (or at least the headers thereof), which have
a limited size.  I don't think that size has changed since circa 1999,
so maybe we could consider increasing it; but I doubt we could move
the needle very far that way.

> * Why doesn't it happen in single-process pg_restore?

A single-process restore is going to restore all the data in the order
it appears in the archive file, so no seeking is required.  Of course,
as soon as you ask for parallelism, that doesn't work too well.

Hypothetically, maybe the algorithm for handing out tables-to-restore
to parallel workers could pay attention to the distance to the data
... except that in the problematic case we don't have that
information.  I don't recall for sure, but I think that the order of
the TOC entries is not necessarily a usable proxy for the order of the
data entries.  It's unclear to me that overriding the existing
heuristic (biggest tables first, I think) would be a win anyway.

            regards, tom lane



On Sun, 23 Mar 2025, Tom Lane wrote:

> Dimitrios Apostolou <jimis@gmx.net> writes:
>> On Thu, 20 Mar 2025, Tom Lane wrote:
>>> I am betting that the problem is that the dump's TOC (table of
>>> contents) lacks offsets to the actual data of the database objects,
>>> and thus the readers have to reconstruct that information by scanning
>>> the dump file.  Normally, pg_dump will back-fill offset data in the
>>> TOC at completion of the dump, but if it's told to write to an
>>> un-seekable output file then it cannot do that.
>
>> Further questions:
>
>> * The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
>>    jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
>>    hour. /Maybe/ something to improve there... Where can I read more about
>>    the format?
>
> It's reading data blocks (or at least the headers thereof), which have
> a limited size.  I don't think that size has changed since circa 1999,
> so maybe we could consider increasing it; but I doubt we could move
> the needle very far that way.

If it could be substantially increased then it would cut down the number
of seeks by a big factor and definitely make a difference. But it would
have a negative impact on the size of small backups?

However, I would have expected the kernel to handle this reading-seeking
pattern better: The 4K read() should cause a big pre-fetch (read-ahead) to
cover the next seek and read. Not sure why I don't see it happening, it's
a rather typical setup (NVMe SSD with ext4 fs on Linux 6.11).

As a workaround, pg_restore could sequentially read() the whole archive to
construct the TOC, instead of jumping around. I'm a bit split on this one,
as it would be much faster on the compressed archive (~800GB) but not that
much on the uncompressed one (>5TB).

I'm also under the impression that all the pg_restore child processes are
doing the same thing: seeking+reading the whole file, each for their own
sake. Can you verify or disprove?


Thank you for the detailed response.

Dimitris




Hello again,

I traced the seeking-reading behaviour of parallel pg_restore inside
_skipData() when called from _PrintTocData(). Since most of today's I/O
devices (both rotating and solid state) can read 1MB faster sequentially
than it takes to seek and read 4KB, I tried the following change:

diff --git a/src/bin/pg_dump/pg_backup_custom.c
b/src/bin/pg_dump/pg_backup_custom.c
index 55107b20058..262ba509829 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -618,31 +618,31 @@ _skipLOs(ArchiveHandle *AH)
   * Skip data from current file position.
   * Data blocks are formatted as an integer length, followed by data.
   * A zero length indicates the end of the block.
  */
  static void
  _skipData(ArchiveHandle *AH)
  {
         lclContext *ctx = (lclContext *) AH->formatData;
         size_t          blkLen;
         char       *buf = NULL;
         int                     buflen = 0;

         blkLen = ReadInt(AH);
         while (blkLen != 0)
         {
-               if (ctx->hasSeek)
+               if (ctx->hasSeek && blkLen > 1024 * 1024)
                 {
                         if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
                                 pg_fatal("error during file seek: %m");
                 }
                 else
                 {
                         if (blkLen > buflen)
                         {
                                 free(buf);
                                 buf = (char *) pg_malloc(blkLen);
                                 buflen = blkLen;
                         }
                         if (fread(buf, 1, blkLen, AH->FH) != blkLen)
                         {
                                 if (feof(AH->FH))


This simple change improves immensely (10x maybe, depends on the number of
workers) the offset-table building phase of the parallel backup.

A problem still remaining is that this offset-table building phase is done
on every worker process, which means that all workers scan almost in
parallel the whole archive. A more intrusive improvement would be to move
this phase to the parent process, before spawning the children.

What do you think?

Regards,
Dimitris


P.S. I also have a simple change that changes -j1 switch to mean "parallel
but with one worker process", that I did for debugging purposes. Not sure
if it is of interest here.