Thread: parallel pg_restore blocks on heavy random read I/O on all children processes
parallel pg_restore blocks on heavy random read I/O on all children processes
From
Dimitrios Apostolou
Date:
Hello list, I noticed the weird behaviour that doing a pg_restore of a huge database dump, leads to constant read I/O (at about 15K IOPS from the NVMe drive that has the dump file) for about one hour. I believe it happens with any -j value>=2. In particular, I get output like the following in the pg_restore log, only a few seconds after running it: pg_restore: launching item 12110 TABLE DATA yyy pg_restore: processing data for table "public.yyy" [ long pause ...] pg_restore: finished item 12110 TABLE DATA yyy The output varies a bit, depending on when each process comes to the bottleneck point, and it is hard to isolate each process' log since the output is not prefixed with the PID, so take it with a grain of salt. If I strace the pg_restore children, I see them read()ing 4K and seek()ing a couple blocks forward, all the time: read(4, "..."..., 4096) = 4096 lseek(4, 55544369152, SEEK_SET) = 55544369152 read(4, "..."..., 4096) = 4096 lseek(4, 55544381440, SEEK_SET) = 55544381440 read(4, "..."..., 4096) = 4096 lseek(4, 55544397824, SEEK_SET) = 55544397824 read(4, "..."..., 4096) = 4096 lseek(4, 55544414208, SEEK_SET) = 55544414208 read(4, "..."..., 4096) = 4096 lseek(4, 55544426496, SEEK_SET) = 55544426496 No other syscalls are occuring (no sends or writes whatsoever). After the ~1h pause, I don't notice the behaviour again during the many hours restore, even though there are plenty of larger tables in the dump. The normal restore activity resumes, during which the pg_restore processes are barely noticeable in the system, and the COPY postgres backend processes do most of the work. Here is the exact pg_restore command that reproduced the issue for me last on PostgreSQL 17: pg_restore -vvv -U $DBUSER -j8 -d $DBNAME --data-only --no-tablespaces --no-table-access-method --strict-names $DUMP_FILENAME And here is the pg_dump command which has created the dump file, executed on PostgreSQL 16. pg_dump -v --format=custom --compress=zstd --no-toast-compression $DBNAME | $send_to_remote_storage The database is multiple TBs large with a couple thousands tables (notable is one huge table with 1000 partitions). The zstd-compressed custom format dump file is 816 GB in size. What do you think causes this? Is it something that can be improved? Thanks in advance, Dimitris
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > I noticed the weird behaviour that doing a pg_restore of a huge database > dump, leads to constant read I/O (at about 15K IOPS from the NVMe drive > that has the dump file) for about one hour. I believe it happens with > any -j value>=2. > In particular, I get output like the following in the pg_restore log, only > a few seconds after running it: > pg_restore: launching item 12110 TABLE DATA yyy > pg_restore: processing data for table "public.yyy" > [ long pause ...] > pg_restore: finished item 12110 TABLE DATA yyy I am betting that the problem is that the dump's TOC (table of contents) lacks offsets to the actual data of the database objects, and thus the readers have to reconstruct that information by scanning the dump file. Normally, pg_dump will back-fill offset data in the TOC at completion of the dump, but if it's told to write to an un-seekable output file then it cannot do that. > And here is the pg_dump command which has created the dump file, executed > on PostgreSQL 16. > pg_dump -v --format=custom --compress=zstd --no-toast-compression $DBNAME | $send_to_remote_storage Yup, writing the output to a pipe would cause that ... > What do you think causes this? Is it something that can be improved? I don't see an easy way, and certainly no way that wouldn't involve redefining the archive format. Can you write the dump to a local file rather than piping it immediately? regards, tom lane
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
From
Dimitrios Apostolou
Date:
On Thu, 20 Mar 2025, Tom Lane wrote: > I am betting that the problem is that the dump's TOC (table of > contents) lacks offsets to the actual data of the database objects, > and thus the readers have to reconstruct that information by scanning > the dump file. Normally, pg_dump will back-fill offset data in the > TOC at completion of the dump, but if it's told to write to an > un-seekable output file then it cannot do that. Thanks Tom, this makes sense! As you noticed, I'm piping the output, and this was a conscious choice. > I don't see an easy way, and certainly no way that wouldn't involve > redefining the archive format. Can you write the dump to a local > file rather than piping it immediately? Unfortunately I don't have enough space for that. I'm still testing, but the way this is designed to work is to take an uncompressed pg_dump (unlike the above which was compressed for testing purposes) and send it to a backup server having its own deduplication and compression. Further questions: * Does the same happen in an uncompressed dump? Or maybe the offsets are pre-filled because they are predictable without compression? * Should pg_dump print some warning for generating a lower quality format? * The seeking pattern in pg_restore seems non-sensical to me: reading 4K, jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an hour. /Maybe/ something to improve there... Where can I read more about the format? * Why doesn't it happen in single-process pg_restore? Thank you! Dimitris
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > On Thu, 20 Mar 2025, Tom Lane wrote: >> I am betting that the problem is that the dump's TOC (table of >> contents) lacks offsets to the actual data of the database objects, >> and thus the readers have to reconstruct that information by scanning >> the dump file. Normally, pg_dump will back-fill offset data in the >> TOC at completion of the dump, but if it's told to write to an >> un-seekable output file then it cannot do that. > Further questions: > * Does the same happen in an uncompressed dump? Or maybe the offsets are > pre-filled because they are predictable without compression? Yes; no. We don't know the size of a table's data as-dumped until we've dumped it. > * Should pg_dump print some warning for generating a lower quality format? I don't think so. In many use-cases this is irrelevant and the warning would just be an annoyance. > * The seeking pattern in pg_restore seems non-sensical to me: reading 4K, > jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an > hour. /Maybe/ something to improve there... Where can I read more about > the format? It's reading data blocks (or at least the headers thereof), which have a limited size. I don't think that size has changed since circa 1999, so maybe we could consider increasing it; but I doubt we could move the needle very far that way. > * Why doesn't it happen in single-process pg_restore? A single-process restore is going to restore all the data in the order it appears in the archive file, so no seeking is required. Of course, as soon as you ask for parallelism, that doesn't work too well. Hypothetically, maybe the algorithm for handing out tables-to-restore to parallel workers could pay attention to the distance to the data ... except that in the problematic case we don't have that information. I don't recall for sure, but I think that the order of the TOC entries is not necessarily a usable proxy for the order of the data entries. It's unclear to me that overriding the existing heuristic (biggest tables first, I think) would be a win anyway. regards, tom lane
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
From
Dimitrios Apostolou
Date:
On Sun, 23 Mar 2025, Tom Lane wrote: > Dimitrios Apostolou <jimis@gmx.net> writes: >> On Thu, 20 Mar 2025, Tom Lane wrote: >>> I am betting that the problem is that the dump's TOC (table of >>> contents) lacks offsets to the actual data of the database objects, >>> and thus the readers have to reconstruct that information by scanning >>> the dump file. Normally, pg_dump will back-fill offset data in the >>> TOC at completion of the dump, but if it's told to write to an >>> un-seekable output file then it cannot do that. > >> Further questions: > >> * The seeking pattern in pg_restore seems non-sensical to me: reading 4K, >> jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an >> hour. /Maybe/ something to improve there... Where can I read more about >> the format? > > It's reading data blocks (or at least the headers thereof), which have > a limited size. I don't think that size has changed since circa 1999, > so maybe we could consider increasing it; but I doubt we could move > the needle very far that way. If it could be substantially increased then it would cut down the number of seeks by a big factor and definitely make a difference. But it would have a negative impact on the size of small backups? However, I would have expected the kernel to handle this reading-seeking pattern better: The 4K read() should cause a big pre-fetch (read-ahead) to cover the next seek and read. Not sure why I don't see it happening, it's a rather typical setup (NVMe SSD with ext4 fs on Linux 6.11). As a workaround, pg_restore could sequentially read() the whole archive to construct the TOC, instead of jumping around. I'm a bit split on this one, as it would be much faster on the compressed archive (~800GB) but not that much on the uncompressed one (>5TB). I'm also under the impression that all the pg_restore child processes are doing the same thing: seeking+reading the whole file, each for their own sake. Can you verify or disprove? Thank you for the detailed response. Dimitris
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
From
Dimitrios Apostolou
Date:
Hello again, I traced the seeking-reading behaviour of parallel pg_restore inside _skipData() when called from _PrintTocData(). Since most of today's I/O devices (both rotating and solid state) can read 1MB faster sequentially than it takes to seek and read 4KB, I tried the following change: diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c index 55107b20058..262ba509829 100644 --- a/src/bin/pg_dump/pg_backup_custom.c +++ b/src/bin/pg_dump/pg_backup_custom.c @@ -618,31 +618,31 @@ _skipLOs(ArchiveHandle *AH) * Skip data from current file position. * Data blocks are formatted as an integer length, followed by data. * A zero length indicates the end of the block. */ static void _skipData(ArchiveHandle *AH) { lclContext *ctx = (lclContext *) AH->formatData; size_t blkLen; char *buf = NULL; int buflen = 0; blkLen = ReadInt(AH); while (blkLen != 0) { - if (ctx->hasSeek) + if (ctx->hasSeek && blkLen > 1024 * 1024) { if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0) pg_fatal("error during file seek: %m"); } else { if (blkLen > buflen) { free(buf); buf = (char *) pg_malloc(blkLen); buflen = blkLen; } if (fread(buf, 1, blkLen, AH->FH) != blkLen) { if (feof(AH->FH)) This simple change improves immensely (10x maybe, depends on the number of workers) the offset-table building phase of the parallel backup. A problem still remaining is that this offset-table building phase is done on every worker process, which means that all workers scan almost in parallel the whole archive. A more intrusive improvement would be to move this phase to the parent process, before spawning the children. What do you think? Regards, Dimitris P.S. I also have a simple change that changes -j1 switch to mean "parallel but with one worker process", that I did for debugging purposes. Not sure if it is of interest here.
Re: parallel pg_restore blocks on heavy random read I/O on all children processes
From
Dimitrios Apostolou
Date:
Sent proper patch to pgsql-hackers mailing list, at: https://www.postgresql.org/message-id/flat/2edb7a57-b225-3b23-a680-62ba90658fec%40gmx.net