Re: parallel pg_restore blocks on heavy random read I/O on all children processes - Mailing list pgsql-performance

From Hannu Krosing
Subject Re: parallel pg_restore blocks on heavy random read I/O on all children processes
Date
Msg-id CAMT0RQTz7Zi99C66U2160Mmcxj+fBJ0OpD6Eq=aLZsdYaFZwBg@mail.gmail.com
Whole thread Raw
In response to Re: parallel pg_restore blocks on heavy random read I/O on all children processes  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
You may be interested in a patch "Adding pg_dump flag for parallel
export to pipes"[1] which allows using pipes in directory former
parallel dump and restore.
There the offsets are implicitly taken care of by the file system.

[1] https://www.postgresql.org/message-id/CAH5HC97p4kkpikar%2BswuC0Lx4YTVkE30sTsFX94tyzih7Cc_%3Dw%40mail.gmail.com

On Sun, Mar 23, 2025 at 4:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Dimitrios Apostolou <jimis@gmx.net> writes:
> > On Thu, 20 Mar 2025, Tom Lane wrote:
> >> I am betting that the problem is that the dump's TOC (table of
> >> contents) lacks offsets to the actual data of the database objects,
> >> and thus the readers have to reconstruct that information by scanning
> >> the dump file.  Normally, pg_dump will back-fill offset data in the
> >> TOC at completion of the dump, but if it's told to write to an
> >> un-seekable output file then it cannot do that.
>
> > Further questions:
>
> > * Does the same happen in an uncompressed dump? Or maybe the offsets are
> >    pre-filled because they are predictable without compression?
>
> Yes; no.  We don't know the size of a table's data as-dumped until
> we've dumped it.
>
> > * Should pg_dump print some warning for generating a lower quality format?
>
> I don't think so.  In many use-cases this is irrelevant and the
> warning would just be an annoyance.
>
> > * The seeking pattern in pg_restore seems non-sensical to me: reading 4K,
> >    jumping 8-12K, repeat for the whole file? Consuming 15K IOPS for an
> >    hour. /Maybe/ something to improve there... Where can I read more about
> >    the format?
>
> It's reading data blocks (or at least the headers thereof), which have
> a limited size.  I don't think that size has changed since circa 1999,
> so maybe we could consider increasing it; but I doubt we could move
> the needle very far that way.
>
> > * Why doesn't it happen in single-process pg_restore?
>
> A single-process restore is going to restore all the data in the order
> it appears in the archive file, so no seeking is required.  Of course,
> as soon as you ask for parallelism, that doesn't work too well.
>
> Hypothetically, maybe the algorithm for handing out tables-to-restore
> to parallel workers could pay attention to the distance to the data
> ... except that in the problematic case we don't have that
> information.  I don't recall for sure, but I think that the order of
> the TOC entries is not necessarily a usable proxy for the order of the
> data entries.  It's unclear to me that overriding the existing
> heuristic (biggest tables first, I think) would be a win anyway.
>
>                         regards, tom lane
>
>



pgsql-performance by date:

Previous
From: "Vitale, Anthony, Sony Music"
Date:
Subject: RE: Question on what Duration in the log
Next
From: James Pang
Date:
Subject: many sessions wait on LWlock WALWrite suddenly