Home > mailing lists

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

From	Dimitrios Apostolou
Subject	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date	July 28 16:34:20
Msg-id	1po8os49-r63o-2923-p37n-12698o1qn7p0@tzk.arg Whole thread Raw
In response to	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward (Dimitrios Apostolou <jimis@gmx.net>)
List	pgsql-hackers

Tree view

On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:

> Out of curiosity I've tried the same with an uncompressed dump
> (--compress=none). Surprisingly it seems the blocksize is even smaller.
>
> With my patched pg_restore I only get 4K reads and nothing else on the strace
> output.
>
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096

To clarify this output again, I have a huge uncompressed custom format
dump without TOC (because pg_dump was writing to stdout), and at this
point pg_restore is going through the whole archive to find the items it
needs. Allow me to explain what goes on at this point since I have
better insight now.

The code in question, in pg_backup_custom.c:

/*
  * Skip data from current file position.
  * Data blocks are formatted as an integer length, followed by data.
  * A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
     lclContext *ctx = (lclContext *) AH->formatData;
     size_t        blkLen;
     char       *buf = NULL;
     int            buflen = 0;

     blkLen = ReadInt(AH);
     while (blkLen != 0)
     {
         /* Sequential access is usually faster, so avoid seeking if the
          * jump forward is shorter than 1MB. */
         if (ctx->hasSeek && blkLen > 1024 * 1024)
         {
             if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
                 pg_fatal("error during file seek: %m");
         }
         else
         {
             if (blkLen > buflen)
             {
                 free(buf);
                 buf = (char *) pg_malloc(blkLen);
                 buflen = blkLen;
             }
             if (fread(buf, 1, blkLen, AH->FH) != blkLen)
             {
                 if (feof(AH->FH))
                     pg_fatal("could not read from input file: end of file");
                 else
                     pg_fatal("could not read from input file: %m");
             }
         }

         blkLen = ReadInt(AH);
     }

     free(buf);
}

blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.

Libc is doing buffering of 4k, and that's how we end up seeing so many
4k reads. This also explains the ~80 lseek() between each 4k read() on
the unpatched version, mentioned in previous email.

I've tried setvbuf() like Thomas Munro suggested and I saw a significant
improvement by allocating and using a 1MB buffer for libc stream
buffering.

Question that remains: where is pg_dump setting this ~35B length block?
Is that easy to change without breaking old versions?

Thanks in advance,
Dimitris

pgsql-hackers by date:

From: Andrei Lepikhov
Date: 28 July, 16:29:06
Subject: Re: Pushing down a subquery relation's ppi_clauses, and more ...

From: Tom Lane
Date: 28 July, 16:58:08
Subject: Re: [PATCH] avoid double scanning in function byteain

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

Previous

Next