Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date
Msg-id 526972.1761068213@sss.pgh.pa.us
Whole thread Raw
In response to Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward  (Dimitrios Apostolou <jimis@gmx.net>)
List pgsql-hackers
Dimitrios Apostolou <jimis@gmx.net> writes:
> On Tuesday 2025-10-21 00:15, Tom Lane wrote:
>> Not sure where we go from here, but clearly a bunch of research
>> is going to be needed to decide whether this is committable.

> pg_dump files from before your latest fix still exist, and they possibly 
> contain block header every 30 bytes (or however wide is the table rows). 
> A patch in pg_restore would vastly improve this use case.

Yes, that's a fair case to worry about.

> May I suggest the attached patch, which replaces fseeko() with fread() 
> if the distance is 32KB or less? Sounds rather improbable that this 
> would make things worse, but maybe it's possible to generate a dump file 
> with 32KB wide rows, and try restoring on various hardware?
> If this too is controversial, then we can reduce the number to 4KB. This 
> is the buffering that glibc does internally. By using the same in the 
> given patch, we avoid all the lseek(same-offset) repetitions between the 
> 4K reads. This should be a strict gain, with no downsides.

I spent some time strace'ing pg_restore on older dump files with
relatively small block sizes.  You are right that glibc seems quite
stupid about this: it appears to issue a kernel lseek() call for every
fseeko(), even though it keeps using the same 4K worth of data it had
previously read in.  I thought maybe that was an artifact of the
relatively old glibc in RHEL8, but glibc 2.40 from Fedora 41 is no
better.  I also checked current macOS, and it's marginally smarter:
it issues just one seek call per read call.  But that seek is still
useless, since the read is still 4K adjacent to the previous 4K.

(I wonder if maybe there is some POSIX standards compliance issue
involved here?  We don't care about the possibility that the disk
file is changing under us, but other applications do, and maybe
the kernel-visible seek calls are required for some reason.)

Putting in a minimum-block-size-to-seek check gets rid of the seek
calls entirely, producing a straight stream of reads, on all three
platforms.  I agree that doing this with a threshold of 4K seems
like a no-brainer, since that seems to be the common default stdio
buffer size.  Using a larger threshold, or setting up a larger buffer,
seems to risk platform-dependent results, so it would require more
performance testing than I care to do now.  Let's do 4K and call it
a day.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Nathan Bossart
Date:
Subject: Re: abi-compliance-check failure due to recent changes to pg_{clear,restore}_{attribute,relation}_stats()
Next
From: Peter Eisentraut
Date:
Subject: Re: CI: Add task that runs pgindent