[PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

From Dimitrios Apostolou
Subject [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date
Msg-id 9opr64ps-625r-667n-q19o-op35rs414n59@tzk.arg
Whole thread Raw
In response to Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
List pgsql-hackers
On Thursday 2025-10-16 19:01, Tom Lane wrote:

>> I think this is more or less committable, and then we could get
>> back to the original question of whether it's worth tweaking
>> pg_restore's seek-vs-scan behavior.
>
> And done.  Dimitrios, could you re-do your testing against current
> HEAD, and see if there's still a benefit to tweaking pg_restore's
> seek-vs-read decisions, and if so what's the best number?

Sorry for the delay, I hadn't realized a needed to generate a new
database dump using the current HEAD. So I did that, using
--compress=none and storing it on compressed btrfs filesystem, since
that's my primary use case.

I notice that things have improved immensely!
Using the test you suggested (see NOTE1):

     pg_restore -t last_table -f /dev/null  huge.pg_dump


1. The strace output is much more reasonable now; basically it's
    repeating the pattern

        read(4k)
        lseek(~128k forward)

   As a reminder, with old archives it was repeating the pattern:

        read(4k)
        lseek(4k forward)
        lseek(same offset as above) x ~80 times

2. The IO speed is better than before:

       On my 20TB HDD I get 30-50 MB/s read rate.

       With old archives I get 10-20 MB/s read rate.

3. Time to complete: ~25 min

4. CPU usage is low. With old archives the pg_restore process shows
    high *system* CPU (because of the amount of syscalls).


I can't really compare the actual runtime between old and new dump,
because the two dumps are very different. But I have no doubt the new
dump is several times faster to seek through.


NOTE1: My original testcase was

           pg_restore -t last_table -j $NCPU -d testdb

        This testcase does not show as big improvement,
        because every single of the parallel workers is
        concurrently seeking through the dump file.



*** All above was measured from master branch HEAD **
277dec6514728e2d0d87c1279dd5e0afbf897428
Don't rely on zlib's gzgetc() macro.

*** Below I have applied attached patch ***


Regarding the attached patch (rebased and edited commit message), it
basically replaces seek(up to 1MB forward) with read(). The 1MB number
comes a bit out of the top of my head. But tweaking it between 128KB and
1MB wouldn't really change anything, given that the block size is now
128KB: The read() will always be chosen against the seek(). Do you know
of a real-world case with block sizes >128KB?

Anyway I tried it with the new archive from above.


1. strace output is a loop of the following:

         read(4k)
         read(~128k)

2. Read rate is between 150-250MB/s basically max that the HDD can give.

3. Time to complete: ~5 min

4. CPU usage: HIGH (63%), most likely because of the sheer amount
    of data it's parsing.


Regards,
Dimitris

Attachment

pgsql-hackers by date:

Previous
From: Sami Imseih
Date:
Subject: Skip unregistered custom kinds on stats load
Next
From: Daniil Davydov
Date:
Subject: Re: Accessing an invalid pointer in BufferManagerRelation structure