Home > mailing lists

[PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

From	Dimitrios Apostolou
Subject	[PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
Date	October 20 21:40:27
Msg-id	9opr64ps-625r-667n-q19o-op35rs414n59@tzk.arg Whole thread Raw
In response to	Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward
List	pgsql-hackers

Tree view

On Thursday 2025-10-16 19:01, Tom Lane wrote:

>> I think this is more or less committable, and then we could get
>> back to the original question of whether it's worth tweaking
>> pg_restore's seek-vs-scan behavior.
>
> And done.  Dimitrios, could you re-do your testing against current
> HEAD, and see if there's still a benefit to tweaking pg_restore's
> seek-vs-read decisions, and if so what's the best number?

Sorry for the delay, I hadn't realized a needed to generate a new
database dump using the current HEAD. So I did that, using
--compress=none and storing it on compressed btrfs filesystem, since
that's my primary use case.

I notice that things have improved immensely!
Using the test you suggested (see NOTE1):

     pg_restore -t last_table -f /dev/null  huge.pg_dump

1. The strace output is much more reasonable now; basically it's
    repeating the pattern

        read(4k)
        lseek(~128k forward)

   As a reminder, with old archives it was repeating the pattern:

        read(4k)
        lseek(4k forward)
        lseek(same offset as above) x ~80 times

2. The IO speed is better than before:

       On my 20TB HDD I get 30-50 MB/s read rate.

       With old archives I get 10-20 MB/s read rate.

3. Time to complete: ~25 min

4. CPU usage is low. With old archives the pg_restore process shows
    high *system* CPU (because of the amount of syscalls).

I can't really compare the actual runtime between old and new dump,
because the two dumps are very different. But I have no doubt the new
dump is several times faster to seek through.

NOTE1: My original testcase was

           pg_restore -t last_table -j $NCPU -d testdb

        This testcase does not show as big improvement,
        because every single of the parallel workers is
        concurrently seeking through the dump file.

*** All above was measured from master branch HEAD **
277dec6514728e2d0d87c1279dd5e0afbf897428
Don't rely on zlib's gzgetc() macro.

*** Below I have applied attached patch ***

Regarding the attached patch (rebased and edited commit message), it
basically replaces seek(up to 1MB forward) with read(). The 1MB number
comes a bit out of the top of my head. But tweaking it between 128KB and
1MB wouldn't really change anything, given that the block size is now
128KB: The read() will always be chosen against the seek(). Do you know
of a real-world case with block sizes >128KB?

Anyway I tried it with the new archive from above.

1. strace output is a loop of the following:

         read(4k)
         read(~128k)

2. Read rate is between 150-250MB/s basically max that the HDD can give.

3. Time to complete: ~5 min

4. CPU usage: HIGH (63%), most likely because of the sheer amount
    of data it's parsing.

Regards,
Dimitris

Attachment

v4-0001-parallel-pg_restore-avoid-disk-seeks-when-moving-.patch

pgsql-hackers by date:

From: Sami Imseih
Date: 20 October, 21:39:37
Subject: Skip unregistered custom kinds on stats load

From: Daniil Davydov
Date: 20 October, 22:03:35
Subject: Re: Accessing an invalid pointer in BufferManagerRelation structure

[PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward - Mailing list pgsql-hackers

Attachment

Previous

Next