On Thursday 2025-10-16 19:01, Tom Lane wrote:
>> I think this is more or less committable, and then we could get
>> back to the original question of whether it's worth tweaking
>> pg_restore's seek-vs-scan behavior.
>
> And done. Dimitrios, could you re-do your testing against current
> HEAD, and see if there's still a benefit to tweaking pg_restore's
> seek-vs-read decisions, and if so what's the best number?
Sorry for the delay, I hadn't realized a needed to generate a new
database dump using the current HEAD. So I did that, using
--compress=none and storing it on compressed btrfs filesystem, since
that's my primary use case.
I notice that things have improved immensely!
Using the test you suggested (see NOTE1):
pg_restore -t last_table -f /dev/null huge.pg_dump
1. The strace output is much more reasonable now; basically it's
repeating the pattern
read(4k)
lseek(~128k forward)
As a reminder, with old archives it was repeating the pattern:
read(4k)
lseek(4k forward)
lseek(same offset as above) x ~80 times
2. The IO speed is better than before:
On my 20TB HDD I get 30-50 MB/s read rate.
With old archives I get 10-20 MB/s read rate.
3. Time to complete: ~25 min
4. CPU usage is low. With old archives the pg_restore process shows
high *system* CPU (because of the amount of syscalls).
I can't really compare the actual runtime between old and new dump,
because the two dumps are very different. But I have no doubt the new
dump is several times faster to seek through.
NOTE1: My original testcase was
pg_restore -t last_table -j $NCPU -d testdb
This testcase does not show as big improvement,
because every single of the parallel workers is
concurrently seeking through the dump file.
*** All above was measured from master branch HEAD **
277dec6514728e2d0d87c1279dd5e0afbf897428
Don't rely on zlib's gzgetc() macro.
*** Below I have applied attached patch ***
Regarding the attached patch (rebased and edited commit message), it
basically replaces seek(up to 1MB forward) with read(). The 1MB number
comes a bit out of the top of my head. But tweaking it between 128KB and
1MB wouldn't really change anything, given that the block size is now
128KB: The read() will always be chosen against the seek(). Do you know
of a real-world case with block sizes >128KB?
Anyway I tried it with the new archive from above.
1. strace output is a loop of the following:
read(4k)
read(~128k)
2. Read rate is between 150-250MB/s basically max that the HDD can give.
3. Time to complete: ~5 min
4. CPU usage: HIGH (63%), most likely because of the sheer amount
of data it's parsing.
Regards,
Dimitris