Re: Streamify more code paths - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: Streamify more code paths
Date
Msg-id CABPTF7Wz8OCcr2f1CSdt-jRHfJAAdXZwwBokBD4baZAPyK7CgA@mail.gmail.com
Whole thread Raw
In response to Re: Streamify more code paths  (Michael Paquier <michael@paquier.xyz>)
Responses Re: Streamify more code paths
List pgsql-hackers
On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> > Thanks for doing that.  On my side, I am going to look at the gin and
> > hash vacuum paths first with more testing as these don't use a custom
> > callback.  I don't think that I am going to need a lot of convincing,
> > but I'd rather produce some numbers myself because doing something.
> > I'll tweak a mounting point with the delay trick, as well.
>
> While debug_io_direct has been helping a bit, the trick for the delay
> to throttle the IO activity has helped much more with my runtime
> numbers.  I have mounted a separate partition with a delay of 5ms,
> disabled checkums (this part did not make a real difference), and
> evicted shared buffers for relation and indexes before the VACUUM.
>
> Then I got better numbers.  Here is an extract:
> - worker=3:
> gin_vacuum (100k tuples)   base=  1448.2ms  patch=   572.5ms   2.53x
> ( 60.5%)  (reads=175→104, io_time=1382.70→506.64ms)
> gin_vacuum (300k tuples)   base=  3728.0ms  patch=  1332.0ms   2.80x
> ( 64.3%)  (reads=486→293, io_time=3669.89→1266.27ms)
> bloom_vacuum (100k tuples) base= 21826.8ms  patch= 17220.3ms   1.27x
> ( 21.1%)  (reads=485→117, io_time=4773.33→270.56ms)
> bloom_vacuum (300k tuples) base= 67054.0ms  patch= 53164.7ms   1.26x
> ( 20.7%)  (reads=1431.5→327.5, io_time=13880.2→381.395ms)
> - io_uring:
> gin_vacuum (100k tuples)   base=  1240.3ms  patch=   360.5ms   3.44x
> ( 70.9%)  (reads=175→104, io_time=1175.35→299.75ms)
> gin_vacuum (300k tuples)   base=  2829.9ms  patch=   642.0ms   4.41x
> ( 77.3%)  (reads=465.5→293, io_time=2768.46→579.04ms)
> bloom_vacuum (100k tuples) base= 22121.7ms  patch= 17532.3ms   1.26x
> ( 20.7%)  (reads=485→117, io_time=4850.46→285.28ms)
> bloom_vacuum (300k tuples) base= 67058.0ms  patch= 53118.0ms   1.26x
> ( 20.8%)  (reads=1431.5→327.5, io_time=13870.9→305.44ms)
>
> The higher the number of tuples, the better the performance for each
> individual operation, but the tests take a much longer time (tens of
> seconds vs tens of minutes).  For GIN, the numbers can be quite good
> once these reads are pushed.  For bloom, the runtime is improved, and
> the IO numbers are much better.
>
> At the end, I have applied these two parts.  Remains now the hash
> vacuum and the two parts for pgstattuple.
> --
> Michael

Thanks for running the benchmarks and pushing!

Here're the results of my test with debug_io_direct and delay :

-- io_uring, medium size

bloom_vacuum_medium        base=  8355.2ms  patch=   715.0ms  11.68x
( 91.4%)  (reads=4732→1056, io_time=7699.47→86.52ms)
pgstattuple_medium         base=  4012.8ms  patch=   213.7ms  18.78x
( 94.7%)  (reads=2006→2006, io_time=4001.66→200.24ms)
pgstatindex_medium         base=  5490.6ms  patch=    37.9ms  144.88x
( 99.3%)  (reads=2745→173, io_time=5481.54→7.82ms)
hash_vacuum_medium         base= 34483.4ms  patch=  2703.5ms  12.75x
( 92.2%)  (reads=19166→3901, io_time=31948.33→308.05ms)
wal_logging_medium         base=  7778.6ms  patch=  7814.5ms   1.00x
( -0.5%)  (reads=2857→2845, io_time=11.84→11.45ms)

-- worker, medium size
bloom_vacuum_medium        base=  8376.2ms  patch=   747.7ms  11.20x
( 91.1%)  (reads=4732→1056, io_time=7688.91→65.49ms)
pgstattuple_medium         base=  4012.7ms  patch=   339.0ms  11.84x
( 91.6%)  (reads=2006→2006, io_time=4002.23→49.99ms)
pgstatindex_medium         base=  5490.3ms  patch=    38.3ms  143.23x
( 99.3%)  (reads=2745→173, io_time=5480.60→16.24ms)
hash_vacuum_medium         base= 34638.4ms  patch=  2940.2ms  11.78x
( 91.5%)  (reads=19166→3901, io_time=31881.61→242.01ms)
wal_logging_medium         base=  7440.1ms  patch=  7434.0ms   1.00x
(  0.1%)  (reads=2861→2825, io_time=10.62→10.71ms)

-- Setting read delay only
sudo dmsetup reload "$DM_DELAY_DEV" --table "0 $size delay $dev 0 $ms $dev 0 0"
Setting dm_delay on delayed to 2ms read / 0ms write

After setting the write delay to 0ms, I can observe more pronounced
speedups overall, since vacuum operation is write-intensive — delaying
writes might dominate the runtime and mask the read-path improvement
we're measuring. It also speeds up the runtime of the test.

-- wal_logging
The wal_logging patch does not seem to benefit from streamification in
this configuration either.

-- Delay settup
For anyone wanting to reproduce the results with a simulated-latency
device, here is the setup I used.

1. Create a 50GB file-backed block device (enough for PG data + indexes)

sudo dd if=/dev/zero of=/srv/delay_disk.img bs=1M count=50000 status=progress
sudo losetup /dev/loop0 /srv/delay_disk.img

2. Create the dm_delay device with 2ms delay
sudo dmsetup create delayed --table "0 $(sudo blockdev --getsz
/dev/loop0) delay /dev/loop0 0 2"

3. Format and mount it

sudo mkfs.ext4 /dev/mapper/delayed
sudo mkdir -p /srv/pg_delayed
sudo mount /dev/mapper/delayed /srv/pg_delayed
sudo chown $(whoami) /srv/pg_delayed

4. Run benchmark with WORKROOT pointing to the delayed device

WORKROOT=/srv/pg_delayed SIZES=medium REPS=3 \
  ./run_streaming_benchmark.sh --baseline --io-method io_uring \
    --test gin_vacuum --direct-io --io-delay 2 \
     the targeted patch


--
Best,
Xuneng

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: POC: PLpgSQL FOREACH IN JSON ARRAY
Next
From: Fujii Masao
Date:
Subject: Re: Add missing stats_reset column to pg_stat_database_conflicts view