Re: Streamify more code paths - Mailing list pgsql-hackers

From Xuneng Zhou
Subject Re: Streamify more code paths
Date
Msg-id CABPTF7UA3sEw1ZpAj8qAKY6Xs71sk41X-pV43_iZHZz2U_AP=Q@mail.gmail.com
Whole thread
In response to Re: Streamify more code paths  (Xuneng Zhou <xunengzhou@gmail.com>)
Responses Re: Streamify more code paths
List pgsql-hackers
On Thu, Mar 12, 2026 at 12:39 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> On Thu, Mar 12, 2026 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Thu, Mar 12, 2026 at 06:33:08AM +0900, Michael Paquier wrote:
> > > Thanks for doing that.  On my side, I am going to look at the gin and
> > > hash vacuum paths first with more testing as these don't use a custom
> > > callback.  I don't think that I am going to need a lot of convincing,
> > > but I'd rather produce some numbers myself because doing something.
> > > I'll tweak a mounting point with the delay trick, as well.
> >
> > While debug_io_direct has been helping a bit, the trick for the delay
> > to throttle the IO activity has helped much more with my runtime
> > numbers.  I have mounted a separate partition with a delay of 5ms,
> > disabled checkums (this part did not make a real difference), and
> > evicted shared buffers for relation and indexes before the VACUUM.
> >
> > Then I got better numbers.  Here is an extract:
> > - worker=3:
> > gin_vacuum (100k tuples)   base=  1448.2ms  patch=   572.5ms   2.53x
> > ( 60.5%)  (reads=175→104, io_time=1382.70→506.64ms)
> > gin_vacuum (300k tuples)   base=  3728.0ms  patch=  1332.0ms   2.80x
> > ( 64.3%)  (reads=486→293, io_time=3669.89→1266.27ms)
> > bloom_vacuum (100k tuples) base= 21826.8ms  patch= 17220.3ms   1.27x
> > ( 21.1%)  (reads=485→117, io_time=4773.33→270.56ms)
> > bloom_vacuum (300k tuples) base= 67054.0ms  patch= 53164.7ms   1.26x
> > ( 20.7%)  (reads=1431.5→327.5, io_time=13880.2→381.395ms)
> > - io_uring:
> > gin_vacuum (100k tuples)   base=  1240.3ms  patch=   360.5ms   3.44x
> > ( 70.9%)  (reads=175→104, io_time=1175.35→299.75ms)
> > gin_vacuum (300k tuples)   base=  2829.9ms  patch=   642.0ms   4.41x
> > ( 77.3%)  (reads=465.5→293, io_time=2768.46→579.04ms)
> > bloom_vacuum (100k tuples) base= 22121.7ms  patch= 17532.3ms   1.26x
> > ( 20.7%)  (reads=485→117, io_time=4850.46→285.28ms)
> > bloom_vacuum (300k tuples) base= 67058.0ms  patch= 53118.0ms   1.26x
> > ( 20.8%)  (reads=1431.5→327.5, io_time=13870.9→305.44ms)
> >
> > The higher the number of tuples, the better the performance for each
> > individual operation, but the tests take a much longer time (tens of
> > seconds vs tens of minutes).  For GIN, the numbers can be quite good
> > once these reads are pushed.  For bloom, the runtime is improved, and
> > the IO numbers are much better.
> >
>
> -- io_uring, medium size
>
> bloom_vacuum_medium        base=  8355.2ms  patch=   715.0ms  11.68x
> ( 91.4%)  (reads=4732→1056, io_time=7699.47→86.52ms)
> pgstattuple_medium         base=  4012.8ms  patch=   213.7ms  18.78x
> ( 94.7%)  (reads=2006→2006, io_time=4001.66→200.24ms)
> pgstatindex_medium         base=  5490.6ms  patch=    37.9ms  144.88x
> ( 99.3%)  (reads=2745→173, io_time=5481.54→7.82ms)
> hash_vacuum_medium         base= 34483.4ms  patch=  2703.5ms  12.75x
> ( 92.2%)  (reads=19166→3901, io_time=31948.33→308.05ms)
> wal_logging_medium         base=  7778.6ms  patch=  7814.5ms   1.00x
> ( -0.5%)  (reads=2857→2845, io_time=11.84→11.45ms)
>
> -- worker, medium size
> bloom_vacuum_medium        base=  8376.2ms  patch=   747.7ms  11.20x
> ( 91.1%)  (reads=4732→1056, io_time=7688.91→65.49ms)
> pgstattuple_medium         base=  4012.7ms  patch=   339.0ms  11.84x
> ( 91.6%)  (reads=2006→2006, io_time=4002.23→49.99ms)
> pgstatindex_medium         base=  5490.3ms  patch=    38.3ms  143.23x
> ( 99.3%)  (reads=2745→173, io_time=5480.60→16.24ms)
> hash_vacuum_medium         base= 34638.4ms  patch=  2940.2ms  11.78x
> ( 91.5%)  (reads=19166→3901, io_time=31881.61→242.01ms)
> wal_logging_medium         base=  7440.1ms  patch=  7434.0ms   1.00x
> (  0.1%)  (reads=2861→2825, io_time=10.62→10.71ms)
>

Our io_time metric currently measures only read time and ignores write
I/O, which can be misleading. We now separate it into read_time and
write_time.

-- write-delay 2 ms
WORKROOT=/srv/pg_delayed SIZES=small REPS=3
./run_streaming_benchmark.sh --baseline --io-method worker
--io-workers 12 --test hash_vacuum --direct-io --read-delay 2
--write-delay 2
v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch

hash_vacuum_small          base= 16652.8ms  patch= 13493.2ms   1.23x
( 19.0%)  (reads=2338→815, read_time=4136.19→884.79ms,
writes=6218→6206, write_time=12313.81→12289.58ms)

-- write-delay 0 ms
WORKROOT=/srv/pg_delayed SIZES=small REPS=3
./run_streaming_benchmark.sh --baseline --io-method worker
--io-workers 12 --test hash_vacuum --direct-io --read-delay 2
--write-delay 0
v6-0004-Streamify-hash-index-VACUUM-primary-bucket-page-r.patch

hash_vacuum_small          base=  4310.2ms  patch=  1146.7ms   3.76x
( 73.4%)  (reads=2338→815, read_time=4002.24→833.47ms,
writes=6218→6206, write_time=186.69→140.96ms)

--
Best,
Xuneng

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Drop 32-bit support (was "Re: Fix typo 586/686 in atomics/arch-x86.h")
Next
From: Alastair Turner
Date:
Subject: Re: [PATCH] libpq: try all addresses for a host before moving to next on target_session_attrs mismatch