Home > mailing lists
Re: EXPLAIN: showing ReadStream / prefetch stats - Mailing list pgsql-hackers

From	Andres Freund
Subject	Re: EXPLAIN: showing ReadStream / prefetch stats
Date	March 31 21:10:56
Msg-id	bu6faqsvinpiyw5no5cb76j67oazdvmsi732xf7qmwava6lewo@7qsndjfzyqkd Whole thread
In response to	Re: EXPLAIN: showing ReadStream / prefetch stats (Tomas Vondra <tomas@vondra.me>)
Responses	Re: EXPLAIN: showing ReadStream / prefetch stats
List	pgsql-hackers
Tree view
Hi,

On 2026-03-31 15:16:57 +0200, Tomas Vondra wrote:
> From 1bd80f66019aea773c2b6f46212172de592efb60 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Tue, 31 Mar 2026 13:44:23 +0200
> Subject: [PATCH v6 2/5] switch explain to unaligned for json/xml/yaml

LGTM.


> From cd5437f5f49e84b8ae3f8cb8f0df4e605c1b5592 Mon Sep 17 00:00:00 2001
> From: Tomas Vondra <tomas@vondra.me>
> Date: Tue, 31 Mar 2026 13:47:04 +0200
> Subject: [PATCH v6 3/5] explain: show prefetch stats in EXPLAIN (ANALYZE)
> 
> This adds details about AIO / prefetch for executor nodes using a
> ReadStream. Right now this applies only to BitmapHeapScan, because
> that's the only scan node using a ReadStream and collecting
> instrumentation from workers.

s/Right now/As of this commit ... support for ... will be added in subsequent commits/


Not for now, but I wonder if we ought to introduce a REPRODUCIBLE option for
EXPLAIN [ANALYZE] that sets BUFFERS, COSTS, IO, ... the right way for the
regression tests, instead having to go through and change a gazillion tests
every time.



> diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
> index 20ff58aa782..5cfe8e24615 100644
> --- a/contrib/amcheck/verify_heapam.c
> +++ b/contrib/amcheck/verify_heapam.c
> @@ -477,7 +477,8 @@ verify_heapam(PG_FUNCTION_ARGS)
>                                          MAIN_FORKNUM,
>                                          stream_cb,
>                                          stream_data,
> -                                        0);
> +                                        0,
> +                                        NULL);
>  
>      while ((ctx.buffer = read_stream_next_buffer(stream, NULL)) != InvalidBuffer)
>      {

Kinda wondering if we, instead of adding a NULL to a lot of places, should
instead add a
  read_stream_enable_stats(stream, stats)
or such.


> --- a/src/backend/commands/explain.c
> +++ b/src/backend/commands/explain.c
> @@ -13,6 +13,8 @@
>   */
>  #include "postgres.h"
>  
> +#include "access/genam.h"

Why genam.h?


> +#include "access/tableam.h"

I'd probably include access/relscan.h instead.



> +/*
> + * show_scan_io_usage
> + *        show info about prefetching for a seq/bitmap scan
> + *
> + * Shows summary of stats for leader and workers (if any).
> + */
> +static void
> +show_scan_io_usage(ScanState *planstate, ExplainState *es)
> +{
> +    Plan       *plan = planstate->ps.plan;
> +    IOStats        stats;
> +
> +    if (!es->io)
> +        return;
> +
> +    /* scan not started or no prefetch stats */
> +    if (!(planstate &&
> +          planstate->ss_currentScanDesc &&
> +          planstate->ss_currentScanDesc->rs_instrument))
> +        return;
> +
> +    /* Initialize counters with stats from the local process first */
> +    switch (nodeTag(plan))
> +    {

That "local node first" comment looks a bit odd here, given that at that level
it's for a block that does both?


> +        case T_BitmapHeapScan:
> +            {
> +                SharedBitmapHeapInstrumentation *sinstrument
> +                = ((BitmapHeapScanState *) planstate)->sinstrument;
> +
> +                /* collect prefetch statistics from the read stream */
> +                stats = planstate->ss_currentScanDesc->rs_instrument->io;
> +
> +                /*
> +                 * get the sum of the counters set within each and every
> +                 * process
> +                 */
> +                if (sinstrument)
> +                {
> +                    for (int i = 0; i < sinstrument->num_workers; ++i)
> +                    {
> +                        BitmapHeapScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
> +
> +                        AccumulateIOStats(&stats, &winstrument->stats.io);
> +                    }

I'm not entirely sure how useful it is to accumulate the stats from worker
into the leader's stats. Doesn't that mean that the leader's stats can't ever
be viewed in isolation? I'm also not really clear that it's very useful to
just smush the stats of the different streams together.


> +/*
> + * show_io_usage
> + *        show info about I/O prefetching for a single worker
> + *
> + * Shows prefetching stats for a parallel scan worker.
> + */
> +static void
> +show_io_usage(PlanState *planstate, ExplainState *es, int worker)

s/show_io_usage/show_worker_io_usage/?

Yes, the other functions called in the per-worker block don't have _worker_ in
them, but afaict they aren't specific to be used in the context of a worker.


> +/*
> + * Update stream stats with current pinned buffer depth.
> + *
> + * Called once per buffer returned to the consumer in read_stream_next_buffer().
> + * Records the number of pinned buffers at that moment, so we can compute the
> + * average look-ahead depth.
> + */
> +static inline void
> +read_stream_count_prefetch(ReadStream *stream)
> +{
> +    IOStats    *stats = stream->stats;
> +
> +    if (stats == NULL)
> +        return;
> +
> +    stats->prefetch_count++;
> +    stats->distance_sum += stream->pinned_buffers;
> +    if (stream->pinned_buffers > stats->distance_max)
> +        stats->distance_max = stream->pinned_buffers;

I'd probably use Max().



> @@ -380,6 +435,9 @@ read_stream_start_pending_read(ReadStream *stream)
>          Assert(stream->ios_in_progress < stream->max_ios);
>          stream->ios_in_progress++;
>          stream->seq_blocknum = stream->pending_read_blocknum + nblocks;
> +
> +        /* update I/O stats */
> +        read_stream_count_io(stream, nblocks, stream->ios_in_progress);
>      }
>  
>      /*
> @@ -544,7 +602,8 @@ read_stream_begin_impl(int flags,
>                         ForkNumber forknum,
>                         ReadStreamBlockNumberCB callback,
>                         void *callback_private_data,
> -                       size_t per_buffer_data_size)
> +                       size_t per_buffer_data_size,
> +                       IOStats *stats)
>  {
>      ReadStream *stream;
>      size_t        size;
> @@ -703,6 +762,11 @@ read_stream_begin_impl(int flags,
>      stream->seq_until_processed = InvalidBlockNumber;
>      stream->temporary = SmgrIsTemp(smgr);
>  
> +    /* set capacity */
> +    stream->stats = stats;
> +    if (stream->stats)
> +        stream->stats->distance_capacity = max_pinned_buffers;
> +

The comments here don't seem to add much to me.


> +/* ---------------------
> + *    Instrumentation information about read streams
> + * ---------------------
> + */
> +typedef struct IOStats
> +{
> +    /* number of buffers returned to consumer (for averaging distance) */
> +    uint64        prefetch_count;
> +
> +    /* sum of pinned_buffers sampled at each buffer return */
> +    uint64        distance_sum;
> +
> +    /* maximum actual pinned_buffers observed during the scan */
> +    int16        distance_max;
> +
> +    /* maximum possible look-ahead distance (max_pinned_buffers) */
> +    int16        distance_capacity;
> +
> +    /* number of waits for a read (for the I/O) */
> +    uint64        wait_count;
> +
> +    /* I/O stats */
> +    uint64        io_count;        /* number of I/Os */
> +    uint64        io_nblocks;        /* sum of blocks for all I/Os */
> +    uint64        io_in_progress; /* sum of in-progress I/Os */

IO stats in a struct named IO stats ;)


>  /* ---------------------
>   *    Instrumentation information for indexscans (amgettuple and amgetbitmap)
>   * ---------------------
>   */
> +
>  typedef struct IndexScanInstrumentation
>  {
>      /* Index search count (incremented with pgstat_count_index_scan call) */

Spurious new newline?




> @@ -4068,6 +4068,27 @@ show_scan_io_usage(ScanState *planstate, ExplainState *es)
>      /* Initialize counters with stats from the local process first */
>      switch (nodeTag(plan))
>      {
> +        case T_SeqScan:
> +            {
> +                SharedSeqScanInstrumentation *sinstrument
> +                    = ((SeqScanState *) planstate)->sinstrument;
> +
> +                /* collect prefetch statistics from the read stream */
> +                stats = planstate->ss_currentScanDesc->rs_instrument->io;
> +
> +                /* get the sum of the counters set within each and every process */
> +                if (sinstrument)
> +                {
> +                    for (int i = 0; i < sinstrument->num_workers; ++i)
> +                    {
> +                        SeqScanInstrumentation *winstrument = &sinstrument->sinstrument[i];
> +
> +                        AccumulateIOStats(&stats, &winstrument->stats.io);
> +                    }
> +                }
> +
> +                break;
> +            }
>          case T_BitmapHeapScan:
>              {
>                  SharedBitmapHeapInstrumentation *sinstrument

It's a bit sad how much repetition there is between the per-node support for
each node type. Not sure there's a great solution though.

Perhaps it'd be nicer if we instead sum up the stats in
ExecSeqScanRetrieveInstrumentation() etc?

> @@ -119,6 +119,13 @@ explain_filter
>        <Actual-Rows>N.N</Actual-Rows>
>        <Actual-Loops>N</Actual-Loops>
>        <Disabled>false</Disabled>
> +      <Average-Prefetch-Distance>N.N</Average-Prefetch-Distance>
> +      <Max-Prefetch-Distance>N</Max-Prefetch-Distance>
> +      <Prefetch-Capacity>N</Prefetch-Capacity>
> +      <I-O-Count>N</I-O-Count>
> +      <I-O-Waits>N</I-O-Waits>
> +      <Average-I-O-Size>N.N</Average-I-O-Size>
> +      <Average-I-Os-In-Progress>N.N</Average-I-Os-In-Progress>
>        <Shared-Hit-Blocks>N</Shared-Hit-Blocks>
>        <Shared-Read-Blocks>N</Shared-Read-Blocks>
>        <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>

The I-O looks a bit ridiculous. But since people here seem to like I/O much
better than IO (which I don't personally get), it's perhaps the right choice
:(. I guess we also already have that precedent due to track_io_timing.


Greetings,

Andres Freund
pgsql-hackers by date:
From: Bharath Rupireddy
Date: 31 March, 21:09:43
Subject: Re: Add pg_stat_autovacuum_priority
From: Andres Freund
Date: 31 March, 21:12:20
Subject: Re: EXPLAIN: showing ReadStream / prefetch stats
Re: EXPLAIN: showing ReadStream / prefetch stats - Mailing list pgsql-hackers

Previous

Next