Home > mailing lists

Re: EXPLAIN: showing ReadStream / prefetch stats - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: EXPLAIN: showing ReadStream / prefetch stats
Date	March 19 23:57:32
Msg-id	2551c884-af06-4663-83d2-cd5688c3ae12@vondra.me Whole thread Raw
In response to	Re: EXPLAIN: showing ReadStream / prefetch stats (Lukas Fittl <lukas@fittl.com>)
Responses	Re: EXPLAIN: showing ReadStream / prefetch stats
List	pgsql-hackers

Tree view

Hi Lukas,

Thanks for the review. I've been working on cleaning up the patch,
reordering the changes a bit. I ran into most of the issues you
mentioned, and the attached v4 should address most of them.

I got rid of the "old" approach entirely, and just do everything through
the scan descriptor. I reorganized the patch series to make more sense -
it's split into three pieces (0001 is a prerequisite patch, not really a
part of the changes):

* 0002 - Basic changes, adding the info to a single scan node that
already uses shared instrumentation (=BitmapHeapScan). It adds the
counting to read_stream, and printing to explain. It also adjusts all
the failing tests checking query plans (usually by disabling I/O).

* 0003 - Adds support for SeqScan (has to setup the instrumentation for
parallel workers, ...).

* 0004 - Adds support for TidRangeScan (has to setup the instrumentation
for parallel workers, ...).

I think it'd make sense to commit these parts separately.

The 0002 part could be made smaller by introducing a new read_stream
helper, so that we don't need to add a parameter to all the places
calling read_stream_begin_relation(). But that's a detail.

More comments inline.

On 3/19/26 19:11, Lukas Fittl wrote:
> On Wed, Mar 18, 2026 at 3:41 PM Tomas Vondra <tomas@vondra.me> wrote:
>> The 0003 also changes the EXPLAIN to enable IO by default, just like we
>> do for BUFFERS. It seems like a reasonable precedent to me.
> 
> One side effect of that is that the tests now fail for me locally,
> because the specific values are system-dependent. Attached a patch
> (nocfbot-0002) that fixed that for me.
> 

Right. I haven't addressed that in v3, but with v4 the tests should be
passing (after each patch).

> There is one detail maybe calling out specifically on JSON output:
> Currently Postgres always emits all fields in JSON output, even if
> they are zero. The code that you have in v3 skips the "I/O" group when
> the value is zero, which doesn't work well with how current regression
> tests are written. I'm definitely not a fan of the unnecessary
> verbosity of JSON EXPLAIN output, but I'd suggest we don't break with
> the tradition here, and instead always output the "I/O" group in
> non-text formats. Also attached a patch for that (nocfbot-0001).
> 

Yeah, I noticed that too, but I wasn't sure what to do about it. The I/O
stats are conditional on (io_count > 0) because it calculates average
I/O size etc. If there are no I/Os, we could probably print 0.0.

Another thing I noticed in the non-text formats is that this is the only
case with explicit groups, which adds nesting. It looks weird, so I plan
to remove that, and will add some "prefix" to the items.

> Overall I think the abstraction here seems reasonable if we're
> primarily focused on getting the per-node instrumentation taken care
> of.
> 

OK, good. I think passing the stats through a scan descriptor works OK.
I still have a strange feeling about it - we have the TAM interface and
yet we choose to pass data in other ways. It's better than global
variables, of course. But it's not the only piece of data returned
through the scan descriptor, so at least there's a precedent.

> That said, two thoughts on an example EXPLAIN output I just ran:
> 
> 1) I do wonder if its a bit confusing that we propagate I/O timings up
> the EXPLAIN tree, but not the "I/O" information - I realize fixing
> that would be a bit involved though, e.g. we'd have to invent
> accumulation logic in explain.c. It'd also maybe make people thing
> this covers things like temporary file reads/etc.
> 

I'm not sure it makes sense to combine this information from multiple
nodes. If you have multiple read streams with very different average I/O
size and/or prefetch distances, what does the "global average" mean?

I also don't think it's very useful - the information about distance
and/or I/O size is most useful when analyzing individual nodes. What
would it tell me for multiple nodes combined?

So I don't think it's something we need, and we can add it later.

> 2) The ordering of "I/O Timings" in relation to "I/O" feels off to me
> (since they're not next to each other) - maybe we should re-order I/O
> Timings to come before Buffers in show_buffer_usage to address that?
> 

Yeah, maybe. I don't think we have any explicit order, but why not.

Thanks!

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

From: Zsolt Parragi
Date: 19 March, 23:48:27
Subject: Re: pg_waldump: support decoding of WAL inside tarfile

From: Andrew Dunstan
Date: 20 March, 00:30:06
Subject: Re: pg_get__*_ddl consolidation

Re: EXPLAIN: showing ReadStream / prefetch stats - Mailing list pgsql-hackers

Attachment

Previous

Next