Re: EXPLAIN: showing ReadStream / prefetch stats - Mailing list pgsql-hackers
| From | Tomas Vondra |
|---|---|
| Subject | Re: EXPLAIN: showing ReadStream / prefetch stats |
| Date | |
| Msg-id | 2551c884-af06-4663-83d2-cd5688c3ae12@vondra.me Whole thread Raw |
| In response to | Re: EXPLAIN: showing ReadStream / prefetch stats (Lukas Fittl <lukas@fittl.com>) |
| Responses |
Re: EXPLAIN: showing ReadStream / prefetch stats
|
| List | pgsql-hackers |
Hi Lukas, Thanks for the review. I've been working on cleaning up the patch, reordering the changes a bit. I ran into most of the issues you mentioned, and the attached v4 should address most of them. I got rid of the "old" approach entirely, and just do everything through the scan descriptor. I reorganized the patch series to make more sense - it's split into three pieces (0001 is a prerequisite patch, not really a part of the changes): * 0002 - Basic changes, adding the info to a single scan node that already uses shared instrumentation (=BitmapHeapScan). It adds the counting to read_stream, and printing to explain. It also adjusts all the failing tests checking query plans (usually by disabling I/O). * 0003 - Adds support for SeqScan (has to setup the instrumentation for parallel workers, ...). * 0004 - Adds support for TidRangeScan (has to setup the instrumentation for parallel workers, ...). I think it'd make sense to commit these parts separately. The 0002 part could be made smaller by introducing a new read_stream helper, so that we don't need to add a parameter to all the places calling read_stream_begin_relation(). But that's a detail. More comments inline. On 3/19/26 19:11, Lukas Fittl wrote: > On Wed, Mar 18, 2026 at 3:41 PM Tomas Vondra <tomas@vondra.me> wrote: >> The 0003 also changes the EXPLAIN to enable IO by default, just like we >> do for BUFFERS. It seems like a reasonable precedent to me. > > One side effect of that is that the tests now fail for me locally, > because the specific values are system-dependent. Attached a patch > (nocfbot-0002) that fixed that for me. > Right. I haven't addressed that in v3, but with v4 the tests should be passing (after each patch). > There is one detail maybe calling out specifically on JSON output: > Currently Postgres always emits all fields in JSON output, even if > they are zero. The code that you have in v3 skips the "I/O" group when > the value is zero, which doesn't work well with how current regression > tests are written. I'm definitely not a fan of the unnecessary > verbosity of JSON EXPLAIN output, but I'd suggest we don't break with > the tradition here, and instead always output the "I/O" group in > non-text formats. Also attached a patch for that (nocfbot-0001). > Yeah, I noticed that too, but I wasn't sure what to do about it. The I/O stats are conditional on (io_count > 0) because it calculates average I/O size etc. If there are no I/Os, we could probably print 0.0. Another thing I noticed in the non-text formats is that this is the only case with explicit groups, which adds nesting. It looks weird, so I plan to remove that, and will add some "prefix" to the items. > Overall I think the abstraction here seems reasonable if we're > primarily focused on getting the per-node instrumentation taken care > of. > OK, good. I think passing the stats through a scan descriptor works OK. I still have a strange feeling about it - we have the TAM interface and yet we choose to pass data in other ways. It's better than global variables, of course. But it's not the only piece of data returned through the scan descriptor, so at least there's a precedent. > That said, two thoughts on an example EXPLAIN output I just ran: > > 1) I do wonder if its a bit confusing that we propagate I/O timings up > the EXPLAIN tree, but not the "I/O" information - I realize fixing > that would be a bit involved though, e.g. we'd have to invent > accumulation logic in explain.c. It'd also maybe make people thing > this covers things like temporary file reads/etc. > I'm not sure it makes sense to combine this information from multiple nodes. If you have multiple read streams with very different average I/O size and/or prefetch distances, what does the "global average" mean? I also don't think it's very useful - the information about distance and/or I/O size is most useful when analyzing individual nodes. What would it tell me for multiple nodes combined? So I don't think it's something we need, and we can add it later. > 2) The ordering of "I/O Timings" in relation to "I/O" feels off to me > (since they're not next to each other) - maybe we should re-order I/O > Timings to come before Buffers in show_buffer_usage to address that? > Yeah, maybe. I don't think we have any explicit order, but why not. Thanks! -- Tomas Vondra
Attachment
pgsql-hackers by date: